Before getting started, make sure that you have installed the package.
To install the package from CRAN:
install.packages("DiDforBigData")
To install the package from Github:
devtools::install_github("setzler/DiDforBigData")
If the package is installed, make sure that you load it:
1. Choosing the Control Group
The control_group
argument
Depending on empirical context, the parallel-trends assumption may only hold for one of these possible control groups:
-
"never-treated"
: These are the control units that are never observed receiving the treatment within the sampling time frame. -
"future-treated"
: These are the control units that are observed receiving the treatment within the sampling time frame. -
"all"
: These include both the “never-treated” and “future-treated” groups.
DiDforBigData
makes it easy to choose among these
possibilities by setting the control_group
argument of the
DiD
function.
Example
We start by simulating some data to work with:
sim = SimDiD(seed=123, sample_size = 1000)
simdata = sim$simdata
print(simdata)
#> id year cohort Y
#> <int> <int> <num> <num>
#> 1: 1 2003 2012 9.556685
#> 2: 1 2004 2012 10.838554
#> 3: 1 2005 2012 10.639781
#> 4: 1 2006 2012 10.768935
#> 5: 1 2007 2012 10.613825
#> ---
#> 10996: 1000 2009 2010 6.640274
#> 10997: 1000 2010 2010 10.571372
#> 10998: 1000 2011 2010 11.282597
#> 10999: 1000 2012 2010 13.403816
#> 11000: 1000 2013 2010 12.542082
We can check what the true ATT is in the simulated data. In particular, let’s focus on the ATT for Cohort 2007:
print(sim$true_ATT[cohort==2007])
#> cohort event ATTge
#> <char> <num> <num>
#> 1: 2007 0 1
#> 2: 2007 1 2
#> 3: 2007 2 3
#> 4: 2007 3 4
#> 5: 2007 4 5
#> 6: 2007 5 6
#> 7: 2007 6 7
We also set up the variable names as a list so that
DiDforBigData
knows which variable is the outcome, which
variable is the treatment cohort, etc.:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
Now, we are ready to estimate DiD. We focus on the ATT for the 2007
cohort at event times -3,…,5. Initially, the control group is not set,
so the function will use the default option
control_group = "all"
:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol
#> <num> <num> <num> <num> <num> <num> <int>
#> 1: 2007 -3 -1 2004 0.2212255 0.10491102 751
#> 2: 2007 -2 -1 2005 0.1274586 0.10260112 751
#> 3: 2007 -1 -1 2006 0.0000000 0.00000000 751
#> 4: 2007 0 -1 2007 1.3464834 0.11215961 751
#> 5: 2007 1 -1 2008 2.2323825 0.09443787 751
#> 6: 2007 2 -1 2009 3.3705987 0.10350708 751
#> 7: 2007 3 -1 2010 4.2668717 0.10723955 501
#> 8: 2007 4 -1 2011 5.3241860 0.10738958 501
#> 9: 2007 5 -1 2012 6.1964127 0.13611288 251
#> Ntreated
#> <int>
#> 1: 249
#> 2: 249
#> 3: 249
#> 4: 249
#> 5: 249
#> 6: 249
#> 7: 249
#> 8: 249
#> 9: 249
We now consider changing the estimation to only use the “never-treated” control group:
did = DiD(inputdata = simdata, varnames = varnames,
control_group = "never-treated", min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol
#> <num> <num> <num> <num> <num> <num> <int>
#> 1: 2007 -3 -1 2004 0.10444344 0.1269592 251
#> 2: 2007 -2 -1 2005 -0.05942059 0.1298412 251
#> 3: 2007 -1 -1 2006 0.00000000 0.0000000 251
#> 4: 2007 0 -1 2007 1.27025637 0.1322487 251
#> 5: 2007 1 -1 2008 2.15386918 0.1171451 251
#> 6: 2007 2 -1 2009 3.35116597 0.1288344 251
#> 7: 2007 3 -1 2010 4.11246994 0.1195819 251
#> 8: 2007 4 -1 2011 5.34013939 0.1225179 251
#> 9: 2007 5 -1 2012 6.19641269 0.1361129 251
#> Ntreated
#> <int>
#> 1: 249
#> 2: 249
#> 3: 249
#> 4: 249
#> 5: 249
#> 6: 249
#> 7: 249
#> 8: 249
#> 9: 249
Note that the sample size for the control group,
Ncontrol
, is now substantially lower. Finally, we consider
changing this to only use the “future-treated” group:
did = DiD(inputdata = simdata, varnames = varnames,
control_group = "future-treated", min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol
#> <num> <num> <num> <num> <num> <num> <int>
#> 1: 2007 -3 -1 2004 0.2798501 0.1117079 500
#> 2: 2007 -2 -1 2005 0.2212720 0.1081106 500
#> 3: 2007 -1 -1 2006 0.0000000 0.0000000 500
#> 4: 2007 0 -1 2007 1.3847494 0.1183855 500
#> 5: 2007 1 -1 2008 2.2717963 0.1022833 500
#> 6: 2007 2 -1 2009 3.3803539 0.1094793 500
#> 7: 2007 3 -1 2010 4.4218911 0.1294090 250
#> 8: 2007 4 -1 2011 5.3081689 0.1292680 250
#> Ntreated
#> <int>
#> 1: 249
#> 2: 249
#> 3: 249
#> 4: 249
#> 5: 249
#> 6: 249
#> 7: 249
#> 8: 249
We see that, because there are no future-treated cohorts left at event time +5, it is no longer possible to provide an ATT estimate at event time +5 if using the future-treated control group.
2. Avoiding anticipation
The base_event
argument
In DiD designs, the researcher must choose a base pre-period against
which differences are measured. A natural choice is
base_event = -1
, which means to use the time period just
before treatment onset at event time 0.
As discussed by Callaway & Sant’Anna (2021), one possibility is
that the treatment group begins responding to treatment prior to
treatment onset, which is called “anticipation.” If anticipation begins
at period -1, then differences measured relative to
base_event = -1
yield inconsistent DiD estimates.
Fortunately, there is an easy solution: choose a base pre-period from
before anticipation began. If it seems reasonable to assume that the
treatment group could not have anticipated treatment 3 years before
treatment onset, then base_event = -3
should be free of
anticipation.
Example
We can simulate data that is subject to anticipation using the
anticipation
argument in the SimDiD
function:
sim = SimDiD(seed=123, sample_size = 200, anticipation = 2)
simdata = sim$simdata
print(simdata)
#> id year cohort Y
#> <int> <int> <num> <num>
#> 1: 1 2003 Inf 11.238356
#> 2: 1 2004 Inf 11.520443
#> 3: 1 2005 Inf 13.252991
#> 4: 1 2006 Inf 11.940777
#> 5: 1 2007 Inf 13.120440
#> ---
#> 2196: 200 2009 Inf 10.250944
#> 2197: 200 2010 Inf 8.812227
#> 2198: 200 2011 Inf 9.926592
#> 2199: 200 2012 Inf 11.188476
#> 2200: 200 2013 Inf 9.270495
Let’s focus on the average ATT across all cohorts. There are two periods of treatment effects prior to treatment, which we can verify by checking the true ATT from the simulation:
print(sim$true_ATT[cohort=="Average"])
#> cohort event ATTge
#> <char> <num> <num>
#> 1: Average -2 0.500000
#> 2: Average -1 0.500000
#> 3: Average 0 1.503356
#> 4: Average 1 2.503356
#> 5: Average 2 3.252525
#> 6: Average 3 4.252525
#> 7: Average 4 5.000000
#> 8: Average 5 6.000000
#> 9: Average 6 7.000000
We set up the varnames
to prepare for estimation:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
If we estimate DiD using the default argument that
base_event = -1
, the estimate will be biased and
inconsistent:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=3)
print(did$results_average)
#> Key: <EventTime>
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> <num> <num> <num> <num> <int> <int>
#> 1: -3 -1 -0.52346600 0.1534229 303 149
#> 2: -2 -1 -0.04556457 0.1380282 303 149
#> 3: -1 -1 0.00000000 0.0000000 303 149
#> 4: 0 -1 0.75946293 0.1495045 303 149
#> 5: 1 -1 1.80044696 0.1392958 303 149
#> 6: 2 -1 2.50749407 0.1718781 202 99
#> 7: 3 -1 3.33212214 0.1917002 152 99
We now set base_event = -3
to avoid the anticipation at
-1 and -2:
did = DiD(inputdata = simdata, varnames = varnames,
base_event = -3, min_event = -3, max_event=3)
print(did$results_average)
#> Key: <EventTime>
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> <num> <num> <num> <num> <int> <int>
#> 1: -3 -3 0.0000000 0.0000000 303 149
#> 2: -2 -3 0.4779014 0.1521152 303 149
#> 3: -1 -3 0.5234660 0.1534229 303 149
#> 4: 0 -3 1.2829289 0.1441747 303 149
#> 5: 1 -3 2.3239130 0.1485972 303 149
#> 6: 2 -3 2.9653357 0.1643872 202 99
#> 7: 3 -3 3.8282455 0.1773802 152 99
We see that the estimate is now close to the true ATT.
3. Controlling for Time-varying Covariates
The covariate_names
entry
We sometimes worry that treatment cohorts are selected on time-varying observables, and those time-varying observables also directly affect the outcome. If the growth rate in the observables differs between the treatment and control groups, it creates a violation of the parallel-trends assumption: the treatment and control groups would have experienced different growth profiles in the absence of treatment due to their different observables.
Fortunately, this is easy to fix: Since the confounding variables are
observable, we just need to control for those observables.
DiDforBigData
requires only that the list of variable
names, varnames
, is modified to include a
covariate_names
entry. For example,
varnames$covariate_names = c("X1","X2")
tells it to control
linearly for time-variation in X1 and X2.
Example
There is an option in SimDiD()
to add a couple of
covariates.
sim = SimDiD(sample_size=1000, time_covars=TRUE)
simdata = sim$simdata
print(simdata)
#> id year cohort Y X1 X2
#> <int> <int> <num> <num> <num> <num>
#> 1: 1 2003 2007 11.763806 -5.28713552 -0.546247081
#> 2: 1 2004 2007 14.216536 -3.16836592 0.882383754
#> 3: 1 2005 2007 12.898204 -3.52698674 -0.007446627
#> 4: 1 2006 2007 14.117949 -3.83264390 -0.307454720
#> 5: 1 2007 2007 9.932859 0.04643156 0.153863150
#> ---
#> 10996: 1000 2009 2010 10.096541 -0.96662668 -0.375274801
#> 10997: 1000 2010 2010 6.679059 4.38379275 0.229111354
#> 10998: 1000 2011 2010 8.328832 4.11630763 0.242690466
#> 10999: 1000 2012 2010 9.171007 3.05442888 2.269272560
#> 11000: 1000 2013 2010 9.044345 4.46350435 0.510888022
print(sim$true_ATT[cohort==2007])
#> cohort event ATTge
#> <char> <num> <num>
#> 1: 2007 0 1
#> 2: 2007 1 2
#> 3: 2007 2 3
#> 4: 2007 3 4
#> 5: 2007 4 5
#> 6: 2007 5 6
#> 7: 2007 6 7
There are two covariates, X1 and X2. In particular, X2 differs in growth rates across treatment cohorts, which means that it causes violations of parallel trends if not controlled.
We set up the varnames
to prepare for estimation,
ignoring the covariates for now:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
We verify that DiD gives the wrong answer for cohort 2007:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol
#> <num> <num> <num> <num> <num> <num> <int>
#> 1: 2007 -3 -1 2004 0.1677000 0.1877317 751
#> 2: 2007 -2 -1 2005 0.2901911 0.1778623 751
#> 3: 2007 -1 -1 2006 0.0000000 0.0000000 751
#> 4: 2007 0 -1 2007 1.1509604 0.1816170 751
#> 5: 2007 1 -1 2008 2.8765749 0.1862649 751
#> 6: 2007 2 -1 2009 4.3001933 0.1904476 751
#> 7: 2007 3 -1 2010 5.4489463 0.2012098 501
#> 8: 2007 4 -1 2011 6.8211665 0.1907504 501
#> 9: 2007 5 -1 2012 7.9496252 0.2256687 251
#> Ntreated
#> <int>
#> 1: 249
#> 2: 249
#> 3: 249
#> 4: 249
#> 5: 249
#> 6: 249
#> 7: 249
#> 8: 249
#> 9: 249
To control linearly for X1 and X2, we just need to add them to the
covariate_names
argument of varnames
:
varnames$covariate_names = c("X1","X2")
Now we check DiD with controls for time-variation in X1 and X2:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)
print(did$results_cohort[Cohort==2007])
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol
#> <num> <num> <num> <num> <num> <num> <int>
#> 1: 2007 -3 -1 2004 -0.1225027 0.1074379 751
#> 2: 2007 -2 -1 2005 0.1021934 0.1042450 751
#> 3: 2007 -1 -1 2006 0.0000000 0.0000000 751
#> 4: 2007 0 -1 2007 0.9427569 0.1066192 751
#> 5: 2007 1 -1 2008 2.0692206 0.1117700 751
#> 6: 2007 2 -1 2009 2.8921399 0.1144014 751
#> 7: 2007 3 -1 2010 3.9086404 0.1233646 501
#> 8: 2007 4 -1 2011 5.1380269 0.1275067 501
#> 9: 2007 5 -1 2012 5.9657885 0.1503580 251
#> Ntreated
#> <int>
#> 1: 249
#> 2: 249
#> 3: 249
#> 4: 249
#> 5: 249
#> 6: 249
#> 7: 249
#> 8: 249
#> 9: 249
We see that the bias has been removed thanks to the control variables.
4. Controlling for Time-varying Fixed-effects
The fixedeffect_names
entry
Similar to time-varying covariates, we may be worried that units
belong to discrete categories which are subject to shocks. For example,
if different units \(i\) belong to
different regions, and some regions have higher treatment shares than
others, then region-specific shocks may correlate with treatment and
bias the DiD estimates. We can correct for time-specific shocks to
regions by specifying
varnames$fixedeffect_names = c("region")
.
5. Robust and Clustered Standard Errors
The cluster_names
entry
By default, this package always provides heteroskedasticity-robust standard errors. However, in difference-in-differences applications, it is often the case that treatment is assigned to groups of individuals (e.g., a change in state-wide policy treats all individuals in a state simultaneously). If those groups are also subject to common shocks, this induces correlation in the estimation errors within cluster, and standard errors will tend to be too small.
Fortunately, this is easy to fix: If the groups within which the
estimation errors are correlated are known to the researcher, we just
need to cluster standard errors by group. DiDforBigData
requires only that the list of variable names, varnames
, is
modified to include a cluster_names
entry. For example,
varnames$cluster_names = c("group1","group2")
tells it to
use multi-way clustering in a way that accounts for common shocks to
each of observable groups, “group1” and “group2”.
Note: When estimating a regression that combines multiple treatment
cohorts and/or multiple event times, it is necessary to always cluster
on unit (individual). DiDforBigData
adds this clustering by
default.
Example
There is an option in SimDiD()
using
clusters=TRUE
to group individuals into bins that are
differentially selected for treatment and that also face common shocks
within each bin:
sim = SimDiD(sample_size=1000, clusters = TRUE)
simdata = sim$simdata
print(simdata)
#> id year cohort Y cluster
#> <int> <int> <num> <num> <int>
#> 1: 1 2003 2007 8.266417 9
#> 2: 1 2004 2007 12.657646 9
#> 3: 1 2005 2007 9.373441 9
#> 4: 1 2006 2007 10.851528 9
#> 5: 1 2007 2007 9.306055 9
#> ---
#> 10996: 1000 2009 2010 10.317600 6
#> 10997: 1000 2010 2010 9.623423 6
#> 10998: 1000 2011 2010 11.509660 6
#> 10999: 1000 2012 2010 9.718423 6
#> 11000: 1000 2013 2010 11.042858 6
print(sim$true_ATT[cohort=="Average"])
#> cohort event ATTge
#> <char> <num> <num>
#> 1: Average 0 1.500668
#> 2: Average 1 2.500668
#> 3: Average 2 3.250501
#> 4: Average 3 4.250501
#> 5: Average 4 5.000000
#> 6: Average 5 6.000000
#> 7: Average 6 7.000000
We set up the varnames
to prepare for estimation:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
We check the usual standard errors, which are clustered on unit based
on varnames$id_name
by default:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -1, max_event=3)
print(did$results_average)
#> Key: <EventTime>
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> <num> <num> <num> <num> <int> <int>
#> 1: -1 -1 0.000000 0.00000000 1503 749
#> 2: 0 -1 1.654672 0.07813674 1503 749
#> 3: 1 -1 2.710981 0.09333436 1503 749
#> 4: 2 -1 3.319803 0.11998209 1002 499
#> 5: 3 -1 4.597428 0.12376782 752 499
Next, we cluster on the “cluster” variable by adding it to the
varnames
and re-estimating:
varnames$cluster_names = "cluster"
did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3)
print(did$results_average)
#> Key: <EventTime>
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> <num> <num> <num> <num> <int> <int>
#> 1: -1 -1 0.000000 0.00000000 1503 749
#> 2: 0 -1 1.654672 0.08714482 1503 749
#> 3: 1 -1 2.710981 0.11885828 1503 749
#> 4: 2 -1 3.319803 0.13355642 1002 499
#> 5: 3 -1 4.597428 0.25745379 752 499
6. Parallelization
The parallel_cores
argument
If you have the parallel
package installed, it is
trivial to execute your DiD estimation in parallel by setting the
parallel_cores
argument in the DiD()
command.
For example, DiD(..., parallel_cores = 4)
will utilize 4
cores in parallel. To determine how many cores are available on your
system, just run the command parallel::detectCores()
in R.
(I suggest leaving at least 1 core free to keep your system from
freezing.)
While parallel processing usually does not interact well with the
progress bar, I have written a modified version of R’s parallelization
protocol that correctly updates the progress bar as it completes its
tasks. Thus, if you have the progress
package installed,
you will correctly see a progress bar and predicted completion time
while your DiD
estimation is executing in parallel.
Example
We simulate some data:
sim = SimDiD(seed=123, sample_size = 1000)
simdata = sim$simdata
We set up the varnames
to prepare for estimation:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
We run the estimation not in parallel:
We run the estimation in parallel with 2 processes:
7. Average across Event Times
The Esets
argument
Suppose you wish to average the DiD estimate across a few event
times, with a corresponding standard error for the average across event
times. This is done by providing a list to the Esets
argument in the DiD()
command.
For example, if Esets = list(c(1,2,3))
, then the output
will include the average DiD for event times e=1,2,3
.
Multiple sets of event times can be provided, e.g.,
Esets = list(c(1,2,3), c(1,3))
will provide the average
across e=1,2,3
as well as the average for e=1
and e=3
.
Event set averages are returned in the $results_Esets
argument of the output list. Sample size is not provided, as it is
unclear how to define the sample size when averaging multiple
statistics, each of which has a different sample size.
Example
We simulate some data:
sim = SimDiD(seed=123, sample_size = 1000)
simdata = sim$simdata
We set up the varnames
to prepare for estimation:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
We run the estimation with two event set averages:
did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3, Esets = list(c(1,2,3), c(1,3)))
print(did)
#> $results_cohort
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol
#> <num> <num> <num> <num> <num> <num> <int>
#> 1: 2007 -1 -1 2006 0.000000 0.00000000 751
#> 2: 2007 0 -1 2007 1.346483 0.11215961 751
#> 3: 2007 1 -1 2008 2.232383 0.09443787 751
#> 4: 2007 2 -1 2009 3.370599 0.10350708 751
#> 5: 2007 3 -1 2010 4.266872 0.10723955 501
#> 6: 2010 -1 -1 2009 0.000000 0.00000000 501
#> 7: 2010 0 -1 2010 1.506723 0.10838000 501
#> 8: 2010 1 -1 2011 2.660030 0.10325447 501
#> 9: 2010 2 -1 2012 3.637140 0.12739621 251
#> 10: 2010 3 -1 2013 4.398071 0.12737026 251
#> 11: 2012 -1 -1 2011 0.000000 0.00000000 251
#> 12: 2012 0 -1 2012 1.920898 0.13066712 251
#> 13: 2012 1 -1 2013 2.786013 0.12683651 251
#> Ntreated
#> <int>
#> 1: 249
#> 2: 249
#> 3: 249
#> 4: 249
#> 5: 249
#> 6: 250
#> 7: 250
#> 8: 250
#> 9: 250
#> 10: 250
#> 11: 250
#> 12: 250
#> 13: 250
#>
#> $results_average
#> Key: <EventTime>
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> <num> <num> <num> <num> <int> <int>
#> 1: -1 -1 0.000000 0.00000000 1503 749
#> 2: 0 -1 1.591695 0.06629046 1503 749
#> 3: 1 -1 2.559912 0.06200891 1503 749
#> 4: 2 -1 3.504136 0.08179983 1002 499
#> 5: 3 -1 4.332603 0.08219426 752 499
#>
#> $results_Esets
#> Eset ATT_Eset ATT_Eset_SE
#> <char> <num> <num>
#> 1: 1,2,3 3.465550 0.05887067
#> 2: 1,3 3.446257 0.06172373