Skip to contents

Before getting started, make sure that you have installed the package.

To install the package from CRAN:

install.packages("DiDforBigData")

To install the package from Github:

devtools::install_github("setzler/DiDforBigData")

If the package is installed, make sure that you load it:

1. Choosing the Control Group

The control_group argument

Depending on empirical context, the parallel-trends assumption may only hold for one of these possible control groups:

  • "never-treated": These are the control units that are never observed receiving the treatment within the sampling time frame.
  • "future-treated": These are the control units that are observed receiving the treatment within the sampling time frame.
  • "all": These include both the “never-treated” and “future-treated” groups.

DiDforBigData makes it easy to choose among these possibilities by setting the control_group argument of the DiD function.

Example

We start by simulating some data to work with:

sim = SimDiD(seed=123, sample_size = 1000)
simdata = sim$simdata
print(simdata)
#>          id year cohort         Y
#>     1:    1 2003   2012  9.556685
#>     2:    1 2004   2012 10.838554
#>     3:    1 2005   2012 10.639781
#>     4:    1 2006   2012 10.768935
#>     5:    1 2007   2012 10.613825
#>    ---                           
#> 10996: 1000 2009   2010  6.640274
#> 10997: 1000 2010   2010 10.571372
#> 10998: 1000 2011   2010 11.282597
#> 10999: 1000 2012   2010 13.403816
#> 11000: 1000 2013   2010 12.542082

We can check what the true ATT is in the simulated data. In particular, let’s focus on the ATT for Cohort 2007:

print(sim$true_ATT[cohort==2007])
#>    cohort event ATTge
#> 1:   2007     0     1
#> 2:   2007     1     2
#> 3:   2007     2     3
#> 4:   2007     3     4
#> 5:   2007     4     5
#> 6:   2007     5     6
#> 7:   2007     6     7

We also set up the variable names as a list so that DiDforBigData knows which variable is the outcome, which variable is the treatment cohort, etc.:

varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"

Now, we are ready to estimate DiD. We focus on the ATT for the 2007 cohort at event times -3,…,5. Initially, the control group is not set, so the function will use the default option control_group = "all":

did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)

print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime     ATTge   ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004 0.2212255 0.10491102      751
#> 2:   2007        -2        -1         2005 0.1274586 0.10260112      751
#> 3:   2007        -1        -1         2006 0.0000000 0.00000000      751
#> 4:   2007         0        -1         2007 1.3464834 0.11215961      751
#> 5:   2007         1        -1         2008 2.2323825 0.09443787      751
#> 6:   2007         2        -1         2009 3.3705987 0.10350708      751
#> 7:   2007         3        -1         2010 4.2668717 0.10723955      501
#> 8:   2007         4        -1         2011 5.3241860 0.10738958      501
#> 9:   2007         5        -1         2012 6.1964127 0.13611288      251
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249
#> 9:      249

We now consider changing the estimation to only use the “never-treated” control group:

did = DiD(inputdata = simdata, varnames = varnames, 
          control_group = "never-treated", min_event = -3, max_event=5)

print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime       ATTge  ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004  0.10444344 0.1269592      251
#> 2:   2007        -2        -1         2005 -0.05942059 0.1298412      251
#> 3:   2007        -1        -1         2006  0.00000000 0.0000000      251
#> 4:   2007         0        -1         2007  1.27025637 0.1322487      251
#> 5:   2007         1        -1         2008  2.15386918 0.1171451      251
#> 6:   2007         2        -1         2009  3.35116597 0.1288344      251
#> 7:   2007         3        -1         2010  4.11246994 0.1195819      251
#> 8:   2007         4        -1         2011  5.34013939 0.1225179      251
#> 9:   2007         5        -1         2012  6.19641269 0.1361129      251
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249
#> 9:      249

Note that the sample size for the control group, Ncontrol, is now substantially lower. Finally, we consider changing this to only use the “future-treated” group:

did = DiD(inputdata = simdata, varnames = varnames, 
          control_group = "future-treated", min_event = -3, max_event=5)

print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime     ATTge  ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004 0.2798501 0.1117079      500
#> 2:   2007        -2        -1         2005 0.2212720 0.1081106      500
#> 3:   2007        -1        -1         2006 0.0000000 0.0000000      500
#> 4:   2007         0        -1         2007 1.3847494 0.1183855      500
#> 5:   2007         1        -1         2008 2.2717963 0.1022833      500
#> 6:   2007         2        -1         2009 3.3803539 0.1094793      500
#> 7:   2007         3        -1         2010 4.4218911 0.1294090      250
#> 8:   2007         4        -1         2011 5.3081689 0.1292680      250
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249

We see that, because there are no future-treated cohorts left at event time +5, it is no longer possible to provide an ATT estimate at event time +5 if using the future-treated control group.

2. Avoiding anticipation

The base_event argument

In DiD designs, the researcher must choose a base pre-period against which differences are measured. A natural choice is base_event = -1, which means to use the time period just before treatment onset at event time 0.

As discussed by Callaway & Sant’Anna (2021), one possibility is that the treatment group begins responding to treatment prior to treatment onset, which is called “anticipation.” If anticipation begins at period -1, then differences measured relative to base_event = -1 yield inconsistent DiD estimates.

Fortunately, there is an easy solution: choose a base pre-period from before anticipation began. If it seems reasonable to assume that the treatment group could not have anticipated treatment 3 years before treatment onset, then base_event = -3 should be free of anticipation.

Example

We can simulate data that is subject to anticipation using the anticipation argument in the SimDiD function:

sim = SimDiD(seed=123, sample_size = 200, anticipation = 2)
simdata = sim$simdata
print(simdata)
#>        id year cohort         Y
#>    1:   1 2003    Inf 11.238356
#>    2:   1 2004    Inf 11.520443
#>    3:   1 2005    Inf 13.252991
#>    4:   1 2006    Inf 11.940777
#>    5:   1 2007    Inf 13.120440
#>   ---                          
#> 2196: 200 2009    Inf 10.250944
#> 2197: 200 2010    Inf  8.812227
#> 2198: 200 2011    Inf  9.926592
#> 2199: 200 2012    Inf 11.188476
#> 2200: 200 2013    Inf  9.270495

Let’s focus on the average ATT across all cohorts. There are two periods of treatment effects prior to treatment, which we can verify by checking the true ATT from the simulation:

print(sim$true_ATT[cohort=="Average"])
#>     cohort event    ATTge
#> 1: Average    -2 0.500000
#> 2: Average    -1 0.500000
#> 3: Average     0 1.503356
#> 4: Average     1 2.503356
#> 5: Average     2 3.252525
#> 6: Average     3 4.252525
#> 7: Average     4 5.000000
#> 8: Average     5 6.000000
#> 9: Average     6 7.000000

We set up the varnames to prepare for estimation:

varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"

If we estimate DiD using the default argument that base_event = -1, the estimate will be biased and inconsistent:

did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=3)

print(did$results_average)
#>    EventTime BaseEvent        ATTe   ATTe_SE Ncontrol Ntreated
#> 1:        -3        -1 -0.52346600 0.1534229      303      149
#> 2:        -2        -1 -0.04556457 0.1380282      303      149
#> 3:        -1        -1  0.00000000 0.0000000      303      149
#> 4:         0        -1  0.75946293 0.1495045      303      149
#> 5:         1        -1  1.80044696 0.1392958      303      149
#> 6:         2        -1  2.50749407 0.1718781      202       99
#> 7:         3        -1  3.33212214 0.1917002      152       99

We now set base_event = -3 to avoid the anticipation at -1 and -2:

did = DiD(inputdata = simdata, varnames = varnames, 
          base_event = -3, min_event = -3, max_event=3)

print(did$results_average) 
#>    EventTime BaseEvent      ATTe   ATTe_SE Ncontrol Ntreated
#> 1:        -3        -3 0.0000000 0.0000000      303      149
#> 2:        -2        -3 0.4779014 0.1521152      303      149
#> 3:        -1        -3 0.5234660 0.1534229      303      149
#> 4:         0        -3 1.2829289 0.1441747      303      149
#> 5:         1        -3 2.3239130 0.1485972      303      149
#> 6:         2        -3 2.9653357 0.1643872      202       99
#> 7:         3        -3 3.8282455 0.1773802      152       99

We see that the estimate is now close to the true ATT.

3. Controlling for Time-varying Covariates

The covariate_names entry

We sometimes worry that treatment cohorts are selected on time-varying observables, and those time-varying observables also directly affect the outcome. If the growth rate in the observables differs between the treatment and control groups, it creates a violation of the parallel-trends assumption: the treatment and control groups would have experienced different growth profiles in the absence of treatment due to their different observables.

Fortunately, this is easy to fix: Since the confounding variables are observable, we just need to control for those observables. DiDforBigData requires only that the list of variable names, varnames, is modified to include a covariate_names entry. For example, varnames$covariate_names = c("X1","X2") tells it to control linearly for time-variation in X1 and X2.

Example

There is an option in SimDiD() to add a couple of covariates.


sim = SimDiD(sample_size=1000, time_covars=TRUE)
simdata = sim$simdata
print(simdata)
#>          id year cohort         Y          X1           X2
#>     1:    1 2003   2007 11.763806 -5.28713552 -0.546247081
#>     2:    1 2004   2007 14.216536 -3.16836592  0.882383754
#>     3:    1 2005   2007 12.898204 -3.52698674 -0.007446627
#>     4:    1 2006   2007 14.117949 -3.83264390 -0.307454720
#>     5:    1 2007   2007  9.932859  0.04643156  0.153863150
#>    ---                                                    
#> 10996: 1000 2009   2010 10.096541 -0.96662668 -0.375274801
#> 10997: 1000 2010   2010  6.679059  4.38379275  0.229111354
#> 10998: 1000 2011   2010  8.328832  4.11630763  0.242690466
#> 10999: 1000 2012   2010  9.171007  3.05442888  2.269272560
#> 11000: 1000 2013   2010  9.044345  4.46350435  0.510888022

print(sim$true_ATT[cohort==2007])
#>    cohort event ATTge
#> 1:   2007     0     1
#> 2:   2007     1     2
#> 3:   2007     2     3
#> 4:   2007     3     4
#> 5:   2007     4     5
#> 6:   2007     5     6
#> 7:   2007     6     7

There are two covariates, X1 and X2. In particular, X2 differs in growth rates across treatment cohorts, which means that it causes violations of parallel trends if not controlled.

We set up the varnames to prepare for estimation, ignoring the covariates for now:

varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"

We verify that DiD gives the wrong answer for cohort 2007:

did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)

print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime     ATTge  ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004 0.1677000 0.1877317      751
#> 2:   2007        -2        -1         2005 0.2901911 0.1778623      751
#> 3:   2007        -1        -1         2006 0.0000000 0.0000000      751
#> 4:   2007         0        -1         2007 1.1509604 0.1816170      751
#> 5:   2007         1        -1         2008 2.8765749 0.1862649      751
#> 6:   2007         2        -1         2009 4.3001933 0.1904476      751
#> 7:   2007         3        -1         2010 5.4489463 0.2012098      501
#> 8:   2007         4        -1         2011 6.8211665 0.1907504      501
#> 9:   2007         5        -1         2012 7.9496252 0.2256687      251
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249
#> 9:      249

To control linearly for X1 and X2, we just need to add them to the covariate_names argument of varnames:

varnames$covariate_names = c("X1","X2")

Now we check DiD with controls for time-variation in X1 and X2:

did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5)

print(did$results_cohort[Cohort==2007])
#>    Cohort EventTime BaseEvent CalendarTime      ATTge  ATTge_SE Ncontrol
#> 1:   2007        -3        -1         2004 -0.1225027 0.1074379      751
#> 2:   2007        -2        -1         2005  0.1021934 0.1042450      751
#> 3:   2007        -1        -1         2006  0.0000000 0.0000000      751
#> 4:   2007         0        -1         2007  0.9427569 0.1066192      751
#> 5:   2007         1        -1         2008  2.0692206 0.1117700      751
#> 6:   2007         2        -1         2009  2.8921399 0.1144014      751
#> 7:   2007         3        -1         2010  3.9086404 0.1233646      501
#> 8:   2007         4        -1         2011  5.1380269 0.1275067      501
#> 9:   2007         5        -1         2012  5.9657885 0.1503580      251
#>    Ntreated
#> 1:      249
#> 2:      249
#> 3:      249
#> 4:      249
#> 5:      249
#> 6:      249
#> 7:      249
#> 8:      249
#> 9:      249

We see that the bias has been removed thanks to the control variables.

4. Controlling for Time-varying Fixed-effects

The fixedeffect_names entry

Similar to time-varying covariates, we may be worried that units belong to discrete categories which are subject to shocks. For example, if different units \(i\) belong to different regions, and some regions have higher treatment shares than others, then region-specific shocks may correlate with treatment and bias the DiD estimates. We can correct for time-specific shocks to regions by specifying varnames$fixedeffect_names = c("region").

5. Robust and Clustered Standard Errors

The cluster_names entry

By default, this package always provides heteroskedasticity-robust standard errors. However, in difference-in-differences applications, it is often the case that treatment is assigned to groups of individuals (e.g., a change in state-wide policy treats all individuals in a state simultaneously). If those groups are also subject to common shocks, this induces correlation in the estimation errors within cluster, and standard errors will tend to be too small.

Fortunately, this is easy to fix: If the groups within which the estimation errors are correlated are known to the researcher, we just need to cluster standard errors by group. DiDforBigData requires only that the list of variable names, varnames, is modified to include a cluster_names entry. For example, varnames$cluster_names = c("group1","group2") tells it to use multi-way clustering in a way that accounts for common shocks to each of observable groups, “group1” and “group2”.

Note: When estimating a regression that combines multiple treatment cohorts and/or multiple event times, it is necessary to always cluster on unit (individual). DiDforBigData adds this clustering by default.

Example

There is an option in SimDiD() using clusters=TRUE to group individuals into bins that are differentially selected for treatment and that also face common shocks within each bin:


sim = SimDiD(sample_size=1000, clusters = TRUE)
simdata = sim$simdata
print(simdata)
#>          id year cohort         Y cluster
#>     1:    1 2003   2007  8.266417       9
#>     2:    1 2004   2007 12.657646       9
#>     3:    1 2005   2007  9.373441       9
#>     4:    1 2006   2007 10.851528       9
#>     5:    1 2007   2007  9.306055       9
#>    ---                                   
#> 10996: 1000 2009   2010 10.317600       6
#> 10997: 1000 2010   2010  9.623423       6
#> 10998: 1000 2011   2010 11.509660       6
#> 10999: 1000 2012   2010  9.718423       6
#> 11000: 1000 2013   2010 11.042858       6

print(sim$true_ATT[cohort=="Average"])
#>     cohort event    ATTge
#> 1: Average     0 1.500668
#> 2: Average     1 2.500668
#> 3: Average     2 3.250501
#> 4: Average     3 4.250501
#> 5: Average     4 5.000000
#> 6: Average     5 6.000000
#> 7: Average     6 7.000000

We set up the varnames to prepare for estimation:

varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"

We check the usual standard errors, which are clustered on unit based on varnames$id_name by default:

did = DiD(inputdata = simdata, varnames = varnames, min_event = -1, max_event=3)

print(did$results_average)
#>    EventTime BaseEvent     ATTe    ATTe_SE Ncontrol Ntreated
#> 1:        -1        -1 0.000000 0.00000000     1503      749
#> 2:         0        -1 1.654672 0.07813674     1503      749
#> 3:         1        -1 2.710981 0.09333436     1503      749
#> 4:         2        -1 3.319803 0.11998209     1002      499
#> 5:         3        -1 4.597428 0.12376782      752      499

Next, we cluster on the “cluster” variable by adding it to the varnames and re-estimating:


varnames$cluster_names = "cluster" 

did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3)

print(did$results_average)
#>    EventTime BaseEvent     ATTe    ATTe_SE Ncontrol Ntreated
#> 1:        -1        -1 0.000000 0.00000000     1503      749
#> 2:         0        -1 1.654672 0.08714482     1503      749
#> 3:         1        -1 2.710981 0.11885828     1503      749
#> 4:         2        -1 3.319803 0.13355642     1002      499
#> 5:         3        -1 4.597428 0.25745379      752      499

6. Parallelization

The parallel_cores argument

If you have the parallel package installed, it is trivial to execute your DiD estimation in parallel by setting the parallel_cores argument in the DiD() command. For example, DiD(..., parallel_cores = 4) will utilize 4 cores in parallel. To determine how many cores are available on your system, just run the command parallel::detectCores() in R. (I suggest leaving at least 1 core free to keep your system from freezing.)

While parallel processing usually does not interact well with the progress bar, I have written a modified version of R’s parallelization protocol that correctly updates the progress bar as it completes its tasks. Thus, if you have the progress package installed, you will correctly see a progress bar and predicted completion time while your DiD estimation is executing in parallel.

Example

We simulate some data:

sim = SimDiD(seed=123, sample_size = 1000)
simdata = sim$simdata

We set up the varnames to prepare for estimation:

varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"

We run the estimation not in parallel:

did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3)

We run the estimation in parallel with 2 processes:

did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3, parallel_cores = 2)

7. Average across Event Times

The Esets argument

Suppose you wish to average the DiD estimate across a few event times, with a corresponding standard error for the average across event times. This is done by providing a list to the Esets argument in the DiD() command.

For example, if Esets = list(c(1,2,3)), then the output will include the average DiD for event times e=1,2,3. Multiple sets of event times can be provided, e.g., Esets = list(c(1,2,3), c(1,3)) will provide the average across e=1,2,3 as well as the average for e=1 and e=3.

Event set averages are returned in the $results_Esets argument of the output list. Sample size is not provided, as it is unclear how to define the sample size when averaging multiple statistics, each of which has a different sample size.

Example

We simulate some data:

sim = SimDiD(seed=123, sample_size = 1000)
simdata = sim$simdata

We set up the varnames to prepare for estimation:

varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"

We run the estimation with two event set averages:

did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3, Esets = list(c(1,2,3), c(1,3)))

print(did)
#> $results_cohort
#>     Cohort EventTime BaseEvent CalendarTime    ATTge   ATTge_SE Ncontrol
#>  1:   2007        -1        -1         2006 0.000000 0.00000000      751
#>  2:   2007         0        -1         2007 1.346483 0.11215961      751
#>  3:   2007         1        -1         2008 2.232383 0.09443787      751
#>  4:   2007         2        -1         2009 3.370599 0.10350708      751
#>  5:   2007         3        -1         2010 4.266872 0.10723955      501
#>  6:   2010        -1        -1         2009 0.000000 0.00000000      501
#>  7:   2010         0        -1         2010 1.506723 0.10838000      501
#>  8:   2010         1        -1         2011 2.660030 0.10325447      501
#>  9:   2010         2        -1         2012 3.637140 0.12739621      251
#> 10:   2010         3        -1         2013 4.398071 0.12737026      251
#> 11:   2012        -1        -1         2011 0.000000 0.00000000      251
#> 12:   2012         0        -1         2012 1.920898 0.13066712      251
#> 13:   2012         1        -1         2013 2.786013 0.12683651      251
#>     Ntreated
#>  1:      249
#>  2:      249
#>  3:      249
#>  4:      249
#>  5:      249
#>  6:      250
#>  7:      250
#>  8:      250
#>  9:      250
#> 10:      250
#> 11:      250
#> 12:      250
#> 13:      250
#> 
#> $results_average
#>    EventTime BaseEvent     ATTe    ATTe_SE Ncontrol Ntreated
#> 1:        -1        -1 0.000000 0.00000000     1503      749
#> 2:         0        -1 1.591695 0.06629046     1503      749
#> 3:         1        -1 2.559912 0.06200891     1503      749
#> 4:         2        -1 3.504136 0.08179983     1002      499
#> 5:         3        -1 4.332603 0.08219426      752      499
#> 
#> $results_Esets
#>     Eset ATT_Eset ATT_Eset_SE
#> 1: 1,2,3 3.340838  0.05596817
#> 2:   1,3 3.251005  0.05801106