There are only 3 functions in this package:
-
DiDge()
: This function estimates DiD for a single cohort and a single event time. -
DiD()
: This function estimates DiD for all available cohorts and event times. -
SimDiD()
: This function simulates data.
We now demonstrate the simplest application of the 3 functions.
Detailed documentation for each of these function is available from the Reference tab above.
0. Installation
To install the package from CRAN:
install.packages("DiDforBigData")
To install the package from Github:
devtools::install_github("setzler/DiDforBigData")
To use the package after it is installed:
It is recommended to also make sure these optional packages have been installed:
1. Prepare Data
I provide a simple data simulator as follows:
sim = SimDiD(sample_size = 400, seed=123)
# true ATTs in the simulation
print(sim$true_ATT)
#> cohort event ATTge
#> <char> <num> <num>
#> 1: 2007 0 1.000000
#> 2: 2007 1 2.000000
#> 3: 2007 2 3.000000
#> 4: 2007 3 4.000000
#> 5: 2007 4 5.000000
#> 6: 2007 5 6.000000
#> 7: 2007 6 7.000000
#> 8: 2010 0 1.500000
#> 9: 2010 1 2.500000
#> 10: 2010 2 3.500000
#> 11: 2010 3 4.500000
#> 12: 2012 0 2.000000
#> 13: 2012 1 3.000000
#> 14: Average 0 1.501672
#> 15: Average 1 2.501672
#> 16: Average 2 3.251256
#> 17: Average 3 4.251256
#> 18: Average 4 5.000000
#> 19: Average 5 6.000000
#> 20: Average 6 7.000000
#> cohort event ATTge
# simulated data
simdata = sim$simdata
print(simdata)
#> id year cohort Y
#> <int> <int> <num> <num>
#> 1: 1 2003 2010 8.773933
#> 2: 1 2004 2010 9.846116
#> 3: 1 2005 2010 9.963274
#> 4: 1 2006 2010 9.997385
#> 5: 1 2007 2010 10.060080
#> ---
#> 4396: 400 2009 2007 8.035127
#> 4397: 400 2010 2007 14.438798
#> 4398: 400 2011 2007 11.973035
#> 4399: 400 2012 2007 13.033367
#> 4400: 400 2013 2007 13.552533
Your real data needs to have this “long” format, i.e., there need to
be variables for the individual identifier (e.g. id
), the
time variable (e.g. year
), the cohort at which treatment
begins (e.g. cohort
), and the outcome variable
(e.g. Y
). No other variables are required. These variables
can have any names you prefer.
The never-treated cohort should be coded as infinity
(cohort = Inf
). If the cohort value is missing
(cohort = NA
), then the cohort will be automatically
re-coded as infinity.
Before going to the estimation, we need to prepare a list of the variable names:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
2. Estimate DiD for a Single Cohort
We choose an event time (+3) and a cohort of treated units (2010), then estimate DiD:
did_2010 = DiDge(inputdata = simdata, varnames = varnames,
cohort_time = 2010, event_postperiod = 3)
print(did_2010)
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol Ntreated
#> <num> <num> <num> <num> <num> <num> <int> <int>
#> 1: 2010 3 -1 2013 4.629839 0.1962355 101 100
Comparing this estimate to the true ATT above, we see that the estimation performed well.
Note that it used -1 as the base year by default. This is easy to change.
3. Estimate DiD for All Cohorts and Event Times
Suppose we want to estimate the ATT at each event time from -3 to +5. We can do so as follows:
did_all = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event = 5)
The output of DiD() is a list. One object in the list is results_average, which includes the average ATT across cohorts:
print(did_all$results_average)
#> Key: <EventTime>
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> <num> <num> <num> <num> <int> <int>
#> 1: -3 -1 -0.03472821 0.10802340 603 299
#> 2: -2 -1 -0.06416254 0.09847063 603 299
#> 3: -1 -1 0.00000000 0.00000000 603 299
#> 4: 0 -1 1.44852075 0.10387376 603 299
#> 5: 1 -1 2.67299583 0.09964407 603 299
#> 6: 2 -1 3.17946138 0.12477922 402 199
#> 7: 3 -1 4.27349270 0.12596253 302 199
#> 8: 4 -1 4.98423853 0.17470913 201 99
#> 9: 5 -1 5.66743134 0.21029573 101 99
The other output from DiD() is results_cohort, which includes all combinations of event times and cohorts. It is too large to print here, so let’s just print the results for event times 1 and 2:
print(did_all$results_cohort[EventTime==1 | EventTime==2])
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol Ntreated
#> <num> <num> <num> <num> <num> <num> <int> <int>
#> 1: 2007 1 -1 2008 2.263430 0.1498733 301 99
#> 2: 2007 2 -1 2009 3.083096 0.1666782 301 99
#> 3: 2010 1 -1 2011 2.474058 0.1733037 201 100
#> 4: 2010 2 -1 2012 3.274863 0.1863323 101 100
#> 5: 2012 1 -1 2013 3.277404 0.2117916 101 100
Note: the simulated data ends in 2013, so event time 2 is not available for treatment cohort 2012.
To take an average across multiple event times, use the
Esets
argument. It accepts a list, in which each item is a
vector of event times over which to average:
did_all = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event = 5,
Esets = list(c(1,2), c(1,2,3)))
print(did_all$results_Esets)
#> Eset ATT_Eset ATT_Eset_SE
#> <char> <num> <num>
#> 1: 1,2 2.926229 0.08930124
#> 2: 1,2,3 3.375317 0.08822397