grantmcdermott / etwfe Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 11.0 6.39 MB

Extended two-way fixed effects

Home Page: https://grantmcdermott.com/etwfe/

License: Other

R 100.00%

etwfe's People

Contributors

Stargazers

Watchers

Forkers

s3alfisc jameslairdsmith yfnian fhollenbach shizelong1985 rv990002 frederickluser tyleransom vincentarelbundock joepvanderplas ih4al

etwfe's Issues

Two errors: (1) "Warning message: The demeaning algorithm did not converge, the results are not reliable. (tol: 1e-06, iter: 2036)" and (2) "Error: Class of the `.Dtreat` variable is class is not supported."

Hi, this is a really cool package, and I'm trying to make it work for my dataset. Please note that I'm a relative beginner in R.

Brief description of my data:

I have a dataset with ~7000 neighbourhoods in the Netherlands, two periods (2012 and 2021), and a treatment that happens (or does not happen) in a neighbourhood at some point in between 2012 and 2021. The treatment variable thus takes the values 1 and 0.
All relevant variables are numeric variables, or have been converted to numeric variables.
I do not know in what year the treatment happens per neighbourhood, so I have created a first.treat variable that is the same for all observations.

Brief description of my problems

In the code below, I get two error messages:
1. In step 4, R tells me: "Warning message: The demeaning algorithm did not converge, the results are not reliable. (tol: 1e-06, iter: 2036)". Furthermore, all the estimates shown are NaN; and
2. In step 5, R tells me: "Error: Class of the .Dtreat variable is class is not supported."
Finally, I have one more question: is it also possible to use the staggered treatment with, in some cases, repeated treatments? To illustrate: suppose I have the relevant data for 2010, 2012, 2017 and 2021, but some neighbourhoods get treated multiple times (e.g. in between 2010 and 2012, but again in 2012 and 2017). Do I then just create a separate first.treat value for those groups, or is there another way around this?

My code:

First, I impute some values for NAs of controls

suppressPackageStartupMessages(library(mice))
missing_vars <- c("age", "p_n_w_al", "p_eenp_hh", "p_ink_hi", "p_00_14_jr", "g_woz")
df_missing <- tk_cbs_test_12_21_80_t12_hosp[, missing_vars]
mids_obj <- mice(df_missing)
df_imputed <- complete(mids_obj)
tk_cbs_test_12_21_80_t12_hosp_imp <- cbind(tk_cbs_test_12_21_80_t12_hosp[, -which(names(tk_cbs_test_12_21_80_t12_hosp) %in% missing_vars)], df_imputed)

Second, create first.treat variable that is the same for all treated neighbourhoods

tk_cbs_test_12_21_80_t12_hosp_imp$first.treat <- ifelse(tk_cbs_test_12_21_80_t12_hosp_imp$evertr == 1, 2015, 0)

Third, convert all controls to numeric variables where this was not yet the case

tk_cbs_test_12_21_80_t12_hosp_imp$p_eenp_hh <- as.numeric(tk_cbs_test_12_21_80_t12_hosp_imp$p_eenp_hh)
tk_cbs_test_12_21_80_t12_hosp_imp$p_00_14_jr <- as.numeric(tk_cbs_test_12_21_80_t12_hosp_imp$p_00_14_jr)

Fourth, specify the model

twoway_FE_12_21_80_co =
  etwfe(
    fml = p_prrp ~ treated + urban + age + p_n_w_al + p_eenp_hh + p_ink_hi + p_00_14_jr + g_woz,
    tvar = jaar.x,
    gvar = first.treat,
    ivar = buurt2021,
    data = tk_cbs_test_12_21_80_t12_hosp_imp
  )


# Have a look at the results
twoway_FE_12_21_80_co

Fifth, recover the Average Treatment Effect (ATE)

emfx(twoway_FE_12_21_80_co)

I hope you can help me out! Regardless of the answer, many thanks for this cool package.

Query RE Heterogeneous treatment effects

Dear Grant,

Thank you very much for creating a great package. I have a query regarding the behaviour of the xvar argument for heterogeneous treatment effects:

In the case of a categorical xvar with more than two levels, I noticed what might be some unexpected behaviour when the variables are demeaned. The demeaned dataset created by etwfe() seems to have a number of variables with NA column names that don't appear to enter the estimation. As an example, I have augmented the quickstart example on the package homepage to estimate heterogeneous treatment effects across quartiles of lpop:

library(data.table)
data("mpdta", package = "did")
setDT(mpdta)
# Create quartiles of lpop
mpdta[, lpop_quartile := cut(lpop,
                        breaks = quantile(lpop, probs = 0:4/4),
                        labels = 1:4, include.lowest=TRUE)]

mod =
  etwfe(
    fml  = lemp ~ 0, 
    tvar = year,        
    gvar = first.treat, 
    data = mpdta,   
    xvar = lpop_quartile,
    vcov = ~countyreal  
  )

# etwfe_dat() is etwfe with the final line set to return(data) instead of return(est)  
mod_dat = etwfe_dat(
  fml  = lemp ~ 0, 
  tvar = year,        
  gvar = first.treat, 
  data = mpdta,   
  xvar = lpop_quartile,
  vcov = ~countyreal  
)

> colnames(mod_dat)
 [1] "year"     "countyreal"  "lpop"   "lemp"     "first.treat"      "treat"           
 [7] "lpop_quartile"   ".Dtreat"   ".Dtreated_cohort" "lpop_quartile_dm" NA   NA

> head(mod_dat)
  year countyreal     lpop     lemp first.treat treat lpop_quartile .Dtreat .Dtreated_cohort lpop_quartile_dm         NA         NA
1 2003       8001 5.896761 8.461469        2007     1             4   FALSE                1       -0.2356021 -0.3089005  0.7277487
2 2004       8001 5.896761 8.336870        2007     1             4   FALSE                1       -0.2356021 -0.3089005  0.7277487
3 2005       8001 5.896761 8.340217        2007     1             4   FALSE                1       -0.2356021 -0.3089005  0.7277487
4 2006       8001 5.896761 8.378161        2007     1             4   FALSE                1       -0.2356021 -0.3089005  0.7277487
5 2007       8001 5.896761 8.487352        2007     1             4    TRUE                1       -0.2356021 -0.3089005  0.7277487
6 2003       8019 2.232377 4.997212        2007     1             1   FALSE                1       -0.2356021 -0.3089005 -0.2722513

The two NA columns appear to be other levels of lpop_quartile after demeaning. Perhaps I am misunderstanding the process here; I wonder would you be able to confirm if this is the expected behaviour of the function?

Also, I notice that the process for demeaning the `xvar' is relative to time rather than the usual group demeaning that appears in the case of covariates in Wooldridge (2021):

etwfe/R/etwfe.R

Lines 334 to 341 in 55a032a

    
           if (!is.null(xvar)) { 
        
             data$.Dtreated_cohort = ifelse(data[[gvar]] != gref & !is.na(data[[gvar]]), 1, 0) # generate a treatment-dummy 
        
             dm_fml = stats::reformulate(c(tvar), response = xvar) 
        
             ctrls_dm_df = fixest::demean(dm_fml, data = data, weights = data$.Dtreated_cohort, as.matrix = FALSE) # weights: only use the treated cohorts (units) to demean 
        
             ctrls_dm_df = stats::setNames(ctrls_dm_df, paste0(xvar, "_dm")) # give a name 
        
             data = cbind(data, ctrls_dm_df) 
        
           }

I wonder if you could also confirm that this is as expected? Could you provide some intuition as to why a different approach is required here if this is the case?

Many thanks again for providing such an excellent package, and I very much appreciate your help on this.

Anticipation

Dear Grant,

I hope you are doing fine.
I wondered whether you think it would be worth incorporating anticipation effects for the estimator (this would be kind of similar to what Callaway and Sant'Anna have).
It would be generally possible, and Wooldridge writes about that in the paper.

I don't know how this could be implemented for the ETWFE and would have to think about it, but do you think it would be worth digging?

Best, Frederic

Throws error with more than one covariate

First off, thank you for creating the package.

At the moment, more than one covariate leads to an error. The problem is caused when creating the crtls_dm variable in line 109 in etwfe.R, the string has to be split to get the names right, e.g.,
ctrls_dm = paste0(ctrls[[1]], " ", "")), "_dm")

I have not had time to check whether this can be fixed here or would lead to downstream problems. Will try to take a look tomorrow.

marginaleffects slow in large data sets

Dear Grant,

The marginaleffects::marginaleffects part in emfx becomes very slow in large datasets (> 500k), especially when the number of periods increases. At some point, it does not run in a sensible time anymore (running time explodes exponentially).

Currently, etwfe runs marginaleffects over the entire dat dataset. But that is actually not necessary.
The results will be (almost) equivalent if you first collapse the data set for all period-cohort combinations and take weighted means!

I added in my fork the part below that collapses the data. With some simulated data (1 mio. obs, 10 periods), adding this block improves running time by a factor of 4 (100 sec. vs 400 sec.) and this only increases exponentially with larger datasets.

What do you think about this? Greetings !

# define formulas (needs an "if-else" with / without xvar)
form_count = stats::as.formula(paste(".", " ~", gvar, "+", tvar, "+", xvar))
form_data = stats::as.formula(paste(".", " ~", gvar, "+", tvar, "+", xvar, "+ .Dtreat"))

# calculate weights
dat_weights = aggregate(form_count, data = subset(dat, .Dtreat == 1), FUN = length)[c(gvar, tvar, xvar, ".Dtreat")]
names(dat_weights)[names(dat_weights) == ".Dtreat"] = "n"

# collapse the data 
dat = aggregate(form_data, data = subset(dat, .Dtreat == 1), FUN = mean, na.rm = TRUE)

# merge the weights onto the collapsed data
dat = merge(dat, dat_weights, all.x = T)

# [...]
mfx = marginaleffects::marginaleffects(
  object,
  newdata = dat, # the collapsed data
  wts = "n", # the cohort-period cell size as weights
  variables = ".Dtreat", 
  by = c(by_var, xvar),
  ...

Including weights?

Is it possible to include weights in the etwfe function?

Interaction with continuous variable

Hey Grant,

Hey @vincentarelbundock (maybe you can help me too because this is in the end a question on marginaleffects rather than etwfe)

I'm using the xvar argument so far to calculate effects for different groups of a categorical variable. This works great.

As an example, I can calculate a treatment effect for education = 1, 2, or 3 and I get three treatment effects for each of these groups from emfx. This would be something like this:

Term                 Contrast Estimate Std. Error      z       Pr(>|z|)  2.5 % 97.5 %  .Dtreat. educ
 .Dtreat mean(TRUE) - mean(FALSE)   1.1529   0.010874 106.02 < 2.22e-16 1.1315 1.1742    TRUE   1
 .Dtreat mean(TRUE) - mean(FALSE)   2.1092   0.010739 196.41 < 2.22e-16 2.0882 2.1303    TRUE   2
 .Dtreat mean(TRUE) - mean(FALSE)   3.1726   0.009533 332.82 < 2.22e-16 3.1540 3.1913    TRUE   3

However, for a continous variable, I guess most researchers want to report something different for a linear model: Simply an intercept and the interaction. Assuming I want to interact the treatment with distance:

etwfe(fml = y ~ 0, tvar = T,  gvar = G, xvar = distance, data = dat)

Then etwfe gives

GLM estimation, Dep. Var.: y
Observations: 1,000 
Fixed-effects: G: 19,  T: 3
Standard-errors: Clustered (G) 
                          Estimate Std. Error    t value   Pr(>|t|)    
.Dtreat:G::1:T::2:distance  0.014189   0.010905   1.301083 2.0964e-01    
.Dtreat:G::1:T::3:distance -0.178339   0.011416 -15.622212 6.5175e-12 ***
.Dtreat:G::2:T::2:distance -0.061419   0.010905  -5.631918 2.4157e-05 ***
.Dtreat:G::2:T::3:distance  0.419388   0.011416  36.737625  < 2.2e-16 ***
.Dtreat:G::3:T::3:distance -0.140821   0.011416 -12.335705 3.2324e-10 ***
...

which is good so far and emfx reports the marginal effects for different values of distance.

    Term                 Contrast .Dtreat ldist_air Estimate Std. Error        z Pr(>|z|)    S  2.5 % 97.5 %
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      8.88  -0.0372       6.59 -0.00564    0.996  0.0 -12.96  12.88
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      6.69 132.9161      24.82  5.35418   <0.001 23.5  84.26 181.57
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      8.24   1.7932      11.79  0.15216    0.879  0.2 -21.31  24.89
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      9.45  -7.0808       6.12 -1.15642    0.248  2.0 -19.08   4.92
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      9.06  -2.2928       6.43 -0.35667    0.721  0.5 -14.89  10.31
--- 79 rows omitted. See ?print.marginaleffects --- 
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      9.10  -2.8199       6.39 -0.44118    0.659  0.6 -15.35   9.71
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      8.25   8.4341       7.29  1.15744    0.247  2.0  -5.85  22.72
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      9.82 120.9891      16.57  7.30238   <0.001 41.7  88.52 153.46
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      8.34   6.9434       5.00  1.38910    0.165  2.6  -2.85  16.74
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE      9.25  -9.4350       9.73 -0.97017    0.332  1.6 -28.50   9.63

But it's a linear model and I would like to see from emfx simply an Intercept and Slope Coefficient. Sadly, I am not an experts on the marginaleffects package. Do you know how I could achieve that? (And if you think this would be useful I could also try to implement and push it).

All the best, Frederic

Add ivar argument

For unbalanced panels, but also just as an additional option.

Error: Class of the `.Dtreat` variable is class is not supported.

Hi Grant,

Thanks for making this package. I would like to use this package on my Ubuntu workstation, but it gives me the following error.

I have tried reinstalling R and this package, but the same error occurs. Would there be an easy way to fix it?

Thanks again for the package, and I would very much appreciate your help on this issue. Many thanks in advance!

gref could be smarter

If min(gvar) < min (tvar), as with the mpdta dataset in the README, that's a "never treated" value (in the sense of being able to recover a treatment effect).
Conversely, the current set will also fail without a bespoke gref value unless max(gvar) > max(tvar). This is the case even when using the "notyet" treated control, which clearly isn't right.

`marginaleffects` 0.10.0 is on its way to CRAN

Confidence interval of aggregate effects off when using fe = 'none'

Sorry to be a pain and submit another issue, but I wanted to keep track of it.
I just realized when replying to the other issue that the confidence interval of the aggregate effects is weirdly off when using fe = 'none'.

Based on the example mpdta:

mod = 
  etwfe(
   fml  = lemp ~ lpop, # outcome ~ controls
   tvar = year,        # time variable
   gvar = first.treat, # group variable
   data = mpdta,       # dataset
   vcov = ~countyreal, # vcov adjustment (here: clustered)
   fe = "none"
   )

emfx(mod, type = "simple")
     Term    Contrast .Dtreat   Effect Std. Error z value   Pr(>|z|) 2.5 % 97.5 %
1 .Dtreat mean(dY/dX)       1 -0.05063     0.0125  -4.051 5.0988e-05 8.418  8.694


emfx(mod, type = "event")
     Term    Contrast event   Effect Std. Error z value   Pr(>|z|) 2.5 % 97.5 %
1 .Dtreat mean(dY/dX)     0 -0.03321    0.01337  -2.485 0.01296119 8.418  8.694
2 .Dtreat mean(dY/dX)     1 -0.05735    0.01715  -3.344 0.00082628 5.871  6.147
3 .Dtreat mean(dY/dX)     2 -0.13787    0.03079  -4.478 7.5344e-06 5.163  5.462
4 .Dtreat mean(dY/dX)     3 -0.10954    0.03232  -3.390 0.00069966 5.264  5.513

     Term    Contrast first.treat   Effect Std. Error z value   Pr(>|z|) 2.5 % 97.5 %
1 .Dtreat mean(dY/dX)        2004 -0.08763    0.02305  -3.802 0.00014351 5.283  5.557
2 .Dtreat mean(dY/dX)        2006 -0.02128    0.01859  -1.145 0.25240119 5.895  6.162
3 .Dtreat mean(dY/dX)        2007 -0.04595    0.01797  -2.557 0.01055532 8.418  8.694

I'll try to also take a look at this over the weekend.

Error: The function supplied to the `transform_pre` argument must accept two numeric vectors of predicted probabilities of length 0, and return a single numeric value or a numeric vector of length 0, with no missing value.

Issue

Much thanks for the package! I was in search of a tool to implement the nonlinear DiD approach described by Wooldridge (2022) and this is exactly what I was was hoping to find.

Before applying etwfe() and emfx()to my data, I tried to run the example code provided by the README and vignette. I was unable to replicate any emfx() results due to the following error (and additional warning):

Error: The function supplied to the `transform_pre` argument must accept two numeric vectors of predicted probabilities of length 0, and return a single numeric value or a numeric vector
  of length 0, with no missing value.
In addition: Warning message:
In formals(fun) : argument is not a function

I'm not sure if this is a potential general issue with the recent marginaleffects update or if it is specific to me. I have placed the entirety of my code and output below in addition to a traceback of the error and my session information.

Thanks in advance for any help!

Code

# Load library:
library(etwfe)

# Load data:
data("mpdta", package = "did")

# Inspect data:
head(mpdta)

# Estimate model:
mod =
  etwfe(
    fml  = lemp ~ lpop, # outcome ~ controls
    tvar = year,        # time variable
    gvar = first.treat, # group variable
    data = mpdta,       # dataset
    vcov = ~countyreal  # vcov adjustment (here: clustered)
  )

# Inspect model object:
mod

# Get aggregated ATT:
emfx(mod)

Output

> # Load library:
> library(etwfe)
> 
> # Load data:
> data("mpdta", package = "did")
> 
> # Inspect data:
> head(mpdta)
    year countyreal     lpop     lemp first.treat treat
866 2003       8001 5.896761 8.461469        2007     1
841 2004       8001 5.896761 8.336870        2007     1
842 2005       8001 5.896761 8.340217        2007     1
819 2006       8001 5.896761 8.378161        2007     1
827 2007       8001 5.896761 8.487352        2007     1
937 2003       8019 2.232377 4.997212        2007     1
> 
> # Estimate model:
> mod =
+   etwfe(
+     fml  = lemp ~ lpop, # outcome ~ controls
+     tvar = year,        # time variable
+     gvar = first.treat, # group variable
+     data = mpdta,       # dataset
+     vcov = ~countyreal  # vcov adjustment (here: clustered)
+   )
> 
> # Inspect model object:
> mod
OLS estimation, Dep. Var.: lemp
Observations: 2,500 
Fixed-effects: first.treat: 4,  year: 5
Varying slopes: lpop (first.treat: 4),  lpop (year: 5)
Standard-errors: Clustered (countyreal) 
                                              Estimate Std. Error   t value   Pr(>|t|)    
.Dtreat:first.treat::2004:year::2004         -0.021248   0.021728 -0.977890 3.2860e-01    
.Dtreat:first.treat::2004:year::2005         -0.081850   0.027375 -2.989963 2.9279e-03 ** 
.Dtreat:first.treat::2004:year::2006         -0.137870   0.030795 -4.477097 9.3851e-06 ***
.Dtreat:first.treat::2004:year::2007         -0.109539   0.032322 -3.389024 7.5694e-04 ***
.Dtreat:first.treat::2006:year::2006          0.002537   0.018883  0.134344 8.9318e-01    
.Dtreat:first.treat::2006:year::2007         -0.045093   0.021987 -2.050907 4.0798e-02 *  
.Dtreat:first.treat::2007:year::2007         -0.045955   0.017975 -2.556568 1.0866e-02 *  
.Dtreat:first.treat::2004:year::2004:lpop_dm  0.004628   0.017584  0.263184 7.9252e-01    
... 6 coefficients remaining (display them with summary() or use argument n)
... 10 variables were removed because of collinearity (.Dtreat:first.treat::2006:year::2004, .Dtreat:first.treat::2006:year::2005 and 8 others [full set in $collin.var])
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.537131     Adj. R2: 0.87167 
                 Within R2: 8.449e-4
> 
> # Get aggregated ATT:
> emfx(mod)
Error: The function supplied to the `transform_pre` argument must accept two numeric vectors of predicted probabilities of length 0, and return a single numeric value or a numeric vector
  of length 0, with no missing value.
In addition: Warning message:
In formals(fun) : argument is not a function

Traceback

Code:

# Error traceback:
traceback()

Output:

> # Error traceback:
> traceback()
11: stop(format_message(string = string, ..., line_length = line_length, 
        indent = indent), call. = call.)
10: format_alert(..., type = "error")
9: insight::format_error(msg)
8: safefun(hi = predicted_hi, lo = predicted_lo, y = predicted, 
       n = .N, term = term, cross = cross, wts = marginaleffects_wts_internal, 
       tmp_idx = tmp_idx)
7: `[.data.table`(out, , `:=`("estimate", safefun(hi = predicted_hi, 
       lo = predicted_lo, y = predicted, n = .N, term = term, cross = cross, 
       wts = marginaleffects_wts_internal, tmp_idx = tmp_idx)), 
       by = idx)
6: out[, `:=`("estimate", safefun(hi = predicted_hi, lo = predicted_lo, 
       y = predicted, n = .N, term = term, cross = cross, wts = marginaleffects_wts_internal, 
       tmp_idx = tmp_idx)), by = idx]
5: get_contrasts(structure(list(nobs = 2500L, nobs_origin = 2500L, 
       fml = lemp ~ .Dtreat:i(first.treat, i.year, ref = 0, ref2 = 2003)/lpop_dm, 
       call = fixest::feols(fml = Fml, data = data, vcov = ..1, 
           notes = FALSE), call_env = <environment>, method = "feols", 
       method_type = "feols", fml_all = list(linear = lemp ~ .Dtreat:i(first.treat, 
           i.year, ref = 0, ref2 = 2003)/lpop_dm, fixef = ~first.treat + 
           first.treat[[lpop]] + year + year[[lpop]]), fml_no_xpd = lemp ~ 
           .Dtreat:i(first.treat, i.year, ref = 0, ref2 = 2003)/lpop_dm | 
               first.treat[lpop] + year[lpop], fixef.tol = 1e-06, 
       fixef.iter = 10000, nparams = 31, fixef_vars = c("first.treat", 
       "year"), fixef_terms = c("first.treat", "first.treat[[lpop]]", 
       "year", "year[[lpop]]"), slope_flag = c(1L, 1L), slope_flag_reordered = c(1L, 
       1L), slope_variables_reordered = list(lpop = c(5.8967609333053, 
       5.8967609333053, 5.8967609333053, 5.8967609333053, 5.8967609333053, 
       2.23237719795055, 2.23237719795055, 2.23237719795055, 2.23237719795055, 
       2.23237719795055, 1.29828248379668, 1.29828248379668, 1.29828248379668, 
       1.29828248379668, 1.29828248379668, 3.32625829499766, 3.32625829499766, 
       3.32625829499766, 3.32625829499766, 3.32625829499766, 6.24790553432335, 
       6.24790553432335, 6.24790553432335, 6.24790553432335, 6.24790553432335, 
       2.08081559723298, 2.08081559723298, 2.08081559723298, 2.08081559723298, 
    ...
4: do.call("get_contrasts", args)
3: comparisons(model, newdata = newdata, variables = variables, 
       vcov = vcov, conf_level = conf_level, type = type, wts = wts, 
       hypothesis = hypothesis, df = df, by = by, eps = eps, transform_pre = slope, 
       cross = FALSE, internal_call = TRUE, ...)
2: marginaleffects::slopes(object, newdata = dat, wts = "N", variables = ".Dtreat", 
       by = by_var, ...)
1: emfx(mod)

Session Information

Code:

# Session information:
sessionInfo()

Output:

> # Session information:
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 12.0.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] etwfe_0.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3           lattice_0.20-41        fixest_0.10.4          zoo_1.8-8              marginaleffects_0.10.0 grid_4.0.4             nlme_3.1-152           backports_1.2.1       
 [9] dreamerr_1.2.3         data.table_1.14.2      checkmate_2.0.0        generics_0.1.0         sandwich_3.0-2         Formula_1.2-4          tools_4.0.4            numDeriv_2016.8-1.1   
[17] compiler_4.0.4         insight_0.19.0

fe (shortcut?) argument

Allow users to easily specify which of the following three equivalent models they want:

# varying slopes (current default)
feols(
  lemp ~ .Dtreat : i(first_treat, i.year, ref = 0, ref2 = 2003) / lpop_dm |
    first_treat[lpop] + year[lpop],
  data = mpdta, vcov = ~countyreal
  )

# fes only
feols(
  lemp ~ .Dtreat : i(first_treat, i.year, ref = 0, ref2 = 2003) / lpop_dm  +
    lpop +                           # changed
    i(first_treat, lpop, ref = 0) +  # changed
    i(year, lpop, ref = 2003) |      # changed
    first_treat + year,              # changed
  data = mpdta, vcov = ~countyreal
  )

# no fes at all
feols(
  lemp ~ .Dtreat : i(first_treat, i.year, ref = 0, ref2 = 2003) / lpop_dm  +
    lpop +                                          # changed
    i(first_treat, lpop, ref = 0) +                 # changed
    i(year, lpop, ref = 2003) +                     # changed
    i(first_treat, ref = 0) + i(year, ref = 2003),  # changed
  data = mpdta, vcov = ~countyreal
  )

Complication: Need to double check multiple controls...

emfx() not aggregating when the data already includes a variable named 'group'

Hi @grantmcdermott,
just noticed a possible bug?
If the data being used already has a variable named "group" then the emfx() function does not seem to aggregate the effects.
In the example below, I first run the same example that you have on your website, so emfx() works as expected, and then modify the data adding a variable named group. Note that this new variable is not used in the etwfe() function, and yet emfx() does not aggregate the results.
Cheers,
Mario

data("mpdta", package = "did")
head(mpdta)
mod =
  etwfe::etwfe(
    fml  = lemp ~ lpop, # outcome ~ controls
    tvar = year,        # time variable
    gvar = first.treat, # group variable
    data = mpdta,       # dataset
    vcov = ~countyreal  # vcov adjustment (here: clustered)
  )
etwfe::emfx(mod)

    _Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %
 .Dtreat mean(TRUE) - mean(FALSE)    TRUE  -0.0506     0.0125 -4.05   <0.001 14.3 -0.0751 -0.0261_

mpdta$group <- mpdta$countyreal
mod2 =
  etwfe::etwfe(
    fml  = lemp ~ lpop, # outcome ~ controls
    tvar = year,        # time variable
    gvar = first.treat, # group variable
    data = mpdta,       # dataset
    vcov = ~countyreal  # vcov adjustment (here: clustered)
  )
etwfe::emfx(mod2)

 _Group    Term                 Contrast .Dtreat Estimate Std. Error       z Pr(>|z|)   S   2.5 %    97.5 %
  8001 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.09431     0.0330 -2.8579  0.00426 7.9 -0.1590 -0.029634
  8019 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.02163     0.0334 -0.6473  0.51746 1.0 -0.0871  0.043870
  8023 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.00310     0.0474 -0.0654  0.94785 0.1 -0.0961  0.089886
  8029 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.04333     0.0193 -2.2501  0.02444 5.4 -0.0811 -0.005587
  8041 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.10128     0.0382 -2.6539  0.00796 7.0 -0.1761 -0.026483
--- 181 rows omitted. See ?print.marginaleffects --- 
 55103 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.05469     0.0224 -2.4459  0.01445 6.1 -0.0985 -0.010865
 55109 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.00621     0.0205 -0.3027  0.76208 0.4 -0.0464  0.033973
 55123 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.03744     0.0191 -1.9587  0.05015 4.3 -0.0749  0.000024
 55131 .Dtreat mean(TRUE) - mean(FALSE)    TRUE  0.01769     0.0269  0.6585  0.51024 1.0 -0.0350  0.070355
 55137 .Dtreat mean(TRUE) - mean(FALSE)    TRUE -0.04484     0.0202 -2.2177  0.02658 5.2 -0.0845 -0.005210
Columns: group, term, contrast, .Dtreat, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo_

Are controls required?

So thrilled for this and exactly what I was looking for yesterday! I'm wondering whether controls are required, as I can't seem to get your example to run without specifying controls (I'm starting very simply with my own analysis). All is well with your example:

> library(etwfe)
> data("mpdta", package = "did")
> mod = 
+     etwfe(
+         fml  = lemp ~ lpop, # outcome ~ controls
+         tvar = year,        # time variable
+         gvar = first.treat, gref=0, # group variable (with bespoke ref. level)
+         data = mpdta,       # dataset
+         vcov = ~countyreal  # vcov adjustment (here: clustered)
+     )
> mod
OLS estimation, Dep. Var.: lemp
Observations: 2,500 
Fixed-effects: first.treat: 4,  year: 5
Varying slopes: lpop (first.treat: 4),  lpop (year: 5)
Standard-errors: Clustered (countyreal) 
                                              Estimate Std. Error   t value   Pr(>|t|)    
.Dtreat:first.treat::2004:year::2004         -0.021248   0.021728 -0.977890 3.2860e-01    
.Dtreat:first.treat::2004:year::2005         -0.081850   0.027375 -2.989963 2.9279e-03 ** 
.Dtreat:first.treat::2004:year::2006         -0.137870   0.030795 -4.477097 9.3851e-06 ***
.Dtreat:first.treat::2004:year::2007         -0.109539   0.032322 -3.389024 7.5694e-04 ***
.Dtreat:first.treat::2006:year::2006          0.002537   0.018883  0.134344 8.9318e-01    
.Dtreat:first.treat::2006:year::2007         -0.045093   0.021987 -2.050907 4.0798e-02 *  
.Dtreat:first.treat::2007:year::2007         -0.045955   0.017975 -2.556568 1.0866e-02 *  
.Dtreat:first.treat::2004:year::2004:lpop_dm  0.004628   0.017584  0.263184 7.9252e-01    
... 6 coefficients remaining (display them with summary() or use argument n)
... 10 variables were removed because of collinearity (.Dtreat:first.treat::2006:year::2004, .Dtreat:first.treat::2006:year::2005 and 8 others [full set in $collin.var])
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.537131     Adj. R2: 0.87167 
                 Within R2: 8.449e-4

If I drop the controls altogether, it throws an error:

> mod =   
  etwfe(  
         fml  = lemp, # outcome ~ controls  
         tvar = year,        # time variable  
         gvar = first.treat, gref=0, # group variable (with bespoke ref. level)  
         data = mpdta,       # dataset  
         vcov = ~countyreal  # vcov adjustment (here: clustered)  
     )  
Error in etwfe(fml = lemp, tvar = year, gvar = first.treat, gref = 0,  : 
  object 'lemp' not found

I also tried a few other iterations, e.g., fml = lemp ~ or fml = lemp ~ 1 but these also generate errors.

I know this is all in development, and am really thankful for the resource. Thanks again.

Interacted Treatment Effect

Dear Grant,

Thanks for your amazing work. Wooldridge's estimator allows interacting a binary treatment dummy with a time-constant covariate (say, gender - male / female). Then, you get the effect separately for all groups.
I tried to implement the estimator including this feature on my own, but it's buggy and occasionally the standard errors explode (Too small groups etc.)

Are you planning to include this interaction-option in your package, or did you even already implement it?

Thanks and best, Frederic

Error: Error in variable_classes[[v$name]] : subscript out of bounds

When I follow the vignette and run the function emfx, I get the following error: Error in variable_classes[[v$name]] : subscript out of bounds

Support NSE?

Something like the following, adapted from subset, internally. Would add a few microseconds in overhead, but it handles both expressions and strings.

nse_func = function(vars, data) {
  nl = as.list(seq_along(data))
  names(nl) = names(data)
  vars = eval(substitute(vars), nl, parent.frame())
  if (is.numeric(vars)) vars = names(data)[vars]
  # Simple return object to demonstrate
  head(data[, vars, drop = FALSE], 2)
}

# Handles both strings and expressions 
nse_func("mpg", mtcars)
#>               mpg
#> Mazda RX4      21
#> Mazda RX4 Wag  21
nse_func(mpg, mtcars)
#>               mpg
#> Mazda RX4      21
#> Mazda RX4 Wag  21

# Vectors are fine too (although not expected for etwfe)
nse_func(c("mpg", "cyl"), mtcars)
#>               mpg cyl
#> Mazda RX4      21   6
#> Mazda RX4 Wag  21   6
nse_func(c(mpg, cyl), mtcars)
#>               mpg cyl
#> Mazda RX4      21   6
#> Mazda RX4 Wag  21   6
nse_func(c(mpg:cyl, hp), mtcars)
#>               mpg cyl  hp
#> Mazda RX4      21   6 110
#> Mazda RX4 Wag  21   6 110
nse_func(names(mtcars), mtcars)
#>               mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

^{Created on 2022-10-15 with reprex v2.0.2}

Double check Poisson mfx

C.f. With the jwdid Poisson model at the very bottom of this post.

The main results are exactly the same.

library(etwfe)

data("mpdta", package = "did")
mpdta$emp = exp(mpdta$lemp)

# poisson model
pmod = etwfe(emp ~ lpop, tvar=year, gvar=first.treat, gref=0, data=mpdta, 
             vcov=~countyreal, family = "poisson", fe = "none")
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and eight others have been removed because of collinearity (see $collin.var).

fixest::etable(pmod, digits = 7, signif.code = NA)
#>                                                                        pmod
#> Dependent Var.:                                                         emp
#>                                                                            
#> Constant                                               2.484455 (0.0774941)
#> lpop                                                   1.039625 (0.0168640)
#> lpop x first.treat = 2004                            -0.0132716 (0.0404687)
#> lpop x first.treat = 2006                            -0.0811734 (0.0330063)
#> lpop x first.treat = 2007                             0.0162899 (0.0250988)
#> lpop x year = 2004                                   -0.0066118 (0.0045127)
#> lpop x year = 2005                                   -0.0077402 (0.0059153)
#> lpop x year = 2006                                   -0.0090772 (0.0065360)
#> lpop x year = 2007                                   -0.0035862 (0.0054845)
#> first.treat = 2004                                    0.2361076 (0.2079180)
#> first.treat = 2006                                    0.5927050 (0.1770658)
#> first.treat = 2007                                   -0.1440501 (0.1223770)
#> year = 2004                                          -0.0105493 (0.0209058)
#> year = 2005                                           0.0112893 (0.0270397)
#> year = 2006                                           0.0453898 (0.0291563)
#> year = 2007                                           0.0542584 (0.0280050)
#> .Dtreat x first.treat = 2004 x year = 2004           -0.0309557 (0.0176208)
#> .Dtreat x first.treat = 2004 x year = 2005           -0.0662240 (0.0255833)
#> .Dtreat x first.treat = 2004 x year = 2006           -0.1329713 (0.0237253)
#> .Dtreat x first.treat = 2004 x year = 2007           -0.1170228 (0.0225006)
#> .Dtreat x first.treat = 2006 x year = 2006           -0.0090494 (0.0244213)
#> .Dtreat x first.treat = 2006 x year = 2007           -0.0681486 (0.0251113)
#> .Dtreat x first.treat = 2007 x year = 2007           -0.0399916 (0.0168249)
#> .Dtreat x lpop_dm x first.treat = 2004 x year = 2004  0.0118798 (0.0071285)
#> .Dtreat x lpop_dm x first.treat = 2004 x year = 2005  0.0209935 (0.0119250)
#> .Dtreat x lpop_dm x first.treat = 2004 x year = 2006  0.0409103 (0.0095714)
#> .Dtreat x lpop_dm x first.treat = 2004 x year = 2007  0.0250348 (0.0112873)
#> .Dtreat x lpop_dm x first.treat = 2006 x year = 2006  0.0375837 (0.0136183)
#> .Dtreat x lpop_dm x first.treat = 2006 x year = 2007  0.0453804 (0.0164515)
#> .Dtreat x lpop_dm x first.treat = 2007 x year = 2007 -0.0109675 (0.0072256)
#> ________________________________________             ______________________
#> S.E.: Clustered                                              by: countyreal
#> Observations                                                          2,500
#> Squared Cor.                                                        0.95698
#> Pseudo R2                                                           0.95262
#> BIC                                                               287,872.5

But the marginal effects are different.

emfx(pmod, type = "event")
#>      Term    Contrast event  Effect Std. Error  z value   Pr(>|z|)   2.5 %  97.5 %
#> 1 .Dtreat mean(dY/dX)     0 -22.849      16.17 -1.41348 0.15751423  -54.53   8.834
#> 2 .Dtreat mean(dY/dX)     1   3.654      41.81  0.08739 0.93035790  -78.30  85.607
#> 3 .Dtreat mean(dY/dX)     2 -71.521      22.06 -3.24264 0.00118430 -114.75 -28.291
#> 4 .Dtreat mean(dY/dX)     3 -97.798      25.45 -3.84321 0.00012144 -147.67 -47.923
#> 
#> Model type:  etwfe 
#> Prediction type:  response

@vincentarelbundock Do you have any ideas? (Sorry for tagging and running, but wanted to get this down quickly before I have to head out!)

Lee and Wooldridge

In case you missed it: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4516518

Inference using fixest's aggregate command

Fixest's aggregate command may be an efficient alternative for inference in cases when marginaleffects takes a long time. Consider the MWE below:

library(etwfe)

data("mpdta", package="did")

mod = etwfe(
    fml = lemp ~ lpop,
    tvar  = year,
    gvar = first.treat,
    data = mpdta,
    vcov = ~countyreal
)

emfx(mod)

aggregate(mod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"))

emfx returns an estimate of -0.05062703 with std. error of 0.01249979, and aggregate returns the same exact values in this case.

CRAN release

FYI, marginaleffects 0.8.1 is on its way to CRAN.

emfx could be smarter about xvar

We can just grab the "xvar" variable from the model attributes (and this should probably be the default behaviour if xvar was specified in the preceding etwfe call).

emfx() error when using offset in Poisson

Hi Scott, thanks for the great package.
I am using etwfe in a Poisson regression with an exposure variable.
etwfe() works but emfx() returns an error message.
Example using mpdta data and the lpop variable as the exposure.

data("mpdta", package = "did")
mpdta$emp = exp(mpdta$lemp)

etwfe(
  emp ~ 0, tvar = year, gvar = first.treat, data = mpdta, vcov = ~countyreal,
  family = "poisson", offset = ~log(lpop)
) |>
  emfx()

Error: Unable to compute predicted values with this model. You can try to supply a different dataset to the `newdata` argument. If
  this does not work, you can file a report on the Github Issue Tracker:
  https://github.com/vincentarelbundock/marginaleffects/issues

It looks like a problem with the marginaleffects package. Let me know if there is a simple workaround or if I am mistaken.

Error when trying to use emfx function. "Models of class "etwfe" are not supported."

Hi Grant, thanks a lot for the package!

I succesfully run my model with etwfe(), but when i try to aggregate it by the emfx command, i get the following error:

> emfx(teste, type = "calendar")
Error: Models of class "etwfe" are not supported. Supported model classes include:
  
  afex_aov, amest, betareg, bglmerMod, bigglm, biglm, blmerMod, bracl, brglmFit, brmsfit, brnb, clm, clogit, coxph, crch,
  fixest, flac, flic, gam, Gam, gamlss, geeglm, glimML, glm, glmerMod, glmmPQL, glmmTMB, glmrob, glmx, gls, hetprob, hurdle,
  hxlr, iv_robust, ivpml, ivreg, lm, lm_robust, lme, lmerMod, lmerModLmerTest, lmrob, lmRob, loess, logistf, lrm, mblogit,
  mclogit, MCMCglmm, mhurdle, mira, mlogit, multinom, negbin, nls, ols, oohbchoice, orm, phyloglm, phylolm, plm, polr,
  Rchoice, rlmerMod, rq, scam, selection, speedglm, speedlm, stanreg, svyolr, tobit, tobit1, truncreg, zeroinfl
  
  New modeling packages can usually be supported by `marginaleffects` if they include a working `predict()` method. If you
  believe that this is the case, please file a feature request on Github:
  https://github.com/vincentarelbundock/marginaleffects/issues

Allow time-varying covariates

(Note to self: See email correspondence with @fpretis.)

gvar variable question

Dear Professor
Greetings, I've been reading your etwfe code but I'm having a little problem with the gvar variable, my treatment is cumulative percentage km2 of protected areas (percentacumestricta) and outcome (RLI) these variables are continuous and I have data from 2000 to 2014 by country (ID) -year (t).

treated=1 if percentacumestricta >17

I created gvar variable:
bysort ID: egen min_year=min(year) if treated==1
bysort ID: egen First_treated=max(min_year)
replace First_treated=0 if First_treated==.
tab year First_treated

However I couldn´t understand how you create your gvar, Could you explain me that please?

I really appreciate your comments
Thanks for your time
All the best

Question about simple ATT using never treated control group

Hi @grantmcdermott ,

First of all, thank you for providing this valuable package. It has been instrumental in helping me understand the implementation details of the Extended Two-Way Fixed Effects method developed by Wooldridge.

I do, however, have a question regarding the computation of the simple ATT using a never treated control group. In my experiments, I have noticed significant differences between the results obtained from the etwfe function and those obtained from the Callaway-Sanntanna estimator, as well as when I manually run the etwfe procedure.

After executing the etwfe function, I computed the simple ATT using the marginaleffects::slopes() function and obtained an estimate of -0.0404, whereas the estimate I obtained using emfx() was -0.00166.

I must be overlooking some important detail. To provide further context, I have attached the details of my experiments below.

Thank you in advance!

library(did)
library(etwfe)
library(broom)
library(stringr)
library(marginaleffects)
library(fixest)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

options(knitr.kable.NA = "-")

# Data -------------------------------------------------------------------------
mpdta <- did::mpdta %>%
  mutate(
    .Dtreat = treat * (year >= first.treat),
    time_to_event = ifelse(first.treat != 0, year - first.treat, -1000)
  )

head(mpdta)
#>     year countyreal     lpop     lemp first.treat treat .Dtreat time_to_event
#> 866 2003       8001 5.896761 8.461469        2007     1       0            -4
#> 841 2004       8001 5.896761 8.336870        2007     1       0            -3
#> 842 2005       8001 5.896761 8.340217        2007     1       0            -2
#> 819 2006       8001 5.896761 8.378161        2007     1       0            -1
#> 827 2007       8001 5.896761 8.487352        2007     1       1             0
#> 937 2003       8019 2.232377 4.997212        2007     1       0            -4

# Callaway-Sant'Anna - Never Treated Control -----------------------------------

cs_never <- att_gt(
  yname = "lemp",
  tname = "year",
  idname = "countyreal",
  gname = "first.treat",
  data = mpdta,
  control_group = "nevertreated"
)

cs_never_simple <- aggte(cs_never, type = "simple") %>%
  tidy() %>%
  select(type, cs_never = estimate)

cs_never_dynamic <- aggte(cs_never, type = "dynamic") %>%
  tidy() %>%
  select(type, event = event.time, cs_never = estimate)

cs_never <- tidy(cs_never) %>%
  mutate(type = "group_time") %>%
  select(type, group, time, cs_never = estimate) %>%
  mutate(
    group = as.character(group),
    time = as.character(time)
  )

cs_never_res <- bind_rows(cs_never, cs_never_simple, cs_never_dynamic)

# Callaway-Sant'Anna - Not Yet Treated Control ---------------------------------

cs_notyet <- att_gt(
  yname = "lemp",
  tname = "year",
  idname = "countyreal",
  gname = "first.treat",
  data = mpdta,
  control_group = "notyettreated"
)

cs_notyet_simple <- aggte(cs_notyet, type = "simple") %>%
  tidy() %>%
  select(type, cs_notyet = estimate)

cs_notyet_dynamic <- aggte(cs_notyet, type = "dynamic") %>%
  tidy() %>%
  select(type, event = event.time, cs_notyet = estimate)

cs_notyet <- tidy(cs_notyet) %>%
  mutate(type = "group_time") %>%
  select(type, group, time, cs_notyet = estimate) %>%
  mutate(
    group = as.character(group),
    time = as.character(time)
  )

cs_notyet_res <- bind_rows(cs_notyet, cs_notyet_simple, cs_notyet_dynamic)


# Extended TWFE - Never Treated Control ----------------------------------------

etwfe_never <- etwfe(
  fml = lemp ~ 1,
  tvar = year,
  gvar = first.treat,
  data = mpdta,
  ivar = countyreal,
  cgroup = "never"
)

# compute simple ATT manually
slopes(
  model = etwfe_never,
  newdata = mpdta %>% filter(.Dtreat == 1),
  variable = ".Dtreat",
  by = ".Dtreat"
) %>%
  print()
#> 
#>     Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|)
#>  .Dtreat mean(TRUE) - mean(FALSE)       1  -0.0404     0.0196 -2.06   0.0396
#>    S   2.5 %   97.5 %
#>  4.7 -0.0789 -0.00193
#> 
#> Columns: term, contrast, .Dtreat, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

etwfe_never_simple <- emfx(etwfe_never, "simple") %>%
  tidy() %>%
  mutate(type = "simple") %>%
  select(type, etwfe_never = estimate)

etwfe_never_dynamic <- emfx(etwfe_never, "event") %>%
  tidy() %>%
  mutate(type = "dynamic") %>%
  select(type, event, etwfe_never = estimate) %>%
  filter(event < 2000)

etwfe_never <- tidy(etwfe_never) %>%
  mutate(
    type = "group_time",
    group = str_extract(term, "(?<=::)[0-9]{4}(?=:year)"),
    time = str_extract(term, "(?<=year::)[0-9]{4}")
  ) %>%
  select(type, group, time, etwfe_never = estimate)

etwfe_never_res <- bind_rows(
  etwfe_never, etwfe_never_simple, etwfe_never_dynamic
)

# Extended TWFE - Not Yet Treated Control --------------------------------------

etwfe_notyet <- etwfe(
  fml = lemp ~ 1,
  tvar = year,
  gvar = first.treat,
  data = mpdta,
  ivar = countyreal,
  cgroup = "notyet"
)

etwfe_notyet_simple <- emfx(etwfe_notyet, "simple") %>%
  tidy() %>%
  mutate(type = "simple") %>%
  select(type, etwfe_notyet = estimate)

etwfe_notyet_dynamic <- emfx(etwfe_notyet, "event") %>%
  tidy() %>%
  mutate(type = "dynamic") %>%
  select(type, event, etwfe_notyet = estimate) %>%
  filter(event < 2000)

etwfe_notyet <- tidy(etwfe_notyet) %>%
  mutate(
    type = "group_time",
    group = str_extract(term, "(?<=::)[0-9]{4}(?=:year)"),
    time = str_extract(term, "(?<=year::)[0-9]{4}")
  ) %>%
  select(type, group, time, etwfe_notyet = estimate)

etwfe_notyet_res <- bind_rows(
  etwfe_notyet, etwfe_notyet_simple, etwfe_notyet_dynamic
)

# Extended TWFE Manual - Never Treated Control ---------------------------------

manual_never <- feols(
  lemp ~
    treat:i(first.treat, i.time_to_event, ref = 0, ref2 = -1) |
      first.treat + year,
  data = mpdta,
  cluster = ~countyreal
)

manual_never_simple <- slopes(
  model = manual_never,
  newdata = mpdta %>% filter(time_to_event >= 0),
  variables = "treat",
  by = "treat",
) %>%
  tidy() %>%
  mutate(type = "simple") %>%
  select(type, manual_never = estimate)

manual_never_dynamic <- slopes(
  model = manual_never,
  newdata = mpdta %>% filter(treat == 1),
  variables = "treat",
  by = "time_to_event",
) %>%
  tidy() %>%
  mutate(type = "dynamic") %>%
  select(type, event = time_to_event, manual_never = estimate)

manual_never <- tidy(manual_never) %>%
  mutate(
    type = "group_time",
    group = str_extract(term, "(?<=::)[0-9]{4}(?=:time)"),
    time = str_extract(term, "(?<=time_to_event::)[-0-9]{1,2}"),
    time = as.numeric(group) + as.numeric(time),
    time = as.character(time)
  ) %>%
  select(type, group, time, manual_never = estimate)

manual_never_res <- bind_rows(
  manual_never, manual_never_simple, manual_never_dynamic
)

# Extended TWFE Manual - Not Yet Treated Control -------------------------------

manual_notyet <- feols(
  lemp ~
    .Dtreat:i(first.treat, i.time_to_event, ref = 0, ref2 = -1) |
      first.treat + year,
  data = mpdta,
  cluster = ~countyreal
)
#> The variables '.Dtreat:first.treat::2006:time_to_event::-3', '.Dtreat:first.treat::2006:time_to_event::-2' and three others have been removed because of collinearity (see $collin.var).

manual_notyet_simple <- slopes(
  model = manual_notyet,
  newdata = mpdta %>% filter(.Dtreat == 1),
  variables = ".Dtreat",
  by = ".Dtreat",
) %>%
  tidy() %>%
  mutate(type = "simple") %>%
  select(type, manual_notyet = estimate)

manual_notyet_dynamic <- slopes(
  model = manual_notyet,
  newdata = mpdta %>% filter(.Dtreat == 1),
  variables = ".Dtreat",
  by = "time_to_event",
) %>%
  tidy() %>%
  mutate(type = "dynamic") %>%
  select(type, event = time_to_event, manual_notyet = estimate)

manual_notyet <- tidy(manual_notyet) %>%
  mutate(
    type = "group_time",
    group = str_extract(term, "(?<=::)[0-9]{4}(?=:time)"),
    time = str_extract(term, "(?<=time_to_event::)[-0-9]{1,2}"),
    time = as.numeric(group) + as.numeric(time),
    time = as.character(time)
  ) %>%
  select(type, group, time, manual_notyet = estimate)

manual_notyet_res <- bind_rows(
  manual_notyet, manual_notyet_simple, manual_notyet_dynamic
)

# Combine ----------------------------------------------------------------------

results <- full_join(
  x = cs_never_res,
  y = cs_notyet_res,
  by = join_by(type, group, time, event)
) %>%
  full_join(
    y = etwfe_never_res,
    by = join_by(type, group, time, event)
  ) %>%
  full_join(
    y = etwfe_notyet_res,
    by = join_by(type, group, time, event)
  ) %>%
  full_join(
    y = manual_never_res,
    by = join_by(type, group, time, event)
  ) %>%
  full_join(
    y = manual_notyet_res,
    by = join_by(type, group, time, event)
  )

# Simple ATT -------------------------------------------------------------------

results %>%
  filter(type == "simple") %>%
  select(cs_never:manual_notyet, -event) %>%
  knitr::kable(digits = 4, format = "markdown")

cs_never	cs_notyet	etwfe_never	etwfe_notyet	manual_never	manual_notyet
-0.04	-0.0398	-0.0017	-0.0477	-0.04	-0.0477

# Dynamic ATT ------------------------------------------------------------------

results %>%
  filter(type == "dynamic") %>%
  select(event, cs_never:manual_notyet) %>%
  arrange(event) %>%
  knitr::kable(digits = 4, format = "markdown")

event	cs_never	cs_notyet	etwfe_never	etwfe_notyet	manual_never	manual_notyet
-4	-	-	0.0000	-	0.0033	-
-3	0.0305	0.0298	0.0234	-	0.0250	-
-2	-0.0006	-0.0024	0.0228	-	0.0245	-
-1	-0.0245	-0.0243	-0.0015	-	0.0000	-
0	-0.0199	-0.0189	-0.0214	-0.0311	-0.0199	-0.0311
1	-0.0510	-0.0536	-0.0484	-0.0522	-0.0510	-0.0522
2	-0.1373	-0.1363	-0.1373	-0.1361	-0.1373	-0.1361
3	-0.1008	-0.1008	-0.1008	-0.1047	-0.1008	-0.1047

# Group-Time ATT ---------------------------------------------------------------

results %>%
  filter(type == "group_time") %>%
  select(group, time, cs_never:manual_notyet, -event) %>%
  knitr::kable(digits = 4, format = "markdown")

group	time	cs_never	cs_notyet	etwfe_never	etwfe_notyet	manual_never	manual_notyet
2004	2004	-0.0105	-0.0194	-0.0105	-0.0194	-0.0105	-0.0194
2004	2005	-0.0704	-0.0783	-0.0704	-0.0783	-0.0704	-0.0783
2004	2006	-0.1373	-0.1363	-0.1373	-0.1361	-0.1373	-0.1361
2004	2007	-0.1008	-0.1008	-0.1008	-0.1047	-0.1008	-0.1047
2006	2004	0.0065	-0.0026	0.0065	-	0.0028	-
2006	2005	-0.0028	-0.0019	0.0038	-	-	-
2006	2006	-0.0046	0.0047	-0.0008	0.0025	-0.0046	0.0025
2006	2007	-0.0412	-0.0412	-0.0375	-0.0392	-0.0412	-0.0392
2007	2004	0.0305	0.0298	0.0305	-	0.0338	-
2007	2005	-0.0027	-0.0024	0.0278	-	0.0311	-
2007	2006	-0.0311	-0.0311	-0.0033	-	-	-
2007	2007	-0.0261	-0.0261	-0.0294	-0.0431	-0.0261	-0.0431
2006	2003	-	-	-	-	-0.0038	-
2007	2003	-	-	-	-	0.0033	-

^{Created on 2023-06-21 with reprex v2.0.2}

.Dtreat should be logical instead of integer

... so that it automatically invokes marginaleffects::comparisons under the hood for emfx. (i.e. evaluates mean(.Dtreat==1 - mean(.Dtreat==0)instead ofmean(dYdX)`).

This won't make a difference for linear models, but will slightly change the results for non-linear families. So it's a bug fix.

Ironically, this was the default in the first iterations of the package. But I changed it whilst troubleshooting the jwdid difference in #4.

Can I wild cluster bootstrap this? =)

Hi Grant,

Thanks for sharing this package - it looks really cool! Naturally, I am wondering if I can wild (cluster) bootstrap this. =)

After taking a very brief (really very, very brief) glance at both code and the Wooldridge paper, my very topline understanding of the estimation procedure is that etwfe implements a linear regression / glm with a set of interacted covariates and a few estimated properties (e.g. equation 5.7 in the Wooldridge paper attached below, where $\bar{x}_{1}$ is an estimated mean and therefore has a sampling variance for which one has to correct inference for).

After estimating the 'full' model, one then computes marginal effects, e.g. for the effect on a specific cohort, you compute the marginal effect for a given cohort. If we assume that there are $C$ cohorts, for linear models, inference for all $C$ 'cohort effects' should then boil down to C linear multi-parameter tests of the form $R_{c}'\beta_{c} = r$ vs $R_{c}'\beta_{c} \neq r$, where $R_{c}$ is a vector of dimension $k_{c}$, with $k_{c}$ denoting all regression parameters that depend on cohort $c$?

Here's a quick formulaic example to (hopefully) clarify:

We estimate a linear model

$y = \beta_0 + \beta_{1} X_{1} + \beta_{2} X_{1} X_{2} + \epsilon_{i}$

via OLS.

We then compute the average marginal effect at the mean for $X_{1}$ as

$E[\frac{\partial E(y|X)}{\partial X_{1}}] = \beta_{1} + \beta_{2} \bar{X}_{2}.$

Inference for the marginal effect then follows a standard $R\beta = r$ schema, where $R = [1, \bar{X}_{2}]$, $\beta = [\beta_1, \beta_2]$ and $r = 0$.

Do I get this right? If inference can be squeezed into the $R\beta = r$ schema, it will be fairly straightforward to make fwildclusterboot work with etwfe - but only under the assumption that $\bar{x}_{1}$ is a noiseless estimate of $x_1$.

Best, Alex

	if (!is.null(xvar)) {
	data$.Dtreated_cohort = ifelse(data[[gvar]] != gref & !is.na(data[[gvar]]), 1, 0) # generate a treatment-dummy

	dm_fml = stats::reformulate(c(tvar), response = xvar)
	ctrls_dm_df = fixest::demean(dm_fml, data = data, weights = data$.Dtreated_cohort, as.matrix = FALSE) # weights: only use the treated cohorts (units) to demean
	ctrls_dm_df = stats::setNames(ctrls_dm_df, paste0(xvar, "_dm")) # give a name
	data = cbind(data, ctrls_dm_df)
	}

grantmcdermott / etwfe Goto Github PK

etwfe's People

Contributors

Stargazers

Watchers

Forkers

etwfe's Issues

Issue

Code

Output

Traceback

Session Information

Recommend Projects

Recommend Topics

Recommend Org