Coder Social home page Coder Social logo

fixest's People

Contributors

abinashbunty avatar eddelbuettel avatar grantmcdermott avatar khwilson avatar kylebutts avatar lrberge avatar maelastruc avatar rvlenth avatar sebkrantz avatar tappek avatar vincentarelbundock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fixest's Issues

order in etable() does not work with dict

Thank you for writing this fantastic package. This is great contribution to empirical economist community.

While I was working with etable(), I found that the order option in etable() does not work when dict option is specified. Could you look into this please?

library(fixest)

model <- feols(mpg ~ cyl + disp, data= mtcars)

etable(model, order = c("disp", "cyl"))

# order does not work with dict, whether it uses variable name or label.
etable(model, order = c("disp", "cyl"), dict = c(disp = "$DISP$"))
etable(model, order = c("$DISP$", "cyl"), dict = c(disp = "$DISP$"))

est_slopes model missing from exporting vignette

I'm not sure if I'm being very dense since I don't know how the document would have successfully knit without it, but I don't see the definition of the est_slopes model in the exporting table vignette. As written it doesn't run for me since it can't find est_slopes. I assume it was there at some point and is now missing, since the table displays 5 models and only 4 are estimated. However, I didn't see it show up in the commit history.

est_noFE = feols(Ozone ~ Solar.R + Wind + Temp, airquality)
est_1FE = feols(Ozone ~ Solar.R + Wind + Temp | Day, airquality)
est_2FE = feols(Ozone ~ Solar.R + Wind + Temp | Day + Month , airquality)
est_poly = feols(Ozone ~ Solar.R + Wind + poly(Temp, 3) | Day + Month, airquality)
```
To export the results to Latex, we use `etable`. We first set the dictionnary renaming the variables, then we export the results with clustered standard-errors at the Day level:
```{r, eval = FALSE}
# Dictionary => set only once per session
setFixest_dict(c(Ozone = "Ozone (ppb)", Solar.R = "Solar Radiation (Langleys)",
Wind = "Wind Speed (mph)", Temp = "Temperature"))
etable(est_noFE, est_1FE, est_2FE, est_poly, est_slopes, cluster = "Day", file = file,
group = list("Temperature (cubic)" = "poly"), notes = "Estimation of 5 models.")

lag argument in l function does not take a scalar.

Thank you for giving us such a fantastic package!

I think I found a minor bug in l function: it does not take a number (rather than a vector) for its lag argument.

library(fixest)
data("base_did")
pdat = panel(base_did, ~id+period)

The following does not work.

feols(y ~ l(x1, 2), pdat)

But this works.

feols(y ~ l(x1, 2:2), pdat)

Fatal error with FeNmlm function

Thanks for a really useful package.

One error I've encountered is FeNmlm crashing R with a fatal error. I can reproduce this behavior with the example below. Thanks in advance for any help.

library(dplyr)
library(magrittr)
library(fixest)

n_id = 800
n_year = 8
b_true = 1
a_true = 0.5
set.seed(1001)
test_df = data.frame(id = rep(1:n_id, n_year),
year = rep(1:n_year, each = n_id)) %>%
group_by(id) %>%
mutate(mu = rnorm(1, 0, 1)**2) %>% # generate id fixed effect
ungroup %>%
mutate(X = rnorm(n_yearn_id, 0, 1)**2, # random X
Z = rnorm(n_year
n_id, 0, 1)**2, # random Z
Y = rpois(n_yearn_id, mu(1 + b_true*X)^a_true)*Z) # outcome Y

feNmlm(Y ~ log(Z) + log(1 + X) | id,
data = test_df,
verbose = 1) # works perfectly

feNmlm(Y ~ log(Z) | id,
data = test_df,
NL.fml = ~ a*log(1 + X),
NL.start = list(a = 0.5),
verbose = 2) # reports that R encountered a fatal error and restarts`

Request: predict.fixest retains NAs

I noticed that fixest.predict drops observations if newdata contains NAs. Could It instead have NA values for the output vector?

Demo:

new_data <- mtcars
new_data[1, ] <- NA
library(fixest)
x <- feols(wt ~ cyl, mtcars)
pred <- predict(x, newdata=new_data)
nrow(new_data)
#> [1] 32
length(pred)
#> [1] 31

Created on 2019-11-22 by the reprex package (v0.3.0)

fixef when varying slope is integer gives error: REAL() can only be applied to a 'numeric', not a 'integer'

Merci Laurent pour ce super paquet!!

It seems fixef() has a trouble accommodating integer variables when using varying slopes, i.e. if x3 is integer in species[x3]?

See:

library(fixest)

base_vs = iris 
names(base_vs) = c(paste0("x", 1:4), "species")
base_vs$x5 <- as.integer(round(base_vs$x3))

out_double <- feols(x1 ~ x2 | species[x3], base_vs)
out_integer <- feols(x1 ~ x2 | species[x5], base_vs)

out <- fixef(out_double)
fixef(out_integer)
#> Error in cpp_demean(y = S, X_raw = 0, r_weights = 0, iterMax = 1000L, : REAL() can only be applied to a 'numeric', not a 'integer'

Created on 2020-07-19 by the reprex package (v0.3.0)

[Request] Allowing custom leads and lags in interact()

Problem

The interact function is very handy for differences in differences setups and allows to quickly plot the estimated coefficients with coefplot. However, by default the function interacts every values of the fe parameter. This is problematic when one wants to have only some leads and lags.

Example

The basic diff in diff setup presented in the vignette is

data(base_did)
feols(y ~ x1 + treat::period(5) | id + period, base_did)

However, one often wants to set custom number of leads and lags. For instance one would want to have only two pre-treatment coefficients (ie 2 leads) and two post-treatment coefficients (ie 2 lags).

What I tried

I tried to create a third dummy (hereby called window) that is equal to 1 if the observation is treated AND in the focus window such that the interaction between the window and period gives the right set of dummies. However it creates interaction even when window is equal to 0 and I end up with collinearity error.

Code to reproduce :

library(dplyr)
df <- mutate(base_did, window = ifelse(period > 2 & period < 8 & treat, 1, 0))
est <- feols(y ~ window::period | id + period, df)
#> ...
#> "Presence of collinearity, covariance not defined. Use function collinearity() to pinpoint the problems."
collinearity(est)
# [1] "Variables 'window:period::1', 'window:period::2', 'window:period::8', 'window:period::9' and 'window:period::10' are constant, thus collinear with the fixed-effects."

My hacky solution

The solution is simply to filter out the null dummies when I interact window and period.

library(dplyr)
data("base_did")
df <- base_did %>%
  mutate(window = ifelse(period > 2 & period < 8 & treat, 1, 0)) %>%
  mutate(long_term = ifelse(period >= 8 & treat, 1, 0))
coefs <- df %$% interact(window, period) %>% {.[,colSums(.) !=0]}
colnames(coefs) <- gsub("\\:", "_", colnames(coefs))
dummies <- colnames(coefs) %>%
  paste(collapse = "+")
df <- cbind(df, coefs)
frmla <- as.formula(paste("y ~ x1+", dummies, " + long_term | id + period"))
feols(frmla, df)

result:

OLS estimation, Dep. Var.: y
Observations: 1,080 
Fixed-effects: id: 108,  period: 10
Standard-errors: Clustered (id) 
                  Estimate Std. Error   z value  Pr(>|z|)    
x1                0.975347   0.046274 21.078000 < 2.2e-16 ***
window_period__3  1.049600   0.935761  1.121700  0.262288    
window_period__4 -0.473162   0.969770 -0.487912  0.625724    
window_period__5  1.324300   0.947332  1.397900  0.162461    
window_period__6  2.108000   0.874269  2.411200  0.016088 *  
window_period__7  4.921800   0.790574  6.225600  7.17e-10 ***
long_term         6.373900   0.696438  9.152200 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-likelihood: -2,988.30   Adj. R2: 0.48591 
                          R2-Within: 0.38541 

Proposed interface

  • Adding explicit lead and lag options. (prefered solution)
    Maybe lead and lag are too economics-centered and using preand post is clearer.
feols(y ~ treat::period(ref=5, lead=2, lag=2) | id + period, df)

# Non numeric variable
all_months = c("aug", "sept", "oct", "nov", "dec", "jan",
                "feb", "mar", "apr", "may", "jun", "jul")
base_did$period_month = all_months[base_did$period]
feols(y ~ x1 + i(treat, period_month, "oct", lead=c("aug", "sep"), lag=c("nov", "dec")) | id+period, base_inter)
  • Adding a range parameter to the interact function.
feols(y ~ treat::period(range=3:7) | id + period, df)
  • Auto guess and filtering of null interactions
feols(y ~ window::period | id + period, df)

Final words

I can try implementing the function with a little help from the developer.

undefined reference error

Hi,

thanks for setting up the FIXEST package, it's really a huge time saver and intuitive to use!

I just wanted to update to the latest github version using devtools::install_github(), unfortunately I get the following errors:

RcppExports.o:RcppExports.cpp:(.text+0x1b26): undefined reference to `get_nb_threads()'

RcppExports.o:RcppExports.cpp:(.text+0x1d49): undefined reference to `setup_fork_presence()'

RcppExports.o:RcppExports.cpp:(.text+0x1e89): undefined reference to `is_in_fork()'
...
ERROR: compilation failed for package 'fixest'

I suspect the errors may be related to the recent changes in "RcppExports.R"?

I tried to update to the latest version on two different devices one running on windows and the other on linux, resulting in the same error.

Any plan to offer a python package

This is not an issue. But I am wondering do you have any plan to offer a python package for fixest. I feel that most econometric estimation packages are provided in R.

But when it comes to large companies that manage big data analytics, support for python is much greater than that for R.

Just a thought.

Partial Match Warning

When you run the following in R:

options("warnPartialMatchDollar" = TRUE)

And the you use the feglm() function in this package to fit a model which you assign to, say, m1.

Then when you call:

summary(m1)

you get the following warning message:

Warning message:
partial match of 'score' to 'scores'

[Question]: Possible to have different data & clusterings in same etable?

Howdy,

I have two specifications, A and B, that use different identification strategies on slightly different cuts of data. For A I want to cluster on C_a and for B I want to cluster on C_b. Is there a way using etable and esttex that can have specifications {A,B} side-by-side with my desired clustering?

Currently, I do the following example (or esttex) and combine later.

Example:
etable( feols(yvar ~ xvar | FE1A + FE2A), df[df$spec_a==1,]) ,cluster = "C_a", digits = 3)
etable( feols(yvar ~ xvar | FE1B + FE2B), df[df$spec_b==1,]) ,cluster = "C_b", digits = 3)

Looking for (?):
etable( list(
feols(yvar ~ xvar | FE1A + FE2A), df[df$spec_a==1,]),
feols(yvar ~ xvar | FE1B + FE2B), df[df$spec_b==1,])
),
cluster = c("C_a", "C_b"), digits = 3)

Thanks for all help!
Best,
Luke

Bootstrap standard errors

Hi,

First of all, thanks a lot for your incredibly useful package. I have a minor issue with it that I would like to report. Any help is much appreciated.

I am using the function feglm and I need bootstrap standard errors (BSE). I have programmed up the bootstrap routine myself. My problem is that I don't manage to display the BSE with etable(). Below my code:

n.Boot <- 1000
frm <- y ~ x | i + j
out <- feglm(frm, data = df, notes = FALSE)
B.se <- Run_bootstrap(n.Boot,frm) # My bootstrap routine

where df contains my data. I have then tried to substitute my own BSE with those provided by feglm as follows:

out[["se"]] <- B.se
out[["coeftable"]][["Std. Error"]] <- B.se

and then do

etable(out)

However, this is not producing the desired result. I would like to ask you if there is any way to address my point (maybe another function for displaying results besides etable or stargazer?)

I thank you in advance for your help.

Below my system and package info:

package: fixest 0.4.0 2020-03-29 [1] CRAN (R 3.6.3)
version: R version 3.6.3 (2020-02-29)
os : Windows 10 x64
system: x86_64, mingw32
ui : RStudio

`update.fixest` has issues when another object has the same name as input data

I noticed that update.fixest has issues when the data used to estimate is not in scope. If there are no other objects with the same name, you get a helpful error message, but if there are, it's more confusing.

Here's a demo where df stops referring to the trade data and starts referring to the stats::df function, causing trouble.

library(fixest)
fn <- function() {
  df <- trade
  feglm(Euros ~ log(dist_km)|Origin+Destination+Product+Year, data=df)
}

x <- fn()
r2(x)
#> Error in if (n_old != n_new) {: argument is of length zero

Created on 2019-11-22 by the reprex package (v0.3.0)

If I hadn't called the data df but rather trade2, I would have gotten the error:

Error in update.fixest(x, . ~ 1 | ., glm.tol = 0.01, fixef.tol = 0.001,  : 
  To apply 'update.fixest', we fetch the original database in the parent.frame -- but it doesn't seem to be there anymore (btw it was trade2).

The issue is here:

fixest/R/MiscFuns.R

Lines 6537 to 6542 in 9abf86b

try(data <- eval(object$call$data, parent.frame(nframes)), silent = TRUE)
if(is.null(data)){
dataName = object$call$data
stop("To apply 'update.fixest', we fetch the original database in the parent.frame -- but it doesn't seem to be there anymore (btw it was ", deparse_long(dataName), ").")
}

This would also be a problem if people had multiple datasets with the same name, and were using function boundaries to keep them straight. One solution is to have the model object hold onto a copy of the data, which felm and lm do, but that can be its own headache for memory use.

Get no. of clusters (G)

Hi Laurent,

One nice thing that the reghdfe package in Stata does is display the number of clusters. For example, here's what you get if you run the model from your "standard errors" vignette. (Note that I'm clustering twoways here just to demonstrate a slightly more complicated case.)

reghdfe

So we actually see quite a bit of information about how things were clustered and the associated DoF adjustments (including the bottom panel, which is pretty detailed). For the purposes of this issue, though I'm mostly interested in the line that tells us:

(Std. Err. adjusted for **4** clusters in id time).

Sticking with the formal definition, I guess we would refer to the number 4 as min(G).

Is it possible to retrieve this same information with fixest? I know that we can get quite a bit of information with, say, attributes(vcov(est)), but I can't see how to get this G number. (Apologies if I missed it!)

FWIW, I'm mostly asking because I frequently see this information presented at the bottom of regression tables in econ journals... and even more frequently get asked "how big my effective sample is" during presentations. It would be very helpful have it available in the return model object... or even added to etable().

Thanks, as ever.

Add errors/CI to predict.fixest?

Hi, this package is great. I'm wondering if it would be possible to add standard errors/confidence intervals to the output of predict.fixest? I realize that might be quite a bit of work, though.

Related issue in lfe repo. This would also add to fixest's incorporation into the margins package, per this issue.

Request: Estimate Probit models and ability to change default standard errors.

The fixest package is great and has improved my workflow and saved me a lot of computation time. But I was wondering if there is any way to estimate a probit binary response model? At the moment I am using the R package bife. But it would be nice to be able to do this using fixest, as I can do with logit models.

Furthermore, I was wondering if it was possible to change the default standard errors when estimating a model. For example

set.seed(12)
Bond = runif(100, 0, 0.55)
Day = runif(100, 0, 1.45)
Intervention = round(Day)
Type = round(Bond)
Agency = round(Bond + Day)

Data <- data.table(Agency, Bond, Day, Intervention, Type)

feols(Agency ~ Bond + Day | Intervention + Type , Data)

Gives the following output

OLS estimation, Dep. Var.: Agency
Observations: 100 
Fixed-effects: Intervention: 2,  Type: 2
Standard-errors: Clustered (Intervention) 
     Estimate Std. Error t value Pr(>|t|)    
Bond 1.339600   0.324546  4.1276 0.000079 ***
Day  0.956392   0.269115  3.5538 0.000593 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-likelihood: -13.21   Adj. R2: 0.65434 
                       R2-Within: 0.52428 

Where the standard errors are clustered at the intervention level . However, with this is not what I necessarily want. In large datasets it also would seem to have a large effect on the computation time relative to the usual standard errors. Could you provide a way of overriding this behavior so I can calculate only the usual standard errors?

Error with coefplot

Hi,
I am trying to plot a model with coefplot and I get this error:

Error in coefplot_prms(object = object, ..., sd = sd, ci_low = ci_low, :
Internal error regarding the lengths of vectors of coefficients. Could you report to the author of the fixest package?

Do you have any idea what is going on?

Weights with Fixest

When specifying weights with an feols model are these weights similar to aweights in Stata where each observation is treated as a group mean?

I looked at the estimation code here and it looks like that is the case, but I wanted to double check.

Data used for the estimation

The package is great. Thanks for sharing.
I have one question about the data that is used for the estimation. Suppose I have 140K data, and during the estimation it removed a lot of data due to NA or other reasons. It ended up with around 40K data. Is there a way I can know what this 40K data is?

probitFE and logitFE implementation

Hello,

Thanks for the great package; the speed gains and well-documented vignettes have been a boon.
I was wondering if there are any plans to / there be any interest in an implementation of the Ferndandez-Val - Weidner split-sample/debiasing approach to two-way FEs in probit and logit models in the package.

Clustered SE considerations in feols

The details of clustering and degrees-of-freedom corrections are perennially issues, and challenging for all the issues Sergio pointed out in this thread.

When I run @grantmcdermott's example from that same discussion, feols gives the same results as lfe::felm or Stata's cgmreg, but different than Stata's reghdfe or Grant's proposed felm(..., cmethod="reghdfe").
I'm not sure you want to do anything differently, but it's helpful to document the differences across packages.

Ref: DeclareDesign/estimatr#321, sgaure/lfe#19, FixedEffects/FixedEffectModels.jl#49

[Request] Force Warning for Non-convergence

Howdy,

FEOLS appears not to default issue a warning about non-convergence. I only realized this was an issue when I used verbose and adjusted the iteration limit. Even this does not directly tell me it did not converge, but I can see in the output that it hit its iteration limit.

I believe making non-convergence more salient would help most users.

Thank you for your efforts,
Best,
Luke

Add FixedEffectModels.jl to benchmark

I just came across this package yesterday. Really impressed with both the range of models covered and the implementation speed.

On the subject of speed, I was wondering if you could add some benchmark comparisons for the FixedEffectModels.jl package in Julia. To the best of my knowledge, this is currently the cutting edge for HDFE implementations... at least as far as linear models go. My own quick benchmarks suggest that fixest::feols compares very favourably for medium sized data sets.

In case you're not familiar with Julia code, here's a quick and dirty example for running one iteration of the largest model in your benchmark simulation. I'll cc the package owner @matthieugomez, who probably has better ideas for benchmarking and may also want to add fixest to his own benchmark comparisons.

### PRELIMINARIES ###
## Download files here: https://drive.google.com/drive/folders/1-1M_vLGduByk5P3qHl7AMBBl85RzHpv7?usp=sharing
## Then run `Data generation.R`
### END PRELIMINARIES ###

cd("/path/to/downloaded/files")

using RData

base_all = load("DATA/base_all_simulations_rect.RData", convert=true)
base_all = base_all["base_all"]

using FixedEffectModels

## Limit to the very last replication data frame (1,000,000 x 5)
base = base_all[40]

## Single thread
@time reg(base, @formula(log(y+1)~X1 + fe(dum_1) + fe(dum_2) + fe(dum_3)))

## Multi-threaded
@time reg(base, @formula(log(y+1)~X1 + fe(dum_1) + fe(dum_2) + fe(dum_3)), method = :lsmr_threads)

## GPU (requires working CUDA installation)
base.X1 = convert(Vector{Float32}, base[!, :X1]) ## Optional for performance boost
@time reg(base, @formula(log(y+1)~X1 + fe(dum_1) + fe(dum_2) + fe(dum_3)), method = :lsmr_gpu)

esttex "drop" argument too aggressive

Subtitle: Ignore variables with dictionary alias

Hi Laurent,

I've been trying to export a subset of my regression results where (a) I'm only really interested in some interaction terms, and (b) I further wish to rename these interaction terms with a dictionary alias. My problem is that in trying to drop the parent terms using the "drop" argument of esttex(), I invariably end of excluding the interactions variables of interest too... even when I have assigned them a dictionary alias.

An example may help to better illustrate:

library(fixest)

aq = airquality
aq$Month = factor(aq$Month)

mod = feols(Ozone ~ Month / Wind, data = aq)
#> NOTE: 37 observations removed because of NA values (Breakup: LHS: 37, RHS: 0).

myDict = c("Month5:Wind"="May * Wind", "Month6:Wind"="Jun * Wind", 
           "Month7:Wind"="Jul * Wind", "Month8:Wind"="Aug * Wind",
           "Month9:Wind"="Sep * Wind")

## Keep everything, including parent terms (but rename interaction terms)
esttex(mod, dict = myDict)
#> \begin{table}[htbp]\centering
#> \caption{no title}
#> \begin{tabular}{lc}
#>  & \tabularnewline
#> \hline
#> \hline
#> Dependent Variable:&Ozone\\
#> Model:&(1)\\
#> \hline
#> \emph{Variables}\tabularnewline
#> (Intercept)&50.748$^{***}$\\
#>   &(15.748)\\
#> Month6&-41.793\\
#>   &(31.148)\\
#> Month7&68.296$^{***}$\\
#>   &(20.995)\\
#> Month8&82.211$^{***}$\\
#>   &(20.314)\\
#> Month9&23.439\\
#>   &(20.663)\\
#> May * Wind&-2.3681$^{*}$\\
#>   &(1.3162)\\
#> Jun * Wind&1.6825\\
#>   &(2.1141)\\
#> Jul * Wind&-7.0314$^{***}$\\
#>   &(1.5399)\\
#> Aug * Wind&-8.5224$^{***}$\\
#>   &(1.4015)\\
#> Sep * Wind&-4.2418$^{***}$\\
#>   &(1.2575)\\
#> \hline
#> \emph{Fit statistics}&  \\
#> Observations& 116\\
#> R$^2$ & 0.54732\\
#> Adjusted R$^2$ & 0.50889\\
#> \hline
#> \hline
#> \multicolumn{2}{l}{\emph{Normal standard-errors in parentheses. Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
#> \end{tabular}
#> \end{table}

## Try to drop only parent terms, but ends up excluding everything (including
## interactions renamed with a dictionary alias) 
esttex(mod, dict = myDict, drop = paste0("Month", 5:9))
#> \begin{table}[htbp]\centering
#> \caption{no title}
#> \begin{tabular}{lc}
#>  & \tabularnewline
#> \hline
#> \hline
#> Dependent Variable:&Ozone\\
#> Model:&(1)\\
#> \hline
#> \emph{Variables}\tabularnewline
#> (Intercept)&50.748$^{***}$\\
#>   &(15.748)\\
#> \hline
#> \emph{Fit statistics}&  \\
#> Observations& 116\\
#> R$^2$ & 0.54732\\
#> Adjusted R$^2$ & 0.50889\\
#> \hline
#> \hline
#> \multicolumn{2}{l}{\emph{Normal standard-errors in parentheses. Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
#> \end{tabular}
#> \end{table}

Created on 2019-12-29 by the reprex package (v0.3.0)

I know that in the vignette you write: "Note that the actions performed by the arguments drop or order are performed before the renaming takes place with the argument dict."

However, I'm wondering if it is possible to switch things around so that drop respects the assigned aliases? Failing that, is there a way to adjust drop's regexp so that it isn't so aggressive (such that it would allow me to keep the child interaction terms in the above example)?

PS โ€” I have a second question about the dict argument, but I'll file that as a separate issue shortly.

Escaping special characters with etable.

Hi and thanks for such an amazing package. There seems to be an issue with escaping special characters in the etable function when the TEX option is set to true.

#Load the libraries
library(data.table)
library(fixest)

#Set the seed
set.seed(1)
 
X <- rep(c(1,1,2,3), 10) + rnorm(10*4,0,1)
a = rep(c("b","b","a", "c"), 10)
Y = rep(c(3,3,2,1.1), 10) + 0.5*X + rnorm(10*4,0,0.01)

#Create the data
DT = data.table(
  ID_var = a,
  Y,
  X
)

#Some regressions
Reg_1 <- feols(Y~X, DT)
Reg_2 <- feols(Y~X|ID_var, DT)

#Export the regressions
etable(Reg_1, Reg_2, tex = TRUE, fixef_sizes = T, se = "cluster", cluster = "ID_var")

I receive the following output.

\begin{table}[htbp]\centering
\caption{no title}
\begin{tabular}{lcc}
 & & \tabularnewline
\hline
\hline
Dependent Variable:&\multicolumn{2}{c}{Y}\\
Model:&(1)&(2)\\
\hline
\emph{Variables}\tabularnewline
(Intercept)&3.1110$^{***}$&  \\
  &(0.2996)&  \\
X&0.0468&0.5024$^{***}$\\
  &(0.0975)&(0.0023)\\
\hline
\emph{Fixed-Effects}&  & \\
ID\_var&No&Yes\\
\hline
\emph{Fit statistics}&  & \\
Observations& 40&40\\
# ID\_var&--&3\\
R$^2$ & 0.01063&0.99975\\
Within R$^2$ & --&0.99958\\
\hline
\hline
\multicolumn{3}{l}{\emph{One-way (ID_var) standard-errors in parentheses.}}\\
\multicolumn{3}{l}{\emph{Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
\end{tabular}
\end{table}

Notice that ID_var in the clustered error variable in the fourth last line of the table should be escaped as ID\_var. More importantly the # in the fit statistics section of the table, under the Observations variable, also not escaped. This bug persists even if one supplies a file instead of setting tex = TRUE.

Thanks again.

Inconsistent treatment of NA in fixed effects

Having NA in fixed effects normally produces a warning and drops the NA observations, but this does not happen when the fixed effects are specified with ^. Here is an example:

library(fixest)
data(base_did)

base_did <- as_tibble(base_did) %>%
    mutate(
        # create a category with some NAs
        icat = ifelse(id < 20, 0, ifelse(id < 30, NA, 1)),
        # create another "fixed effect"
        period2 = period %% 2
    )

table(base_did %>% select(icat, period2), useNA = 'always')
#         period2
#  icat      0    1  <NA>
#    0      95   95     0
#    1     395  395     0
#    <NA>   50   50     0

mdl <- feols(y ~ treat:post | id + period, data = base_did)
summary(mdl)
mdl$nobs  # 1,080

# the following produces a nice warning due to missing values of icat
# and drops the 100 rows with missing data
mdl <- feols(y ~ treat:post | id + period + icat, data = base_did)
# NOTE: 100 observations removed because of NA values (Breakup: LHS: 0, RHS: 0, Fixed-effects: 100).
summary(mdl)
mdl$nobs  # 980

# this surprised me; I expected the 100 observations with NAs to be dropped again 
mdl <- feols(y ~ treat:post | id + period + icat^period2, data = base_did)
summary(mdl)
mdl$nobs  # 1,080 again

I'm running version 0.6.0 with R 4.0.1 on Ubuntu 20.04.

demean error message

Hi! Thanks for the great package! I'm using demean function for my analysis and I think it works fine but throws some random error message that changes every time (or sometimes not even showing up). Practically, when I implement this within Rmarkdown it stops the compiling so as a get-around I save it to a separate file using R script and load the data into the rmd but obviously this is not the most efficient way. Here's a reproducible example, which is a subset of my actual data.

test <- structure(list(ln_dmg = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8.59600720786391, 
0, 9.71410729937478, 0, 0, 12.1270302478998, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10.2449188370322, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 7.60692111895402, 0, 0, 0, 0, 0, 0, 0, 0, 
0), comm_id = c("060001", "060001", "060001", "060001", "060001", 
"060001", "060001", "060001", "060001", "060001", "060001", "060001", 
"060001", "060001", "060001", "060001", "060001", "060001", "060001", 
"060001", "060001", "060001", "060001", "060001", "060001", "060002", 
"060002", "060002", "060002", "060002", "060002", "060002", "060002", 
"060002", "060002", "060002", "060002", "060002", "060002", "060002", 
"060002", "060002", "060002", "060002", "060002", "060002", "060002", 
"060002", "060002", "060002", "060003", "060003", "060003", "060003", 
"060003", "060003", "060003", "060003", "060003", "060003", "060003", 
"060003", "060003", "060003", "060003", "060003", "060003", "060003", 
"060003", "060003", "060003", "060003", "060003", "060003", "060003", 
"060004", "060004", "060004", "060004", "060004", "060004", "060004", 
"060004", "060004", "060004", "060004", "060004", "060004", "060004", 
"060004", "060004", "060004", "060004", "060004", "060004", "060004", 
"060004", "060004", "060004", "060004"), trans_year = c(1983, 
1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 
1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 
2006, 2007, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 
1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 
2003, 2004, 2005, 2006, 2007, 1983, 1984, 1985, 1986, 1987, 1988, 
1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 
2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 1983, 1984, 1985, 
1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 
1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007
), event_id = c(9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 
9999, 9999, 949, 9999, 968, 9999, 9999, 611, 9999, 9999, 9999, 
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 611, 
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 
9999, 9999, 611, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 
9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 
949, 9999, 9999, 9999, 9999, 611, 9999, 9999, 9999, 9999, 9999, 
9999, 9999, 9999, 9999)), row.names = c(NA, -100L), class = "data.frame")
Y_demean = fixest::demean(X = test[, "ln_dmg"],
                          fe = test[, c("comm_id", "trans_year", "event_id")])
Warning: stack imbalance in '=', 2 then -9
Error in (function (env, objName, computeSize = TRUE)  : 
  R_Reprotect: only 2 protected items, can't reprotect index -3

As mentioned earlier, the error message changes each time I run this code. Here are a few of them. As you can see, sometimes it runs without any error message. Oh and importantly, they produce the same results though.

> Y_demean = fixest::demean(X = test[, "ln_dmg"],
+                           fe = test[, c("comm_id", "trans_year", "event_id")])
Warning: stack imbalance in '=', 2 then -8
Error in (function (env, objName, computeSize = TRUE)  : 
  R_Reprotect: only 3 protected items, can't reprotect index -2
> Y_demean = fixest::demean(X = test[, "ln_dmg"],
+                           fe = test[, c("comm_id", "trans_year", "event_id")])
> Y_demean = fixest::demean(X = test[, "ln_dmg"],
+                           fe = test[, c("comm_id", "trans_year", "event_id")])
Warning: stack imbalance in '=', 2 then 5
> Y_demean = fixest::demean(X = test[, "ln_dmg"],
+                           fe = test[, c("comm_id", "trans_year", "event_id")])
Warning: stack imbalance in '=', 2 then 1

I would have ignored this if it wasn't for rmd so this is not as important issue for others (esp who don't use rmd) but thought that it wouldn't hurt to report. Please let me know any other info is needed. Thanks!!

fixest standard error adjustment

Love this package. The performance is an absolute game changer. However I noticed some large differences between lfe and fixest when clustering. Hereโ€™s some sample code I wrote generated from a thread examining how lfe was too conservative relative to Stata, which is now 'fixed'. sgaure/lfe#1

I wonder if it's worth aligning the standard errors here.

Here's sample code that replicates the issue


# Generate 500 FE levels, with 5 obs each, nested within 50 clusters
n_clusters <- 50
n_obs_per_fe_level <- 5
n_fe_levels <- 500
n_obs <- n_obs_per_fe_level * n_fe_levels
n_fe_per_cluster <- n_fe_levels / n_clusters
n_obs_per_cluster <- n_obs / n_clusters

# Make 500 FE levels
fe <- sort(rep(1:n_fe_levels, length.out = n_obs))
# Make clusters that nest FE levels
cl <- ceiling(fe / n_fe_per_cluster)


draw_one_cluster <- function(N) {
  # Each cluster has its own mean and se for epsilon.
  cluster_eps_mean <- rnorm(1)
  cluster_eps_sd <- 10 * runif(1)
  rnorm(N, mean=cluster_eps_mean, sd = cluster_eps_sd)
}
eps <- unlist(
  replicate(n_clusters, draw_one_cluster(n_obs_per_cluster), simplify = FALSE)
)

x <- rnorm(n_obs, 0, 10)
y <- 0.02 * x + fe / 100 + eps

DF <- data.frame(cl, y, x, fe)

# haven::write_dta(DF, "~/test_data.dta")

mod = lfe::felm(y ~ x | fe | 0 | cl, data=DF)
summary(mod)

mod = fixest::feols(y ~ x | fe , data=DF)
summary(mod,cluster=DF$cl)

double free or corruption (out) [1] 10004 abort (core dumped) on `feglm` with `poisson`

Thanks for the wonderful package.

I am getting an unpredictable core dump when trying to run lots of poisson feglm regressions.

Basically I run lots of felm from lfe regressions and feglm regressions with the poisson distribution, using high dimensional fixed effects.

Sometimes my scrips runs all the way through, but after a few successful runs I get

double free or corruption (out)
[1]    10004 abort (core dumped)  radian

(I'm running R using radian in the terminal).

An example of a regression is this:

  target_poisson = feglm(prob ~ as.factor(counter_month):ratio | factor(stay_counter_month_dummy) + factor(home_counter_month_dummy), data = panel_work)

The data is a balanced panel of 10 months, where each observation in the panel is a "home" and a "stay" location with 275 locations. So the data set is 756,000 observations in total. As you can see, the fixed effects for stay location by month dummies and the home location by month dummies.

I think there are a lot of observations with no variation across the panel and perfectly collinear FEs. There are also a lot of NaNs so clearly there are a lot of sources of error for this regression.

I think I'm doing a good job making sure I'm putting things in functions and not allocating extra memory. But it's hard to debug to see if that is the real issue. See this thread on discourse.

Can I email you the data set as an MWE?

Here is the full session info:

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] R.utils_2.9.2     R.oo_1.23.0       R.methodsS3_1.8.0 lfe_2.8-5.1       Matrix_1.2-18    
 [6] fixest_0.6.0      data.table_1.12.8 lubridate_1.7.9   forcats_0.5.0     stringr_1.4.0    
[11] dplyr_1.0.0       purrr_0.3.4       readr_1.3.1       tidyr_1.1.0       tibble_3.0.3     
[16] ggplot2_3.3.2     tidyverse_1.3.0  

loaded via a namespace (and not attached):
 [1] zoo_1.8-8           tidyselect_1.1.0    haven_2.3.1         lattice_0.20-41     colorspace_1.4-1   
 [6] vctrs_0.3.2         generics_0.0.2      blob_1.2.1          rlang_0.4.7         pillar_1.4.6       
[11] glue_1.4.1          withr_2.2.0         DBI_1.1.0           dbplyr_1.4.4        modelr_0.1.8       
[16] readxl_1.3.1        lifecycle_0.2.0     munsell_0.5.0       gtable_0.3.0        cellranger_1.1.0   
[21] rvest_0.3.5         parallel_3.6.2      fansi_0.4.1         broom_0.7.0         Rcpp_1.0.5         
[26] xtable_1.8-4        scales_1.1.1        backports_1.1.8     jsonlite_1.7.0      fs_1.4.2           
[31] hms_0.5.3           stringi_1.4.6       numDeriv_2016.8-1.1 grid_3.6.2          cli_2.0.2          
[36] tools_3.6.2         sandwich_2.5-1      magrittr_1.5        Formula_1.2-3       crayon_1.3.4       
[41] pkgconfig_2.0.3     ellipsis_0.3.1      dreamerr_1.2.0      xml2_1.3.2          reprex_0.3.0       
[46] assertthat_0.2.1    httr_1.4.2          rstudioapi_0.11     R6_2.4.1            nlme_3.1-148       
[51] compiler_3.6.2     

Different results with xtreg from Stata?

When I tried including one fix effect, I found that feols and xtreg from Stata gave totally different results (the sign of the coefficient is opposite). Can you explain why? Are they using different optimization methods?

Feature request: broom methods for fixest models

Thanks for the great package!

It would be convenient (for me, hopefully for others) if fixest had tidy.fixest, glance.fixest, and augment.fixest methods. The broom docs provide recommendations for adding these methods: https://broom.tidyverse.org/articles/adding-tidiers.html

library(broom)
library(fixest)

tidy(lm(mpg ~ wt, data=mtcars))
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    37.3      1.88      19.9  8.24e-19
#> 2 wt             -5.34     0.559     -9.56 1.29e-10
tidy(feols(mpg ~ wt, data=mtcars))
#> Error: No tidy method for objects of class fixest

Created on 2019-11-20 by the reprex package (v0.3.0)

What do you think?
If anyone else wants to tackle this, please feel free. If not, I can submit a PR sometime in the next few weeks.

Output from esttex function doesn't create a working table.

I noticed that there is a bug with the esttex function when there are multiple clustered standard errors. The table it creates does not compile without modification in Latex.

rm(list = ls(all.names = TRUE))
gc()

#Load the libraries
library(ggplot2)
library(tsibble)#yearweek function.
library(data.table)
library(zoo)
library(Hmisc) #For weighted medians functions.
library(fixest)#For the fixed effects regressions
setFixest_nthreads(6)#Set the number of threads to use

#Load in the data.
Reg_data <- fread(paste0(data_analysis_path, "Trade_cost_reg.csv"))
Reg_data[, `:=`(TRD_EXCTN_DT = as.IDate(TRD_EXCTN_DT), Year_month = as.yearmon(Year_month), Year_week = yearweek(Year_week))]

#Want to set the dictionary for exporting tables nicely to latex
setFixest_dict(c( "CUSIP_ID & Year_month & Actual_trade_party" = "Bond + Period (Monthly) + Intermediary", CUSIP_ID = "Bond", "Year_month" = "Monthly time", "Year_week" = "Weekly time", "log(ENTRD_VOL_QT)" = "log(Volume)", "Actual_trade_party" = "Intermediary",
                 "Contra_S" = "Selling party", "Contra_B" = "Buying party", "Investment_gradeYes" = "Investment grade", "log(Contra_trade_volume)" = "log(Mean contra volume)",
                  "Last_month_portion_trds" = "Relationship measure", "Last_month_portion_trds_S" = "Relationship measure_S", "Last_month_portion_trds_B" = "Relationship measure_B", "Trade_costs" = "Trade costs", "CUSIP_ID^Year_month" = "Bond x period (monthly)"))

Bond_times_mon_fe <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT)  + log(Contra_trade_volume) | CUSIP_ID^Year_month, data = Reg_data)

esttex(Bond_times_mon_fe, se = "threeway", cluster = c("CUSIP_ID", "Year_month",  "Actual_trade_party"))

The output I get from this is

\begin{table}[htbp]\centering
\caption{no title}
\begin{tabular}{lc}
 & \tabularnewline
\hline
\hline
Dependent Variable:&Trade costs\\
Model:&(1)\\
\hline
\emph{Variables}\tabularnewline
Relationship measure&-0.2777$^{**}$\\
  &(0.1136)\\
log(Volume)&-0.0157$^{**}$\\
  &(0.0069)\\
log(Mean contra volume)&0.0023\\
  &(0.0031)\\
\hline
\emph{Fixed-Effects}&  \\
Bond x period (monthly)&Yes\\
\hline
\emph{Fit statistics}&  \\
Observations& 6,933,998\\
R$^2$ & 0.40135\\
Within R$^2$ & 0.00819\\
\hline
\hline
\multicolumn{2}{l}{\emph{Three-way (Bond & Monthly time & Intermediary) standard-errors in parentheses.}}\\
\multicolumn{2}{l}{\emph{Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
\end{tabular}
\end{table}

The issue is the

\multicolumn{2}{l}{\emph{Three-way (Bond & Monthly time & Intermediary) standard-errors in parentheses.}}\\

Here Latex uses the & as a column separator. Ideally, I would like to change this to a + or \& but haven't been able to find a way to do this, other than changing the output by hand.

Furthermore, if I don't specify the "CUSIP_ID^Year_month" = "Bond x period (monthly)" in the dictionary the default table will contain CUSIP_ID^Year_month. This must be modified as ^ is math symbol in Latex. It is preferable that the default table the function creates complies without error. Could you make is so that this is changed to x automatically?

R version: 3.6.2

Thanks.

Possible to reduce size of fixest models?

Thanks for the great package.

Is there a way to specify which objects to return from a fixest model?

For example can I choose to return the fitted values, sum of the fixed effects or matrix of the scores or not?

I am running a feols model with about 40 million observations that looks like this

est_did_emp_500_col1 = feols(log(employment) ~ small_business_500:post | small_business_500 + post, data=dfr_emp_ind_sample)

I am finding that the returned model objects take up a lot of space.

object.size(est_did_emp_500_col1)
1962264560 bytes

I looked at the size of each component:

lapply(est_did_emp_500_col1,object.size)

$nobs
56 bytes

$fml
1176 bytes

$call
1736 bytes

$method
112 bytes

$fml_full
1736 bytes

$nparams
56 bytes

$fixef_vars
200 bytes

$fixef_id
392379704 bytes

$fixef_sizes
368 bytes

$obsRemoved
358296 bytes

$iterations
56 bytes

$coefficients
304 bytes

$residuals
392378704 bytes

$multicol
56 bytes

$sumFE
392378704 bytes

$fitted.values
392378704 bytes

$scores
392378872 bytes

$hessian
224 bytes

$sigma2
56 bytes

$cov.unscaled
672 bytes

$coeftable
1456 bytes

$se
536 bytes

$sq.cor
56 bytes

$ssr_null
56 bytes

$ll_null
56 bytes

$ssr_fe_only
56 bytes

$ll_fe_only
56 bytes

Are there other good ways to reduce the size of these models?

Thanks.

`tidy` and `glance` methods

"tidiers" are a set of methods defined by the generics and broom packages to extract information from model objects in a standardized fashion.

https://broom.tidymodels.org/

A tidy.fixest method would return a data.frame with one row per term, and columns with standard names: term, estimate, std.error, statistic, p.value, conf.low, conf.high

A glance.method method would return a data.frame with a single row, and columns for the most important goodness-of-fit statistics and model characteristics.

There are two main benefits of supplying tidiers in a package:

  1. Users do not have to learn the idiosyncrasies of the internal structure of each kind of model object they use. This allows analysts to interact with a bunch of models programmatically in a predicable fashion.
  2. External packages which rely on tidy and glance, such as my modelsummary package, would support fixest automatically.

As a stopgap measure, I have implemented tidy and a glance methods inside modelsummary: https://github.com/vincentarelbundock/modelsummary/blob/master/R/tidiers.R#L106

Ideally, these methods would be ported over to fixest itself, as is recommended by the broom and tidymodels conventions.

Would you be interested in a Pull Request to add these methods to your package?

This would allow stuff like this:

remotes::install_github('vincentarelbundock/modelsummary')

library(fixest)
library(modelsummary)

url <- 'https://vincentarelbundock.github.io/Rdatasets/csv/plm/EmplUK.csv'
dat <- read.csv(url)

mod <- list()
mod[[1]] <- lm(wage ~ emp + capital, dat)
mod[[2]] <- feols(wage ~ emp + capital | firm, dat)
mod[[3]] <- feols(wage ~ emp + capital | year, dat)
mod[[4]] <- feols(wage ~ emp + capital | firm + year, dat)

msummary(mod, title = "tidy and glance functions allow easy programmatic manipulation of fixest objects. In turn, this allows packages like modelsummary to support fixest out-of-the-box.")

Which produces an HTML or a LaTeX table like this one:

latextable

Include mean of dependent variable as a row in "etable"

Hi Laurent,

In my regression tables, I'd like to have a row at the bottom indicating the mean of the dependent variable (along with the currently supported number of observations and r squared). Is there a way etable could support this feature?

fixef doesn't handle integer variables on varying slopes

Hi (bonjour),

fixef returns an error when the varying slope variable is an integer. See the code below :
##your example
base_vs = iris
names(base_vs) = c(paste0("x", 1:4), "species")
est_vs = feols(x1 ~ x2 | species[x3], base_vs)
fixef(est_vs)
##the example which leads to an error
base_vs$x4=as.integer(base_vs$x3)
est_vs = feols(x1 ~ x2 | species[x4], base_vs)
fixef(est_vs)
##one way to solve it
base_vs$x5=as.numeric(as.integer(base_vs$x3))
est_vs = feols(x1 ~ x2 | species[x5], base_vs)
fixef(est_vs)

Bonne journรฉe,

Q: reference for adjusted within R2

Likely not an issue, rather a question: Do you happen to know a reference for the adjusted R2 of the within model? I have this line in mind:
res[i] = 1 - cpp_ssq(resid(x)) / ssr_fe_only * (n - nb_fe) / (n - nb_fe - df_k)

Stata seems to calculate the adj. R2 in the within case as per the "standard" formula, see this post where someone reworked the Stata formula used: Stata forum

For the one-way fixed effect on Grunfeld data (200 obs) I get with fixest:

data("Grunfeld", package = "plm")
fe <- feols(inv ~ value + capital | firm, data = Grunfeld)
r2(fe)
# war2 = 0.7642763

With package lfe I get a different adjusted R2:

summary(felm(inv ~ value + capital | firm, data = Grunfeld))
# Adjusted R-squared: 0.7531104211

felm seems to use the usual formula for the adjusted R2, without any adjustment in the numerator for the (missing) intercept or absorbed effects.

Marginal effects and standard errors for probit/logit models

Hi, I like the package. Is there a way to get the marginal effects for the probit/logit models and their standard errors? I couldn't find it.

Can you add functions to more easily recover the marginal effects? I'd suggest either 1) at the mean for each variable or 2) calculate the marginal effect for every observation and then take the mean. And for the standard errors I guess use the delta method? Thanks!

Another ^ bug, now with predict.fixest

predict.fixest returns a warning and preditcs all NAs when fixed-effects/cluster_vars are used with a "^" (observed for a feglm):

Annotation

I tinkered with line 5627 of MiscFuns.R as parse did not handle the "^" by doing a gsub of ^ to , and wrapping it in "paste(...,sep='_')" to make the fixed effects "match" those produced by model fit. This caused cluster_current_unik to correctly return an empty set on line 5628. But there's obviously more to this problem as various other parses involving "^" happen down the line.

I'm rambling (hopefully helpfully). But for now, the only workaround I can think of is to create a new factor variable with paste0(factor1,"",factor2,""...).

Really enjoying this package!

Guidelines for bug submission

First thanks for the interest in the package and for wanting to submit a bug report. This feedback is invaluable since it significantly improves software quality, benefiting to all users.

Here's a few guidelines before submitting a bug:

  1. Bugs are corrected at almost every release. So ensure you have the latest version of the software running. Currently it's Version .

  2. Every time a bug is corrected, it's reported in the NEWS file. Please have a look at it to see whether it isn't already fixed in the development version.

  3. I know it's some work, but please provide a minimal working example in your report with reproducible code and data. If your bug only happens for some large data set and you're not able to make it appear for data of smaller size, you can DM me the data. A minimal working example ensures that:

    1. you really have encountered a bug and it's not a user-side mistake.
    2. it can be quickly solved.

There are many very good examples of bug reports in past issues (see #4, #10, or #17 to name a few).

compatibility with stargazer / texreg

Here is some texreg code (written by my PhD student). Not sure 100% compatible going forward, but works for me.

 

extract.feols <- function(model, include.nobs = TRUE, include.rsquared = TRUE,
                            include.adjrs = TRUE, include.fstatistic = FALSE, ...) {
    
    s <- summary(model, ...)
    nam <- rownames(s$coeftable)
    co <- s$coeftable[, 1]
    se <- s$coeftable[, 2]
    pval <- s$coeftable[, 4]
    
    gof <- numeric()
    gof.names <- character()
    gof.decimal <- logical()
    
    if (include.nobs == TRUE) {
      gof <- c(gof, s$nobs)
      gof.names <- c(gof.names, "Num.\ obs.")
      gof.decimal <- c(gof.decimal, FALSE)
    }
    if (include.rsquared == TRUE) {
      gof <- c(gof, fixest::r2(s, "r2"), fixest::r2(s, "wr2"))
      gof.names <- c(gof.names, "R$^2$ (full model)", "R$^2$ (proj model)")
      gof.decimal <- c(gof.decimal, TRUE, TRUE)
    }
    if (include.adjrs == TRUE) {
      gof <- c(gof, fixest::r2(s, "ar2"), fixest::r2(s, "war2"))
      gof.names <- c(gof.names, "Adj.\ R$^2$ (full model)",
                     "Adj.\ R$^2$ (proj model)")
      gof.decimal <- c(gof.decimal, TRUE, TRUE)
    }
    
    #     TODO: fixest does not have builtin F-stat. Will have to calculate ourselves
    #     
    #     if (include.fstatistic == TRUE) {
    #         gof <- c(gof, s$F.fstat[1], s$F.fstat[4],
    #             s$P.fstat[length(s$P.fstat) - 1], s$P.fstat[1])
    #         gof.names <- c(gof.names, "F statistic (full model)",
    #             "F (full model): p-value", "F statistic (proj model)",
    #             "F (proj model): p-value")
    #         gof.decimal <- c(gof.decimal, TRUE, TRUE, TRUE, TRUE)
    #     }
    
    tr <- texreg::createTexreg(
      coef.names = nam,
      coef = co,
      se = se,
      pvalues = pval,
      gof.names = gof.names,
      gof = gof,
      gof.decimal = gof.decimal
    )
    return(tr)
  }

``

a bug (I think) in deciding which terms to omit in absorbed fixed effect regressions

Hello,

Thanks for writing such a fantastic package! My coauthor and I use it everyday!

We would like to report a bug (I think). The following two regressions differ only in that "factor(year)" term is included as a regressor or absorbed. They should generate the same results, but unfortunately they don't.

Std. Errors suggest that the first one ("t0") generates correct outcomes. To confirm this conjecture, I ran the same regression using "reghdfe" package in STATA and got the identical results to "t0".

It seems that the bug is in deciding which variables to drop. The results below show that specification "t1" has one more term than "t0". STATA also drops the term just like "t0".

We also noted that even "t0" specification generates incorrect result without adding fixef.tol argument ("fixef.tol = 10000*.Machine$double.eps"). We wonder whether it may be better to raise the default precision level.

We attach the data we used. Could you please look into this when you have time?

Thank you!

library(fixest)

df <- readRDS("data.rds")

t0 <- feols((log(depsum_cz) - log(l(depsum_cz))) ~ cb:ntr_gap:factor(year) + ntr_gap:factor(year) + cb:factor(year) + factor(year) | bank_id + cz90, data=panel(df, panel.id = c('bank_id_cz', 'year')), fixef.tol = 10000*.Machine$double.eps)

t1 <- feols((log(depsum_cz) - log(l(depsum_cz))) ~ cb:ntr_gap:factor(year) + ntr_gap:factor(year) + cb:factor(year) | factor(year) + bank_id + cz90, data=panel(df, panel.id = c('bank_id_cz', 'year')), fixef.tol = 10000*.Machine$double.eps)

summary(t0, cluster="bank_id")
#> OLS estimation, Dep. Var.: (log(depsum_cz) - log(l(depsum_cz, 1)))
#> Observations: 60,816 
#> Fixed-effects: bank_id: 10481,  cz90: 731
#> Standard-errors: Clustered (bank_id) 
#>                              Estimate Std. Error   t value Pr(>|t|) 
#> factor(year)1998:cb          0.044212   0.045045  0.981521 0.326341 
#> factor(year)1999:cb         -0.069912   0.063748 -1.096700 0.272777 
#> factor(year)2000:cb          0.006705   0.051751  0.129565 0.896911 
#> factor(year)2001:cb          0.024838   0.034328  0.723532 0.469356 
summary(t1, cluster="bank_id")
#> OLS estimation, Dep. Var.: (log(depsum_cz) - log(l(depsum_cz, 1)))
#> Observations: 60,816 
#> Fixed-effects: factor(year): 5,  bank_id: 10481,  cz90: 731
#> Standard-errors: Clustered (bank_id) 
#>                              Estimate Std. Error   t value Pr(>|t|)    
#> factor(year)1998:cb         -0.024328  14.778000 -0.001646 0.998687 
#> factor(year)1999:cb         -0.138486  14.795000 -0.009360 0.992532 
#> factor(year)2000:cb         -0.061817  14.764000 -0.004187 0.996659 
#> factor(year)2001:cb         -0.043704  14.774000 -0.002958 0.997640 
#> factor(year)2002:cb         -0.068527  14.769000 -0.004640 0.996298 

Created on 2020-10-12 by the reprex package (v0.3.0)

An option to have t-stats below the coefficients instead of sd in esttex regression outputs

Hi Laurent,

This package has made everything (e.g. clustering, generating latex codes) much easier and faster! Thanks for making it.

I was wondering, if the function esttex has an option that allows me to report t-stats (or z-stats) in the parentheses below the coefficients, instead of sd/se, in the regression output.

And, if not, is there an easier way to customize the latex output so that it contains the t-stats , than manually copying the t-stats and replacing the standard deviations with them in the latex output?

I ask because some journals prefer t-statistics to standard errors. Thanks!

Get esttex to respect common dict aliases

As promised, a second esttex issue...

Is it possible for esttex output to respect common (i.e. unique) dictionary aliases?

I often want to output similar variables on the same row in a regression table. For instance, to make them immediately comparable even if they are named differently in my actual R model calls. The way this works for other regression outputting packages in R (e.g. huxtable or stargazer) is to give them the same alias. However, that strategy does not appear to work here.

Consider the following example, where I want to compare full vs partial marginal effects, using different model syntax. The coefficient on Wind in mod_2 is actually the same as Month5:Wind in mod_1. Yet, even if I explicitly give it the same alias, the two coefficients are still printed on separate rows.

library(fixest)

aq = airquality
aq$Month = factor(aq$Month)

## Full marginal effects
mod_1 = feols(Ozone ~ Month / Wind, data = aq)
#> NOTE: 37 observations removed because of NA values (Breakup: LHS: 37, RHS: 0).

## Partial marginal effects
mod_2 = feols(Ozone ~ Month * Wind, data = aq)
#> NOTE: 37 observations removed because of NA values (Breakup: LHS: 37, RHS: 0).

## "Wind" in mod_2 is actual "Month5:Wind"...
myDict = c("Wind" = "Month5:Wind")

## Doesn't work as expected: Get separate "Month5:Wind" rows instead of one.
esttex(
  mod_1, mod_2, titles  = c("Full MEs", "Partial MEs"),
  dict = myDict,
  drop = c("Int", paste0("Month", 6:9)) ## Optional: Just to highlight rows of interest 
)
#> \begin{table}[htbp]\centering
#> \caption{no title}
#> \begin{tabular}{lcc}
#>  & & \tabularnewline
#> \hline
#> \hline
#> Dependent Variable:&\multicolumn{2}{c}{Ozone}\\
#> Model:&(1)&(2)\\
#> \hline
#> \emph{Variables}\tabularnewline
#> Month5:Wind&-2.3681$^{*}$&  \\
#>   &(1.3162)&  \\
#> Month5:Wind&  &-2.3681$^{*}$\\
#>   &  &(1.3162)\\
#> \hline
#> \emph{Fit statistics}&  & \\
#> Observations& 116&116\\
#> R$^2$ & 0.54732&0.54732\\
#> Adjusted R$^2$ & 0.50889&0.50889\\
#> \hline
#> \hline
#> \multicolumn{3}{l}{\emph{Normal standard-errors in parentheses. Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
#> \end{tabular}
#> \end{table}

Created on 2019-12-29 by the reprex package (v0.3.0)

Minor bug: `^` causes trouble in `terms.formula`

I noticed that using ^ can cause issues with formula parsing. Here's the case I tripped over, with stats::terms.formula called indirectly by fixest::r2. I think this is a bug, though let me know if I'm misinterpreting.

library(fixest)

gravity_results <- feglm(
  Euros ~ log(dist_km) | Origin ^ Destination + Product ^ Year, 
  data=trade
)
r2(gravity_results)
#> Error in terms.formula(tmp, simplify = TRUE): invalid power in formula

Created on 2019-11-22 by the reprex package (v0.3.0)

esttex function not displaying title or fixef_sizes

Hi I have noticed another (new?) bug with the esttex function. Namely that it no longer displays the title or fixed_sizes when set to true. I have cannot in fact get it to display this for any combinations that I have tried. The following is an example that I ran using R 3.6.2 and fixest 0.3.0.

#This will clear all objects includes hidden objects
rm(list = ls(all.names = TRUE))
gc()

#Load the libraries
library(ggplot2)
library(tsibble)#yearweek function.
library(data.table)
library(zoo)
library(Hmisc) #For weighted medians functions.
library(fixest)#For the fixed effects regressions
setFixest_nthreads(6)#Set the number of threads to use

#Load back in the data.
Reg_data <- fread(paste0(data_analysis_path, "Trade_cost_reg.csv"))
Reg_data[, `:=`(TRD_EXCTN_DT = as.IDate(TRD_EXCTN_DT), Year_month = as.yearmon(Year_month), Year_week = yearweek(Year_week))]

#Want to set the dictionary for exporting tables nicely to latex
setFixest_dict(c("CUSIP_ID" = "Bond", "Year_month" = "Time (monthly)", "Year_week" = "Time (weekly)", "log(ENTRD_VOL_QT)" = "log(Volume)", "Actual_trade_party" = "Dealer",
                 "Contra_S" = "Selling party", "Contra_B" = "Buying party", "Investment_gradeYes" = "Investment grade", "log(Contra_trade_volume)" = "log(Mean contra volume)",
                  "Last_month_portion_trds" = "Relationship measure", "Last_month_portion_trds_S" = "Relationship measure_S", "Last_month_portion_trds_B" = "Relationship measure_B", "Trade_costs" = "Trade costs",
                 "CUSIP_ID^Year_month" = "Bond x Time (monthly)", "CUSIP_ID^Year_week" = "Bond x Time (weekly)", "Actual_trade_party^CUSIP_ID" = "Dealer x Bond",
                 "Actual_trade_party^Year_month" = "Dealer x Time (monthly)", "Actual_trade_party^Year_week" = "Dealer x Time (weekly)", "Actual_trade_party^Year_month^CUSIP_ID" = 
                 "Dealer x Time (monthly) x Bond", "Actual_trade_party^Year_week^CUSIP_ID" = "Dealer x Time (weekly) x Bond", "Contra_S^Year_month" = "Contra_S x Time (monthly)",
                 "Contra_B^Year_month" = "Contra_B x Time (monthly)", "Contra_S^Year_month^CUSIP_ID" = "Contra_S x Time (monthly) x Bond", "Contra_B^Year_month^CUSIP_ID" = "Contra_B x Time (monthly) x Bond",
                 "Contra_S^Year_week^CUSIP_ID" = "Seller x Time (weekly) x Bond", "Contra_B^Year_week^CUSIP_ID" = "Buyer party x Time (weekly) x Bond", "Contra_S^CUSIP_ID^Actual_trade_party^Year_month" =
                 "Seller x Bond x Dealer x Time (monthly)", "Contra_B^CUSIP_ID^Actual_trade_party^Year_month" = "Buyer x Bond x Dealer x Time (monthly)", "Contra_B^CUSIP_ID^Actual_trade_party^Year_week" = 
                 "Buyer x Bond x Dealer x Time (weekly)", "Contra_S^CUSIP_ID^Actual_trade_party^Year_week" = "Seller x Bond x Dealer x Time (weekly)", "log(Contra_trade_volume_B)" = "log(Buyer volume)",
                 "log(Contra_trade_volume_S)" = "log(Seller volume)",  "CUSIP_ID & Year_month" = "Bond + Time (monthly)"))


#Pre-crisis regressions.
Pre_crisis_data <- Reg_data[Year_month >= "Jan 2006"][Year_month <= "Jun 2007"]

#Here we just do bond*dealer*Year_month
dealer_times_bond_times_month_fe_Pre_crisis <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) + log(Contra_trade_volume) | Actual_trade_party^Year_month^CUSIP_ID, data = Pre_crisis_data)
deal_times_bond_times_week_fe_Pre_crisis <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) +log(Contra_trade_volume) | 
                                           Actual_trade_party^Year_week^CUSIP_ID, data = Pre_crisis_data)
Contra_S_times_bond_times_dealer_times_month_fe_Pre_crisis <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) + log(Contra_trade_volume)| Contra_S^CUSIP_ID^Actual_trade_party^Year_month, data = Pre_crisis_data)
Contra_B_times_bond_times_dealer_times_month_fe_Pre_crisis <- feols(Trade_costs ~ Last_month_portion_trds + log(ENTRD_VOL_QT) + log(Contra_trade_volume)| Contra_B^CUSIP_ID^Actual_trade_party^Year_month, data = Pre_crisis_data)

esttex(dealer_times_bond_times_month_fe_Pre_crisis, deal_times_bond_times_week_fe_Pre_crisis, Contra_S_times_bond_times_dealer_times_month_fe_Pre_crisis, 
Contra_B_times_bond_times_dealer_times_month_fe_Pre_crisis, 
se = "twoway", cluster = c("CUSIP_ID", "Year_month"), 
title = "A regression of inter-dealer matched trade costs on our relationship measure for the pre-crisis period as defined in the text. 
Variables with B or S at the end indicates that the variable is for the buying or selling counterparty only. 
Investment grade is an indicator variable that equals one if the bond is investment grade and zero if not. 
Contra volume is the total volume of the contra parties from all trades they undertake in the same calendar month.", 
fixef_sizes = TRUE )

The following output is returned to the console

\begin{table}[htbp]\centering
\caption{no title}
\begin{tabular}{lcccc}
 & & & & \tabularnewline
\hline
\hline
Dependent Variable:&\multicolumn{4}{c}{Trade costs}\\
Model:&(1)&(2)&(3)&(4)\\
\hline
\emph{Variables}\tabularnewline
Relationship measure&-0.2605$^{***}$&-0.2110$^{*}$&-0.4044$^{**}$&-0.6253$^{***}$\\
  &(0.0810)&(0.1098)&(0.2001)&(0.1113)\\
log(Volume)&-0.0392$^{***}$&-0.0282$^{***}$&-0.0322$^{***}$&-0.0224$^{***}$\\
  &(0.0044)&(0.0046)&(0.0057)&(0.0049)\\
log(Mean contra volume)&0.0056$^{***}$&0.0050$^{**}$&0.0081$^{*}$&0.0138$^{***}$\\
  &(0.0016)&(0.0022)&(0.0044)&(0.0044)\\
\hline
\emph{Fixed-Effects}&  & & & \\
Dealer x Time (monthly) x Bond&Yes&No&No&No\\
Dealer x Time (weekly) x Bond&No&Yes&No&No\\
Seller x Bond x Dealer x Time (monthly)&No&No&Yes&No\\
Buyer x Bond x Dealer x Time (monthly)&No&No&No&Yes\\
\hline
\emph{Fit statistics}&  & & & \\
Observations& 397,891&397,891&397,891&397,891\\
R$^2$ & 0.91507&0.96558&0.96396&0.97514\\
Within R$^2$ & 0.03447&0.02522&0.02639&0.02169\\
\hline
\hline
\multicolumn{5}{l}{\emph{Two-way (Bond \& Time (monthly)) standard-errors in parentheses.}}\\
\multicolumn{5}{l}{\emph{Signif Codes: ***: 0.01, **: 0.05, *: 0.1}}\\
\end{tabular}
\end{table}

Notice that the table does not contain the title or the number of "individuals" per fixed-effect dimension.

r2 function when only fixed effects are used

Minor issue.

The r2 function returns an error when applied to a regression that only has fixed effects but no other covariates. I haven't checked this for all the objects returned, but adjusted r-squared is returned in the summary function in that case, so I assume it could also be returned when the r2 function is called.

library(fixest)
library(magrittr)

# create data
df <- data.frame(x = runif(1000))
df$id <- sample(20, 1000, TRUE)
df$y <- runif(1000) + 0.2 * df$x + 0.05 * df$id

# show that r2 works well in the normal case
fixest::feols(y ~ x | id, data = df) %>% 
  summary()
#> OLS estimation, Dep. Var.: y
#> Observations: 1,000 
#> Fixed-effects: id: 20
#> Standard-errors: Clustered (id) 
#>   Estimate Std. Error t value  Pr(>|t|)    
#> x 0.204396   0.020233  10.102 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-likelihood: -151.67   Adj. R2: 0.485 
#>                         R2-Within: 0.04161

fixest::feols(y ~ x | id, data = df) %>% 
  fixest::r2()
#>     sq.cor         r2        ar2        pr2       apr2        wr2       war2 
#> 0.49531459 0.49531459 0.48500437 0.69270986 0.65218985 0.04161231 0.04063337 
#>       wpr2      wapr2 
#> 0.12289423 0.11711136

# show that there is an adjusted r-squared measure even for cases where only fixed-effects are used
fixest::feols(y ~ 1 | id, data = df) %>% 
  summary()
#> OLS estimation, Dep. Var.: y
#> Observations: 1,000 
#> Fixed-effects: id: 20
#> Log-likelihood: -172.92   Adj. R2: 0.46319

# error when calling r2 on this estimation
fixest::feols(y ~ 1 | id, data = df) %>% 
  fixest::r2()
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 0, 1000

Created on 2020-07-06 by the reprex package (v0.3.0)

Thank you for the great work in the package.

Issue when formula is of the form Z ~ X + Y + (X==1)

Hi Mr Berge !

Loving your package, thanks for releasing it into the wild.

There are times when it useful to do a dummy encoding in the formula itself, such as I have outlined in the title of this post.

(This approach works with lm(), glm(), model.matrix(), etc)

When I do that with feglm(), we get this error message:

ClassType + FaceAmountBand + :
Error in parse(text = x, keep.source = FALSE) :
:1:115: unexpected '=='
1: lhs ~ X + Y + X == 1
^
This error was unforeseen by the author of the function feglm. If you think your call to the function is legitimate, could you report?

Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed

Hey Mr Berge !

Found another little bug. Below is a minimal reproduceable example.

Peace
Jordi

+++++++

library(data.table)
library(fixest)

tmp <- data.table(x1 = 1:100,x2 = runif(100)*100)
tmp[, y := x1 + x2]

model <- feglm(y ~ x1 + x2, data = tmp) # works fine
model <- feglm(y ~ bs(x1) + bs(x2), data = tmp) # does not work

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.