The pyfixest from s3alfisc

HC3 and CRV3 do not seem to match

investigate and fix
tests are already added

Mimic fixest vcov defaults

drop requirement to set vcov argument in .feols()
for no specified vcov without fixed effects: iid inference
for no specified vcov with fixed effects: CRV1 clustered at the first fixed effects level

See here.

Singleton fixed effects

Add message when data is dropped due to singleton fixed effects
add tests with singleton fixed effects

Link to original issue

Bug in i() method

For some reason, signs of integers are swapped, i.e. for event studies with time-to-treatment (ttt) -27, ...., 0, ..., 21, the estimated coefs are labeled as -21, ..., 0, ..., 27.

Different Design Matrix Combinations

For options, see here:

sw()
csw()
fsplit

Custom fixed effects demeaning function

Write a custom demean() function that allows for NA values in either X or Y
further, allow for weights (for WLS)
implement dropping of singletons
note that there are a few errors in the current version of demean() (i.e. the handling of missings + multiple fixed effects)

Reproduce applied econometrics textbook examples in notebooks

Motivation:

uncovers potentially needed functionality
makes sure that pyfixest actually runs for "real world" problems and not simply my synthetic test datasets

Textbooks with code:

the effect (data in the causaldata package)
the mixtape (data also included in the causaldata package)
Data Analysis Business & Econ (github repo)

p-value and some t-stat calculations differ from R fixest

Some more toying around, and I'm getting different p-value calculations from the original fixest, even in cases where estimates and t-stats match:

With clustering, the t-stats match but p-values are different:

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'Year'}))
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: inspection_score 
# 
#            Estimate  Std. Error    t value  Pr(>|t|)
# Intercept 93.627262    0.321288 291.412047  0.000000
#   Weekend  2.096548    0.862544   2.430656  0.015072

library(fixest)
library(causaldata)

data(restaurant_inspections)

feols(inspection_score ~ Weekend , data = restaurant_inspections, vcov = ~Year)

# OLS estimation, Dep. Var.: inspection_score
# Observations: 27,178 
# Standard-errors: Clustered (Year) 
# Estimate Std. Error   t value  Pr(>|t|)    
# (Intercept) 93.62726   0.321288 291.41205 < 2.2e-16 ***
#   WeekendTRUE  2.09655   0.862544   2.43066  0.029106 *  
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 6.25408   Adj. R2: 8.241e-4

With IID and HC1, the t-stats and p-values differ by very small amounts

fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'iid')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036224 55497.059216  0.000000
# Weekend   -0.862863    0.412097    -2.093833  0.036275

feols(Year~Weekend, data = restaurant_inspections)
OLS estimation, Dep. Var.: Year
Observations: 27,178 
Standard-errors: IID 
               Estimate Std. Error     t value  Pr(>|t|)    
(Intercept) 2010.343815   0.036226 55495.01719 < 2.2e-16 ***
WeekendTRUE   -0.862863   0.412112    -2.09376  0.036291 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 5.94874   Adj. R2: 1.245e-4

fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'HC1')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036181 55562.889684  0.000000
# Weekend   -0.862863    0.471032    -1.831856  0.066973

feols(Year~Weekend, data = restaurant_inspections, vcov = 'hc1')
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: Heteroskedasticity-robust 
# Estimate Std. Error     t value  Pr(>|t|)    
# (Intercept) 2010.343815   0.036182 55561.86747 < 2.2e-16 ***
#   WeekendTRUE   -0.862863   0.471041    -1.83182  0.066989 .

Improve performance

See here for faster solving of the least squares problem.
faster demeaning algo via numba / c++
clean up code, drop redundancies, etc
for a given column in the design matrix, only project out fe's when not already done (for multiple estimation)

Issues with specifying vcov

Currently, vcov is a required positional argument in feols. Note this also means you can't specify the vcov using the .vcov method without also specifying it in the feols call as a positional argument. Suggest matching the defaults from the R package of vcov = 'iid' with no fixed effects specified, or clustered at the level of the first fixed effect if fixed effects are specified., or overwriting that with whatever was set in .vcov.
help(fixest.feols) suggests using dict("CRV1":"clustervar") to specify a cluster variable but this is improper syntax.
In my example data I'm having trouble setting any sort of clustering variable alongside a set of fixed effects (pip install causaldata; note there are no missing values in this data):

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest as pf
import numpy as np

res = restaurant_inspections.load_pandas().data
fixest = pf.Fixest(data = res)

fixest.feols('inspection_score ~ Weekend | business_name', vcov = 'iid')
# runs fine

fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

# maybe it prefers numbers?
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'Year'}))
# IndexError: index 27134 is out of bounds for axis 0 with size 27076

# Or integers?
res['random_category'] = np.random.randint(0, 10, res.shape[0])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'random_category'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# or categoricals?
res['categorical'] = pd.Categorical(res['random_category'])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'categorical'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# Works fine without the FEs
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'categorical'}))

# although still not for business_name
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''


# Maybe it can't handle single-row FEs alongside clusters?
res['counts'] = (res
		   .groupby('business_name')['business_name']
		   .transform('size'))
fixest = pf.Fixest(data = res.query('counts > 1'))
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# ValueError: matrix should have the same number of rows as fixed effect IDs.

Tests

compare against fixest
internal benchmarks

IV support

Required Steps:

allow three-part formulas
implement IV estimation & inference. Start with just-identified models.
update summary() and tidy() methods. For tidy(), simply add index "stage1" and "stage2". Split summary() into two parts. Check out what fixest does.

Nice to have's (for now):

common IV diagnostics
AR confidence intervals

In Practice:

implement IV estimator (Z'X)^(-1) Z'Y.
always create X and Z. For OLS, set X = Z
pass everything through, done (at least for models with only one endog variable)
after this is implemented, allow for over identified models. This requires implementation of the 2SLS estimator + updates to inference procedures

Implement CRV3 for arbitrary fixed effects

Should be fairly straightforward: for no clustering, run MNW's summclust algo, else do what sandwich does. E.g. do along these lines:

                if self.has_fixef == False:
                    # inverse hessian precomputed?
                    tXX = np.transpose(self.X) @ self.X
                    tXy = np.transpose(self.X) @ self.Y

                    # compute leave-one-out regression coefficients (aka clusterjacks')
                    for ixg, g in enumerate(clusters):

                        Xg = self.X[np.equal(ixg, group)]
                        Yg = self.Y[np.equal(ixg, group)]
                        tXgXg = np.transpose(Xg) @ Xg
                        # jackknife regression coefficient
                        beta_jack[ixg,:] = (
                            np.linalg.pinv(tXX - tXgXg) @ (tXy - np.transpose(Xg) @ Yg)
                        ).flatten()

                else:

                    for ixg, g in enumerate(clusters):
                        data = self.data[np.equal(ixg, group)]
                        model = Fixest(data)
                        model.feols(self.formula, vcov = "iid")
                        beta_jack[ixg,:] = model.beta_hat

Handling of Categorical Variables

Build around formulaic's C(..., levels = ...) option.
Need to pass levels information to formula.
Rename levels to "ref" so that users can only provide a reference str instead of a full list.
Is there a need for a dedicated i() option to interact with a categorical variable? The key advantage of the i() method seems custom tooling around it, e.g. iplot. But that could be handled with plotting method with decent string regex for coef names (i.e. for variables included in i()). Maybe not 100% error prone?

Pass NAs to vcov() to use for clustered standard errors

pass NAs to cluster variable
raise exception when there is an additional missing variable in the clustering variable (i.e. when the clustering variable is not part of the model)

Bug: formula with more than one regressor and fixed effects

E.g.

fit = feols(fml = 'Y ~ X1 + X2 | X3', vcov = {'CRV1':'group_id'}, data = data)
# invalid value encountered in sqrt
#  np.sqrt(np.diagonal(self.vcov[x]))

Multicollinearity?

Bug in multiple estimations with fixed effects

E.g.

feols('y1 + y2 ~ x1 | species ', data = base, vcov = "hetero")

leads to

IndexError: index 1 is out of bounds for axis 1 with size 1

Match fixest standard errors exactly

Status quo: minor deviations due to small sample adjustments.

Required: implementation of ssc() function with all fixed effect options.

Add support for two-way clustering

Simply apply vcov() cluster three times: vcov_a + vcov_b - vcov_ab.
For ssc, see here.

ssc() function does not work properly

i.e. adj = False does nothing

Bug for i() formula without fixed effects

I.e. for fml = "Y ~ i(X2, X1)". No problems for formulas with fixed effects.

vcov() method does not update summary info

Example:

import pyfixest as pf
import numpy as np
from pyfixest.utils import get_data

fixest = pf.Fixest(data = data)
fixest.feols("Y~X1 | csw0(X2, X3)", vcov = {'CRV1':'id'})
fixest.summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.220649 30.130262   0.00000
#        X1 -0.141200    0.211081 -0.668937   0.50369
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274    0.210556 -0.675707  0.499383
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.096317    0.204801 -0.470296  0.638247
fixest.vcov({'CRV3':'group_id'}).summary()
>>> fixest.vcov({'CRV3':'group_id'}).summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.229614 28.953831  0.000000
#        X1 -0.141200    0.207516 -0.680428  0.502745
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274     0.20774 -0.684867   0.49999
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error  t value  Pr(>|t|)
# X1 -0.096317    0.206282 -0.46692  0.644768
#

replace patsy dependency byformulaic

Support for datetime variables

Was inspired to try using a datetime in a regression, given that statsmodels handles these incorrectly. Currently these do not appear to be supported in pyfixest

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

res['DT'] = [pd.to_datetime(str(y)+'-01-01 00:00:00') for y in res['Year']]

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ DT', vcov = 'iid')
fixest.summary()
# UFuncTypeError: ufunc 'multiply' cannot use operands with types dtype('int32') and dtype('<M8[ns]')

Note that this also occurs without the 00:00:00, or if using a datetime.date() instead of a Pandas datetime. Compare to R:

library(fixest)
library(causaldata)

data(restaurant_inspections)

restaurant_inspections$Time = lubridate::ymd_hms(paste0(restaurant_inspections$Year, '-01-01 00:00:00'))

feols(Year~Time, data = restaurant_inspections)
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: IID 
# Estimate   Std. Error  t value  Pr(>|t|)    
# (Intercept) 1.970001e+03 3.323279e-05 59278828 < 2.2e-16 ***
#   Time        3.170000e-08 2.580000e-14  1226887 < 2.2e-16 ***
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 7.994e-4   Adj. R2: 1

That's probably enough tinkering for me today! Sorry to load up the Issue page.

Unit Tests

test feols front end against fixest::feols
test parsing of formula syntax
add continuous integration
add even more tests

Additional methods

tidy()
summary()
etable()
iplot() (after implementing i())

Multiple Estimations and Formula Syntax

Make sure to drop singleton fixed effects from the clustering matrix

Link to original issue: link

KeyError: "['X1:X2'] not in index". Happens because variable "X1:X2" is not contained in the input data. Likely requires even more reshuffling of the formula parsing - demeaning - model matrix pipeline.