pachadotdev / eflm Goto Github PK

See pacha.dev/capybara for a much better GLM implementation. Efficient Fitting of Linear and Generalized Linear Models by using just base R.

Home Page: https://pacha.dev/capybara

License: Other

R 98.32% Stata 1.68%

r lm glm broom sandwich

eflm's Introduction

See pacha.dev/capybara for a much better GLM implementation.

Efficient Fitting of Linear Models

Scope

eflm package reduces the design matrix from N × P into P × P for reduced fitting time, and delivers functions that are drop-in replacements for glm and lm, like:

# just append and 'e' to glm
eglm(mpg ~ wt, data = mtcars)

The best computational performance is obtained when R is linked against OpenBLAS, Intel MKL or other optimized BLAS library. This implementation aims at being compatible with ‘broom’ and ‘sandwich’ packages for summary statistics and clustering by providing S3 methods.

This package takes ideas from glm2, speedglm, fastglm, speedglm and fixest packages, but the implementations here shall keep the functions and outputs as closely as possible to the stats package, therefore making the functions provided here compatible with packages such as sandwich for robust estimation, even if that means to attenuate the speed gains.

The greatest strength of this package is testing. With more than 1600 (and counting) tests, we try to do exactly the same as lm/glm, even in edge cases, but faster.

The ultimate aim of the project is to produce a package that:

Does exactly the same as lm and glm in less time
Is equally numerically stable as lm and glm
Depends only on base R, with no Rcpp or other calls
Uses R’s internal C code such as the Cdqrls function that the stats package uses for model fitting
Can be used in Shiny dashboard and contexts where you need fast model fitting
Is useful for memory consuming models
Allows model fitting in cases demanding more memory than free RAM (PENDING)

Installation

You can install the released version of eflm from CRAN with:

install.packages("eflm")

And the development version with:

remotes::install_github("pachadotdev/eflm")

Progress list

Stats compatibility

cooks.distance

Sandwich compatibility

Broom compatibility

augment
tidy
glance

Lmtest compatibility

resettest

Benchmarking

The dataset for this benchmark was taken from Yotov et al. (2016) and consists in a 28,152 x 8 data frame with 6 numeric and 2 categorical columns of the form:

Year (t)	Trade (X)	DIST	Exp Year (π)	Imp Year (χ)
1986	27.8	12045	ARG1986	AUS1986
1986	3.56	11751	ARG1986	AUT1986
1986	96.1	11305	ARG1986	BEL1986

This data can be found in the tradepolicy package.

The variables are:

year: time of export/import flow
trade: bilateral trade
log_dist: log of distance
cntg: contiguity (0/1)
lang: common language (0/1)
clny: colonial relation (0/1)
exp_year/imp_year: exporter/importer time fixed effects

For benchmarking I’ll fit a PPML model, as it’s a computationally expensive model.

ch1_application1 <- tradepolicy::agtpa_applications %>%
  select(exporter, importer, pair_id, year, trade, dist, cntg, lang, clny) %>%
  filter(year %in% seq(1986, 2006, 4))
  
formula <- trade ~ log(dist) + cntg + lang + clny + exp_year + imp_year
eglm(formula, quasipoisson, ch1_application1)

To compare glm, the proposed eglm and Stata’s ppml, I conducted a test with 500 repetitions locally, and reported the median of the realizations as the fitting time. The plots on the right report the fitting times and used memory by running regressions with cumulative subset of the data for 1986, …, 2006 (e.g. regress for 1986, then 1986 and 1990, …, then 1986 to 2006), we obtain the next fitting times and memory allocation depending on the design matrix dimensions:

Yotov et al. (2016) features complex both partial and general equilibrium models. Some partial equilibrium models are particularly slow to fit because of the allocated memory and the number of fixed effects, such as the Regional Trade Agreements (RTAs) model.

In the next table, TG means ‘Traditional Gravity’ (e.g. vanilla PPML), DP means ‘Distance Puzzle’ and GB stands for ‘Globalization’, which are refinements of the simple PPML model and include dummy variables such as specific country pair fixed effects and lagged RTAs.

Model	Rows in design matrix	Cols in design matrix
TG, PPML	28152	831
DP, FE	28566	905
RTAs, GB	28482	3175

The results for the RTA model show that the speedups can be scaled, and we can show both time reduction and required memory increases.

Model	GLM Time (s)	EGLM Time (s)	Time Gain (%)
DP, FE	111.0	9.08	91.82%
RTAs, GB	1824.0	161.40	91.15%
TG, PPML	108.6	9.06	91.66%

Is it important to mention that the increase in memory results in reduced object size for the stored model.

Model	GLM Size (MB)	EGLM Size (MB)	Memory Savings (%)
DP, FE	231.04	37.26	83.87%
RTAs, GB	824.89	263.36	68.07%
TG, PPML	210.88	34.69	83.55%

To conclude my benchmarks, I fitted the PPML model again on DigitalOcean droplets, leading to consistent times across scaled hardware. The results can be seen in the next plot:

Edge cases

An elementary example that breaks eflm even with QR decomposition can be found in Golub et al. (2013), which consists in passing an ill conditioned matrix:

Model	(Intercept)	x₁	x₂
REG 1	1.98	2.98	1.02
REG 2	1.98	4.00	NA

References

Golub, Gene H, and Charles F Van Loan. 2013. Matrix Computations. Vol. 3. JHU press.

Yotov, Yoto V, Roberta Piermartini, José-Antonio Monteiro, and Mario Larch. 2016. An Advanced Guide to Trade Policy Analysis: The Structural Gravity Model. World Trade Organization Geneva.

eflm's People

Contributors

Stargazers

Watchers

Forkers

rohanalexander nfultz saxenism harobledo krlmlr

eflm's Issues

Benchmark figures in readme.md not showing

Hi,

The figures that were supposed to demonstrate the performance of the package are missing.

(P.S. https://pacha.dev/eflm is dead, too.)

Patrick 0.1.0 will have backwards incompatible changes

Hi Pacha!

Thanks for using patrick for creating parameterized tests.

I am going to start the process of releasing a backwards incompatible change in the package.

In the past, the undocumented test_name parameter could be used to in cases data frames and as an argument for naming tests
I am moving this to a documented argument in with_parameters_test_that(). The argument is also getting the name .test_name in order to distinguish it from test cases passed by a user

In version 0.1.0, patrick will throw a warning about this change and rename input as appropriate. In the future, this warning will be dropped. Addressing it requires changing your use of test_name to .test_name.

Apologies for any inconvenience that this causes. Please let me know how else I can help.

Best wishes,
Michael

offset() terms not recognised?

Hello

Ran this test case on a large dataset on Rstudio server. The stats::glm() equivalent works fine however the below code failes.

We always run glm() with offsets so this is pretty mission critical.

This is the most simple model I tried, a more complex one also failed with the same error

Error in offset(log(expectednum_rate2)) :
object 'expectednum_rate2' not found

Model1e <-
  eglm(actualnum ~ offset(log(expectednum_rate2)) + offset(log(exposednum)), family = poisson(link = "log"),
                      data = US_grp_dta)