config-i1 / greybox Goto Github PK

Regression model building and forecasting in R

R 98.89% C++ 0.58% TeX 0.54%

forecasting model-selection model-selection-and-evaluation r-package regression regression-models statistics

greybox's Introduction

greybox

The package greybox contains functions for model building, which is currently done via the model selection and combinations based on information criteria. The resulting model can then be used in analysis and forecasting.

There are several groups of functions in the package.

Regression model functions

alm - Augmented Linear (regression) Model that implements likelihood estimation of parameters for Normal, Laplace, Asymmetric Laplace, Logistic, Student's t, S, Generalised Normal, Folded Normal, Log Normal, Box-Cox Normal, Logit Normal, Inverse Gaussian, Gamma, Poisson, Negative Binomial, Cumulative Logistic and Cumulative Normal distributions. In a sense this is similar to glm() function, but with a different set of distributions and with a focus on forecasting.
sm - Scale Model which constructs a regression for scale parameter of a distribution (e.g. for variance in normal distribution). Works like a method applied to already existing model (lm / alm).
stepwise - function implements stepwise IC based on partial correlations for the location model.
lmCombine - function combines the regression models from the provided data, based on IC weights and returns the combined alm object.

Exogenous variables transformation functions

xregExpander - function produces lags and leads of the provided data.
xregTransformer - function produces non-linear transformations of the provided data (logs, inverse etc).
xregMultiplier - function produces cross-products of the variables in the matrix. Could be useful when exploring interaction effects of dummy variables.

The data analysis functions

cramer - calculates Cramer's V for two categorical variables. Plus tests the significance of such association.
mcor - function returns the coefficients of multiple correlation between the variables. This is useful when measuring association between categorical and numerical variables.
association (aka 'assoc()') - function returns matrix of measures of association, choosing between cramer(), mcor() and cor() depending on the types of variables.
determination (and the method 'determ()') - function returns the vector of coefficients of determination (R^2) for the provided data. This is useful for the diagnostics of multicollinearity.
tableplot - plots the graph for two categorical variables.
spread - plots the matrix of scatter / boxplot / tableplot diagrams - depending on the type of the provided variables.
graphmaker - plots the original series, the fitted values and the forecasts.

Models evaluation functions

ro - rolling origin evaluation (see the vignette).
rmcb - Regression for Multiple Comparison with the Best. This is a simplified version of the nemenyi / MCB test, relying on regression on ranks of methods.
measures - the error measures for the provided forecasts. Includes MPE, MAPE, MASE, sMAE, sMSE, RelMAE, RelRMSE, MIS, sMIS, RelMIS, pinball and others.

Distribution functions:

qlaplace, dlaplace, rlaplace, plaplace - functions for Laplace distribution.
qalaplace, dalaplace, ralaplace, palaplace - functions for Asymmetric Laplace distribution.
qs, ds, rs, ps - functions for S distribution.
qgnorm, dgnorm, rgnorm, pgnorm - functions for Generalised normal distribution.
qfnorm, dfnorm, rfnorm, pfnorm - functions for folded normal distribution.
qtplnorm, dtplnorm, rtplnorm, ptplnorm - functions for three parameter log normal distribution.
qbcnorm, dbcnorm, rbcnorm, pbcnorm - functions for Box-Cox normal distribution (discussed in Box & Cox, 1964).
qlogitnorm, dlogitnorm, rlogitnorm, plogitnorm - functions for Logit-normal distribution.

Methods for the introduced and some existing classes:

temporaldummy - the method that creates a matrix of dummy variables for an object based on the selected frequency. e.g. this can create week of year based on the provided zoo object.
outlierdummy - the method that creates a matrix of dummy variables based on the residuals of an object, selected confidence level and type of residuals.
pointLik - point likelihood method for the time series models.
pAIC, pAICc, pBIC, pBICc - respective point values for the information criteria, based on pointLik.
coefbootstrap - the method that returns bootstrapped coefficients of the model. Useful for the calculation of covariance matrix and confidence intervals for parameters.
summary - returns summary of the regression (either selected or combined).
vcov - covariance matrix for combined models. This is an approximate thing. The real one is quite messy and not yet available.
confint - confidence intervals for combined models.
predict, forecast - point and interval forecasts for the response variable. forecast method relies on the parameter h (the forecast horizon), while predict is focused on the newdata. See vignettes for the details.
nparam - returns the number of estimated parameters in the model (including location, scale, shift).
nvariate - returns the number of dimensions of the response variable.
actuals - returns the response variable from the model.
plot - plots several graphs for the analysis of the residuals (see documentation for more details).
AICc - AICc for regression with normally distributed residuals.
BICc - BICc for regression with normally distributed residuals.
is.greybox, is.alm etc. - functions to check if the object was generated by respective functions.

Experimental functions:

lmDynamic - linear regression with time varying parameters based on pAIC.

Installation

The stable version of the package is available on CRAN, so you can install it by running:

install.packages("greybox")

A recent, development version, is available via github and can be installed using "remotes" in R. First make sure that you have remotes:

if (!require("remotes")){install.packages("remotes")}

and after that run:

remotes::install_github("config-i1/greybox")

greybox's People

Contributors

Stargazers

Watchers

Forkers

rohitpandey13 minghao2016 yforecasting filtheo zebrajack gaiskasalomon nadia-el

greybox's Issues

mcor: x works only if numeric?

The mcor documentation example works: mcor(mtcars$am, mtcars$mpg). However, when mcor x is a factor like the example you provide here, mcor(mtcarsData$am, mtcarsData$mpg) fails with this error:

Error in terms.formula(object, data = data) : '.' in formula and no 'data' argument

Make lmCombine and lmDynamic more flexible

So that the fitted values correspond to the ones from alm() with different distributions.

Produce forecasts based on combiner for fat regressions

Currently it is not possible because predict.lm says:

Error in qr.R(qr.lm(object))[p1, p1] : subscript out of bounds

This is probably because the number of parameters in the combined model is higher than the number of observations.

Introduce heteroscedasticity model

This can be a new parameter in alm(), determining the formula for the scale, e.g. sigma^2_j = a_0 + a_1 x_j + a_2 x_j^2:

scaleFormula = ~x+x^2

This is relatively easy to implement and it can be estimated together with the parameters for the location (that are already in place) via the maximisation of likelihood. This probably does not make sense for the other losses.

New set of functions for folded normal distribution

This includes rfnorm, dfnorm, pfnorm, qfnorm and will be needed for alm function.

Make factors available for lmCombine and lmDynamic

This should be done as a first step, expanding factors into dummies. We want to have the relative importance of each dummy, not each factor.

Make lmCombine and lmDynamic work with all the distributions of alm

Currently it doesn't with dalaplace.

Create parameter "use" for mcor, assoc and cramer

This should work similar to cor() function, so that the function knows what to do with NAs.

Implement pAIC / pBIC etc in lmCombine

This should be done in the following steps:

Fit the models, extract point values,
Do a test (nemenyi / rmc / Tukey) on point values to select the pool of models,
Combine the models in the pool using either Dirichlet or Ranks.

Parameter loss for consistency with adam()

Implement loss, which can be:

likelihood;
MSE,
MAE,
HAM,
LASSO;
RIDGE;
custom (user specified).

occurrence in lmCombine and lmDynamic

Make occurrence variable available for the functions similar to how it is with stepwise.

forecast and plot.forecast methods for stepwise function

Although stepwise returns lm, it makes sense for it to return something else, so that we can produce meaningful forecasts and nice plots.

Re: prediction inputs

Hi,
I really love this package, very accurate and easy to use. This is not a bug, it's more an undesirable change in behaviour between version 0.22 an the current package. I used to be able to feed a single row of features into the forecast/predict function and get a prediction. Now, the prediction function seems to require more than 1 row of data. If I put in a single row, I get this error: 'Error in matrixOfxreg %% ourVcov %% t(matrixOfxreg) : non-conformable arguments' So, when rolling a model forward incrementally, I have to use a test set of 2 rows, and then take the last prediction only. Everything still works, but it's slightly inconvenient, and it may be confusing to new users.

Thanks, Gavin.

forecast function for alm

Make this closer to the original one, with h parameter, but smarter:

If h is specified, do forecasts for that horizon;
If h > newdata, produce forecasts from the data+newdata and then forecast the response;
If h < newdata, cut the newdata;
If h is not specified, use all newdata.

predict() function in this case will need newdata and won't have h.

Compatibility with data.table and tbl

Need to check all the functions for the compatibility with "data.table" and "tbl" classes, so that they are used most efficiently and the functions still work.

This relates to: cramer, mcor, determ, assoc, tableplot, spread, alm, stepwise and lmCombine.

At the moment, the analytical functions just change the class to data.frame if they detect "data.table" or "tbl".

formula in stepwise() and lmCombine

Introduce formula to allow non-linear transformations of variables and limiting the pool. If it is NULL, then variables are used as is.

lm with folded normal distribution

This might be useful for rmc() function with value="a"

alm() with ARMA errors

This means that we need a simultaneous estimation of the whole thing in ALM.

After that the stuff can be introduced in stepwise, lmCombine etc.

Accept factor as response variable

Make it possible to specify factor in the response. This implies switching to numeric and encoding the variable. In cases of "plogis", "pnorm" we should have logit / probit models.

Prediction and confidence intervals for the alm function

Implement the options similar to predict.lm.

Handle really big data with alm()

In order to do this, use Matrix with sparse.model.matrix() and data.table packages.

temporaldummy method implementation

Implement the following options for POSIXct:

type="hour", of="week" - currently it assumes that the data is hourly;
type="minute", of="week" - currently it assumes that the data is in minutes;
type="minute", of="day" - currently it assumes that the data is in minutes;
type="halfhour";
type="second", of!="minute" - only minutes are available for "of" at the moment.

Also, implement methods for the classes:

xts
tsibble
timeDate / timeSeries

alm() with the data in differences

Create an option of estimating the model of the type: y[t] = y[t-1] + a0 + a1 x[t] + e, which should be equivalent to: diff(y) = a0 + a1 x[t] + e, but estimated in terms of y instead of differences.

This will be possible to do after solving the issue #29

Import nemenyi test from TStools

The implementation of the test also needs to be rechecked.

residuals in alm

The residuals extracted from alm() need to correspond to the used distribution:

dnorm: e = y - f;
dlnorm: e = log(y) - log(f);
...

Speed up alm function

alm() is slow for several reasons:

vcov is done using hessian function from numDeriv. But there seems to be no other way to do that for a general likelihood,
Matrix multiplication in R can be sometimes slow (especially on large samples with big data),
Inverting matrix in the initial calculation of parameters is done using solve() function and potentially can also be improved.

While (2) and (3) are doable, they won't fix (1). Not sure what to do with it...

ro() + forecast + model type FFF

Would it make sense, to be able to retrieve the actual final model as characters from a final fit, when using type FFF via forecast and ro()?

Special symbols in the names of variables

spread() and probably other functions don't like spaces, "$", "%" etc. Not sure if anything should be done with this...

level in prediction / forecast

Make it accept vectors and produce several values for bounds

Create a function for exporting outputs in LaTeX / Word

Something like stargazer function but for ALM and greybox classes.

Rename spread function

There is a conflict with a spread() from tidyr. The options for new names:

spreadplot()
explore()
scatter()
...

lmCombine() error when training data gets to a certain size, even with bruteForce = FALSE

As per subject; example shown below.

Using greybox_0.4.1.

> dim(data)
[1] 1474   27

> head(data)
          y       x1       x2       x3       x4       x5       x6       x7       x8       x9      x10      x11      x12      x13      x14      x15      x16      x17      x18
1 1.1342020 1031.898 25.33311 169.4480 106.6904 26.70383 2715.493 10.43662 1280.008 84.91353 604.4532 176.3360 192.9604 108.7097 1046.748 34.57609 150.1286 105.3070 37.58345
2 1.0490129 1028.735 29.79364 159.7876 104.9702 25.10849 2712.597 10.97205 1238.919 85.18453 604.2857 170.3515 186.3013 108.7397 1031.898 25.33311 169.4480 106.6904 26.70383
3 1.1446238 1044.398 27.95415 150.3061 104.4126 31.20778 2693.927 10.83171 1204.779 85.84771 592.2333 157.8295 174.7717 107.6142 1028.735 29.79364 159.7876 104.9702 25.10849
4 0.9843351 1038.720 30.00345 145.5703 105.0761 28.00348 2675.840 10.90912 1218.317 87.26625 597.8358 156.0761 171.6895 110.0136 1044.398 27.95415 150.3061 104.4126 31.20778
5 0.9213088 1043.121 30.22741 148.4236 105.0712 29.02609 2656.345 10.97350 1212.080 88.17706 597.5838 159.8943 170.6431 110.4755 1038.720 30.00345 145.5703 105.0761 28.00348
6 0.8701861 1052.820 23.28180 160.5087 104.3836 23.47458 2704.388 10.16553 1287.438 87.92422 595.7473 169.7416 181.6210 106.4576 1043.121 30.22741 148.4236 105.0712 29.02609
       x19      x20      x21      x22      x23      x24      x25      x26
1 2704.123 10.81376 1222.988 79.57888 604.1358 164.6718 186.6565 105.6083
2 2715.493 10.43662 1280.008 84.91353 604.4532 176.3360 192.9604 108.7097
3 2712.597 10.97205 1238.919 85.18453 604.2857 170.3515 186.3013 108.7397
4 2693.927 10.83171 1204.779 85.84771 592.2333 157.8295 174.7717 107.6142
5 2675.840 10.90912 1218.317 87.26625 597.8358 156.0761 171.6895 110.0136
6 2656.345 10.97350 1212.080 88.17706 597.5838 159.8943 170.6431 110.4755

> lmCombine(data[1:50, ], bruteForce = FALSE)
Call:
lmCombine(data = data[1:50, ], bruteForce = FALSE, formula = y ~ 
    .)

Coefficients:
  (Intercept)            x2           x12            x6           x16            x1 
 1.3146239493  0.0225953692  0.0037014722 -0.0004804042  0.0012738700 -0.0004843076 

> lmCombine(data[1:1500, ], bruteForce = FALSE)
Error in cbind(y, model.matrix(cl$formula, data = data[rowsSelected, ])[,  : 
  number of rows of matrices must match (see arg 2)

> lmCombine(data, bruteForce = FALSE)
 logLik(ourModel) : object 'ourModel' not found greybox

Move graphmaker to greybox from smooth

Move it.
Make it a method, so that it can be applied to smooth models or whatever else.

Make determination work with factors

Currently this is not supported at all... Only dummies are available.
No need to make multinomial or whatever regression, the basic one will do the trick.

intermittent parameter for alm

Do the mixture distribution in alm

C++ code for the recursive model

Create a code, so that this thing would be fitted in C++

factors in alm() and stepwise()

Make both compatible with factor variables, so that dummies can be properly used.

Measures of association and advanced plot functions

Create functions that for the provided matrix / data.frame:

Return a matrix of measures association, calculated correctly for each variable in the data (e.g. Pearson's correlations for metric variables and Phi for nominal etc),
Return a mixture of scatter plots and boxplots depending on the types of variables (e.g. scatter for metric scales and boxplots for the metrics vs categorical).

This should help in the analysis of data.

Studentized and standardised residuals

Implement these methods for the alm for the analysis of the residuals:

rstudent()
rstandard()

Minor issue in auto.gum()

Hi!

I get a code break when trying to fit a model with auto.gum and type='select'.
The system then tries to fit a multiplicative model and fails with a code break.
I saw this behavior before, from package forecast, trying seasonal where there is no seasonality.

The problem is, I'm running multiple fits with various inputs and attributes in a loop.
It would be some how convenient, if it would switch to additive automatically, if it can't fit a multiplicative model.

I can hint this with an explicit exception... so not really a bug...

Checking model with a type="a".
Starting preliminary loop: 1 out of 1. Done.
Searching for appropriate lags: We found them!
Searching for appropriate orders: Orders found.
Checking model with a type="m".
Starting preliminary loop: Error in costfunc(matvt, matF, matw, yInSample, vecg, h, lagsModel, Etype, :
Mat::operator(): index out of bounds
█

├─global::bAlgos(i, f, "prediction") ~/R/daScript.R:51786:16
│ └─global::standAloneAlgos(...) ~/R/daScript.R:51399:16
│ └─smooth::auto.gum(...) ~/R/daScript.R:23035:16
│ └─smooth::gum(...)
│ └─smooth:::CreatorGUM(silentText = silentText)
│ └─nloptr::nloptr(...)
├─(function (x) ...
│ └─smooth:::eval_f(x, ...)
│ └─smooth:::costfunc(...)
├─base::stop(...)
└─(function () ...
└─lobstr::cst() ~/R/daScript.R:9:29
No traceback available

Normal distribution,
F-distribution,
Weird one with the ratio of folded normal distributions,

Estimate, select and combine regression models with MAE and HAM

This should be based on Laplace and S distributions for the respective cost functions.

New methods for greybox functions

Need to implement or make available:

vcov,
confint,
forecast.

Additional elements in the formula

Introduce trend in the formula;
Introduce seasonality in the formula;
Dynamic trend?