Coder Social home page Coder Social logo

config-i1 / greybox Goto Github PK

View Code? Open in Web Editor NEW
29.0 6.0 7.0 3.87 MB

Regression model building and forecasting in R

R 98.89% C++ 0.58% TeX 0.54%
forecasting model-selection model-selection-and-evaluation r-package regression regression-models statistics

greybox's Introduction

greybox

CRAN_Status_Badge Downloads R-CMD-check ko-fi

The package greybox contains functions for model building, which is currently done via the model selection and combinations based on information criteria. The resulting model can then be used in analysis and forecasting.

hex-sticker of the greybox package for R

There are several groups of functions in the package.

Regression model functions

  1. alm - Augmented Linear (regression) Model that implements likelihood estimation of parameters for Normal, Laplace, Asymmetric Laplace, Logistic, Student's t, S, Generalised Normal, Folded Normal, Log Normal, Box-Cox Normal, Logit Normal, Inverse Gaussian, Gamma, Poisson, Negative Binomial, Cumulative Logistic and Cumulative Normal distributions. In a sense this is similar to glm() function, but with a different set of distributions and with a focus on forecasting.
  2. sm - Scale Model which constructs a regression for scale parameter of a distribution (e.g. for variance in normal distribution). Works like a method applied to already existing model (lm / alm).
  3. stepwise - function implements stepwise IC based on partial correlations for the location model.
  4. lmCombine - function combines the regression models from the provided data, based on IC weights and returns the combined alm object.

Exogenous variables transformation functions

  1. xregExpander - function produces lags and leads of the provided data.
  2. xregTransformer - function produces non-linear transformations of the provided data (logs, inverse etc).
  3. xregMultiplier - function produces cross-products of the variables in the matrix. Could be useful when exploring interaction effects of dummy variables.

The data analysis functions

  1. cramer - calculates Cramer's V for two categorical variables. Plus tests the significance of such association.
  2. mcor - function returns the coefficients of multiple correlation between the variables. This is useful when measuring association between categorical and numerical variables.
  3. association (aka 'assoc()') - function returns matrix of measures of association, choosing between cramer(), mcor() and cor() depending on the types of variables.
  4. determination (and the method 'determ()') - function returns the vector of coefficients of determination (R^2) for the provided data. This is useful for the diagnostics of multicollinearity.
  5. tableplot - plots the graph for two categorical variables.
  6. spread - plots the matrix of scatter / boxplot / tableplot diagrams - depending on the type of the provided variables.
  7. graphmaker - plots the original series, the fitted values and the forecasts.

Models evaluation functions

  1. ro - rolling origin evaluation (see the vignette).
  2. rmcb - Regression for Multiple Comparison with the Best. This is a simplified version of the nemenyi / MCB test, relying on regression on ranks of methods.
  3. measures - the error measures for the provided forecasts. Includes MPE, MAPE, MASE, sMAE, sMSE, RelMAE, RelRMSE, MIS, sMIS, RelMIS, pinball and others.

Distribution functions:

  1. qlaplace, dlaplace, rlaplace, plaplace - functions for Laplace distribution.
  2. qalaplace, dalaplace, ralaplace, palaplace - functions for Asymmetric Laplace distribution.
  3. qs, ds, rs, ps - functions for S distribution.
  4. qgnorm, dgnorm, rgnorm, pgnorm - functions for Generalised normal distribution.
  5. qfnorm, dfnorm, rfnorm, pfnorm - functions for folded normal distribution.
  6. qtplnorm, dtplnorm, rtplnorm, ptplnorm - functions for three parameter log normal distribution.
  7. qbcnorm, dbcnorm, rbcnorm, pbcnorm - functions for Box-Cox normal distribution (discussed in Box & Cox, 1964).
  8. qlogitnorm, dlogitnorm, rlogitnorm, plogitnorm - functions for Logit-normal distribution.

Methods for the introduced and some existing classes:

  1. temporaldummy - the method that creates a matrix of dummy variables for an object based on the selected frequency. e.g. this can create week of year based on the provided zoo object.
  2. outlierdummy - the method that creates a matrix of dummy variables based on the residuals of an object, selected confidence level and type of residuals.
  3. pointLik - point likelihood method for the time series models.
  4. pAIC, pAICc, pBIC, pBICc - respective point values for the information criteria, based on pointLik.
  5. coefbootstrap - the method that returns bootstrapped coefficients of the model. Useful for the calculation of covariance matrix and confidence intervals for parameters.
  6. summary - returns summary of the regression (either selected or combined).
  7. vcov - covariance matrix for combined models. This is an approximate thing. The real one is quite messy and not yet available.
  8. confint - confidence intervals for combined models.
  9. predict, forecast - point and interval forecasts for the response variable. forecast method relies on the parameter h (the forecast horizon), while predict is focused on the newdata. See vignettes for the details.
  10. nparam - returns the number of estimated parameters in the model (including location, scale, shift).
  11. nvariate - returns the number of dimensions of the response variable.
  12. actuals - returns the response variable from the model.
  13. plot - plots several graphs for the analysis of the residuals (see documentation for more details).
  14. AICc - AICc for regression with normally distributed residuals.
  15. BICc - BICc for regression with normally distributed residuals.
  16. is.greybox, is.alm etc. - functions to check if the object was generated by respective functions.

Experimental functions:

  1. lmDynamic - linear regression with time varying parameters based on pAIC.

Installation

The stable version of the package is available on CRAN, so you can install it by running:

install.packages("greybox")

A recent, development version, is available via github and can be installed using "remotes" in R. First make sure that you have remotes:

if (!require("remotes")){install.packages("remotes")}

and after that run:

remotes::install_github("config-i1/greybox")

greybox's People

Contributors

config-i1 avatar yforecasting avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

greybox's Issues

mcor: x works only if numeric?

The mcor documentation example works: mcor(mtcars$am, mtcars$mpg). However, when mcor x is a factor like the example you provide here, mcor(mtcarsData$am, mtcarsData$mpg) fails with this error:

Error in terms.formula(object, data = data) : '.' in formula and no 'data' argument

Produce forecasts based on combiner for fat regressions

Currently it is not possible because predict.lm says:

Error in qr.R(qr.lm(object))[p1, p1] : subscript out of bounds

This is probably because the number of parameters in the combined model is higher than the number of observations.

Introduce heteroscedasticity model

This can be a new parameter in alm(), determining the formula for the scale, e.g. sigma^2_j = a_0 + a_1 x_j + a_2 x_j^2:

scaleFormula = ~x+x^2

This is relatively easy to implement and it can be estimated together with the parameters for the location (that are already in place) via the maximisation of likelihood. This probably does not make sense for the other losses.

Implement pAIC / pBIC etc in lmCombine

This should be done in the following steps:

  1. Fit the models, extract point values,
  2. Do a test (nemenyi / rmc / Tukey) on point values to select the pool of models,
  3. Combine the models in the pool using either Dirichlet or Ranks.

Re: prediction inputs

Hi,
I really love this package, very accurate and easy to use. This is not a bug, it's more an undesirable change in behaviour between version 0.22 an the current package. I used to be able to feed a single row of features into the forecast/predict function and get a prediction. Now, the prediction function seems to require more than 1 row of data. If I put in a single row, I get this error: 'Error in matrixOfxreg %% ourVcov %% t(matrixOfxreg) : non-conformable arguments' So, when rolling a model forward incrementally, I have to use a test set of 2 rows, and then take the last prediction only. Everything still works, but it's slightly inconvenient, and it may be confusing to new users.

Thanks, Gavin.

forecast function for alm

Make this closer to the original one, with h parameter, but smarter:

  1. If h is specified, do forecasts for that horizon;
  2. If h > newdata, produce forecasts from the data+newdata and then forecast the response;
  3. If h < newdata, cut the newdata;
  4. If h is not specified, use all newdata.

predict() function in this case will need newdata and won't have h.

Compatibility with data.table and tbl

Need to check all the functions for the compatibility with "data.table" and "tbl" classes, so that they are used most efficiently and the functions still work.

This relates to: cramer, mcor, determ, assoc, tableplot, spread, alm, stepwise and lmCombine.

At the moment, the analytical functions just change the class to data.frame if they detect "data.table" or "tbl".

alm() with ARMA errors

This means that we need a simultaneous estimation of the whole thing in ALM.

After that the stuff can be introduced in stepwise, lmCombine etc.

Accept factor as response variable

Make it possible to specify factor in the response. This implies switching to numeric and encoding the variable. In cases of "plogis", "pnorm" we should have logit / probit models.

temporaldummy method implementation

Implement the following options for POSIXct:

  1. type="hour", of="week" - currently it assumes that the data is hourly;
  2. type="minute", of="week" - currently it assumes that the data is in minutes;
  3. type="minute", of="day" - currently it assumes that the data is in minutes;
  4. type="halfhour";
  5. type="second", of!="minute" - only minutes are available for "of" at the moment.

Also, implement methods for the classes:

  1. xts
  2. tsibble
  3. timeDate / timeSeries

alm() with the data in differences

Create an option of estimating the model of the type: y[t] = y[t-1] + a0 + a1 x[t] + e, which should be equivalent to: diff(y) = a0 + a1 x[t] + e, but estimated in terms of y instead of differences.

This will be possible to do after solving the issue #29

residuals in alm

The residuals extracted from alm() need to correspond to the used distribution:

  1. dnorm: e = y - f;
  2. dlnorm: e = log(y) - log(f);
  3. ...

Speed up alm function

alm() is slow for several reasons:

  1. vcov is done using hessian function from numDeriv. But there seems to be no other way to do that for a general likelihood,
  2. Matrix multiplication in R can be sometimes slow (especially on large samples with big data),
  3. Inverting matrix in the initial calculation of parameters is done using solve() function and potentially can also be improved.

While (2) and (3) are doable, they won't fix (1). Not sure what to do with it...

ro() + forecast + model type FFF

Would it make sense, to be able to retrieve the actual final model as characters from a final fit, when using type FFF via forecast and ro()?

Rename spread function

There is a conflict with a spread() from tidyr. The options for new names:

  • spreadplot()
  • explore()
  • scatter()
  • ...

lmCombine() error when training data gets to a certain size, even with bruteForce = FALSE

As per subject; example shown below.

Using greybox_0.4.1.

> dim(data)
[1] 1474   27

> head(data)
          y       x1       x2       x3       x4       x5       x6       x7       x8       x9      x10      x11      x12      x13      x14      x15      x16      x17      x18
1 1.1342020 1031.898 25.33311 169.4480 106.6904 26.70383 2715.493 10.43662 1280.008 84.91353 604.4532 176.3360 192.9604 108.7097 1046.748 34.57609 150.1286 105.3070 37.58345
2 1.0490129 1028.735 29.79364 159.7876 104.9702 25.10849 2712.597 10.97205 1238.919 85.18453 604.2857 170.3515 186.3013 108.7397 1031.898 25.33311 169.4480 106.6904 26.70383
3 1.1446238 1044.398 27.95415 150.3061 104.4126 31.20778 2693.927 10.83171 1204.779 85.84771 592.2333 157.8295 174.7717 107.6142 1028.735 29.79364 159.7876 104.9702 25.10849
4 0.9843351 1038.720 30.00345 145.5703 105.0761 28.00348 2675.840 10.90912 1218.317 87.26625 597.8358 156.0761 171.6895 110.0136 1044.398 27.95415 150.3061 104.4126 31.20778
5 0.9213088 1043.121 30.22741 148.4236 105.0712 29.02609 2656.345 10.97350 1212.080 88.17706 597.5838 159.8943 170.6431 110.4755 1038.720 30.00345 145.5703 105.0761 28.00348
6 0.8701861 1052.820 23.28180 160.5087 104.3836 23.47458 2704.388 10.16553 1287.438 87.92422 595.7473 169.7416 181.6210 106.4576 1043.121 30.22741 148.4236 105.0712 29.02609
       x19      x20      x21      x22      x23      x24      x25      x26
1 2704.123 10.81376 1222.988 79.57888 604.1358 164.6718 186.6565 105.6083
2 2715.493 10.43662 1280.008 84.91353 604.4532 176.3360 192.9604 108.7097
3 2712.597 10.97205 1238.919 85.18453 604.2857 170.3515 186.3013 108.7397
4 2693.927 10.83171 1204.779 85.84771 592.2333 157.8295 174.7717 107.6142
5 2675.840 10.90912 1218.317 87.26625 597.8358 156.0761 171.6895 110.0136
6 2656.345 10.97350 1212.080 88.17706 597.5838 159.8943 170.6431 110.4755

> lmCombine(data[1:50, ], bruteForce = FALSE)
Call:
lmCombine(data = data[1:50, ], bruteForce = FALSE, formula = y ~ 
    .)

Coefficients:
  (Intercept)            x2           x12            x6           x16            x1 
 1.3146239493  0.0225953692  0.0037014722 -0.0004804042  0.0012738700 -0.0004843076 

> lmCombine(data[1:1500, ], bruteForce = FALSE)
Error in cbind(y, model.matrix(cl$formula, data = data[rowsSelected, ])[,  : 
  number of rows of matrices must match (see arg 2)

> lmCombine(data, bruteForce = FALSE)
 logLik(ourModel) : object 'ourModel' not found greybox

Make determination work with factors

Currently this is not supported at all... Only dummies are available.
No need to make multinomial or whatever regression, the basic one will do the trick.

Measures of association and advanced plot functions

Create functions that for the provided matrix / data.frame:

  1. Return a matrix of measures association, calculated correctly for each variable in the data (e.g. Pearson's correlations for metric variables and Phi for nominal etc),
  2. Return a mixture of scatter plots and boxplots depending on the types of variables (e.g. scatter for metric scales and boxplots for the metrics vs categorical).

This should help in the analysis of data.

Minor issue in auto.gum()

Hi!

I get a code break when trying to fit a model with auto.gum and type='select'.
The system then tries to fit a multiplicative model and fails with a code break.
I saw this behavior before, from package forecast, trying seasonal where there is no seasonality.

The problem is, I'm running multiple fits with various inputs and attributes in a loop.
It would be some how convenient, if it would switch to additive automatically, if it can't fit a multiplicative model.

I can hint this with an explicit exception... so not really a bug...

Checking model with a type="a".
Starting preliminary loop: 1 out of 1. Done.
Searching for appropriate lags: We found them!
Searching for appropriate orders: Orders found.
Checking model with a type="m".
Starting preliminary loop: Error in costfunc(matvt, matF, matw, yInSample, vecg, h, lagsModel, Etype, :
Mat::operator(): index out of bounds

  1. ├─global::bAlgos(i, f, "prediction") ~/R/daScript.R:51786:16
  2. │ └─global::standAloneAlgos(...) ~/R/daScript.R:51399:16
  3. │ └─smooth::auto.gum(...) ~/R/daScript.R:23035:16
  4. │ └─smooth::gum(...)
  5. │ └─smooth:::CreatorGUM(silentText = silentText)
  6. │ └─nloptr::nloptr(...)
  7. ├─(function (x) ...
  8. │ └─smooth:::eval_f(x, ...)
  9. │ └─smooth:::costfunc(...)
  10. ├─base::stop(...)
  11. └─(function () ...
  12. └─lobstr::cst() ~/R/daScript.R:9:29
    No traceback available

Pinball LOWESS

Create a function producing LOWESS smoothing for specific quantiles.
This might help with regression diagnostics, showing, how the distribution of residuals changes.

CCF with assoc()

Produce cross-correlation function with measures of association instead of correlation coefficients.

Distributions for ALM

  1. Normal distribution,
  2. F-distribution,
  3. Weird one with the ratio of folded normal distributions,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.