blunde1 / agtboost Goto Github PK

View Code? Open in Web Editor NEW

66.0 4.0 11.0 6.26 MB

Adaptive and automatic gradient boosting computations.

License: MIT License

R 51.68% C++ 48.32%

machine-learning gradient-boosting information-theory adaptive-learning

agtboost's Introduction

aGTBoost

Adaptive and automatic gradient tree boosting computations

aGTBoost is a lightning fast gradient boosting library designed to avoid manual tuning and cross-validation by utilizing an information theoretic approach. This makes the algorithm adaptive to the dataset at hand; it is completely automatic, and with minimal worries of overfitting. Consequently, the speed-ups relative to state-of-the-art implementations are in the thousands while mathematical and technical knowledge required on the user are minimized.

Note: Currently for academic purposes: Implementing and testing new innovations w.r.t. information theoretic choices of GTB-complexity. See below for to-do research list.

Installation

R: Finally on CRAN! Install the stable version with

install.packages("agtboost")

or install the development version from GitHub

devtools::install_github("Blunde1/agtboost/R-package")

Users experiencing errors after warnings during installlation, may be helped by the following command prior to installation:

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")

Example code and documentation

agtboost essentially has two functions, a train function gbt.train and a predict function predict. From the code below it should be clear how to train an aGTBoost model using a design matrix x and a response vector y, write ?gbt.train in the console for detailed documentation.

library(agtboost)

# -- Load data --
data(caravan.train, package = "agtboost")
data(caravan.test, package = "agtboost")
train <- caravan.train
test <- caravan.test

# -- Model building --
mod <- gbt.train(train$y, train$x, loss_function = "logloss", verbose=10)

# -- Predictions --
prob <- predict(mod, test$x) # Score after logistic transformation: Probabilities

agtboostalso contain functions for model inspection and validation.

Feature importance: gbt.importance generates a typical feature importance plot. Techniques like inserting noise-features are redundant due to computations w.r.t. approximate generalization (test) loss.
Convergence: gbt.convergence computes the loss over the path of boosting iterations. Check visually for convergence on test loss.
Model validation: gbt.ksval transforms observations to standard uniformly distributed random variables, if the model is specified correctly. Perform a formal Kolmogorov-Smirnov test and plots transformed observations for visual inspection.

# -- Feature importance --
gbt.importance(feature_names=colnames(caravan.train$x), object=mod)

# -- Model validation --
gbt.ksval(object=mod, y=caravan.test$y, x=caravan.test$x)

The functions gbt.ksval and gbt.importance create the following plots:

Furthermore, an aGTBoost model is (see example code)

highly robust to dimensions: Comparisons to (penalized) linear regression in (very) high dimensions
has minimal worries of overfitting: Stock market classificatin
and can train further given previous models: Boosting from a regularized linear model

Dependencies

My research
Eigen Linear algebra
Rcpp for the R-package

Scheduled updates

Adaptive and automatic deterministic frequentist gradient tree boosting.
Information criterion for fast histogram algorithm (non-exact search) (Fall 2020, planned)
Adaptive L2-penalized gradient tree boosting. (Fall 2020, planned)
Automatic stochastic gradient tree boosting. (Fall 2020/Spring 2021, planned)

Hopeful updates

Optimal stochastic gradient tree boosting.

References

Contribute

Any help on the following subjects are especially welcome:

Utilizing sparsity (possibly Eigen sparsity).
Paralellizatin (CPU and/or GPU).
Distribution (Python, Java, Scala, ...),
good ideas and coding best-practices in general.

Please note that the priority is to work on and push the above mentioned scheduled updates. Patience is a virtue. :)

agtboost's People

Contributors

Stargazers

Watchers

Forkers

durbin-watson eirikstad useric sswamyn lukasappelhans barardo sondreus royshan cacoch mortenblorstad

agtboost's Issues

Obtain `xgboost` and `lightgbm` parameters from trained `agtboost` model

The most important parameters of other tree-boosting libraries should be possible to obtain from an agtboost ensemble.
Possible parameters include

Number of boosting iterations
Tree depth
Minimum reduction in node to split
Minimum Hessian weights
Minimum observations in node

Usefulness
Implementations such as xgboost are very robust, but time-consuming to tune.
It might be desirable to have an xgboost model in production, but where parameters are trained from agtboost.

Refactor serialization to be more efficient

The human-readable text-serialization is one of the most expensive ways to serialize. See the comment in #55 (comment).

Take a look at more efficient and robust ways of serialization.

Make agtboost model object entirely numerical

The agtboost model class contains the type of loss-function as an std::string type attribute. This is problematic (for memory and garbage collection reasons) when working with a model object from the R side. See e.g. the following comment #55 (comment).

The model could and should consist of entirely numerical attributes.

Infinite boosting rounds with "logloss" loss function

I was trying to predict a binary variable specifying the loss_function as "logloss", but instead of stopping early, it seems the boosting rounds continue even while improvement stops. I expected training on the iris dataset to be very quick (like with loss_function "mse") but it just keeps going. The same issue is happening with other loss functions (similar issue with "poisson") except "mse" which performs as expected. The example can be reproduced as below.

Automatic stochastic (down)sampling in `agtboost`

Secret manuscript will be uploaded to arxiv

Stripping debugging information does not flow with CRAN

The CRAN policy contains

- Packages should not attempt to disable compiler diagnostics, nor to
remove other diagnostic information such as symbols in shared objects.

yet packages

MatchIt agtboost briskaR kgrams matrixprofiler prospectr resemble
strucchangeRcpp

attempt to do so.

Do correct before 2021-11-26 to safely retain your package on CRAN, and
as this will require manual check, do allow time for the CRAN
submissions team to do so.

Suggestions:

Comment on https://stackoverflow.com/questions/46280628/object-files-in-r-package-too-large-rcpp
Remove part of Makevars
Submit 0.9.3 to CRAN and explain NOTE due to large .so file
Include comment from prof Brian Ripley

Handle categorical features internally

Currently agtboost only supports a purely numerical design-matrix.
Consider letting agtboost handle categorical features (factor variable in R) internally.

Benefits

Easier interface for users
Could in principle support R-formula syntax of type y~x1+x2+...
More accurate feature importance possibly?

Downsides

More code and added complexity

Let `agtboost` handle random effects / latent variables

Secret manuscript-recipe coming first on arxiv

Influence-adjusted derivatives-calculations

Secret manuscript will arrive on arxiv first

In principle:

derivatives are influenced by their own response (self-influence)
employ asymptotic correspondance between n-fold CV, influence adjustment and TIC

Crate viginette or updated documentation / illustration of theory

youtube-video?
viginette?
More information in README?
Github docs - tutorial with "how-it-works" and examples?

Shapley-values for agtboost models

Implement Shapley-values to explain model predictions.
See vignett for https://github.com/NorskRegnesentral/shapr

Refactor codebase

Overall, codebase, especially on the C++ side, needs cleanup in terms of

Commented code
Style-guide
Verbosity

Major revision would include removing different variants of count-regression except for ordinary Poissong with log-link.

Automatic L2-regularization

Secret manuscript will be uploaded to arxiv

Let predict return g^{-1}(f(x))

Currently, only loss_function=count::auto returns E[y|x] by default.
It uses code

if(type %in% c("", "response")){
        # predict mean
        res <- object$predict(newdata)
        res <- exp(res)
    }else if(type == "link_response"){
        # predict response on log (link) level
        res <- object$predict(newdata)
    }

Implement this for ordinary loss functions in R/gbt.pred

Let users define custom loss-functions

See xgboost implementation.
This would be easy if gradient-calculations was on the R-side.
But perhaps harder since agtboost gradient calculations happen on R-side.

Suggestions

Take a look if this issue gives a hint towards major refactoring of the code #30
Should the Ensemble class be removed / be an R-style class and not on C++-side?

Add sparsity and agtboost matrix class to hold pointer to C++ design matrix and response vector

Should be possible with the Eigen sparse matrix class + R Matrix package and possible RcppModules to return pointer to C++ model object.

Optimal initial prediction

When offset is nonzero, boosting from the average of y (optimal with zero offset) or from zero is sub-optimal.
Thus: Provide an optimal initial prediction when offset is nonzero:
An optimization algorithm starting from the average is natural, unless no result is available.
Explore if g and h may be used for internal optimizer: Will be able to solve convex problems.
In this case, it is natural to create an "initial_boosting_prediction.h" module.

NaN predictions

The following code:

library(agtboost)
data(caravan.train, package = "agtboost")
train <- caravan.train
gbt.train(train$y, train$x, loss_function="logloss", verbose=10, nrounds=500, learning_rate=0.1)

gives the following output:

it: 1  |  n-leaves: 167  |  tr loss: 0.1703  |  gen loss: 0.1686
it: 10  |  n-leaves: 80  |  tr loss: 0.1073  |  gen loss: 0.1041
it: 20  |  n-leaves: 6  |  tr loss: 0.05831  |  gen loss: 0.0536
it: 30  |  n-leaves: 432  |  tr loss: 0.0302  |  gen loss: 0.02553
it: 40  |  n-leaves: 411  |  tr loss: 0.01981  |  gen loss: 0.01515
it: 50  |  n-leaves: 433  |  tr loss: 0.01582  |  gen loss: 0.01117
it: 60  |  n-leaves: 30  |  tr loss: 0.01389  |  gen loss: 0.009244
it: 70  |  n-leaves: 434  |  tr loss: 0.01263  |  gen loss: 0.007985
it: 80  |  n-leaves: 4  |  tr loss: 0.01221  |  gen loss: 0.007566
it: 90  |  n-leaves: 2  |  tr loss: 0.01183  |  gen loss: 0.007186
it: 100  |  n-leaves: 217  |  tr loss: 0.01156  |  gen loss: 0.006912
it: 110  |  n-leaves: 473  |  tr loss: 0.01138  |  gen loss: 0.00674
it: 120  |  n-leaves: 496  |  tr loss: 0.01129  |  gen loss: 0.006649
it: 130  |  n-leaves: 252  |  tr loss: 0.01125  |  gen loss: 0.006602
it: 140  |  n-leaves: 475  |  tr loss: 0.01122  |  gen loss: 0.006576
it: 150  |  n-leaves: 485  |  tr loss: 0.01121  |  gen loss: 0.006563
it: 160  |  n-leaves: 276  |  tr loss: 0.0112  |  gen loss: 0.006554
it: 170  |  n-leaves: 276  |  tr loss: 0.01119  |  gen loss: 0.00655
it: 180  |  n-leaves: 276  |  tr loss: 0.01119  |  gen loss: 0.006547
it: 190  |  n-leaves: 276  |  tr loss: 0.01119  |  gen loss: 0.006546
it: 200  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006545
it: 210  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006545
it: 220  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 230  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 240  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 250  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 260  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 270  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 280  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 290  |  n-leaves: 282  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 300  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 310  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 320  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 330  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 340  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 350  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 360  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 370  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 380  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 390  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 400  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 410  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 420  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 430  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 440  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 450  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 460  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 470  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 480  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 490  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 500  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan

And predict outputs NaN predictions.

Fast histogram algorithm

There are no problems for the information criterion in regards to implementing a fast histogram-based algorithm.
This should make the sorting which is nlog(n) be linear in n.

demo poison-regression.R missing package name

insert library(agtboost)

Add paralellisation with `OpenMP`

The node::split_information()should be easy to paralellize.

Problems with scalability on size of design-matrix

agtboost does not scale well with large n and m.
In principle, it should scale as well as e.g. xgboost with algorithm set to exact.
There also seems to be problems with memory-consumption for large datasets.

Goal
Make agtboost scale well in n and m.

Suspission
I suspect this is due to the re-declarations/initializations of u_store and u vectors and cir calculations in split_information().

Find a smarter way to do these calculations. See if this fixes scalability issue.

Quick look at performance

szilard/GBM-perf#35

I might be doing something wrong, maybe there are ways to make aGTBoost faster. As far as the AUC, this is just one dataset, I know.

Anyway, you might wanna look at this: https://github.com/szilard/GBM-perf

Use github-actions instead of Travis

Low-frequent count-data produces root-trees and does not converge

The following example with low-frequent data does not converge.
Here, EY = 0.1% given that exposure is 1 (it is Ez = 0.5 for training data, but 1.0 for test).
Seems to break at iteration 824

# Import library
library(agtboost)

# Reproducible
set.seed(123)

# Approximately similar n, with low-frequent events
# x: feature, EX = 1
# z: exposure
# y: response
n <- 100000
average_frequendy <- 0.001
xtr <- as.matrix(runif(n, 0, 2)) # EX = 1
ztr <- runif(n) 
ytr <- rpois(n, ztr* average_frequendy * xtr)
xte <- as.matrix(runif(n, 0, 2))
yte <- rpois(n, average_frequendy * xte) # test for z=1

# Plot data, ensure it looks appropriate
par(mfrow=c(1,2))
plot_sample <- sample(n, 10000)
plot(xte[plot_sample,], yte[plot_sample])
plot((xtr*ztr)[plot_sample,], ytr[plot_sample])
par(mfrow=c(1,1))

# Train agtb-model
agtb <- gbt.train(ytr, xtr, loss_function = "poisson", offset=log(ztr), verbose=1)

# Compare null-model and agtb-model on test-data
-sum(dpois(yte, mean(ytr), log=TRUE))
# agtb model
-sum(dpois(yte, predict(agtb, xte), log=TRUE))

# Inspect predictions
plot(xte[plot_sample,], yte[plot_sample])
points(xte[plot_sample], predict(agtb, xte)[plot_sample], col=2)

Monotone constraints

Both xgboost and lightgbm implements monotone constraints.
This is important to practitioners.
An introduction to a possible implementation is given here

Save gbtorch model object

The model object cannot be saved by R-default save() function as .RData.
A custom save function must be created, say gbt.save_gbt_model() that saves the model
as a binary file. Complementary gbt.load_gbt_model() must also be created.

Memory leaks in gbt.load()

I think I have detected a memory leak in gbt.load. Could be I'm failing to understand how R handles memory (and the imperfections of that process), but maybe adding a manual garbage collection in the C subroutine could be the way to solve this?

Minimal replication:

x <- runif(50000, 0, 10)
y <- rnorm(50000, x, 1)

mod <- gbt.train(y, as.matrix(x))
gbt.save(mod, 'gbt_model.gbt')

Memory usage (by the R process, not the objects in R workspace) increments on my system by roughly 1mb every time the below line is run:

mod <- gbt.load('gbt_model.gbt')

agtboost tests

Use testthat
As a minimum, write tests for

gbt.save() and gbt.load()
gbt.train()
predict()

Add `offset` possibility to `gbt.train` and `predict`

This has an importance for many applications, but perhaps especially for Poisson-regression where observations have been observed at different time-lengths/fractions of total-time.

Let pred = g^{-1}(F(x) + offset)
Poisson-regression example would then be handled by letting offset equal to log-timelength.