Coder Social home page Coder Social logo

agtboost's Introduction

Travis build status Lifecycle: experimental License: MIT CRAN RStudio mirror downloads

aGTBoost

Adaptive and automatic gradient tree boosting computations

aGTBoost is a lightning fast gradient boosting library designed to avoid manual tuning and cross-validation by utilizing an information theoretic approach. This makes the algorithm adaptive to the dataset at hand; it is completely automatic, and with minimal worries of overfitting. Consequently, the speed-ups relative to state-of-the-art implementations are in the thousands while mathematical and technical knowledge required on the user are minimized.

Note: Currently for academic purposes: Implementing and testing new innovations w.r.t. information theoretic choices of GTB-complexity. See below for to-do research list.

Installation

R: Finally on CRAN! Install the stable version with

install.packages("agtboost")

or install the development version from GitHub

devtools::install_github("Blunde1/agtboost/R-package")

Users experiencing errors after warnings during installlation, may be helped by the following command prior to installation:

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")

Example code and documentation

agtboost essentially has two functions, a train function gbt.train and a predict function predict. From the code below it should be clear how to train an aGTBoost model using a design matrix x and a response vector y, write ?gbt.train in the console for detailed documentation.

library(agtboost)

# -- Load data --
data(caravan.train, package = "agtboost")
data(caravan.test, package = "agtboost")
train <- caravan.train
test <- caravan.test

# -- Model building --
mod <- gbt.train(train$y, train$x, loss_function = "logloss", verbose=10)

# -- Predictions --
prob <- predict(mod, test$x) # Score after logistic transformation: Probabilities

agtboostalso contain functions for model inspection and validation.

  • Feature importance: gbt.importance generates a typical feature importance plot. Techniques like inserting noise-features are redundant due to computations w.r.t. approximate generalization (test) loss.
  • Convergence: gbt.convergence computes the loss over the path of boosting iterations. Check visually for convergence on test loss.
  • Model validation: gbt.ksval transforms observations to standard uniformly distributed random variables, if the model is specified correctly. Perform a formal Kolmogorov-Smirnov test and plots transformed observations for visual inspection.
# -- Feature importance --
gbt.importance(feature_names=colnames(caravan.train$x), object=mod)

# -- Model validation --
gbt.ksval(object=mod, y=caravan.test$y, x=caravan.test$x)

The functions gbt.ksval and gbt.importance create the following plots:

Furthermore, an aGTBoost model is (see example code)

Dependencies

Scheduled updates

  • Adaptive and automatic deterministic frequentist gradient tree boosting.
  • Information criterion for fast histogram algorithm (non-exact search) (Fall 2020, planned)
  • Adaptive L2-penalized gradient tree boosting. (Fall 2020, planned)
  • Automatic stochastic gradient tree boosting. (Fall 2020/Spring 2021, planned)

Hopeful updates

  • Optimal stochastic gradient tree boosting.

References

Contribute

Any help on the following subjects are especially welcome:

  • Utilizing sparsity (possibly Eigen sparsity).
  • Paralellizatin (CPU and/or GPU).
  • Distribution (Python, Java, Scala, ...),
  • good ideas and coding best-practices in general.

Please note that the priority is to work on and push the above mentioned scheduled updates. Patience is a virtue. :)

agtboost's People

Contributors

blunde1 avatar lukasappelhans avatar sondreus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

agtboost's Issues

Obtain `xgboost` and `lightgbm` parameters from trained `agtboost` model

The most important parameters of other tree-boosting libraries should be possible to obtain from an agtboost ensemble.
Possible parameters include

  • Number of boosting iterations
  • Tree depth
  • Minimum reduction in node to split
  • Minimum Hessian weights
  • Minimum observations in node

Usefulness
Implementations such as xgboost are very robust, but time-consuming to tune.
It might be desirable to have an xgboost model in production, but where parameters are trained from agtboost.

Make agtboost model object entirely numerical

The agtboost model class contains the type of loss-function as an std::string type attribute. This is problematic (for memory and garbage collection reasons) when working with a model object from the R side. See e.g. the following comment #55 (comment).

The model could and should consist of entirely numerical attributes.

Infinite boosting rounds with "logloss" loss function

I was trying to predict a binary variable specifying the loss_function as "logloss", but instead of stopping early, it seems the boosting rounds continue even while improvement stops. I expected training on the iris dataset to be very quick (like with loss_function "mse") but it just keeps going. The same issue is happening with other loss functions (similar issue with "poisson") except "mse" which performs as expected. The example can be reproduced as below.

image

Stripping debugging information does not flow with CRAN

The CRAN policy contains

- Packages should not attempt to disable compiler diagnostics, nor to
remove other diagnostic information such as symbols in shared objects.

yet packages

MatchIt agtboost briskaR kgrams matrixprofiler prospectr resemble
strucchangeRcpp

attempt to do so.

Do correct before 2021-11-26 to safely retain your package on CRAN, and
as this will require manual check, do allow time for the CRAN
submissions team to do so.

Suggestions:

Handle categorical features internally

Currently agtboost only supports a purely numerical design-matrix.
Consider letting agtboost handle categorical features (factor variable in R) internally.

Benefits

  • Easier interface for users
  • Could in principle support R-formula syntax of type y~x1+x2+...
  • More accurate feature importance possibly?

Downsides

  • More code and added complexity

Influence-adjusted derivatives-calculations

Secret manuscript will arrive on arxiv first

In principle:

  • derivatives are influenced by their own response (self-influence)
  • employ asymptotic correspondance between n-fold CV, influence adjustment and TIC

Refactor codebase

Overall, codebase, especially on the C++ side, needs cleanup in terms of

  • Commented code
  • Style-guide
  • Verbosity

Major revision would include removing different variants of count-regression except for ordinary Poissong with log-link.

Let predict return g^{-1}(f(x))

Currently, only loss_function=count::auto returns E[y|x] by default.
It uses code

if(type %in% c("", "response")){
        # predict mean
        res <- object$predict(newdata)
        res <- exp(res)
    }else if(type == "link_response"){
        # predict response on log (link) level
        res <- object$predict(newdata)
    }

Implement this for ordinary loss functions in R/gbt.pred

Let users define custom loss-functions

See xgboost implementation.
This would be easy if gradient-calculations was on the R-side.
But perhaps harder since agtboost gradient calculations happen on R-side.

Suggestions

  • Take a look if this issue gives a hint towards major refactoring of the code #30
  • Should the Ensemble class be removed / be an R-style class and not on C++-side?

Optimal initial prediction

When offset is nonzero, boosting from the average of y (optimal with zero offset) or from zero is sub-optimal.
Thus: Provide an optimal initial prediction when offset is nonzero:
An optimization algorithm starting from the average is natural, unless no result is available.
Explore if g and h may be used for internal optimizer: Will be able to solve convex problems.
In this case, it is natural to create an "initial_boosting_prediction.h" module.

NaN predictions

The following code:

library(agtboost)
data(caravan.train, package = "agtboost")
train <- caravan.train
gbt.train(train$y, train$x, loss_function="logloss", verbose=10, nrounds=500, learning_rate=0.1)

gives the following output:

it: 1  |  n-leaves: 167  |  tr loss: 0.1703  |  gen loss: 0.1686
it: 10  |  n-leaves: 80  |  tr loss: 0.1073  |  gen loss: 0.1041
it: 20  |  n-leaves: 6  |  tr loss: 0.05831  |  gen loss: 0.0536
it: 30  |  n-leaves: 432  |  tr loss: 0.0302  |  gen loss: 0.02553
it: 40  |  n-leaves: 411  |  tr loss: 0.01981  |  gen loss: 0.01515
it: 50  |  n-leaves: 433  |  tr loss: 0.01582  |  gen loss: 0.01117
it: 60  |  n-leaves: 30  |  tr loss: 0.01389  |  gen loss: 0.009244
it: 70  |  n-leaves: 434  |  tr loss: 0.01263  |  gen loss: 0.007985
it: 80  |  n-leaves: 4  |  tr loss: 0.01221  |  gen loss: 0.007566
it: 90  |  n-leaves: 2  |  tr loss: 0.01183  |  gen loss: 0.007186
it: 100  |  n-leaves: 217  |  tr loss: 0.01156  |  gen loss: 0.006912
it: 110  |  n-leaves: 473  |  tr loss: 0.01138  |  gen loss: 0.00674
it: 120  |  n-leaves: 496  |  tr loss: 0.01129  |  gen loss: 0.006649
it: 130  |  n-leaves: 252  |  tr loss: 0.01125  |  gen loss: 0.006602
it: 140  |  n-leaves: 475  |  tr loss: 0.01122  |  gen loss: 0.006576
it: 150  |  n-leaves: 485  |  tr loss: 0.01121  |  gen loss: 0.006563
it: 160  |  n-leaves: 276  |  tr loss: 0.0112  |  gen loss: 0.006554
it: 170  |  n-leaves: 276  |  tr loss: 0.01119  |  gen loss: 0.00655
it: 180  |  n-leaves: 276  |  tr loss: 0.01119  |  gen loss: 0.006547
it: 190  |  n-leaves: 276  |  tr loss: 0.01119  |  gen loss: 0.006546
it: 200  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006545
it: 210  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006545
it: 220  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 230  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 240  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 250  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 260  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 270  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 280  |  n-leaves: 281  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 290  |  n-leaves: 282  |  tr loss: 0.01119  |  gen loss: 0.006544
it: 300  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 310  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 320  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 330  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 340  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 350  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 360  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 370  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 380  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 390  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 400  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 410  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 420  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 430  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 440  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 450  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 460  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 470  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 480  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 490  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan
it: 500  |  n-leaves: 1  |  tr loss: -nan  |  gen loss: -nan

And predict outputs NaN predictions.

Fast histogram algorithm

There are no problems for the information criterion in regards to implementing a fast histogram-based algorithm.
This should make the sorting which is nlog(n) be linear in n.

Problems with scalability on size of design-matrix

agtboost does not scale well with large n and m.
In principle, it should scale as well as e.g. xgboost with algorithm set to exact.
There also seems to be problems with memory-consumption for large datasets.

Goal
Make agtboost scale well in n and m.

Suspission
I suspect this is due to the re-declarations/initializations of u_store and u vectors and cir calculations in split_information().

  • Find a smarter way to do these calculations. See if this fixes scalability issue.

Low-frequent count-data produces root-trees and does not converge

The following example with low-frequent data does not converge.
Here, EY = 0.1% given that exposure is 1 (it is Ez = 0.5 for training data, but 1.0 for test).
Seems to break at iteration 824

# Import library
library(agtboost)

# Reproducible
set.seed(123)

# Approximately similar n, with low-frequent events
# x: feature, EX = 1
# z: exposure
# y: response
n <- 100000
average_frequendy <- 0.001
xtr <- as.matrix(runif(n, 0, 2)) # EX = 1
ztr <- runif(n) 
ytr <- rpois(n, ztr* average_frequendy * xtr)
xte <- as.matrix(runif(n, 0, 2))
yte <- rpois(n, average_frequendy * xte) # test for z=1

# Plot data, ensure it looks appropriate
par(mfrow=c(1,2))
plot_sample <- sample(n, 10000)
plot(xte[plot_sample,], yte[plot_sample])
plot((xtr*ztr)[plot_sample,], ytr[plot_sample])
par(mfrow=c(1,1))

# Train agtb-model
agtb <- gbt.train(ytr, xtr, loss_function = "poisson", offset=log(ztr), verbose=1)

# Compare null-model and agtb-model on test-data
-sum(dpois(yte, mean(ytr), log=TRUE))
# agtb model
-sum(dpois(yte, predict(agtb, xte), log=TRUE))

# Inspect predictions
plot(xte[plot_sample,], yte[plot_sample])
points(xte[plot_sample], predict(agtb, xte)[plot_sample], col=2)

Monotone constraints

Both xgboost and lightgbm implements monotone constraints.
This is important to practitioners.
An introduction to a possible implementation is given here

Save gbtorch model object

The model object cannot be saved by R-default save() function as .RData.
A custom save function must be created, say gbt.save_gbt_model() that saves the model
as a binary file. Complementary gbt.load_gbt_model() must also be created.

Memory leaks in gbt.load()

I think I have detected a memory leak in gbt.load. Could be I'm failing to understand how R handles memory (and the imperfections of that process), but maybe adding a manual garbage collection in the C subroutine could be the way to solve this?

Minimal replication:

x <- runif(50000, 0, 10)
y <- rnorm(50000, x, 1)

mod <- gbt.train(y, as.matrix(x))
gbt.save(mod, 'gbt_model.gbt')

Memory usage (by the R process, not the objects in R workspace) increments on my system by roughly 1mb every time the below line is run:

mod <- gbt.load('gbt_model.gbt')

agtboost tests

Use testthat
As a minimum, write tests for

  • gbt.save() and gbt.load()
  • gbt.train()
  • predict()

Add `offset` possibility to `gbt.train` and `predict`

This has an importance for many applications, but perhaps especially for Poisson-regression where observations have been observed at different time-lengths/fractions of total-time.

  • Let pred = g^{-1}(F(x) + offset)
  • Poisson-regression example would then be handled by letting offset equal to log-timelength.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.