Coder Social home page Coder Social logo

szilard / gbm-tune Goto Github PK

View Code? Open in Web Editor NEW
21.0 21.0 3.0 10.61 MB

Tuning GBMs (hyperparameter tuning) and impact on out-of-sample predictions

R 0.17% HTML 99.83%
gbm gradient-boosting-machine hyperparameter-optimization machine-learning overfitting

gbm-tune's People

Contributors

szilard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gbm-tune's Issues

Parallel trainings for faster training

If you have enough RAM, it is always better to run parallel trainings with a low amount of training threads instead of parallelizing each training.

Example on a 20c/40t server, approximate theoretical time:

Mode Threads (train) Threads (parallel) Time for 1 model Time for 40 models (passes)
Demo Single 1 1 500 20000 (40)
Demo Dual 2 1 300 12000 (40)
Demo Multi 20 1 100 4000 (40)
Demo Multi + Hyperthreading 40 1 70 2600 (40)
Parallel Single 1 20 (RAM Single x20) 500 1000 (2)
Parallel Single + Hyperthreading 1 40 (RAM Single x40) 500 700 (1)
Parallel Dual 2 10 (RAM Dual x10) 300 1200 (4)
Parallel Dual + Hyperthreading 2 20 (RAM Dual x20) 300 840 (1)

With hyperthreading, timings decrease by about 30% (in theory - in reality it is about 15-25% due to overhead). Parallel versions only have the overhead of merging results together (and copying data, if not forking), which is nearly non-existent (use a parallel lapply and not a parallel for to remove most of the overhead).

Also, it will allow to skip the negative efficiency issue you may have.

do you take care about big difference between splits?

Important question :
may you clarify if you using some special algorithms for generation many test data / train data splits

to get big difference between splits
since some splits many have negligible difference for example for

samples 1 2 3 4 5 6 7 8 9 10 11 12
good split
test data 1 2 3 4 5 6 train data 7 8 9 10 11 12
test data 1 2 3 7 8 9 train data 4 5 6 10 11 12
minimum difference between all sets is 3

bad split
test data 1 2 3 4 5 6 train data 7 8 9 10 11 12
test data 1 2 3 4 5 7 train data 6 8 9 10 11 12
minimum difference between all sets 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.