szilard / gbm-tune Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 3.0 10.61 MB

Tuning GBMs (hyperparameter tuning) and impact on out-of-sample predictions

R 0.17% HTML 99.83%

gbm gradient-boosting-machine hyperparameter-optimization machine-learning overfitting

gbm-tune's People

Contributors

Stargazers

Watchers

Forkers

prokopyev sandy4321 jun0609

gbm-tune's Issues

Parallel trainings for faster training

If you have enough RAM, it is always better to run parallel trainings with a low amount of training threads instead of parallelizing each training.

Example on a 20c/40t server, approximate theoretical time:

Mode	Threads (train)	Threads (parallel)	Time for 1 model	Time for 40 models (passes)
Demo Single	1	1	500	20000 (40)
Demo Dual	2	1	300	12000 (40)
Demo Multi	20	1	100	4000 (40)
Demo Multi + Hyperthreading	40	1	70	2600 (40)
Parallel Single	1	20 (RAM Single x20)	500	1000 (2)
Parallel Single + Hyperthreading	1	40 (RAM Single x40)	500	700 (1)
Parallel Dual	2	10 (RAM Dual x10)	300	1200 (4)
Parallel Dual + Hyperthreading	2	20 (RAM Dual x20)	300	840 (1)

With hyperthreading, timings decrease by about 30% (in theory - in reality it is about 15-25% due to overhead). Parallel versions only have the overhead of merging results together (and copying data, if not forking), which is nearly non-existent (use a parallel lapply and not a parallel for to remove most of the overhead).

Also, it will allow to skip the negative efficiency issue you may have.

may you share link from where to get data please

as usual great code and ideas
but
d0_train <- fread(paste0("/var/data/airline/",yr-1,".csv"))
may you share link from where to get data please

do you take care about big difference between splits?

Important question :
may you clarify if you using some special algorithms for generation many test data / train data splits

to get big difference between splits
since some splits many have negligible difference for example for

samples 1 2 3 4 5 6 7 8 9 10 11 12
good split
test data 1 2 3 4 5 6 train data 7 8 9 10 11 12
test data 1 2 3 7 8 9 train data 4 5 6 10 11 12
minimum difference between all sets is 3

bad split
test data 1 2 3 4 5 6 train data 7 8 9 10 11 12
test data 1 2 3 4 5 7 train data 6 8 9 10 11 12
minimum difference between all sets 1

szilard / gbm-tune Goto Github PK

gbm-tune's People

Contributors

Stargazers

Watchers

Forkers

gbm-tune's Issues

Parallel trainings for faster training

may you share link from where to get data please

do you take care about big difference between splits?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent