The factory fresh option of using case weights in drawing the bootstrap sample is very

"case.weights" take very long about ranger HOT 7 CLOSED

imbs-hl commented on July 24, 2024

"case.weights" take very long

from ranger.

Comments (7)

mnwright commented on July 24, 2024 1

I just released a version (0.4.2) based on the new toolchain. As reported, the problem is solved there. In addition, multithreading is finally working! This version can also be installed on the current R version by using the binary, see https://github.com/imbs-hl/ranger/releases.

I hope it's solved with R-3.3.0!

from ranger.

mnwright commented on July 24, 2024

No this is not as expected. I can reproduce the issue on Windows but not on Mac or Linux. I will check the code for some Windows-specific problems.

from ranger.

mnwright commented on July 24, 2024

The problem seems to be std::discrete_distribution<> with gcc 4.6.3. I tried with the new 4.9.3 toolchain and R-devel and it was fast.

Any idea how to solve this instead of waiting for a newer gcc?

from ranger.

khotilov commented on July 24, 2024

Using boost::random::discrete_distribution as a replacement helps:
before:

> system.time(fit.1 <- ranger(y ~ x)) 
   user  system elapsed 
   9.27    0.13    9.41 
> system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n))) 
   user  system elapsed 
  93.02    0.07   93.19

after:

> system.time(fit.1 <- ranger(y ~ x)) 
   user  system elapsed 
   8.76    0.16    8.96 
> system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n))) 
   user  system elapsed 
   8.98    0.09    9.09

from ranger.

mnwright commented on July 24, 2024

Thanks! However I'm reluctant to merge it in the master because of the Boost dependency... ;)

from ranger.

khotilov commented on July 24, 2024

That is a temporary simple solution while waiting for a newer gcc. I didn't do extensive testing, but a quick check showed very similar model performance (see below). That should make it at least feasible for me to run some prototyping with ranger on my windows laptop, as I frequently need to use weights. And the real dependency is only for the windows R version, which is already a neglected child with no multithreading :)

# with the original std::discrete_distribution
set.seed(111)
fit_std <- ranger(y ~ x, case.weights = rep(1, times = n), write.forest=T)
pr_std <- predict(fit_std, data.frame(x = x))

# with boost::random::discrete_distribution
set.seed(111)
fit_boost <- ranger(y ~ x, case.weights = rep(1, times = n), write.forest=T)
pr_boost <- predict(fit_boost, data.frame(x = x))

cor(pr_std$predictions, pr_boost$predictions)
[1] 0.9979446

The gcc's <random> was based on boost. But some over-engineering resulted in overheads and worse speed - I've seen a few discussions about that in the past. It wasn't just the discrete_distribution, but some other distributions too were several times slower. Maybe things did significantly improve in this regard in the latest releases (I didn't really follow), but I personally had more trust in boost::random.

It's your choice in the end. I'm just telling you what I know. I'm glad I've noticed this discussion, since my initial observations didn't agree with the claims of ranger being very fast, so I didn't even try it on a linux server.

from ranger.

mayer79 commented on July 24, 2024

This is brilliant, thank you very much for these investigations. Even on the current R version, the issue seems to be fixed with ranger 0.4.2. Wow!

from ranger.

"case.weights" take very long about ranger HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent