Coder Social home page Coder Social logo

"case.weights" take very long about ranger HOT 7 CLOSED

imbs-hl avatar imbs-hl commented on July 24, 2024
"case.weights" take very long

from ranger.

Comments (7)

mnwright avatar mnwright commented on July 24, 2024 1

I just released a version (0.4.2) based on the new toolchain. As reported, the problem is solved there. In addition, multithreading is finally working! This version can also be installed on the current R version by using the binary, see https://github.com/imbs-hl/ranger/releases.

I hope it's solved with R-3.3.0!

from ranger.

mnwright avatar mnwright commented on July 24, 2024

No this is not as expected. I can reproduce the issue on Windows but not on Mac or Linux. I will check the code for some Windows-specific problems.

from ranger.

mnwright avatar mnwright commented on July 24, 2024

The problem seems to be std::discrete_distribution<> with gcc 4.6.3. I tried with the new 4.9.3 toolchain and R-devel and it was fast.

Any idea how to solve this instead of waiting for a newer gcc?

from ranger.

khotilov avatar khotilov commented on July 24, 2024

Using boost::random::discrete_distribution as a replacement helps:
before:

> system.time(fit.1 <- ranger(y ~ x)) 
   user  system elapsed 
   9.27    0.13    9.41 
> system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n))) 
   user  system elapsed 
  93.02    0.07   93.19 

after:

> system.time(fit.1 <- ranger(y ~ x)) 
   user  system elapsed 
   8.76    0.16    8.96 
> system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n))) 
   user  system elapsed 
   8.98    0.09    9.09 

from ranger.

mnwright avatar mnwright commented on July 24, 2024

Thanks! However I'm reluctant to merge it in the master because of the Boost dependency... ;)

from ranger.

khotilov avatar khotilov commented on July 24, 2024

That is a temporary simple solution while waiting for a newer gcc. I didn't do extensive testing, but a quick check showed very similar model performance (see below). That should make it at least feasible for me to run some prototyping with ranger on my windows laptop, as I frequently need to use weights. And the real dependency is only for the windows R version, which is already a neglected child with no multithreading :)

# with the original std::discrete_distribution
set.seed(111)
fit_std <- ranger(y ~ x, case.weights = rep(1, times = n), write.forest=T)
pr_std <- predict(fit_std, data.frame(x = x))

# with boost::random::discrete_distribution
set.seed(111)
fit_boost <- ranger(y ~ x, case.weights = rep(1, times = n), write.forest=T)
pr_boost <- predict(fit_boost, data.frame(x = x))

cor(pr_std$predictions, pr_boost$predictions)
[1] 0.9979446

The gcc's <random> was based on boost. But some over-engineering resulted in overheads and worse speed - I've seen a few discussions about that in the past. It wasn't just the discrete_distribution, but some other distributions too were several times slower. Maybe things did significantly improve in this regard in the latest releases (I didn't really follow), but I personally had more trust in boost::random.

It's your choice in the end. I'm just telling you what I know. I'm glad I've noticed this discussion, since my initial observations didn't agree with the claims of ranger being very fast, so I didn't even try it on a linux server.

from ranger.

mayer79 avatar mayer79 commented on July 24, 2024

This is brilliant, thank you very much for these investigations. Even on the current R version, the issue seems to be fixed with ranger 0.4.2. Wow!

from ranger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.