Coder Social home page Coder Social logo

Comments (7)

ogrisel avatar ogrisel commented on June 11, 2024

In splitting.py, the left/right_indices_buffer will use up 8GB for 10^9 rows. If that causes swapping, the performance benefit of multithreading (which requires these buffers) are most probably not worth it. Would it be an option to disable this?

I don't see how we could have multithreading at that level anymore. You suggest disabling thread-based parallelism for the split_indices operation? Maybe that could be an option. @NicolasHug might know better how LightGBM does for this part of the code.

from pygbm.

NicolasHug avatar NicolasHug commented on June 11, 2024

Would it be an option to disable this?

Technically yes... I suppose we could use a single-threaded quick-sort like partitioning scheme.

Is there also a method that could without the buffer

I don't think so, or at least no with the current strategy. those arrays are used to that sample indices don't overwrite each other

from pygbm.

NicolasHug avatar NicolasHug commented on June 11, 2024

@NicolasHug might know better how LightGBM does for this part of the code

I haven't checked again but I don't think they have an option to disable parallel splitting.

@maartenbreddels could you check if you have the same issue on LightGBM? Note that they are reusing allocated data like we plan to do in #81 so we need to take this into account

from pygbm.

maartenbreddels avatar maartenbreddels commented on June 11, 2024

There are some parallel in place partition algorithms: http://www.lsi.upc.edu/~lfrias/research/parpar/wea08.pdf
they don't appear super trivial, not something I'd do in 1 evening.

But I think one of the buffers could be avoided, that would already save a bit. Would you be interested in a PR that does either a single threaded split, or uses 1 buffer, or both? I can't promise I can do it, but if the 1 buffer PR makes the code less readable, and that is a reason not to merge, I won't bother.

could you check if you have the same issue on LightGBM?

I cannot use LightGBM as it is now, the current implementation makes at least 2 memory copies, my vaex-ml hack avoids 1 copy, but still the memory usage is excessive.

My plan is to see what is possible with pygbm (much easier to understand, and easier to edit), and possible see how they can be translated to lightgbm.

from pygbm.

NicolasHug avatar NicolasHug commented on June 11, 2024

I think a single-threaded version would be welcome and should not be too complicated to add.

I'd be curious to know how to avoid using one of the two arrays though!

from pygbm.

maartenbreddels avatar maartenbreddels commented on June 11, 2024

I think a single-threaded version would be welcome and should not be too complicated to add.

I'll open an PR for that, any guidelines for how this should be configurable?

I'd be curious to know how to avoid using one of the two arrays though!

I thought of using the sample_indices for the 'left' indices, and a scratchpad/buffer for the 'right' indices. Basically, sample_indices takes over the role of left_indices_buffer. That should work right?

from pygbm.

NicolasHug avatar NicolasHug commented on June 11, 2024

If you can make sure that no entry in samples_indices gets overwritten before it's written into the other buffer then I guess so. But there's the "if" ^^

any guidelines for how this should be configurable?

Let go simple for now, you can try passing a parameter e.g. parallel_splitting from BaseGradientBoosing all the way down to SplittingContext.__init__, and make split_indices dispatch to either split_indices_parallel or split_indices_single_thread. We can work out the details later.

from pygbm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.