Coder Social home page Coder Social logo

Communication with Cupy about gpufit HOT 14 OPEN

BigBSB avatar BigBSB commented on June 11, 2024
Communication with Cupy

from gpufit.

Comments (14)

SBresler avatar SBresler commented on June 11, 2024 1

I have a version now which does the following:

  1. sets up the method in the dll corresponding to the constarined cuda interface function
  2. checks that input cupy arrays are c-contiguous
  3. gets the pointer for the cupy arrrays
  4. sends that over to gpufit.

So I think this is a lot closer to what I want. I will do a PR at some point for this.

An idea I have been toying with is to expose all of the functions to pybind11 rather than using ctypes - this seems to be the tool of choice for a lot of people.

This would give you access to pytest for unit testing in gpufit - i think that the python interface is by far the most important aspect of this for any sort of widespread adoption.

from gpufit.

jkfindeisen avatar jkfindeisen commented on June 11, 2024

The location on the input/output data can only be specified in the C++ interface, but not in the C interface, So I guess that will not work. It could work though with some changes to the code.

from gpufit.

BigBSB avatar BigBSB commented on June 11, 2024

Cupy has an option to run custom CUDA kernels; would this feature be useful to get it working?

from gpufit.

SBresler avatar SBresler commented on June 11, 2024

Hey I ended up figuring this out. Is there a way to update the master branch?

Basically this lets you do all of your data processing on the GPU, and then immediately push it to GPUfit to do the fits without having to transfer the data back to the CPU.

Next thing I want to figure out is how to JIT compile fit models so that you can construct a fit model based on various parameters and then have it run though GPUfit, don't know if this is possible.

from gpufit.

superchromix avatar superchromix commented on June 11, 2024

Sounds great. The best way to do this is fork the repository, include your changes, and submit a pull request.

from gpufit.

SBresler avatar SBresler commented on June 11, 2024

Ok, well the cupy works.

I actually don't think I care about the JIT stuff, it looks like ya'll tried that and there were speed hits.

How about fits that involve complex numbers? would that be a difficult addition?

from gpufit.

casparvitch avatar casparvitch commented on June 11, 2024

@SBresler I don't see any changes to your fork. Can you share with us how you implemented cupy interfacing, I (and I imagine others) would find this very interesting/useful! Cheers.

from gpufit.

SBresler avatar SBresler commented on June 11, 2024

from gpufit.

SBresler avatar SBresler commented on June 11, 2024

Alright, I have come back to this.

What I did was cast a pointer to to the cupy object in the python interface.

This allowed me to put in a cupy object as an argument to the fit function call.

Looking at the traces through nvtx, there is still a lot of copying... stuff happening during that block.

I was looking at jaxFit, which I could more realistically make modifications to since I have a lot more python knowledge than C++ at this point, but to the me program is more focused on extremely large, complex fits, whereas gpufit is all about doing a ton of small fits at once.

This might be personal bias because it's exactly my use-case, but my feeling is that it's really obvious that if you have a ton of small data like this that you want to fit, that the most obvious possible improvement at the moment for gpufit is to allow for data that is already in global memory on device to be accessed.

At the moment I am doing ~3GB/s transfers to the gpu for FFTs and then some reduction operations, and it's working relatively well, but the bottlenecks are always transfer times.

I just thought of this - maybe it's easier to just do it the other way and put in all of my preprocessing into gpufit instead.

I am streaming a LOT of data through a digitizer at the moment (3GB/s), and have gotten fits continuously for about 10 seconds, and I am fairly certain that eliminating one or both of these copies blows the problem apart (RDMA for the digitizer takes away one transfer, accessing global memory for the gpufit calls takes away 2 transfers, and my reduction is about a factor of 2.5)

from gpufit.

SBresler avatar SBresler commented on June 11, 2024

Another thought -

What if you want to use RDMA to get the data to the gpu faster by bypassing the whole read the data into cpu ram over the pci-e bus (hard drive or otherwise), pinning the address, transferring... et cetera.

this would mean that you fundamentally cannot use gpufit and RDMA in the same application.

from gpufit.

SBresler avatar SBresler commented on June 11, 2024

Another idea - add a preprocessing section that allows you to add in your own kernel that you want to add in what you want to do before the fit.

This could work as a stopgap.

from gpufit.

superchromix avatar superchromix commented on June 11, 2024

Hi. Fitting data that is already stored in the GPU memory is already implemented in Gpufit. The docs are here: https://gpufit.readthedocs.io/en/latest/gpufit_api.html#gpufit-cuda-interface .

As you found out, when working with Python, you need to obtain a pointer to a GPU memory location to use the gpufit_cuda_interface call. Gpufit knows nothing about python or numpy arrays, etc.

The pre-processing you're talking about could be implemented as a separate routine. You can do anything you want with the data stored on the GPU before and after calling Gpufit. Gpufit is simply meant to handle the fit step.

Finally, we tried real-time compilation of fit model functions, and this caused major performance bottlenecks. It would clearly be a great feature to have. This topic may be revisited in the future.

from gpufit.

SBresler avatar SBresler commented on June 11, 2024

wow this is why you have to be persistent and keep asking!

so either this is new, or I just was going off some other information found in other posts that wasn't entirely accurate. I don't see a way to look at old docs but that would be interesting to find out.

Thanks so much for the information. I can work with this. It was blowing my mind that this wasn't a feature and it totally is.

from gpufit.

SBresler avatar SBresler commented on June 11, 2024

Hi. Fitting data that is already stored in the GPU memory is already implemented in Gpufit. The docs are here: https://gpufit.readthedocs.io/en/latest/gpufit_api.html#gpufit-cuda-interface .

As you found out, when working with Python, you need to obtain a pointer to a GPU memory location to use the gpufit_cuda_interface call. Gpufit knows nothing about python or numpy arrays, etc.

The pre-processing you're talking about could be implemented as a separate routine. You can do anything you want with the data stored on the GPU before and after calling Gpufit. Gpufit is simply meant to handle the fit step.

Finally, we tried real-time compilation of fit model functions, and this caused major performance bottlenecks. It would clearly be a great feature to have. This topic may be revisited in the future.

Interesting.

when you say "major performance bottlenecks" are you talking more than an order of magnitude speed decrease?

I think that scientists are generally hungry for faster fitting routines and almost anything beats the speed of LMfit.

from gpufit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.