Hello, I encountered this problem on our computation server, it has a dual socket

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thread allocation issues with batched density compensation about torchkbnufft HOT 11 OPEN

headmeister commented on September 21, 2024

Thread allocation issues with batched density compensation

from torchkbnufft.

Comments (11)

mmuckley commented on September 21, 2024

Hello @headmeister, you're correct that there is no planning stage. For the FFT step torchkbnufft puts all of that on the PyTorch FFT functions.

Could you let me know what operating systems you're two machines use and what version of PyTorch you have? It's been a very long time since I worked on the threading backend, but I do remember observing fairly different characteristics on Linux, macOS, and Windows.

For what it's worth, we use multi-threading largely because some of the subroutines that torchkbnufft calls do not have efficient multithreading for the specific problems we have with NUFFT. For these cases we manually chop up the trajectory ourselves and do thread management over the chopped up trajectory. The rules for the distribution were tuned for a 2D, radial trajectory and testing has been on 8-core to 40-core systems.

from torchkbnufft.

headmeister commented on September 21, 2024

Working setup was:
Ubuntu 20.04.03 LTS
Pytorch 1.10.2 Tried with both currently supported versions of CUDA and without it (all worked)
512 GB RAM
AMD Epyc 24 core
GTX 2080 Ti

Non functional setup was
Ubuntu 20.04.03 LTS
Pytorch 1.10.2 Tried with both currently supported versions of CUDA and without it (neither worked)
1 TB RAM
AMD Epyc 2x 64 Core
2x Nvidia A100

I tried also a windows machine with AMD ryzen 8 core CPU and no issues there as well

What was commonfor all of them was however, that when processing a set with multiple trajectories across the batch dimension the benefit of using a GPU was basically zero, it is CPU bound for some reason. When working in the way as in your performance check, that is using single trajectory for multiple input k-spaces, the GPU acceleration was very noticable...

from torchkbnufft.

mmuckley commented on September 21, 2024

Okay so to summarize you have:

100 time points
37 spokes per time point

How many coils? And which version of torchkbnufft? And I see above you said this is 2D.

from torchkbnufft.

headmeister commented on September 21, 2024

Yes 2D acquisition,
4 coils (its a preclinical Bruker machine)
The acquired data size is 128x4x37x100 (pts x coils x spokes x time points) sampled with Golden angle 2D radial sequence.

Torchkbnufft version newest from pip that should be 1.3.0

from torchkbnufft.

mmuckley commented on September 21, 2024

Hello @headmeister, my understanding is we have two issues: the density compensation error and the slow batched NUFFTs.

For (1), this is an obscure error that I haven't encountered before. Have you tried reducing the number of available threads by setting something like OMP_NUM_THREADS=8?

For (2) I think a problem might be you have many tiny problems - even more tiny than we normally expect for dynamic. The threads might not be getting enough work. I do not observe any differences in performance for any thread counts when running on CPU, possibly because the overhead of creating and destroying threads is similar to the computation work. For GPU, I can actually get a 60% speedup by using 8 threads instead of 40 by setting OMP_NUM_THREADS=8.

All of my tests were on the current main version of torchkbnufft on Linux.

Let me know if any of these help you.

from torchkbnufft.

headmeister commented on September 21, 2024

Hello,
(1) I tried reducing the number of threads the OMP uses and nothing changed regarding the presence of the error. On the other hand I did update the GPU drivers in the meantime, and the error changed to:

/torchkbnufft/_nufft/interp.py", line 533, in calc_coef_and_indices_fork_over_batches

    # collect the results
    results = [torch.jit.wait(future) for future in futures]
               ~~~~~~~~~~~~~~ <--- HERE
    coef = torch.cat([result[0] for result in results])
    arr_ind = torch.cat([result[1] for result in results])

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA driver error: invalid device context

Again this error applies to the DCF computation only, but I think it might be the same issue as before, only it is reported differently.

(2) setting the number of threads did not have much impact on our system. But this may be due to the fact, that the cores are generally weaker when there is many of them. So, while it may lower the thread creation/destruction overhead, the cores are weak to compute fast enough.
I redid my algorithm also using the standard FFT (cartesian data as input) and the GPU usage rises to almost 100 % and the CPU usage goes basically to zero. The acceleration over CPU is around the factor of 10 for the whole algorithm, which is usually reported as expected.
As I see this, isn't there somewhere during the NUFFT computation some handover between the CPU and GPU, even though the data is on the GPU? This might take quite some time, especially when it is performed for each thread independently.

from torchkbnufft.

mmuckley commented on September 21, 2024

The package shouldn't change the device of the tensors at all after creation - it should use the device of the tensors that you pass in. New tensors should be created on the target device. The only CPU-GPU communication is sending computation instructions to the GPU. You can see the logic for this in the interp function: https://github.com/mmuckley/torchkbnufft/blob/main/torchkbnufft/_nufft/interp.py. You could try dropping some print statements in there to see if any Tensor types are mismatched.

For my 40-core system the cores are also a little slow. It is also a 2-socket system if I recall correctly. In terms of hardware the primary difference would be AMD - I don't have an AMD system to test on.

from torchkbnufft.

wouterzwerink commented on September 21, 2024

Hello, (1) I tried reducing the number of threads the OMP uses and nothing changed regarding the presence of the error. On the other hand I did update the GPU drivers in the meantime, and the error changed to:

/torchkbnufft/_nufft/interp.py", line 533, in calc_coef_and_indices_fork_over_batches
    # collect the results
    results = [torch.jit.wait(future) for future in futures]
               ~~~~~~~~~~~~~~ <--- HERE
    coef = torch.cat([result[0] for result in results])
    arr_ind = torch.cat([result[1] for result in results])
RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: CUDA driver error: invalid device context

I get this same error during backwards passes over a batched NUFFT, but in table_interp_adjoint.
The error dissapears when using:

torch._C._jit_set_profiling_mode(False)

I do not quite understand why, but maybe this helps with finding the bug.

from torchkbnufft.

mlaves commented on September 21, 2024

@wouterzwerink I have the same error for batched inputs with varying sizes in torchkbnufft.KbNufft.

from torchkbnufft.

mmuckley commented on September 21, 2024

@wouterzwerink @mlaves please open a separate issue - that error is not related to thread allocation.

from torchkbnufft.

mlaves commented on September 21, 2024

@mmuckley Thanks, will do.

from torchkbnufft.

Thread allocation issues with batched density compensation about torchkbnufft HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent