Comments (11)
Hello @headmeister, you're correct that there is no planning stage. For the FFT step torchkbnufft
puts all of that on the PyTorch FFT functions.
Could you let me know what operating systems you're two machines use and what version of PyTorch you have? It's been a very long time since I worked on the threading backend, but I do remember observing fairly different characteristics on Linux, macOS, and Windows.
For what it's worth, we use multi-threading largely because some of the subroutines that torchkbnufft
calls do not have efficient multithreading for the specific problems we have with NUFFT. For these cases we manually chop up the trajectory ourselves and do thread management over the chopped up trajectory. The rules for the distribution were tuned for a 2D, radial trajectory and testing has been on 8-core to 40-core systems.
from torchkbnufft.
Working setup was:
Ubuntu 20.04.03 LTS
Pytorch 1.10.2 Tried with both currently supported versions of CUDA and without it (all worked)
512 GB RAM
AMD Epyc 24 core
GTX 2080 Ti
Non functional setup was
Ubuntu 20.04.03 LTS
Pytorch 1.10.2 Tried with both currently supported versions of CUDA and without it (neither worked)
1 TB RAM
AMD Epyc 2x 64 Core
2x Nvidia A100
I tried also a windows machine with AMD ryzen 8 core CPU and no issues there as well
What was commonfor all of them was however, that when processing a set with multiple trajectories across the batch dimension the benefit of using a GPU was basically zero, it is CPU bound for some reason. When working in the way as in your performance check, that is using single trajectory for multiple input k-spaces, the GPU acceleration was very noticable...
from torchkbnufft.
Okay so to summarize you have:
100 time points
37 spokes per time point
How many coils? And which version of torchkbnufft
? And I see above you said this is 2D.
from torchkbnufft.
Yes 2D acquisition,
4 coils (its a preclinical Bruker machine)
The acquired data size is 128x4x37x100 (pts x coils x spokes x time points) sampled with Golden angle 2D radial sequence.
Torchkbnufft version newest from pip that should be 1.3.0
from torchkbnufft.
Hello @headmeister, my understanding is we have two issues: the density compensation error and the slow batched NUFFTs.
For (1), this is an obscure error that I haven't encountered before. Have you tried reducing the number of available threads by setting something like OMP_NUM_THREADS=8
?
For (2) I think a problem might be you have many tiny problems - even more tiny than we normally expect for dynamic. The threads might not be getting enough work. I do not observe any differences in performance for any thread counts when running on CPU, possibly because the overhead of creating and destroying threads is similar to the computation work. For GPU, I can actually get a 60% speedup by using 8 threads instead of 40 by setting OMP_NUM_THREADS=8
.
All of my tests were on the current main
version of torchkbnufft
on Linux.
Let me know if any of these help you.
from torchkbnufft.
Hello,
(1) I tried reducing the number of threads the OMP uses and nothing changed regarding the presence of the error. On the other hand I did update the GPU drivers in the meantime, and the error changed to:
/torchkbnufft/_nufft/interp.py", line 533, in calc_coef_and_indices_fork_over_batches
# collect the results
results = [torch.jit.wait(future) for future in futures]
~~~~~~~~~~~~~~ <--- HERE
coef = torch.cat([result[0] for result in results])
arr_ind = torch.cat([result[1] for result in results])
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA driver error: invalid device context
Again this error applies to the DCF computation only, but I think it might be the same issue as before, only it is reported differently.
(2) setting the number of threads did not have much impact on our system. But this may be due to the fact, that the cores are generally weaker when there is many of them. So, while it may lower the thread creation/destruction overhead, the cores are weak to compute fast enough.
I redid my algorithm also using the standard FFT (cartesian data as input) and the GPU usage rises to almost 100 % and the CPU usage goes basically to zero. The acceleration over CPU is around the factor of 10 for the whole algorithm, which is usually reported as expected.
As I see this, isn't there somewhere during the NUFFT computation some handover between the CPU and GPU, even though the data is on the GPU? This might take quite some time, especially when it is performed for each thread independently.
from torchkbnufft.
The package shouldn't change the device of the tensors at all after creation - it should use the device of the tensors that you pass in. New tensors should be created on the target device. The only CPU-GPU communication is sending computation instructions to the GPU. You can see the logic for this in the interp function: https://github.com/mmuckley/torchkbnufft/blob/main/torchkbnufft/_nufft/interp.py. You could try dropping some print statements in there to see if any Tensor types are mismatched.
For my 40-core system the cores are also a little slow. It is also a 2-socket system if I recall correctly. In terms of hardware the primary difference would be AMD - I don't have an AMD system to test on.
from torchkbnufft.
Hello, (1) I tried reducing the number of threads the OMP uses and nothing changed regarding the presence of the error. On the other hand I did update the GPU drivers in the meantime, and the error changed to:
/torchkbnufft/_nufft/interp.py", line 533, in calc_coef_and_indices_fork_over_batches
# collect the results results = [torch.jit.wait(future) for future in futures] ~~~~~~~~~~~~~~ <--- HERE coef = torch.cat([result[0] for result in results]) arr_ind = torch.cat([result[1] for result in results])
RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: CUDA driver error: invalid device context
I get this same error during backwards passes over a batched NUFFT, but in table_interp_adjoint.
The error dissapears when using:
torch._C._jit_set_profiling_mode(False)
I do not quite understand why, but maybe this helps with finding the bug.
from torchkbnufft.
@wouterzwerink I have the same error for batched inputs with varying sizes in torchkbnufft.KbNufft
.
from torchkbnufft.
@wouterzwerink @mlaves please open a separate issue - that error is not related to thread allocation.
from torchkbnufft.
@mmuckley Thanks, will do.
from torchkbnufft.
Related Issues (20)
- Failed for 3D non-Cartesian trajectory
- MriSenseNufft, AdjMriSenseNufft, AdjKbNufft, ToepSenseNufft removed
- Deprecation warning and shutdown HOT 3
- Accuracy for numpoints = 2 - 7
- nonuniform discrete Fourier transform of type II HOT 4
- grid_size HOT 3
- Make compatible with latest PyTorch HOT 7
- data range changed after processing HOT 4
- KB-NUFFT hangs when used in PyTorch DataLoader with num_workers > 0 HOT 3
- Allow user control of threading
- Could you give an example of implementing a two-dimensional Fourier transform by use torchkbnufft?
- Performance vs FFT HOT 6
- Toeplitz Kernel has boundary artifacts HOT 9
- Support for NFFTs of real valued Arrays HOT 6
- Basic 1D case
- Batched input with varying size leads to TorchScript error HOT 2
- KbInterpAdjoint gives mismatch results? HOT 3
- Gradients w.r.t. coordinates in ktraj
- torch.func (pytorch 2.0) compatibility HOT 1
- Data dtype error when calling inerpolation function using `torch.complex64`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from torchkbnufft.