Comments (8)
This is a very important enhancement because the input image will always be real and taking advantage of faster R2C transforms must be definitely done at some point.
This would reduce not only the computing time, but also the memory usage.
Implementing the usage of R2C transforms will change the coordinate mapping in the Fourier space, but probably it is not too difficult to recompute the algorithm to account for the change in symmetry.
I am undecided whether this should be done for version 1.0 due to the possibile large amount of time needed to check it properly.
from galario.
I agree it will be quite some work. But if the next round of CPU profiling tomorrow shows that the O(n^2) operations like shift still dominate, we have to seriously consider doing it asap. Once we submit the paper, our enthusiasm to take on such changes will fade away.
from galario.
On the FFTW format for r2c transforms
http://fftw.org/fftw3_doc/Real_002ddata-DFT-Array-Format.html#Real_002ddata-DFT-Array-Format
http://docs.nvidia.com/cuda/cufft/#data-layout
from galario.
Reserve the memory for the Fourier transform on the cpu with FFTW functions. From the MPI example,
alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
&local_n0, &local_0_start);
data = fftw_alloc_complex(alloc_local);
...
fftw_free(void *p);
from galario.
Here are some additional performance hints for the gpu taken from http://docs.nvidia.com/cuda/cufft/index.html#accuracy-and-performance. I'm surprised that a plan needs as much temp space as the image size. But this may explain why inplace is not faster because it has to copy back all the elements from temp space.
For real to complex
- Ensure problem size of x dimension is a multiple of 4.
- Use out-of-place mode.
Memory usage
Execution of a transform of a particular size and type may take several stages of processing. When a plan for the transform is generated, cuFFT derives the internal steps that need to be taken. These steps may include multiple kernel launches, memory copies, and so on. In addition, all the intermediate buffer allocations (on CPU/GPU memory) take place during planning. These buffers are released when the plan is destroyed. In the worst case, the cuFFT Library allocates space for 8batchn[0]..n[rank-1] cufftComplex or cufftDoubleComplex elements (where batch denotes the number of transforms that will be executed in parallel, rank is the number of dimensions of the input data (see Multidimensional Transforms) and n[] is the array of transform dimensions) for single and double-precision transforms respectively. Depending on the configuration of the plan, less memory may be used. In some specific cases, the temporary space allocations can be as low as 1batchn[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex elements. This temporary space is allocated separately for each individual plan when it is created (i.e., temporary space is not shared between the plans).
from galario.
Fixed by #61
from galario.
To make this quantitative, we compare c2c (6fe6eaa) and r2c (43d34bd) transform. As anticipated, the computation of the FFT reduces by ~2. From the attached nvprof output files, the most important numbers in ms are the memory transfer (~32) in either case, the first shift (2.6 vs 2), the FFT (4 vs 2), the 2nd shift (2.6 vs 1.4).
from galario.
speed_benchmark_c2c_7cdbfac.txt
speed_benchmark_r2c_43d34bd3.txt
from galario.
Related Issues (20)
- Add link to paper citations HOT 2
- Coordinate mesh grid
- Update Copyright statement for 2020
- AttributeError: 'function' object has no attribute 'called' HOT 5
- Initial guess for the parameters HOT 1
- galario.c still in source
- travis tests are not reported anymore... HOT 1
- Docs deploy is broken HOT 3
- Perform tests for Python 3.8
- Move to GitHub Actions HOT 2
- Fix numpy warnings
- Move docs to readthedocs
- Update 2020->2021 in Copyright notices
- Trouble building galario for GPU use HOT 7
- Test building GPU version on CUDA 11
- Scaling with CUDA cores and changing GPUs
- Move uvtable.txt to a permanent repository HOT 4
- dxy returned by get_image_size is radians - Fix typo in docs HOT 1
- origin at chi2Image HOT 1
- unrecognized argument when running ctest
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from galario.