Right now we ignore the fact that the input is real. This would reduce the output size

On the FFTW format for r2c transforms <a href="http://fftw.org/fftw3_doc/Real_002d

Fixed by <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

To make this quantitative, we compare c2c (<a class="commit-link" data-hovercard-type=

<a href="https://github.com/mtazzari/galario/files/1221944/speed_benchmark_c2c_7cdbfac

Exploit redundancy in real -> complex FFTW about galario HOT 8 CLOSED

mtazzari commented on August 28, 2024

Exploit redundancy in real -> complex FFTW

from galario.

Comments (8)

mtazzari commented on August 28, 2024

This is a very important enhancement because the input image will always be real and taking advantage of faster R2C transforms must be definitely done at some point.
This would reduce not only the computing time, but also the memory usage.

Implementing the usage of R2C transforms will change the coordinate mapping in the Fourier space, but probably it is not too difficult to recompute the algorithm to account for the change in symmetry.

I am undecided whether this should be done for version 1.0 due to the possibile large amount of time needed to check it properly.

from galario.

fredRos commented on August 28, 2024

I agree it will be quite some work. But if the next round of CPU profiling tomorrow shows that the O(n^2) operations like shift still dominate, we have to seriously consider doing it asap. Once we submit the paper, our enthusiasm to take on such changes will fade away.

from galario.

fredRos commented on August 28, 2024

On the FFTW format for r2c transforms
http://fftw.org/fftw3_doc/Real_002ddata-DFT-Array-Format.html#Real_002ddata-DFT-Array-Format
http://docs.nvidia.com/cuda/cufft/#data-layout

from galario.

fredRos commented on August 28, 2024

Reserve the memory for the Fourier transform on the cpu with FFTW functions. From the MPI example,

alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
                                         &local_n0, &local_0_start);
data = fftw_alloc_complex(alloc_local);
...
fftw_free(void *p);

from galario.

fredRos commented on August 28, 2024

Here are some additional performance hints for the gpu taken from http://docs.nvidia.com/cuda/cufft/index.html#accuracy-and-performance. I'm surprised that a plan needs as much temp space as the image size. But this may explain why inplace is not faster because it has to copy back all the elements from temp space.

For real to complex

Ensure problem size of x dimension is a multiple of 4.
Use out-of-place mode.

Memory usage

Execution of a transform of a particular size and type may take several stages of processing. When a plan for the transform is generated, cuFFT derives the internal steps that need to be taken. These steps may include multiple kernel launches, memory copies, and so on. In addition, all the intermediate buffer allocations (on CPU/GPU memory) take place during planning. These buffers are released when the plan is destroyed. In the worst case, the cuFFT Library allocates space for 8batchn[0]..n[rank-1] cufftComplex or cufftDoubleComplex elements (where batch denotes the number of transforms that will be executed in parallel, rank is the number of dimensions of the input data (see Multidimensional Transforms) and n[] is the array of transform dimensions) for single and double-precision transforms respectively. Depending on the configuration of the plan, less memory may be used. In some specific cases, the temporary space allocations can be as low as 1batchn[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex elements. This temporary space is allocated separately for each individual plan when it is created (i.e., temporary space is not shared between the plans).

from galario.

fredRos commented on August 28, 2024

Fixed by #61

from galario.

fredRos commented on August 28, 2024

To make this quantitative, we compare c2c (6fe6eaa) and r2c (43d34bd) transform. As anticipated, the computation of the FFT reduces by ~2. From the attached nvprof output files, the most important numbers in ms are the memory transfer (~32) in either case, the first shift (2.6 vs 2), the FFT (4 vs 2), the 2nd shift (2.6 vs 1.4).

from galario.

fredRos commented on August 28, 2024

speed_benchmark_c2c_7cdbfac.txt
speed_benchmark_r2c_43d34bd3.txt

from galario.

Exploit redundancy in real -> complex FFTW about galario HOT 8 CLOSED

Comments (8)

For real to complex

Memory usage

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent