mtazzari / galario Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 15.0 4.03 MB

Gpu Accelerated Library for Analysing Radio Interferometer Observations

Home Page: https://mtazzari.github.io/galario/

License: GNU Lesser General Public License v3.0

CMake 17.56% Python 26.96% C++ 32.74% Shell 1.23% Cython 21.51%

astronomy fourier-analysis gpu gpu-accelerated-library interferometry radio radio-astronomy

galario's People

Contributors

Stargazers

Watchers

Forkers

fredros equant knut0815 lucadimascolo discsim hujun777 smaret yohannfaure sbrownson bjnorfolk psheehan elnjensen mfkiwl dsharsono annajfehr

galario's Issues

Interpolate

In preparation for the R2C implementation.

Now interpolate works for square matrices. Make it work also for nx, nx/2+1 (or generic nx, ny) matrix sizes.
Add equations from Numerical Recipes in the docstring.
Write explicitly that we need the floor function and why.

Exploit redundancy in real -> complex FFTW

Right now we ignore the fact that the input is real. This would reduce the output size by a factor of 2 and should reduce the effort of all other operations on the array accordingly. The logic, for example for interpolation, would get messier but for the sake of speed, we can try it. The output format that FFTW uses is documented here

http://fftw.org/fftw3_doc/Multi_002dDimensional-DFTs-of-Real-Data.html#Multi_002dDimensional-DFTs-of-Real-Data

and cufftw is compatible with that

http://docs.nvidia.com/cuda/cufft/#data-layout

Create image on the GPU from a radial profile

Start with a radial and flux values at the grid points. Just transfer this to the GPU and create the image there.

Reason: for large images, 90% of time consumed by memory transfer

sketch of an implementation

void galario_sample_radial(int ngrid, const dreal* radial_grid, const dreal* radial_fluxes,
                                       int nx, dreal dxy, dreal dRA, dreal dDec, dreal duv, int nd, 
                                       const dreal* u, const dreal* v, dcomplex* fint);
// returns padded array
dcomplex* galario_sweep(int ngrid, const dreal* radial_grid, const dreal* radial_fluxes,
                                       int nx, dreal dxy);

Continuous Integration

As soon as the tests do not rely anymore on pyvfit (almost there..., only two tests left), we should consider adding CI via TravisCI.

Here the languages supported by Travis: https://docs.travis-ci.com/user/languages/
And here how to install the dependencies (e.g. FFTW): https://docs.travis-ci.com/user/installing-dependencies/

function definitions w/o cuda

#ifdef __CUDACC__
__host__ __device__ inline void apply_phase_sampled_core
#else
inline void apply_phase_sampled_core
#endif

should be

#ifdef __CUDACC__
__host__ __device__ 
#endif
inline void apply_phase_sampled_core

as it's easier to read, don't repeat yourself!

Add Documentation

Add docs. A tentative list:

only import test fails on c2pap

$ python -c "import galario"
Segmentation fault

but other tests pass. Some linker error?

Mac OS peculiarities: nvcc/clang/gcc

Long story short: we need to be able to manually switch on/off the CUDA compilation.

Issue:
I am trying to compile galario on a Mac but there is a circular pitfall due to the fact that on a Mac nvcc supports clang (and not gcc/g++), but the default version of clang installed on the latest Mac OS Sierra is not supported..
Basically this prevents galario from being compiled (even the CPU version) on a Mac.

Quick and dirty temporary solution: comment out line 16 in galario/CMakeLists.txt:

###
# the optional and required packages
###
find_package(CUDA) --> # find_package(CUDA)

This avoids CUDA compilation, but we should fix this programmatically. Two solutions:

add manual switch for CUDA compilation
detect if host is a Mac, then do not compile for CUDA.

Code Normalization

Make the code uniform in terms of:

nomenclature (e.g. variable names)
function interfaces (e.g. 1st place pass image, then arrays, then dRA, dDec,..)
comment style, both for function docstrings and line-comments (comments are above, below, with highligh lines ////////// //// or *********** ********).

Decide licence

We have GPL now, perhaps we use MIT or BSD. In any case, we should add the usual licence comments in the header files and python wrappers

Use nthreads

kernels should get nthreads from nthreads(). Test this. Wrap it in cython. Give it a better name perhaps

Input data is real, not complex

Now the input image is assumed to be complex: (from libcommon.pyx)
def sample(dcomplex[:,::1] data, dRA, dDec, du, dreal[::1] u, dreal[::1] v)

We assumed this so that the input data already allocated the space needed for FFT operations. However, the image in real space is always real, so at the moment we need to cast the image to complex type before using galario:
sample(ref_complex.astype('complex128'), dRA, dDec, du, u, v)

We could save the casting and transfer time if we copy a real 2d array, rather than complex.
Can we easily implement this? E.g. by:

libcommon.pyx: changing to def sample(dreal[:,::1] data, dRA, dDec, du, dreal[::1] u, dreal[::1] v)
galario.cpp:

// nx: size of data
dcomplex *data_d;
size_t nbytes = sizeof(dcomplex)*nx*nx;
dreal *data_d_real;
size_t nbytes = sizeof(dreal)*nx*nx;
CCheck(cudaMalloc((void**)&data_d, nbytes));
CCheck(cudaMemcpy(data_d_real, data, nbytes, cudaMemcpyHostToDevice));
kernel real to complex (see stackoverflow question)

kernel unsigned int to float: https://stackoverflow.com/questions/9153861/typecasting-in-cuda-and-cublas

This change would be needed since there is no real sense to have a complex input image.

As a future step: we could switch from using complex to complex to real to complex FFTs, but we did not do that because it changes the mapping and the size of the matrices.

GPU pinning

Use environment variable with list of GPU indices to use. Then rank % len(indices) is the position of the GPU index to use.

Check: some batch systems may support GPU pinning natively. The above still useful for desktop machines with one slow GPU and one high-performance GPU.

speedtest fails with python3

unit tests

make them independent of pyvfit, test single, double, cuda, cpu vs. one reference

Optimize memory creation and access on GPU

From http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4mPy74O00:

Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses, therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions). The returned pitch (or stride) must be used to access array elements. The following code sample allocates a width x height 2D array of floating-point values and shows how to loop over the array elements in device code:

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
                width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

// Device code
__global__ void MyKernel(float* devPtr,
                         size_t pitch, int width, int height)
{
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
            float element = row[c];
        }
    }
}

py doc string

Want it available in terminal and html output. See pypmc how to do that

Non-square images

We should allow rectangular images, all our algorithms can support this. I think the only constraint is that the number of rows is even for the shift. Any other constraints @mtazzari ?

OpenMP on mac

pain to support open mp because of clang. python built with clang in anaconda, so openmp runtime missing. Let's just warn the user that we don't support openmp on the mac

C_sample

return the values at interpolated points, wrap in python, unit test. all info for test already present

Rename C again

Rename C_* -> galario_* for C functions

read the docs

build documentation on readthedocs.org

install header

test with a simple programm that compiling and linking the C library works

Cuda unified memory

This could simplify our code a bit but hides the data transfers that are now explicit. I keep it here for reference but don't intend to do anything about it now

But it would allow us to oversubscribe GPU memory

From https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/#comment-2358

The Benefits of Unified Memory on Pascal and Later GPUs

Starting with the Pascal GPU architecture, Unified Memory functionality is significantly improved with 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. The Page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages on demand from anywhere in the system to the GPU’s memory for efficient processing.

In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations for any code that is using Unified Memory for allocations (e.g. cudaMallocManaged()). It “just works” without any modifications to the application, whether running on one GPU or multiple GPUs.
Great for
Also, Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. This is useful in writing efficient multi-GPU cooperative algorithms.

Demand paging can be particularly beneficial to applications that access data with a sparse pattern.
but we don't have a sparse pattern
In some applications, it’s not known ahead of time which specific memory addresses a particular processor will access. Without hardware page faulting, applications can only pre-load whole arrays, or suffer the cost of high-latency off-device accesses (also known as “Zero Copy”). But page faulting means that only the pages the kernel accesses need to be migrated.

GPU optimization/profiling/timing

I have made some profiling on my desktop calx174 (GTX 1060) with :

nvprof --export-profile filename.prof python python/speed_benchmark_dp.py --gpu

IMPORTANT: in speed_benchmark_dp.py I commented out the first acc_lib.sample(), so that this profiling contains only acc_lib.chi2().

A few notes, regarding the screenshot. Here I considered matrix size 4096, nsamples 1e6, 64 nthreads

below the graphical timeline, you can see interesting numbers for each kernel.
one very big and good news: with nthreads=64, real_to_complex and shift_d disappear! We are hugely dominated by the MemcpyHtoD and by the cuFFT_C2C (to which the several dpRadix come from). I double checked this: leaving everything the same, and just setting nthreads=32, real_to_complex and shift_d come up again as ~50%. I think we are in good position, since we want to go large with nthreads and there is no reason to use 32.
The Memcpy commands are synchronous: with async Memcpy (and/or streams, not sure if they imply each other) that we could save up to 1.3ms*5 = 6.5ms hiding the Memcpy of the 1d arrays (size ~8MB) under the 2d Memcpy of the image.
If I sum up the "Duration" column of all the kernels, I get ~65ms, which is similar to the 60ms measured from timings (see my comment in #52). The processes listed there do not contain the big cudaFree slots.
and the cudaFree slots? I am getting convinced that the cudaFree that we launch are not causing the big cudaFree blocks of ~300ms in the timeline. To test this, I commented out all the cudaFree in galario.cpp and I see no difference in the timeline nor in the timings. I thus believe that the 0.6ms that the manual timing gives is accurate and when we call chi2() without nvprof we don't get these overheads.
cublasCreateHandle: I think it causes the MemcpyH2D at the very end, it is called in reduce_chi2_d just before diff_weighted_d. See a snapshot here with a zoom on the very final part (time >~1.6s in previous snapshot):

Since this MemcpyH2D is synchronous, it could introduce delay due to latency since it causes a Malloc after interpolated_d and before diff_weighted_d. Perhaps, we could move this create handle at the beginning of sample_d, after the initial Memcpys and pass the handle to reduce_chi2_d, then destroy the handle in sample_d. Here the docs:
cublasCreate

Because cublasCreate allocates some internal resources and the release of those resources by calling cublasDestroy will implicitly call cublasDeviceSynchronize, it is recommended to minimize the number of cublasCreate/cublasDestroy occurences. For multi-threaded applications that use the same device from different threads, the recommended programming model is to create one CUBLAS handle per thread and use that CUBLAS handle for the entire life of the thread.

These are results. Next comment: proposals where to go from here, prioritization of the optimizations.

Test CPU implementation by default

Now gpu tested if available

cmake error Sphinx: _static dir not found

CMake Error at docs/CMakeLists.txt:26 (file):
file COPY cannot find "/Users/mtazzari/repos/galario/docs/_static".

Better tests

We should re-organize the tests in test_galario.py:

reduce repetition, e.g. create a setup() function that creates reference image, sampling points etc. that is called once at the beginning;
choose uniform parametrization: many tests have similar but not identical parametrizations. Try to make this uniform, in many cases it should be possible
- parametrize such that all the tests are repeated for different matrix sizes, nsamples, etc.
add informative printouts: when a test fail, the default output of numpy.testing.assertequal is not very useful. Try to add essential yet informative printout to help debugging.

Some of these proposals might require using objects e.g. to store the static output of setup().

rename c functions

C_acc_* -> `galario_

chi2 -> chi2_reduction
do_everything is a bad name, just call it chi2

add a function to return sampled points: stop do_everyting before chi2 reduction

type C functions

void C_fft2d(int nx, dcomplex* data);
void C_fft2d(int nx, void* data)
{
    C_ff2d(nx, static_cast<dcomplex*>(data);
}

HAVE_CUDA

lower case?
a function?

Document it, for example in snippets

Loss of precision

I decided to look into the loss of precision again. Remember I told you about the summation algorithms I looked at this week. I compared single and double precision on CPU and GPU to the python reference implementation. Quite possibly we had those numbers before but I just did it again. here I printed the first 5 element of the interpolated values output by acc_lib.sample

CPU

[  -7.32114697  117.65641022    0.5806728     3.80655694   23.42615891] # py single
[  -6.9562273   111.83618164    0.72458202    3.68281412   24.37336922] # galario single
[  -7.32083263  117.65531322    0.58068566    3.80660492   23.42551657] # py double
[  -7.32083263  117.65531324    0.58068566    3.80660492   23.42551656] # galario double

GPU

[  -7.32114697  117.65641022    0.5806728     3.80655694   23.42615891] # py single
[  -6.95623207  111.83618164    0.72458595    3.68281269   24.37336731] # galario single
[  -7.32083263  117.65531322    0.58068566    3.80660492   23.42551657] # py double
[  -7.32083263  117.65531324    0.58068566    3.80660492   23.42551656] # galario double

It is clear that there is a big loss in precision for galario single, the other three implementations agree better. That loss is the same for GPU or CPU. So there must be a problem in galario with the algorithm itself.

To find out where the precision is lost, I just checked after every operation in test_loss() in test_galario.py.

The output of fft2 and cuFFT and FFTW were only within 1e-3 relative precision for single, and this is about the first operation on the image, so I think this is where the root of the problem is. I know that pyfftw does a lot of comparisons again numpy FFT, so they should agree. But they seem to differ for results close to zero. If there are terms of varying signs and magnitudes, this can happen with naive summation. I don't know what's used in np.fftw but I know that FFTW uses pairwise summation which should do OK except in pathological cases that I hope/guess are not present in our test case.

I updated many unit tests to use both rtol and atol. This showed that the relative precision is OK unless numbers are close to zero, where atol comes into play.

What totally puzzles me is that if I do all steps individually, the CPU/GPU results agree with pyvfit up until after the interpolation but if I call the GPU version that does everything combined, the results are at much lower precision. I played around and found that mismatch ratio is 55 % for the phases in par1 but setting the phases to zero or multiplying by 10 gave mismatches close to zero. Does that mean the phase is not correctly taken into account in on of the implementations? Or do we just see come loss of precision for certain values of the phase? hmmm

Overwriting function args

I went through the code and applied const in the function interfaces whenever I thought it was useful. The only problem is that we overwrite fint in reduce_chi2. Of course if you just care about chi2, that's fine. But I think to be clean we should not use it as a buffer and rather create our own buffer to store the weighted differences in. Won't be noticeably slow I predict

documentation system for cmake

doxygen or sphinx? Does doxygen handle doc string in .pyx?

Add a target to cmake, add sample doc for one C++ function and one python function.

Memory leaks?

Check all C/CUDA code against memory leakages.

Check that all the mallocs are effectively freed.

GPU and CPU tests

py.test -rxs pyvfit/tests.py works
py.test -rxs --gpu pyvfit/tests.py used to work, but the new version of py.test seems not to allow options passing anymore. Look into that.

Memory leak

There is a memory leak in chi2() due to the fact that many cuda free have been commented out.
Should be easy to fix: just uncomment them.

Manually time memory allocation

nvprof suggests that cudafree totally dominates the computation on P100. And cudafree appears where we don't see it in the code. Let's insert timing by hand to verify like shown here
https://devtalk.nvidia.com/default/topic/529363/how-to-measure-total-time-for-cpu-and-gpu/

Apply phase

Evaluate whether to apply phase after visibility sampling

data type names

in pr #36, I found this TODO in the header

//# TODO consistent notation dreal -> real_t, dcomplex -> complex_t
//# to avoid confusion with host vs. device

Should we rename? I kind of don't want to, so I will just remove the comment from the code

cuda complex

Why do we use cufft complex number in the header and cuComplex in the implementation?

#ifdef __CUDACC__
    #include <cufft.h>
        typedef cufftDoubleComplex dcomplex;

#ifdef __CUDACC__
    #include <cuda_runtime_api.h>
    #include <cuda.h>
    #include <cuComplex.h>

python optional

Getting python to work may require some fiddling. I can imagine users just want the C/C++ library and don't care to make cmake happy, so we should make python build optional

https://github.com/mtazzari/galario/blob/master/python/CMakeLists.txt#L6

shouldn't give a fatal error

threadblock

From the GTC presentation

GPU Performance Analysis and Optimization
Paulius Micikevicius
Developer Technology, NVIDIA

Threadblock size choice:
– Start with 128-256 threads per block
• Adjust up/down by what best matches your function
• Example: stencil codes prefer larger blocks to minimize halos
– Multiple of warp size (32 threads)
– If occupancy is critical to performance:
• Check that block size isn’t precluding occupancy allowed by register and
SMEM resources
Grid size:
– 1,000 or more threadblocks
• 10s of waves of threadblocks: no need to think about tail effect
• Makes your code ready for several generations of future GPUs

So the minimum is 32, and our default should be larger. Let's see its effect once we have a stable configuration

Raise Exception if CUDA is raised

It would be nice to have a Python Exception raised in the case CUDA raises an error, e.g. CUFFT Error: Unable to create plan

Single and double precision

Compile the C code with both precisions, make both available in python

Transfer observations

The primary intended usage of galario is to call it inside a fit where the observed data never change. We could save some time by copying this over to the GPU only once, then reuse it. Before implementing that, I should profile and estimate whether it helps at all. Perhaps the transfer of the model image takes much longer and that has to be transferred every time if the image is generated on the CPU

fftw link

prefer openmp over pthreads libraries. We use openMP ourselves, so for uniform control we'd want openmp. Fall back to pthreads if openmp not available

C compatibility

In our typedef, we seem to require C++ by using std::complex<>. I need to check by building a simple C program that uses galario.

 typedef float dreal;

    #ifdef __CUDACC__
        typedef cufftComplex dcomplex;
    #else
        #include <complex>
        typedef std::complex<float> dcomplex;
    #endif

Of course it's ok to use C++ in the implementation but perhaps we can use the C99 complex type and be compatible with both languages in the header

Optimize rotix core

The rotix_core function can be highly simplified by noting that:

    indu[i] = index + (u[i] - umin - index*du)/du =
            = index + (u[i] - umin)/du - index = (u[i] - umin)/du

And the same applies for indv. I think we can even delete the rotix kernels and put it directly in interpolate_core. Check if the indu and indv arrays are needed only in the interpolate. If yes, then it's a go for deleting rotix.

#ifdef __CUDACC__
__host__ __device__ inline void rotix_core
#else
inline void rotix_core
#endif
        (int const i, int const nx, dreal const umin, dreal const du, dreal const* const u, dreal const* const v, dreal* const __restrict__ indu, dreal*  const __restrict__ indv) {
    // u
    int index = floor((u[i] - umin) / du);
    dreal center_ind = umin + index * du;
    indu[i] = index + (u[i] - center_ind) / du;

    // v
    index = floor((v[i] - umin) / du);
    center_ind = umin + index * du;
    indv[i] = index + (v[i] - center_ind) / du;
}

Profiling results on CPU

Not really a problem but posting the results here for reference

Test problem

    nsamples = 512
    real_type = 'float64'
    complex_type = 'complex128'

    wle_m = par1.get('wle_m', 0.003)
    x0_arcsec = par1.get('x0_arcsec', 0.4)
    y0_arcsec = par1.get('y0_arcsec', 10.)

    # generate the samples
    maxuv_generator = 3.e3
    udat, vdat = create_sampling_points(nsamples, maxuv_generator, dtype=real_type)
    x, _, w = generate_random_vis(nsamples, real_type)

    # compute the matrix size and maxuv
    size, minuv, maxuv = matrix_size(udat, vdat, force_nx=4096)
    print("size:{0}, minuv:{1}, maxuv:{2}".format(size, minuv, maxuv))
    uv = pixel_coordinates(maxuv, size).astype(real_type)

    # create model complex image (it happens to have 0 imaginary part)
    reference_image = create_reference_image(size=size, kernel='gaussian', dtype=complex_type)
    ref_complex = reference_image.astype(complex_type)

    chi2_cuda = g_double.chi2(ref_complex, x0_arcsec, y0_arcsec,
                             maxuv/size/wle_m, udat/wle_m, vdat/wle_m, x.real.copy(), x.imag.copy(), w

tools

use Intel amplifier_xe to identify basic hotspots on

OMP_NUM_THREADS=k python/py.test.sh -s python_package/tests/test_galario.py -k profile

serial

So major hotspot is applying the phase. Understandable becauce FFT scales O(n log(n)) and phase is O(n^2). The default version works in the cartesian plane because that's what FFTW expects. It seemed like applying a phase in polar coordinates is easier but it turns out it takes longer due to the conversion, so we should stick with what we have

    // cartesian
    dcomplex const phase = dcomplex{dreal(cos(angle)), dreal(sin(angle))};
    data[idx] = CMPLXMUL(data[idx], phase);

    // polar
    dreal const magn = std::abs(data[idx]);
    dreal const oldangle = std::arg(data[idx]);
    data[idx] = std::polar(magn, oldangle + angle);
#endif

comparison: polar takes longer!