mtazzari / galario Goto Github PK
View Code? Open in Web Editor NEWGpu Accelerated Library for Analysing Radio Interferometer Observations
Home Page: https://mtazzari.github.io/galario/
License: GNU Lesser General Public License v3.0
Gpu Accelerated Library for Analysing Radio Interferometer Observations
Home Page: https://mtazzari.github.io/galario/
License: GNU Lesser General Public License v3.0
In preparation for the R2C implementation.
floor
function and why.Right now we ignore the fact that the input is real. This would reduce the output size by a factor of 2 and should reduce the effort of all other operations on the array accordingly. The logic, for example for interpolation, would get messier but for the sake of speed, we can try it. The output format that FFTW uses is documented here
and cufftw is compatible with that
Start with a radial and flux values at the grid points. Just transfer this to the GPU and create the image there.
Reason: for large images, 90% of time consumed by memory transfer
sketch of an implementation
void galario_sample_radial(int ngrid, const dreal* radial_grid, const dreal* radial_fluxes,
int nx, dreal dxy, dreal dRA, dreal dDec, dreal duv, int nd,
const dreal* u, const dreal* v, dcomplex* fint);
// returns padded array
dcomplex* galario_sweep(int ngrid, const dreal* radial_grid, const dreal* radial_fluxes,
int nx, dreal dxy);
As soon as the tests do not rely anymore on pyvfit (almost there..., only two tests left), we should consider adding CI via TravisCI.
Here the languages supported by Travis: https://docs.travis-ci.com/user/languages/
And here how to install the dependencies (e.g. FFTW): https://docs.travis-ci.com/user/installing-dependencies/
#ifdef __CUDACC__
__host__ __device__ inline void apply_phase_sampled_core
#else
inline void apply_phase_sampled_core
#endif
should be
#ifdef __CUDACC__
__host__ __device__
#endif
inline void apply_phase_sampled_core
as it's easier to read, don't repeat yourself!
Add docs. A tentative list:
$ python -c "import galario"
Segmentation fault
but other tests pass. Some linker error?
Long story short: we need to be able to manually switch on/off the CUDA compilation.
Issue:
I am trying to compile galario on a Mac but there is a circular pitfall due to the fact that on a Mac nvcc
supports clang
(and not gcc
/g++
), but the default version of clang
installed on the latest Mac OS Sierra is not supported..
Basically this prevents galario from being compiled (even the CPU version) on a Mac.
Quick and dirty temporary solution: comment out line 16 in galario/CMakeLists.txt:
###
# the optional and required packages
###
find_package(CUDA) --> # find_package(CUDA)
This avoids CUDA compilation, but we should fix this programmatically. Two solutions:
Make the code uniform in terms of:
We have GPL now, perhaps we use MIT or BSD. In any case, we should add the usual licence comments in the header files and python wrappers
kernels should get nthreads from nthreads()
. Test this. Wrap it in cython. Give it a better name perhaps
Now the input image is assumed to be complex: (from libcommon.pyx)
def sample(dcomplex[:,::1] data, dRA, dDec, du, dreal[::1] u, dreal[::1] v)
We assumed this so that the input data already allocated the space needed for FFT operations. However, the image in real space is always real, so at the moment we need to cast the image to complex type before using galario:
sample(ref_complex.astype('complex128'), dRA, dDec, du, u, v)
We could save the casting and transfer time if we copy a real 2d array, rather than complex.
Can we easily implement this? E.g. by:
def sample(dreal[:,::1] data, dRA, dDec, du, dreal[::1] u, dreal[::1] v)
// nx: size of data
dcomplex *data_d;
size_t nbytes = sizeof(dcomplex)*nx*nx;
dreal *data_d_real;
size_t nbytes = sizeof(dreal)*nx*nx;
CCheck(cudaMalloc((void**)&data_d, nbytes));
CCheck(cudaMemcpy(data_d_real, data, nbytes, cudaMemcpyHostToDevice));
kernel real to complex (see stackoverflow question)
kernel unsigned int to float: https://stackoverflow.com/questions/9153861/typecasting-in-cuda-and-cublas
This change would be needed since there is no real sense to have a complex input image.
As a future step: we could switch from using complex to complex
to real to complex
FFTs, but we did not do that because it changes the mapping and the size of the matrices.
Use environment variable with list of GPU indices to use. Then rank % len(indices)
is the position of the GPU index to use.
Check: some batch systems may support GPU pinning natively. The above still useful for desktop machines with one slow GPU and one high-performance GPU.
make them independent of pyvfit, test single, double, cuda, cpu vs. one reference
From http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4mPy74O00:
Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses, therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions). The returned pitch (or stride) must be used to access array elements. The following code sample allocates a width x height 2D array of floating-point values and shows how to loop over the array elements in device code:
// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
Want it available in terminal and html output. See pypmc how to do that
We should allow rectangular images, all our algorithms can support this. I think the only constraint is that the number of rows is even for the shift
. Any other constraints @mtazzari ?
pain to support open mp because of clang. python built with clang in anaconda, so openmp runtime missing. Let's just warn the user that we don't support openmp on the mac
return the values at interpolated points, wrap in python, unit test. all info for test already present
Rename C_*
-> galario_*
for C functions
build documentation on readthedocs.org
test with a simple programm that compiling and linking the C library works
This could simplify our code a bit but hides the data transfers that are now explicit. I keep it here for reference but don't intend to do anything about it now
But it would allow us to oversubscribe GPU memory
From https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/#comment-2358
The Benefits of Unified Memory on Pascal and Later GPUs
Starting with the Pascal GPU architecture, Unified Memory functionality is significantly improved with 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. The Page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages on demand from anywhere in the system to the GPU’s memory for efficient processing.
In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations for any code that is using Unified Memory for allocations (e.g. cudaMallocManaged()). It “just works” without any modifications to the application, whether running on one GPU or multiple GPUs.
Great for
Also, Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. This is useful in writing efficient multi-GPU cooperative algorithms.Demand paging can be particularly beneficial to applications that access data with a sparse pattern.
but we don't have a sparse pattern
In some applications, it’s not known ahead of time which specific memory addresses a particular processor will access. Without hardware page faulting, applications can only pre-load whole arrays, or suffer the cost of high-latency off-device accesses (also known as “Zero Copy”). But page faulting means that only the pages the kernel accesses need to be migrated.
I have made some profiling on my desktop calx174 (GTX 1060) with :
nvprof --export-profile filename.prof python python/speed_benchmark_dp.py --gpu
IMPORTANT: in speed_benchmark_dp.py
I commented out the first acc_lib.sample()
, so that this profiling contains only acc_lib.chi2()
.
A few notes, regarding the screenshot. Here I considered matrix size 4096, nsamples 1e6, 64 nthreads
real_to_complex
and shift_d
disappear! We are hugely dominated by the MemcpyHtoD
and by the cuFFT_C2C
(to which the several dpRadix
come from). I double checked this: leaving everything the same, and just setting nthreads=32
, real_to_complex
and shift_d
come up again as ~50%. I think we are in good position, since we want to go large with nthreads
and there is no reason to use 32.Memcpy
commands are synchronous: with async Memcpy (and/or streams, not sure if they imply each other) that we could save up to 1.3ms*5 = 6.5ms hiding the Memcpy
of the 1d arrays (size ~8MB) under the 2d Memcpy
of the image.cudaFree
slots.cudaFree
slots? I am getting convinced that the cudaFree
that we launch are not causing the big cudaFree
blocks of ~300ms in the timeline. To test this, I commented out all the cudaFree
in galario.cpp
and I see no difference in the timeline nor in the timings. I thus believe that the 0.6ms that the manual timing gives is accurate and when we call chi2()
without nvprof
we don't get these overheads.cublasCreateHandle
: I think it causes the MemcpyH2D at the very end, it is called in reduce_chi2_d
just before diff_weighted_d
. See a snapshot here with a zoom on the very final part (time >~1.6s in previous snapshot):Since this MemcpyH2D is synchronous, it could introduce delay due to latency since it causes a Malloc after interpolated_d and before diff_weighted_d. Perhaps, we could move this create handle at the beginning of sample_d
, after the initial Memcpys and pass the handle to reduce_chi2_d
, then destroy the handle in sample_d
. Here the docs:
cublasCreate
Because cublasCreate allocates some internal resources and the release of those resources by calling cublasDestroy will implicitly call cublasDeviceSynchronize, it is recommended to minimize the number of cublasCreate/cublasDestroy occurences. For multi-threaded applications that use the same device from different threads, the recommended programming model is to create one CUBLAS handle per thread and use that CUBLAS handle for the entire life of the thread.
These are results. Next comment: proposals where to go from here, prioritization of the optimizations.
Now gpu tested if available
CMake Error at docs/CMakeLists.txt:26 (file):
file COPY cannot find "/Users/mtazzari/repos/galario/docs/_static".
We should re-organize the tests in test_galario.py
:
reduce repetition, e.g. create a setup()
function that creates reference image, sampling points etc. that is called once at the beginning;
choose uniform parametrization: many tests have similar but not identical parametrizations. Try to make this uniform, in many cases it should be possible
add informative printouts: when a test fail, the default output of numpy.testing.assertequal
is not very useful. Try to add essential yet informative printout to help debugging.
Some of these proposals might require using objects e.g. to store the static output of setup()
.
C_acc_*
-> `galario_
chi2
-> chi2_reduction
do_everything
is a bad name, just call it chi2
add a function to return sampled points: stop do_everyting before chi2 reduction
void C_fft2d(int nx, dcomplex* data);
void C_fft2d(int nx, void* data)
{
C_ff2d(nx, static_cast<dcomplex*>(data);
}
Document it, for example in snippets
I decided to look into the loss of precision again. Remember I told you about the summation algorithms I looked at this week. I compared single and double precision on CPU and GPU to the python reference implementation. Quite possibly we had those numbers before but I just did it again. here I printed the first 5 element of the interpolated values output by acc_lib.sample
[ -7.32114697 117.65641022 0.5806728 3.80655694 23.42615891] # py single
[ -6.9562273 111.83618164 0.72458202 3.68281412 24.37336922] # galario single
[ -7.32083263 117.65531322 0.58068566 3.80660492 23.42551657] # py double
[ -7.32083263 117.65531324 0.58068566 3.80660492 23.42551656] # galario double
[ -7.32114697 117.65641022 0.5806728 3.80655694 23.42615891] # py single
[ -6.95623207 111.83618164 0.72458595 3.68281269 24.37336731] # galario single
[ -7.32083263 117.65531322 0.58068566 3.80660492 23.42551657] # py double
[ -7.32083263 117.65531324 0.58068566 3.80660492 23.42551656] # galario double
It is clear that there is a big loss in precision for galario single, the other three implementations agree better. That loss is the same for GPU or CPU. So there must be a problem in galario with the algorithm itself.
To find out where the precision is lost, I just checked after every operation in test_loss()
in test_galario.py.
The output of fft2 and cuFFT and FFTW were only within 1e-3 relative precision for single, and this is about the first operation on the image, so I think this is where the root of the problem is. I know that pyfftw does a lot of comparisons again numpy FFT, so they should agree. But they seem to differ for results close to zero. If there are terms of varying signs and magnitudes, this can happen with naive summation. I don't know what's used in np.fftw but I know that FFTW uses pairwise summation which should do OK except in pathological cases that I hope/guess are not present in our test case.
I updated many unit tests to use both rtol and atol. This showed that the relative precision is OK unless numbers are close to zero, where atol
comes into play.
What totally puzzles me is that if I do all steps individually, the CPU/GPU results agree with pyvfit up until after the interpolation but if I call the GPU version that does everything combined, the results are at much lower precision. I played around and found that mismatch ratio is 55 % for the phases in par1
but setting the phases to zero or multiplying by 10 gave mismatches close to zero. Does that mean the phase is not correctly taken into account in on of the implementations? Or do we just see come loss of precision for certain values of the phase? hmmm
I went through the code and applied const
in the function interfaces whenever I thought it was useful. The only problem is that we overwrite fint
in reduce_chi2
. Of course if you just care about chi2, that's fine. But I think to be clean we should not use it as a buffer and rather create our own buffer to store the weighted differences in. Won't be noticeably slow I predict
doxygen or sphinx? Does doxygen handle doc string in .pyx?
Add a target to cmake, add sample doc for one C++ function and one python function.
Check all C/CUDA code against memory leakages.
Check that all the mallocs are effectively freed.
py.test -rxs pyvfit/tests.py works
py.test -rxs --gpu pyvfit/tests.py used to work, but the new version of py.test seems not to allow options passing anymore. Look into that.
There is a memory leak in chi2() due to the fact that many cuda free have been commented out.
Should be easy to fix: just uncomment them.
nvprof suggests that cudafree totally dominates the computation on P100. And cudafree
appears where we don't see it in the code. Let's insert timing by hand to verify like shown here
https://devtalk.nvidia.com/default/topic/529363/how-to-measure-total-time-for-cpu-and-gpu/
Evaluate whether to apply phase after visibility sampling
in pr #36, I found this TODO in the header
//# TODO consistent notation dreal -> real_t, dcomplex -> complex_t
//# to avoid confusion with host vs. device
Should we rename? I kind of don't want to, so I will just remove the comment from the code
Why do we use cufft complex number in the header and cuComplex in the implementation?
#ifdef __CUDACC__
#include <cufft.h>
typedef cufftDoubleComplex dcomplex;
#ifdef __CUDACC__
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <cuComplex.h>
Getting python to work may require some fiddling. I can imagine users just want the C/C++ library and don't care to make cmake happy, so we should make python build optional
https://github.com/mtazzari/galario/blob/master/python/CMakeLists.txt#L6
shouldn't give a fatal error
From the GTC presentation
GPU Performance Analysis and Optimization
Paulius Micikevicius
Developer Technology, NVIDIA
Threadblock size choice:
– Start with 128-256 threads per block
• Adjust up/down by what best matches your function
• Example: stencil codes prefer larger blocks to minimize halos
– Multiple of warp size (32 threads)
– If occupancy is critical to performance:
• Check that block size isn’t precluding occupancy allowed by register and
SMEM resources
Grid size:
– 1,000 or more threadblocks
• 10s of waves of threadblocks: no need to think about tail effect
• Makes your code ready for several generations of future GPUs
So the minimum is 32, and our default should be larger. Let's see its effect once we have a stable configuration
It would be nice to have a Python Exception raised in the case CUDA raises an error, e.g. CUFFT Error: Unable to create plan
Compile the C code with both precisions, make both available in python
The primary intended usage of galario is to call it inside a fit where the observed data never change. We could save some time by copying this over to the GPU only once, then reuse it. Before implementing that, I should profile and estimate whether it helps at all. Perhaps the transfer of the model image takes much longer and that has to be transferred every time if the image is generated on the CPU
prefer openmp over pthreads libraries. We use openMP ourselves, so for uniform control we'd want openmp. Fall back to pthreads if openmp not available
In our typedef, we seem to require C++ by using std::complex<>
. I need to check by building a simple C program that uses galario.
typedef float dreal;
#ifdef __CUDACC__
typedef cufftComplex dcomplex;
#else
#include <complex>
typedef std::complex<float> dcomplex;
#endif
Of course it's ok to use C++ in the implementation but perhaps we can use the C99 complex type and be compatible with both languages in the header
The rotix_core function can be highly simplified by noting that:
indu[i] = index + (u[i] - umin - index*du)/du =
= index + (u[i] - umin)/du - index = (u[i] - umin)/du
And the same applies for indv. I think we can even delete the rotix kernels and put it directly in interpolate_core
. Check if the indu and indv arrays are needed only in the interpolate. If yes, then it's a go for deleting rotix.
#ifdef __CUDACC__
__host__ __device__ inline void rotix_core
#else
inline void rotix_core
#endif
(int const i, int const nx, dreal const umin, dreal const du, dreal const* const u, dreal const* const v, dreal* const __restrict__ indu, dreal* const __restrict__ indv) {
// u
int index = floor((u[i] - umin) / du);
dreal center_ind = umin + index * du;
indu[i] = index + (u[i] - center_ind) / du;
// v
index = floor((v[i] - umin) / du);
center_ind = umin + index * du;
indv[i] = index + (v[i] - center_ind) / du;
}
Not really a problem but posting the results here for reference
nsamples = 512
real_type = 'float64'
complex_type = 'complex128'
wle_m = par1.get('wle_m', 0.003)
x0_arcsec = par1.get('x0_arcsec', 0.4)
y0_arcsec = par1.get('y0_arcsec', 10.)
# generate the samples
maxuv_generator = 3.e3
udat, vdat = create_sampling_points(nsamples, maxuv_generator, dtype=real_type)
x, _, w = generate_random_vis(nsamples, real_type)
# compute the matrix size and maxuv
size, minuv, maxuv = matrix_size(udat, vdat, force_nx=4096)
print("size:{0}, minuv:{1}, maxuv:{2}".format(size, minuv, maxuv))
uv = pixel_coordinates(maxuv, size).astype(real_type)
# create model complex image (it happens to have 0 imaginary part)
reference_image = create_reference_image(size=size, kernel='gaussian', dtype=complex_type)
ref_complex = reference_image.astype(complex_type)
chi2_cuda = g_double.chi2(ref_complex, x0_arcsec, y0_arcsec,
maxuv/size/wle_m, udat/wle_m, vdat/wle_m, x.real.copy(), x.imag.copy(), w
use Intel amplifier_xe to identify basic hotspots on
OMP_NUM_THREADS=k python/py.test.sh -s python_package/tests/test_galario.py -k profile
So major hotspot is applying the phase. Understandable becauce FFT scales O(n log(n)) and phase is O(n^2). The default version works in the cartesian plane because that's what FFTW expects. It seemed like applying a phase in polar coordinates is easier but it turns out it takes longer due to the conversion, so we should stick with what we have
// cartesian
dcomplex const phase = dcomplex{dreal(cos(angle)), dreal(sin(angle))};
data[idx] = CMPLXMUL(data[idx], phase);
// polar
dreal const magn = std::abs(data[idx]);
dreal const oldangle = std::arg(data[idx]);
data[idx] = std::polar(magn, oldangle + angle);
#endif
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.