sarrvesh / cuffs Goto Github PK
View Code? Open in Web Editor NEWA GPU-accelerated Rotation Measure Synthesis code
License: GNU General Public License v2.0
A GPU-accelerated Rotation Measure Synthesis code
License: GNU General Public License v2.0
The helper script to create 3D cubes does not allow for 2D input fits files (i.e. NAXIS=2). There are some subtle things in the script that need to be adjusted to make this possible. I have created a modified version of the script and can upload it as a separate branch if you like, Sarrvesh?
The user should be able to compile the rm synthesis code in both float/double mode. Gaming GPUs seem to slow down while attempting to do double-enable math operations while scientific GPUs work quite well with double data types. To switch between these modes, the user should be able to compile the code in both float/double modes.
For a simple Q and U cubes, run RM Synthesis and compare results with the rmsynthesis code on dop254.
Looks like a number of functions terminate execution using exit(). This needs to be avoided. All functions must return control to main() and it is main() who decides whether the code needs to terminate or not.
Detect all GPU devices and print their information before doing anything in rmsynthesis.c
Multiple (dis)similar GPUs can be connected to a single host. The code be able to detect multiple GPUs and distribute threads among all suitable devices.
Cubes that are rotated with fitsrotate lose their WCS (and all other supplemenatry) information after being rotated.
To have equal number of threads across all blocks, the number of threads launched can be greater than the number of \phi planes. To avoid out of bounds memory access, each thread should check if its index is greater than the number of \phi planes. If it is greater, that thread should terminate gracefully without executing anything.
Once the bare-bones of the code is done, write an installation scipt with autoconf. See
Make sure that the code is compatible with different CUDA versions.
Variables size and nElements are being reused multiple times in function doRMSynthesis(). This is not good coding practice and could potentially lead to disasters... Also, need to clean up the code in this function.
Implement Faraday synthesis as described in Bell & Ensslin (2012).
I've tried using the program both on MWA and ASKAP data, and in both cases the pixel values in the output cubes contain only zeros. The program seems to complete successfully, and the cubes are valid FITS files, but they don't contain any useful data. Unsure whether this is due to the build that I managed to do on galaxy, or the code itself. Did you ever see something like this on other systems?
Not sure if this is really required... come back to this at a later stage.
The code should be able to read in a fits mask and run rmsynthesis only on the selected pixels.
While operating in single device mode, select device based on its global memory size. At the moment, the code chooses the first device in the list. (Also see issue #9 )
Write better documentation for all functions. Also document the parameter list and return values for all functions.
In the current version, a single block with nPhi threads are launched by default. This is not necessarily the best option. Ideally, one should decide based on the number of registers available per MP. A good understanding of the GPU hardware is needed to solve this problem.
GPUOcelot (https://code.google.com/archive/p/gpuocelot/) is tool that allows one to run CUDA code on x86 cpus.
Instead of just writing the RMSF to disk, plot RMSF with gnuplot.
Larger test case failing with message
"ERROR: Insufficient memory on device! Try reducing nPhi"
After a bit of inspection it appears that the global memory might be incorrectly identified. The size of the output Q/U cubes seems to be reported as exactly the same size as the global memory.
Remove unwanted printf statements that were coupled with some fits_report_error.
For surveys with large bandwidth, we have to take the spectral index information into account before applying RM synthesis.
Hey @sarrvesh!
I'm not sure how much time you've got for cuFFS things these days. But it'd be great if we could get get cuFFS onto conda-forge for easy installation. Right now the installation seems to be the trickiest bit, so having an easy one-line option would be great.
Unfortunately, I have no experience with adding a package to conda-forge. But, a quick look at the docs would seem to suggest it isn't too much work
Hi @sarrvesh ! Thank you for making cuFFS.
I tried to run cuFFS on 600 * 2 GB fits files, but makeFitsCube.py
couldn't stack the rotated cube requested by cuFFS, because (I think) the HPC RAM available (200GB) is smaller than the cube size (1TB).
So I have a few questions:
Is there an efficient way to stack a 1TB rotated cube? (I wrote a few scripts myself, but they would take over a week to generate that cube...)
Can cuFFS run on a 1TB cube with 200GB nodes? If not, my workaround would be making smaller sub-cubes.
Thanks!
At the moment, if multiple GPUs are detected, the structure array deviceInfoList is retained until the very end. This is not needed; it can be free'd after the best device is selected.
In the current version of the code, each input channel is processed separately and each \phi axis is assigned to a gpu kernel. Due to the way GPU memory is accessed, this might not be the most efficient way to do this. Another approach is to process each line of sight separately. I am not sure if this is faster but we should give it a try.
The code needs to do better input verification (like NAXIS!=3).
At the moment Q(phi) and U (phi) are computed separately. This requires that input frequency channels have to be moved to the device twice. If the device memory is big enough to accommodate both Q and U output cubes at the same time, the code should compute Q and U as one single gpu call. This can speed up the code a bit.
Treat all pixel values as float. Double precision is not required at least for now.
Since RM Synthesis works on one input image channel at a time, we can reduce the code's memory footprint by not reading in the entire Q and U cubes into memory.
If the number of output phi planes is large enough, the output Q, U and P cubes might not fit in the host memory. What should the code do in such cases?
If input array contains a NaN, the output voxel will be filled with NaNs. It would be useful to have NaNs handled as an input.
Currently, the output RMSF is the same width as the specified Faraday depth range. For CLEAN an RMSF of twice the FD range to be CLEANed is required.
Also, implementations such as RM-Tools "correctly deals with isolated clumps of NaN-flagged voxels within the data-cube (unlikely in interferometric cubes, but possible in single-dish cubes)". Specifically, if a NaN is detected, the channel is set to 0 and the weight is also set to zero. The ability to specify a weight per channel would be beneficial. Or, even better would be the output of an RMSF per pixel that correctly accounts for varying flagging per voxel.
Publish the RM Synthesis (version 1.0 milestone) in Astronomy & Computing.
The outputs produced by rmsynthesis and rmsynthesis_cpu do not retain all fits header keywords. Only specific wcs keywords are retained because those have been hard coded.
Brentjens' cpu code is single-threaded. It would be nice to have a multi-threaded cpu version of cuFFS.
This should be the last step before version release.
Should be last issue to fix in this milestone.
I don't really see we should two different structures should be kept separated. Almost all functions rely on both these structures.
fitsrotate uses two memory maps to read in the fits cube, rotate, and write out the rotated cube. If we read the input cube channel-by-channel, the code can work with a single memory map.
The current readme is seriously out-of-date. Update with
At the moment, users will have to rotate the input cubes using external programs like miriad. cuFFS should have an built-in executable that can rotate and derotate FITS cubes.
Should be last issue to fix in this milestone.
The CPU code does not produce identical output as the GPU version of the code. mean(CPU/GPU) ~ 4000 with a large scatter.
While working with large fits images/cubes, the python script makeFitsCube.py could be slow. Will a C implementation be faster?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.