The cuffs from sarrvesh

Estimate execution time

Helper script should allow 2D input FITS files

The helper script to create 3D cubes does not allow for 2D input fits files (i.e. NAXIS=2). There are some subtle things in the script that need to be adjusted to make this possible. I have created a modified version of the script and can upload it as a separate branch if you like, Sarrvesh?

Allow the user to compile in float/double mode

The user should be able to compile the rm synthesis code in both float/double mode. Gaming GPUs seem to slow down while attempting to do double-enable math operations while scientific GPUs work quite well with double data types. To switch between these modes, the user should be able to compile the code in both float/double modes.

Science test case

For a simple Q and U cubes, run RM Synthesis and compare results with the rmsynthesis code on dop254.

Need better error handling

Looks like a number of functions terminate execution using exit(). This needs to be avoided. All functions must return control to main() and it is main() who decides whether the code needs to terminate or not.

Print device information

Detect all GPU devices and print their information before doing anything in rmsynthesis.c

Support multiple GPUs

Multiple (dis)similar GPUs can be connected to a single host. The code be able to detect multiple GPUs and distribute threads among all suitable devices.

fitsrotate doesn't carry header information

Cubes that are rotated with fitsrotate lose their WCS (and all other supplemenatry) information after being rotated.

Check out of bounds in each thread

To have equal number of threads across all blocks, the number of threads launched can be greater than the number of \phi planes. To avoid out of bounds memory access, each thread should check if its index is greater than the number of \phi planes. If it is greater, that thread should terminate gracefully without executing anything.

Build with autoconf or cmake

Once the bare-bones of the code is done, write an installation scipt with autoconf. See

Compile and test against different CUDA versions

Make sure that the code is compatible with different CUDA versions.

Avoid variable reuse in doRMSynthesis()

Variables size and nElements are being reused multiple times in function doRMSynthesis(). This is not good coding practice and could potentially lead to disasters... Also, need to clean up the code in this function.

Implement Faraday synthesis

Implement Faraday synthesis as described in Bell & Ensslin (2012).

Output cubes contain only zeros (on galaxy)

I've tried using the program both on MWA and ASKAP data, and in both cases the pixel values in the output cubes contain only zeros. The program seems to complete successfully, and the cubes are valid FITS files, but they don't contain any useful data. Unsure whether this is due to the build that I managed to do on galaxy, or the code itself. Did you ever see something like this on other systems?

Use FFT instead of DFT for RM Synthesis

Not sure if this is really required... come back to this at a later stage.

Support image masks

The code should be able to read in a fits mask and run rmsynthesis only on the selected pixels.

Select GPU based on global memory size

While operating in single device mode, select device based on its global memory size. At the moment, the code chooses the first device in the list. (Also see issue #9 )

Better documentation

Write better documentation for all functions. Also document the parameter list and return values for all functions.

Optimize thread and block size

In the current version, a single block with nPhi threads are launched by default. This is not necessarily the best option. Ideally, one should decide based on the number of registers available per MP. A good understanding of the GPU hardware is needed to solve this problem.

Support gpuocelot compatible compilation

GPUOcelot (https://code.google.com/archive/p/gpuocelot/) is tool that allows one to run CUDA code on x86 cpus.

Plot rmsf with gnuplot

Instead of just writing the RMSF to disk, plot RMSF with gnuplot.

Insufficient memory on device

Larger test case failing with message
"ERROR: Insufficient memory on device! Try reducing nPhi"

After a bit of inspection it appears that the global memory might be incorrectly identified. The size of the output Q/U cubes seems to be reported as exactly the same size as the global memory.

Remove unwanted code

Remove unwanted printf statements that were coupled with some fits_report_error.

Account for spectral dependence within the bandwidth

For surveys with large bandwidth, we have to take the spectral index information into account before applying RM synthesis.

Add to conda-forge

Hey @sarrvesh!

I'm not sure how much time you've got for cuFFS things these days. But it'd be great if we could get get cuFFS onto conda-forge for easy installation. Right now the installation seems to be the trickiest bit, so having an easy one-line option would be great.

Unfortunately, I have no experience with adding a package to conda-forge. But, a quick look at the docs would seem to suggest it isn't too much work

Running on large files

Hi @sarrvesh ! Thank you for making cuFFS.

I tried to run cuFFS on 600 * 2 GB fits files, but makeFitsCube.py couldn't stack the rotated cube requested by cuFFS, because (I think) the HPC RAM available (200GB) is smaller than the cube size (1TB).

So I have a few questions:

Is there an efficient way to stack a 1TB rotated cube? (I wrote a few scripts myself, but they would take over a week to generate that cube...)
Can cuFFS run on a 1TB cube with 200GB nodes? If not, my workaround would be making smaller sub-cubes.

Thanks!

Implement RM Clean

Avoid for loops inside gpu kernels

Free deviceInfoList after selecting the best gpu

At the moment, if multiple GPUs are detected, the structure array deviceInfoList is retained until the very end. This is not needed; it can be free'd after the best device is selected.

Process each LoS separately

In the current version of the code, each input channel is processed separately and each \phi axis is assigned to a gpu kernel. Due to the way GPU memory is accessed, this might not be the most efficient way to do this. Another approach is to process each line of sight separately. I am not sure if this is faster but we should give it a try.

Better input verification

The code needs to do better input verification (like NAXIS!=3).

Optimize QU computation

At the moment Q(phi) and U (phi) are computed separately. This requires that input frequency channels have to be moved to the device twice. If the device memory is big enough to accommodate both Q and U output cubes at the same time, the code should compute Q and U as one single gpu call. This can speed up the code a bit.

Come up with a better name than RMSynth_GPU.

Change data type

Treat all pixel values as float. Double precision is not required at least for now.

Reduce memory footprint

Since RM Synthesis works on one input image channel at a time, we can reduce the code's memory footprint by not reading in the entire Q and U cubes into memory.

Output larger than device or host memory

If the number of output phi planes is large enough, the output Q, U and P cubes might not fit in the host memory. What should the code do in such cases?

Sums are not NaN safe

If input array contains a NaN, the output voxel will be filled with NaNs. It would be useful to have NaNs handled as an input.

RMSF -- Not wide enough for CLEAN & Doesn't adjust to account for flagging

Currently, the output RMSF is the same width as the specified Faraday depth range. For CLEAN an RMSF of twice the FD range to be CLEANed is required.

Also, implementations such as RM-Tools "correctly deals with isolated clumps of NaN-flagged voxels within the data-cube (unlikely in interferometric cubes, but possible in single-dish cubes)". Specifically, if a NaN is detected, the channel is set to 0 and the weight is also set to zero. The ability to specify a weight per channel would be beneficial. Or, even better would be the output of an RMSF per pixel that correctly accounts for varying flagging per voxel.

Journal publication

Publish the RM Synthesis (version 1.0 milestone) in Astronomy & Computing.

Header keywords are not retained in the output cubes

The outputs produced by rmsynthesis and rmsynthesis_cpu do not retain all fits header keywords. Only specific wcs keywords are retained because those have been hard coded.

Implement a multi-threaded CPU version of rmsynthesis

Brentjens' cpu code is single-threaded. It would be nice to have a multi-threaded cpu version of cuFFS.

Detect memory leaks with valgrind (version 1)

This should be the last step before version release.

Detect memory leaks with valgrind (version 2)

Should be last issue to fix in this milestone.

Merge optionsList and parList structures

I don't really see we should two different structures should be kept separated. Almost all functions rely on both these structures.

Reduce the memory footprint of fitsrotate

fitsrotate uses two memory maps to read in the fits cube, rotate, and write out the rotated cube. If we read the input cube channel-by-channel, the code can work with a single memory map.

Update readme

The current readme is seriously out-of-date. Update with

Installation instructions
Update dependencies?
Structure of the code
Add citation instructions: Link to the A&C publication and ascl.
Information about work in progress?
How to contribute?

sarrvesh / cuffs Goto Github PK

cuffs's People

Contributors

Stargazers

Watchers

Forkers

cuffs's Issues

Recommend Projects

Recommend Topics

Recommend Org