The rfpipe from realfastvla

Too many clusters generated when using default parameters

Running the scan with the default parameters of clustering is generating too many candidates, even if there was just one transient in the dataset (found when running pipeline on 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut), which implies that the clustering is not being done efficiently. Increasing the min_cluster_size while setting the state resolves the issue though.

selecting channels not working with calibration

Selecting channels from 19A-331_TEST_1hr_012.58458.946658078705.4.1 gives this error:

ValueError: operands could not be broadcast together with shapes (1315,300,186,2) (300,192,2) .

Must be something to do with how spw boundaries align with channel selection.

round indexes to improve clustering

The time index of a candidate is scaled by dt. Sometimes candidates are found over a range of dt and calculating an index to compare against dt=1 leads to errors on the order of dt (e.g., dt=8 means multiplying integration by 8 and comparing to cands with dt=1). Clustering can miss candidates with large dt due to this misalignment in time.
One possibility is to round indexes by factors of 2 or more (perhaps up to max dt) such that more candidates have the same time index during clustering. This should only be done to help clustering group these events; we don't want to save rounded indexes.

many candidates with similar SNR getting max SNR calculated incorrectly?

When getting counting data from vys, the candidates are high (false) SNRs and often identical for many different trials. When clustering, it may find one cluster for the dozens found, but the max SNR chosen for the cluster may not be the maximum seen printed to screen.

Move flagging into rfpipe and test suite

rtpipe flagging is currently imported by rfpipe. Need to bring it in and write functions that use numba.guvectorize approach for parallelism.
Also need to test for overflagging (a.k.a. FRB olympics).

replace kalman stuff with kalman_detector

Barak has his code at https://pypi.org/project/kalman_detector. Use it!

spworder incorrect in some cases

Getting inconsistent spw frequencies when parsing with metadata.sdm_metadata for SDM 18A-199.sb34989985.eb35001577.58155.0041900694.

In [13]: meta['spworder']
Out[13]:
[('AC-0', 1988.0),
('AC-1', 2116.0),
('AC-2', 2244.0),
('AC-3', 2372.0),
('AC-4', 2500.0),
('AC-5', 2628.0),
('AC-6', 2756.0),
('AC-7', 2884.0),
('BD-0', 2988.0),
('BD-1', 3116.0),
('BD-2', 3244.0),
('BD-3', 3372.0),
('BD-4', 3500.0),
('BD-5', 3628.0),
('BD-6', 3756.0),
('BD-7', 3884.0)]

versus
In [15]: meta['spw_reffreq']
Out[15]:
[4488000000.0,
4616000000.0,
4744000000.0,
4872000000.0,
5000000000.0,
5128000000.0,
5256000000.0,
5384000000.0,
5488000000.0,
5616000000.0,
5744000000.0,
5872000000.0,
6000000000.0,
6128000000.0,
6256000000.0,
6384000000.0,
1988000000.0,
2116000000.0,
2244000000.0,
2372000000.0,
2500000000.0,
2628000000.0,
2756000000.0,
2884000000.0,
2988000000.0,
3116000000.0,
3244000000.0,
3372000000.0,
3500000000.0,
3628000000.0,
3756000000.0,
3884000000.0]

spworder iterates over sdmpy scanobj.bdf.spw object. spw_reffreq iterates over SpectralWindow.xml rows.

rfpipe save_cands should be string for product to be saved?

Currently save_cands is boolean and defines whether to save candcollection and canddata for all detections. It is not possible to separate these two products.
If save_cands was a string that defined product name (or "all" or True) then it would allow finer control over products that are desired.

sum baselines for either RFI detection or low-sensitivity search

Sum of baselines could be dedispersed and searched as for baselines. That could reproduce the search but at a lower sensitivity that is less likely to suffer from issues with super intense bursts.

Alternatively, the incoherent baseline data could be an anticoincidence signal since it might be more subject to RFI.

test rfpipe scheduler with GPU

Would be useful to see how any kind of GPU kernel behaves when scheduled with dask distributed.

candidate summary plot time axis in units of nanosec?

The summary plots are generated with huge values. Make it more legible with default scaling.

testing search coverage at segment boundaries

Search with lots and few segments.
Compare int, DM, dt coverage and time.

search dedispersed integrations with partial data?

Dedispersion typically reduces the number of integrations that can be searched with nominal sensitivity (i.e., all channels). It may still be scientifically interesting to search sub-optimal integrations. For example, some bursts may have structure only in the top of the band.

However, it may be preferable to have a formalized procedure for defining more matched filters. Subband bursts could be found if they can be parameterized.

use clustering of real-time candidate set

Need to integrate Luis's work with real-time system.
Also should look at clustering with different dimentionality, as discussed at https://hackpad.com/Optimizing-Distance-Metrics-for-Clustering-LZoQbnUMd9J.

gpu resampling seems to resample twice

Running a search with rfgpu and downsampling by 2 will find candidates at integration/2 and integration/4.

manage many candidates in a segment

Bad data can trigger candidates in every image. This is slow and probably could lead to memory problems.
Look for ways to avoid being saturated by candidates.

Saving CandData object

Future interest likely for machine classification of candidates. The png is currently the only way the image and spectra are saved, but they are not great input to a machine classifier. Saving CandData object may make it easier to add ML in the future.
Ideally we'd save CandData and generate plots from that. But not clear how to do that. More likely, we need to save both CandData and CandCollection and pngs and associate offline.

rfpipe on GPU cluster

@caseyjlaw commented on Wed Sep 21 2016

key questions:

latency?
GPU utilization?
is memory buffer adequate for typical VLA observing patterns?

consider kurtosis for flagging

UnboundLocalError when running with default clustering parameters, and saveplots = True

Running the pipeline_scan with the following state on 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut data:

st = rfpipe.state.State(sdmfile=dataloc, sdmscan=7, inprefs={'dtarr': [1,2,4], 'npix_max': 512, 'gainfile': gainloc, 'saveplots': True, 'savecands': True, 'maxdm': 1000, 'applyonlineflags': False, 'fftmode': 'cuda', 'flaglist': []})

cc = rfpipe.pipeline.pipeline_scan(st)

gives the following error (this error doesn't show up if clustercands is set in the state):

`2018-10-28 16:27:17,175 - rfpipe.candidates - INFO - Calculating features for 1 candidate.
2018-10-28 16:27:17,181 - rfpipe.candidates - INFO - Saving CandData to /hyrule/data/users/kshitij/rfgpu/cands_16A-459_TEST_1hr_000.57633.66130137732.scan7.cut.7.1.pkl.

UnboundLocalError Traceback (most recent call last)
in
1 start = time.time()
----> 2 cc = rfpipe.pipeline.pipeline_scan(st)
3 end = time.time()
4 print(end-start)

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in pipeline_scan(st, segments, cfile, vys_timeout)
26 for segment in segments:
27 candcollection += pipeline_seg(st, segment, cfile=cfile,
---> 28 vys_timeout=vys_timeout)
29
30 return candcollection

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in pipeline_seg(st, segment, cfile, vys_timeout)
37
38 data = source.read_segment(st, segment, timeout=vys_timeout, cfile=cfile)
---> 39 candcollection = prep_and_search(st, segment, data)
40
41 return candcollection

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in prep_and_search(st, segment, data)
49
50 if st.prefs.fftmode == "cuda":
---> 51 candcollection = search.dedisperse_search_cuda(st, segment, data)
52 elif st.prefs.fftmode == "fftw":
53 candcollection = search.dedisperse_search_fftw(st, segment, data)

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/search.py in dedisperse_search_cuda(st, segment, data, devicenum)
211 if st.prefs.savecands or st.prefs.saveplots:
212 # triggers optional plotting and saving
--> 213 cc = reproduce_candcollection(cc, data)
214
215 candidates.save_cands(st, candcollection=cc)

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/search.py in reproduce_candcollection(cc, data, wisdom)
434 cd = rfpipe.reproduce.pipeline_imdata(st, candloc, data_corr,
435 cpuonly=True, **kwargs)
--> 436 cc1 += candidates.save_and_plot(cd)
437
438 # TODO: validate that reproduced features match input features?

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/candidates.py in save_and_plot(canddatalist)
347 save_cands(st, canddata=canddata)
348 if st.prefs.saveplots:
--> 349 candplot(canddata, cluster=(clusters[i], clustersizes[i]), snrs=snrs)
350
351 return candcollection

UnboundLocalError: local variable 'clusters' referenced before assignment

`

multithreaded data_prep

Currently the data_prep function takes state and data and then applies calibration and flagging in place. This uses only one thread, which slows it considerably.
Calibration has been made into a functional form, so it can be scheduled separately now. Perhaps with separate threads working on separate subbands.
Flagging is also still dependent on rtpipe, single-threaded, and quite slow.

manage gainfile solution time

The realtime system could apply telcal solutions that differ from offline processing. We need the rfpipe to return the solution time (an mjd array of length 1). Most of that is done.
This value should be attached to the candcollection. It also needs to be passed (by realfast controller) to the sdm builder to tell it which solutions to attach to SDM.

write test suite and automate

demonstrate that fftw and cuda return identical results

Some indication that small (suboptimal) grids produce very different results.
Also need to check that cuda results can be reproduced by fftw.

Use number of clusters as an RFI detector

We sometimes find that RFI can trigger many detections and do not cluster well. They tend to be found with many different (l, m, DM, dt).
Could we use the number of clusters in a segment as a trigger for rejecting RFI? For example, we could set a parameter in the preferences (e.g., max_clusters) that is tested after clustering. If more than that many are found, then reject all candidates in the segment. Or potentially, one could just use that to trigger the generation of a single candidate plot, rather than all in the segment.

use fuzzy test to suggest preference names

We all forget the preference names sometimes. Fuzzy string comparison can be used to suggest the correct name.

https://github.com/seatgeek/fuzzywuzzy

rfpipe save_noise should return values rather than save?

Currently, running search will optionally produce a pkl file with noise properties. realfast needs to look on disk to find these files and then ingest.
Smarter would be to have function return values to realfast. That would require rfpipe to manage that itself in standalone mode.

vysmaw reader not reliably catching all integrations

vysmaw_reader not catching all integrations.
The number of missing vysmaw_reader spectra is not reduced by waiting a few ints longer. Probably earlier than end-of-segment window.

Very low SNR (<7) candidates being clustered and plotted

Pipeline is generating and clustering candidates with SNR less than the threshold (persistent on all the cases: injected transient on simulated data, frb with real data, and real data with injected transient).
Could be an issue of calculation of SNR on GPU vs CPU, cause this was encountered using rfgpu.
An example of the above:
t0 = time.Time.now().mjd
meta = rfpipe.metadata.mock_metadata(t0,t0+10/(24*3600), 27, 4, 32*4, 4, 5e3, datasource='sim', antconfig='D')
st = rfpipe.state.State(inmeta = meta, inprefs={'dtarr': [1, 2], 'npix_max': 512, 'savecands': True, 'saveplots': True, 'maxdm': 500, 'applyonlineflags': False, 'flaglist': [], 'clustercands': True, 'fftmode': 'cuda'})
data = rfpipe.source.read_segment(st,0)
mock = rfpipe.util.make_transient_params(st, snr = 80, data = data, ntr = 3)
st = rfpipe.state.State(inmeta = meta, inprefs={'dtarr': [1, 2, 4], 'npix_max': 512, 'savecands': True, 'saveplots' : True,'maxdm': 500, 'applyonlineflags': False, 'flaglist': [], 'clustercands': (4,2), 'simulated_transient': mock, 'fftmode': 'cuda'})
cc = rfpipe.pipeline.pipeline_scan(st)

the mocks used were:
[(0, 971, 77.4, 0.005, 0.18785936515144086, -0.0016023057774304935, 0.0), (0, 21, 129.0, 0.02, 0.09468444540203508, 0.0025197170973120567, 0.0), (0, 1859, 336.05, 0.005, 0.1892371962894011, 0.0, -0.0029736458931088282)]

Also, for the data 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut the max SNR reported for the clustering, and the one on the plots was different. Relevant logger output attached.
As can be seen below, for the cluster with ~1300 cands, it reports a max SNR of 31.578176498413086, but while reproducing the candidate it was reported to be 22.6.

RRAT observations of 2012 giving error on using onlineflags

As can be seen in the attached screenshot, when running rfpipe pipeline_scan on the RRAT observations taken in 2012, rfpipe throws an errors which indicate that online flags may not be present for these observations. The issue is resolved when the 'applyonlineflags' is set to False.

can dask dataframe be used instead of pandas dataframe?

vysmaw app overwriting some data values

vysmaw_reader chooses indexes in time, baseline, channel, and pol. The logging statement about twice-written fields is triggered sometimes.

data value alread set at 12 191 192 1
data value alread set at 11 191 192 3
data value alread set at 10 191 192 3
data value alread set at 9 191 192 3
data value alread set at 8 191 192 3
data value alread set at 7 191 192 3
data value alread set at 6 191 192 3
data value alread set at 5 191 192 3
data value alread set at 4 191 192 3
data value alread set at 3 191 192 3
data value alread set at 2 191 192 3
data value alread set at 1 191 192 3
data value alread set at 0 191 192 3

Not clear what the pattern is yet, but it may be tied to "missing" spectra ( #26 )

fill state module with some rtpipe.RT.set_pipeline functionality

Currently we are defining state from rtpipe. Need to move that into rfpipe.
Overall goal is to allow state to have methods for calculating useful parameters on the fly (e.g., data delay depends on freqs with input argument of dm).

move to DM transform

Currently, rtpipe uses a brute force dedispersion algorithm. We should move to a tree-like algorithm that actually transforms from (f,t) to (dm,t) space.

complete baseline and channel weighting implementation

The development branch has functions to weight channels and baselines according to an estimate of their variance. Testing on ideal data shows that it improves detection significance when a few channels are noisier than the rest. However, FRB 121102 data shows worse SNRs when using channel weighting. Baseline weighting seems to show a small systematic improvement in FRB 121102 tests.
However, it seems that the variance estimates are not reliable on real data, so these weigthing schemes are turned off by default. Full testing (and perhaps tuning the variance calc) is needed.

implement saving of noise properties

add candidate parsing to test suite

Tests do not cover parsing of candidate output pickle.

Could the incoherent sum of baselines be used as anti-coincidence filter?

Seems that very strong RFI is often spatially incoherent. These events can trigger a huge mess of candidates, but are easily rejected by eye. Perhaps an anti-coincidence filter could help remove them sooner.
Idea would be to sum over baselines and take the amplitude to generate an incoherent, wide-field data stream. Then dedisperse and search for transients. Any detections are likely (but not guaranteed to be!) dominated by RFI.

Move some functions to GPU

Check out https://cupy.chainer.org. Good mapping of numpy to cuda functions. Flexible kernel design syntax, too.
Good plan could be to move some of prep functions in such that data are put in GPU during data_prep call.
First, need to investigate how many moves in/out of memory would be required. If we use cupy and rfgpu, would data need to come out and then in again to GPU?

rfpipe util segment size iteration logs confusing

The segment size optimization logging appears in candidate file creation and state definition, which can lead to confusion about what is happening from the realfast log.

Prototype rfpipe on CBE

@caseyjlaw commented on Wed Sep 21 2016

Would be nice to get some rudimentary version of rfpipe (using CPUs via dask distributed scheduler) on CBE for NRAO review.

scansummary plot not useful in OTF mode

The scansummary bokeh plot currently assumes that scans are long. Need to reevaluate in OTF mode, where subscans are short and scans are extended on the sky.

rfpipe makes a multiple plots if candidate appears in multiple chunks (CPU relevant)

The memory use of the CPU-based imaging algorithm had a concept of "chunks". There is no process for merging candidates between chunks, so a single segment may have many candidates if it has many chunks. Ideally, they are merged so that there is only one (maximum SNR) candidate with a png per segment.

search_thresh_fftw image peak calc not right when nant<27

When providing mock metadata with 20 antennas, some tests fail. The test_seg.py logging shows that the state method for finding a peak pixel is finding weirdness in the input image. Something probably wrong in the structure of the data passed in to this function.

state.py                   646 WARNING  More than one peak pixel ([ 0  0  0 ... 31 31 31], [ 0  1  2 ... 29 30 31]). Using first.
state.py                   646 WARNING  More than one peak pixel ([ 0  0  0 ... 31 31 31], [ 0  1  2 ... 29 30 31]). Using first.

outlier resistant mean time subtraction

Mean time subtraction is often biased by simulated transients and makes testing confusing. Also can produce negative candidates.
Would be best to find mean by MAD and subtract that.

move uv calculation from casa to python

casa measures is not thread safe. uvw calculation crashes when multithreaded.
Probably should just make it Python anyway.

Potential inspiration:

segmentation drops integrations from end

Creating segments currently assumes they must be on integration boundaries and are all the same size. That obligates us to drop some integrations at the end of a scan (up to the size of a segment).
Can we modify this scheme to search end integrations as a partial segment?

vys reading does not support selecting polarizations

change number of DM trials as function of dt

Currently we have a fixed DM grid that is searched for each resampling scale. This is clearly suboptimal, since larger values of dt require fewer DM trials.
Not clear how to use dm indices if we change the dm grid for larger dt. One possibility is to take a subset of dm that are approximately correct for larger dt, so:

dmarr = [0, 10, 20, 30, 40, 50, 60]
dt = 1
for dmind in dminds:
    ...search...
dt = 2
dminds = [0, 2, 4, 6]  # or whatever
for dmind in dminds:
    ...search...

realfastvla / rfpipe Goto Github PK

rfpipe's People

Contributors

Stargazers

Watchers

Forkers

rfpipe's Issues

Recommend Projects

Recommend Topics

Recommend Org