realfastvla / rfpipe Goto Github PK
View Code? Open in Web Editor NEWFast radio interferometric transient search pipeline
Home Page: https://realfastvla.github.io/rfpipe
License: BSD 3-Clause "New" or "Revised" License
Fast radio interferometric transient search pipeline
Home Page: https://realfastvla.github.io/rfpipe
License: BSD 3-Clause "New" or "Revised" License
Running the scan with the default parameters of clustering is generating too many candidates, even if there was just one transient in the dataset (found when running pipeline on 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut
), which implies that the clustering is not being done efficiently. Increasing the min_cluster_size
while setting the state resolves the issue though.
Selecting channels from 19A-331_TEST_1hr_012.58458.946658078705.4.1 gives this error:
ValueError: operands could not be broadcast together with shapes (1315,300,186,2) (300,192,2)
.
Must be something to do with how spw boundaries align with channel selection.
The time index of a candidate is scaled by dt. Sometimes candidates are found over a range of dt and calculating an index to compare against dt=1 leads to errors on the order of dt (e.g., dt=8 means multiplying integration by 8 and comparing to cands with dt=1). Clustering can miss candidates with large dt due to this misalignment in time.
One possibility is to round indexes by factors of 2 or more (perhaps up to max dt) such that more candidates have the same time index during clustering. This should only be done to help clustering group these events; we don't want to save rounded indexes.
When getting counting data from vys, the candidates are high (false) SNRs and often identical for many different trials. When clustering, it may find one cluster for the dozens found, but the max SNR chosen for the cluster may not be the maximum seen printed to screen.
rtpipe flagging is currently imported by rfpipe. Need to bring it in and write functions that use numba.guvectorize approach for parallelism.
Also need to test for overflagging (a.k.a. FRB olympics).
Barak has his code at https://pypi.org/project/kalman_detector. Use it!
Getting inconsistent spw frequencies when parsing with metadata.sdm_metadata for SDM 18A-199.sb34989985.eb35001577.58155.0041900694.
In [13]: meta['spworder']
Out[13]:
[('AC-0', 1988.0),
('AC-1', 2116.0),
('AC-2', 2244.0),
('AC-3', 2372.0),
('AC-4', 2500.0),
('AC-5', 2628.0),
('AC-6', 2756.0),
('AC-7', 2884.0),
('BD-0', 2988.0),
('BD-1', 3116.0),
('BD-2', 3244.0),
('BD-3', 3372.0),
('BD-4', 3500.0),
('BD-5', 3628.0),
('BD-6', 3756.0),
('BD-7', 3884.0)]
versus
In [15]: meta['spw_reffreq']
Out[15]:
[4488000000.0,
4616000000.0,
4744000000.0,
4872000000.0,
5000000000.0,
5128000000.0,
5256000000.0,
5384000000.0,
5488000000.0,
5616000000.0,
5744000000.0,
5872000000.0,
6000000000.0,
6128000000.0,
6256000000.0,
6384000000.0,
1988000000.0,
2116000000.0,
2244000000.0,
2372000000.0,
2500000000.0,
2628000000.0,
2756000000.0,
2884000000.0,
2988000000.0,
3116000000.0,
3244000000.0,
3372000000.0,
3500000000.0,
3628000000.0,
3756000000.0,
3884000000.0]
spworder iterates over sdmpy scanobj.bdf.spw object. spw_reffreq iterates over SpectralWindow.xml rows.
Currently save_cands is boolean and defines whether to save candcollection and canddata for all detections. It is not possible to separate these two products.
If save_cands was a string that defined product name (or "all" or True) then it would allow finer control over products that are desired.
Sum of baselines could be dedispersed and searched as for baselines. That could reproduce the search but at a lower sensitivity that is less likely to suffer from issues with super intense bursts.
Alternatively, the incoherent baseline data could be an anticoincidence signal since it might be more subject to RFI.
Would be useful to see how any kind of GPU kernel behaves when scheduled with dask distributed.
The summary plots are generated with huge values. Make it more legible with default scaling.
Dedispersion typically reduces the number of integrations that can be searched with nominal sensitivity (i.e., all channels). It may still be scientifically interesting to search sub-optimal integrations. For example, some bursts may have structure only in the top of the band.
However, it may be preferable to have a formalized procedure for defining more matched filters. Subband bursts could be found if they can be parameterized.
Need to integrate Luis's work with real-time system.
Also should look at clustering with different dimentionality, as discussed at https://hackpad.com/Optimizing-Distance-Metrics-for-Clustering-LZoQbnUMd9J.
Running a search with rfgpu and downsampling by 2 will find candidates at integration/2 and integration/4.
Bad data can trigger candidates in every image. This is slow and probably could lead to memory problems.
Look for ways to avoid being saturated by candidates.
Future interest likely for machine classification of candidates. The png is currently the only way the image and spectra are saved, but they are not great input to a machine classifier. Saving CandData object may make it easier to add ML in the future.
Ideally we'd save CandData and generate plots from that. But not clear how to do that. More likely, we need to save both CandData and CandCollection and pngs and associate offline.
@caseyjlaw commented on Wed Sep 21 2016
key questions:
Running the pipeline_scan
with the following state on 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut
data:
st = rfpipe.state.State(sdmfile=dataloc, sdmscan=7, inprefs={'dtarr': [1,2,4], 'npix_max': 512, 'gainfile': gainloc, 'saveplots': True, 'savecands': True, 'maxdm': 1000, 'applyonlineflags': False, 'fftmode': 'cuda', 'flaglist': []})
cc = rfpipe.pipeline.pipeline_scan(st)
gives the following error (this error doesn't show up if clustercands
is set in the state):
`2018-10-28 16:27:17,175 - rfpipe.candidates - INFO - Calculating features for 1 candidate.
2018-10-28 16:27:17,181 - rfpipe.candidates - INFO - Saving CandData to /hyrule/data/users/kshitij/rfgpu/cands_16A-459_TEST_1hr_000.57633.66130137732.scan7.cut.7.1.pkl.
UnboundLocalError Traceback (most recent call last)
in
1 start = time.time()
----> 2 cc = rfpipe.pipeline.pipeline_scan(st)
3 end = time.time()
4 print(end-start)
~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in pipeline_scan(st, segments, cfile, vys_timeout)
26 for segment in segments:
27 candcollection += pipeline_seg(st, segment, cfile=cfile,
---> 28 vys_timeout=vys_timeout)
29
30 return candcollection
~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in pipeline_seg(st, segment, cfile, vys_timeout)
37
38 data = source.read_segment(st, segment, timeout=vys_timeout, cfile=cfile)
---> 39 candcollection = prep_and_search(st, segment, data)
40
41 return candcollection
~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in prep_and_search(st, segment, data)
49
50 if st.prefs.fftmode == "cuda":
---> 51 candcollection = search.dedisperse_search_cuda(st, segment, data)
52 elif st.prefs.fftmode == "fftw":
53 candcollection = search.dedisperse_search_fftw(st, segment, data)
~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/search.py in dedisperse_search_cuda(st, segment, data, devicenum)
211 if st.prefs.savecands or st.prefs.saveplots:
212 # triggers optional plotting and saving
--> 213 cc = reproduce_candcollection(cc, data)
214
215 candidates.save_cands(st, candcollection=cc)
~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/search.py in reproduce_candcollection(cc, data, wisdom)
434 cd = rfpipe.reproduce.pipeline_imdata(st, candloc, data_corr,
435 cpuonly=True, **kwargs)
--> 436 cc1 += candidates.save_and_plot(cd)
437
438 # TODO: validate that reproduced features match input features?
~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/candidates.py in save_and_plot(canddatalist)
347 save_cands(st, canddata=canddata)
348 if st.prefs.saveplots:
--> 349 candplot(canddata, cluster=(clusters[i], clustersizes[i]), snrs=snrs)
350
351 return candcollection
UnboundLocalError: local variable 'clusters' referenced before assignment
`
Currently the data_prep function takes state and data and then applies calibration and flagging in place. This uses only one thread, which slows it considerably.
Calibration has been made into a functional form, so it can be scheduled separately now. Perhaps with separate threads working on separate subbands.
Flagging is also still dependent on rtpipe, single-threaded, and quite slow.
The realtime system could apply telcal solutions that differ from offline processing. We need the rfpipe to return the solution time (an mjd array of length 1). Most of that is done.
This value should be attached to the candcollection. It also needs to be passed (by realfast controller) to the sdm builder to tell it which solutions to attach to SDM.
Some indication that small (suboptimal) grids produce very different results.
Also need to check that cuda results can be reproduced by fftw.
We sometimes find that RFI can trigger many detections and do not cluster well. They tend to be found with many different (l, m, DM, dt).
Could we use the number of clusters in a segment as a trigger for rejecting RFI? For example, we could set a parameter in the preferences (e.g., max_clusters
) that is tested after clustering. If more than that many are found, then reject all candidates in the segment. Or potentially, one could just use that to trigger the generation of a single candidate plot, rather than all in the segment.
We all forget the preference names sometimes. Fuzzy string comparison can be used to suggest the correct name.
Currently, running search will optionally produce a pkl file with noise properties. realfast needs to look on disk to find these files and then ingest.
Smarter would be to have function return values to realfast. That would require rfpipe to manage that itself in standalone mode.
vysmaw_reader not catching all integrations.
The number of missing vysmaw_reader spectra is not reduced by waiting a few ints longer. Probably earlier than end-of-segment window.
Pipeline is generating and clustering candidates with SNR less than the threshold (persistent on all the cases: injected transient on simulated data, frb with real data, and real data with injected transient).
Could be an issue of calculation of SNR on GPU vs CPU, cause this was encountered using rfgpu.
An example of the above:
t0 = time.Time.now().mjd
meta = rfpipe.metadata.mock_metadata(t0,t0+10/(24*3600), 27, 4, 32*4, 4, 5e3, datasource='sim', antconfig='D')
st = rfpipe.state.State(inmeta = meta, inprefs={'dtarr': [1, 2], 'npix_max': 512, 'savecands': True, 'saveplots': True, 'maxdm': 500, 'applyonlineflags': False, 'flaglist': [], 'clustercands': True, 'fftmode': 'cuda'})
data = rfpipe.source.read_segment(st,0)
mock = rfpipe.util.make_transient_params(st, snr = 80, data = data, ntr = 3)
st = rfpipe.state.State(inmeta = meta, inprefs={'dtarr': [1, 2, 4], 'npix_max': 512, 'savecands': True, 'saveplots' : True,'maxdm': 500, 'applyonlineflags': False, 'flaglist': [], 'clustercands': (4,2), 'simulated_transient': mock, 'fftmode': 'cuda'})
cc = rfpipe.pipeline.pipeline_scan(st)
the mocks used were:
[(0, 971, 77.4, 0.005, 0.18785936515144086, -0.0016023057774304935, 0.0), (0, 21, 129.0, 0.02, 0.09468444540203508, 0.0025197170973120567, 0.0), (0, 1859, 336.05, 0.005, 0.1892371962894011, 0.0, -0.0029736458931088282)]
Also, for the data 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut
the max SNR reported for the clustering, and the one on the plots was different. Relevant logger output attached.
As can be seen below, for the cluster with ~1300 cands, it reports a max SNR of 31.578176498413086, but while reproducing the candidate it was reported to be 22.6.
vysmaw_reader chooses indexes in time, baseline, channel, and pol. The logging statement about twice-written fields is triggered sometimes.
data value alread set at 12 191 192 1
data value alread set at 11 191 192 3
data value alread set at 10 191 192 3
data value alread set at 9 191 192 3
data value alread set at 8 191 192 3
data value alread set at 7 191 192 3
data value alread set at 6 191 192 3
data value alread set at 5 191 192 3
data value alread set at 4 191 192 3
data value alread set at 3 191 192 3
data value alread set at 2 191 192 3
data value alread set at 1 191 192 3
data value alread set at 0 191 192 3
Not clear what the pattern is yet, but it may be tied to "missing" spectra ( #26 )
Currently we are defining state from rtpipe. Need to move that into rfpipe.
Overall goal is to allow state to have methods for calculating useful parameters on the fly (e.g., data delay depends on freqs with input argument of dm).
Currently, rtpipe uses a brute force dedispersion algorithm. We should move to a tree-like algorithm that actually transforms from (f,t) to (dm,t) space.
The development branch has functions to weight channels and baselines according to an estimate of their variance. Testing on ideal data shows that it improves detection significance when a few channels are noisier than the rest. However, FRB 121102 data shows worse SNRs when using channel weighting. Baseline weighting seems to show a small systematic improvement in FRB 121102 tests.
However, it seems that the variance estimates are not reliable on real data, so these weigthing schemes are turned off by default. Full testing (and perhaps tuning the variance calc) is needed.
Tests do not cover parsing of candidate output pickle.
Seems that very strong RFI is often spatially incoherent. These events can trigger a huge mess of candidates, but are easily rejected by eye. Perhaps an anti-coincidence filter could help remove them sooner.
Idea would be to sum over baselines and take the amplitude to generate an incoherent, wide-field data stream. Then dedisperse and search for transients. Any detections are likely (but not guaranteed to be!) dominated by RFI.
Check out https://cupy.chainer.org. Good mapping of numpy to cuda functions. Flexible kernel design syntax, too.
Good plan could be to move some of prep functions in such that data are put in GPU during data_prep call.
First, need to investigate how many moves in/out of memory would be required. If we use cupy and rfgpu, would data need to come out and then in again to GPU?
The segment size optimization logging appears in candidate file creation and state definition, which can lead to confusion about what is happening from the realfast log.
@caseyjlaw commented on Wed Sep 21 2016
Would be nice to get some rudimentary version of rfpipe (using CPUs via dask distributed scheduler) on CBE for NRAO review.
The scansummary bokeh plot currently assumes that scans are long. Need to reevaluate in OTF mode, where subscans are short and scans are extended on the sky.
The memory use of the CPU-based imaging algorithm had a concept of "chunks". There is no process for merging candidates between chunks, so a single segment may have many candidates if it has many chunks. Ideally, they are merged so that there is only one (maximum SNR) candidate with a png per segment.
When providing mock metadata with 20 antennas, some tests fail. The test_seg.py logging shows that the state method for finding a peak pixel is finding weirdness in the input image. Something probably wrong in the structure of the data passed in to this function.
state.py 646 WARNING More than one peak pixel ([ 0 0 0 ... 31 31 31], [ 0 1 2 ... 29 30 31]). Using first.
state.py 646 WARNING More than one peak pixel ([ 0 0 0 ... 31 31 31], [ 0 1 2 ... 29 30 31]). Using first.
Mean time subtraction is often biased by simulated transients and makes testing confusing. Also can produce negative candidates.
Would be best to find mean by MAD and subtract that.
casa measures is not thread safe. uvw calculation crashes when multithreaded.
Probably should just make it Python anyway.
Potential inspiration:
Creating segments currently assumes they must be on integration boundaries and are all the same size. That obligates us to drop some integrations at the end of a scan (up to the size of a segment).
Can we modify this scheme to search end integrations as a partial segment?
Currently we have a fixed DM grid that is searched for each resampling scale. This is clearly suboptimal, since larger values of dt require fewer DM trials.
Not clear how to use dm indices if we change the dm grid for larger dt. One possibility is to take a subset of dm that are approximately correct for larger dt, so:
dmarr = [0, 10, 20, 30, 40, 50, 60]
dt = 1
for dmind in dminds:
...search...
dt = 2
dminds = [0, 2, 4, 6] # or whatever
for dmind in dminds:
...search...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.