Coder Social home page Coder Social logo

realfastvla / rfpipe Goto Github PK

View Code? Open in Web Editor NEW
10.0 4.0 5.0 6.46 MB

Fast radio interferometric transient search pipeline

Home Page: https://realfastvla.github.io/rfpipe

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
radio-astronomy python pipeline transients interferometry

rfpipe's People

Contributors

astrolobo avatar caseyjlaw avatar demorest avatar kshitijaggarwal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rfpipe's Issues

Too many clusters generated when using default parameters

Running the scan with the default parameters of clustering is generating too many candidates, even if there was just one transient in the dataset (found when running pipeline on 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut), which implies that the clustering is not being done efficiently. Increasing the min_cluster_size while setting the state resolves the issue though.

selecting channels not working with calibration

Selecting channels from 19A-331_TEST_1hr_012.58458.946658078705.4.1 gives this error:

ValueError: operands could not be broadcast together with shapes (1315,300,186,2) (300,192,2) .

Must be something to do with how spw boundaries align with channel selection.

round indexes to improve clustering

The time index of a candidate is scaled by dt. Sometimes candidates are found over a range of dt and calculating an index to compare against dt=1 leads to errors on the order of dt (e.g., dt=8 means multiplying integration by 8 and comparing to cands with dt=1). Clustering can miss candidates with large dt due to this misalignment in time.
One possibility is to round indexes by factors of 2 or more (perhaps up to max dt) such that more candidates have the same time index during clustering. This should only be done to help clustering group these events; we don't want to save rounded indexes.

Move flagging into rfpipe and test suite

rtpipe flagging is currently imported by rfpipe. Need to bring it in and write functions that use numba.guvectorize approach for parallelism.
Also need to test for overflagging (a.k.a. FRB olympics).

spworder incorrect in some cases

Getting inconsistent spw frequencies when parsing with metadata.sdm_metadata for SDM 18A-199.sb34989985.eb35001577.58155.0041900694.

In [13]: meta['spworder']
Out[13]:
[('AC-0', 1988.0),
('AC-1', 2116.0),
('AC-2', 2244.0),
('AC-3', 2372.0),
('AC-4', 2500.0),
('AC-5', 2628.0),
('AC-6', 2756.0),
('AC-7', 2884.0),
('BD-0', 2988.0),
('BD-1', 3116.0),
('BD-2', 3244.0),
('BD-3', 3372.0),
('BD-4', 3500.0),
('BD-5', 3628.0),
('BD-6', 3756.0),
('BD-7', 3884.0)]

versus
In [15]: meta['spw_reffreq']
Out[15]:
[4488000000.0,
4616000000.0,
4744000000.0,
4872000000.0,
5000000000.0,
5128000000.0,
5256000000.0,
5384000000.0,
5488000000.0,
5616000000.0,
5744000000.0,
5872000000.0,
6000000000.0,
6128000000.0,
6256000000.0,
6384000000.0,
1988000000.0,
2116000000.0,
2244000000.0,
2372000000.0,
2500000000.0,
2628000000.0,
2756000000.0,
2884000000.0,
2988000000.0,
3116000000.0,
3244000000.0,
3372000000.0,
3500000000.0,
3628000000.0,
3756000000.0,
3884000000.0]

spworder iterates over sdmpy scanobj.bdf.spw object. spw_reffreq iterates over SpectralWindow.xml rows.

rfpipe save_cands should be string for product to be saved?

Currently save_cands is boolean and defines whether to save candcollection and canddata for all detections. It is not possible to separate these two products.
If save_cands was a string that defined product name (or "all" or True) then it would allow finer control over products that are desired.

sum baselines for either RFI detection or low-sensitivity search

Sum of baselines could be dedispersed and searched as for baselines. That could reproduce the search but at a lower sensitivity that is less likely to suffer from issues with super intense bursts.

Alternatively, the incoherent baseline data could be an anticoincidence signal since it might be more subject to RFI.

search dedispersed integrations with partial data?

Dedispersion typically reduces the number of integrations that can be searched with nominal sensitivity (i.e., all channels). It may still be scientifically interesting to search sub-optimal integrations. For example, some bursts may have structure only in the top of the band.

However, it may be preferable to have a formalized procedure for defining more matched filters. Subband bursts could be found if they can be parameterized.

manage many candidates in a segment

Bad data can trigger candidates in every image. This is slow and probably could lead to memory problems.
Look for ways to avoid being saturated by candidates.

Saving CandData object

Future interest likely for machine classification of candidates. The png is currently the only way the image and spectra are saved, but they are not great input to a machine classifier. Saving CandData object may make it easier to add ML in the future.
Ideally we'd save CandData and generate plots from that. But not clear how to do that. More likely, we need to save both CandData and CandCollection and pngs and associate offline.

UnboundLocalError when running with default clustering parameters, and saveplots = True

Running the pipeline_scan with the following state on 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut data:

st = rfpipe.state.State(sdmfile=dataloc, sdmscan=7, inprefs={'dtarr': [1,2,4], 'npix_max': 512, 'gainfile': gainloc, 'saveplots': True, 'savecands': True, 'maxdm': 1000, 'applyonlineflags': False, 'fftmode': 'cuda', 'flaglist': []})

cc = rfpipe.pipeline.pipeline_scan(st)

gives the following error (this error doesn't show up if clustercands is set in the state):

`2018-10-28 16:27:17,175 - rfpipe.candidates - INFO - Calculating features for 1 candidate.
2018-10-28 16:27:17,181 - rfpipe.candidates - INFO - Saving CandData to /hyrule/data/users/kshitij/rfgpu/cands_16A-459_TEST_1hr_000.57633.66130137732.scan7.cut.7.1.pkl.


UnboundLocalError Traceback (most recent call last)
in
1 start = time.time()
----> 2 cc = rfpipe.pipeline.pipeline_scan(st)
3 end = time.time()
4 print(end-start)

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in pipeline_scan(st, segments, cfile, vys_timeout)
26 for segment in segments:
27 candcollection += pipeline_seg(st, segment, cfile=cfile,
---> 28 vys_timeout=vys_timeout)
29
30 return candcollection

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in pipeline_seg(st, segment, cfile, vys_timeout)
37
38 data = source.read_segment(st, segment, timeout=vys_timeout, cfile=cfile)
---> 39 candcollection = prep_and_search(st, segment, data)
40
41 return candcollection

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/pipeline.py in prep_and_search(st, segment, data)
49
50 if st.prefs.fftmode == "cuda":
---> 51 candcollection = search.dedisperse_search_cuda(st, segment, data)
52 elif st.prefs.fftmode == "fftw":
53 candcollection = search.dedisperse_search_fftw(st, segment, data)

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/search.py in dedisperse_search_cuda(st, segment, data, devicenum)
211 if st.prefs.savecands or st.prefs.saveplots:
212 # triggers optional plotting and saving
--> 213 cc = reproduce_candcollection(cc, data)
214
215 candidates.save_cands(st, candcollection=cc)

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/search.py in reproduce_candcollection(cc, data, wisdom)
434 cd = rfpipe.reproduce.pipeline_imdata(st, candloc, data_corr,
435 cpuonly=True, **kwargs)
--> 436 cc1 += candidates.save_and_plot(cd)
437
438 # TODO: validate that reproduced features match input features?

~/anaconda3/envs/rfgpu/lib/python3.6/site-packages/rfpipe-1.1.5-py3.6.egg/rfpipe/candidates.py in save_and_plot(canddatalist)
347 save_cands(st, canddata=canddata)
348 if st.prefs.saveplots:
--> 349 candplot(canddata, cluster=(clusters[i], clustersizes[i]), snrs=snrs)
350
351 return candcollection

UnboundLocalError: local variable 'clusters' referenced before assignment

`

multithreaded data_prep

Currently the data_prep function takes state and data and then applies calibration and flagging in place. This uses only one thread, which slows it considerably.
Calibration has been made into a functional form, so it can be scheduled separately now. Perhaps with separate threads working on separate subbands.
Flagging is also still dependent on rtpipe, single-threaded, and quite slow.

manage gainfile solution time

The realtime system could apply telcal solutions that differ from offline processing. We need the rfpipe to return the solution time (an mjd array of length 1). Most of that is done.
This value should be attached to the candcollection. It also needs to be passed (by realfast controller) to the sdm builder to tell it which solutions to attach to SDM.

Use number of clusters as an RFI detector

We sometimes find that RFI can trigger many detections and do not cluster well. They tend to be found with many different (l, m, DM, dt).
Could we use the number of clusters in a segment as a trigger for rejecting RFI? For example, we could set a parameter in the preferences (e.g., max_clusters) that is tested after clustering. If more than that many are found, then reject all candidates in the segment. Or potentially, one could just use that to trigger the generation of a single candidate plot, rather than all in the segment.

rfpipe save_noise should return values rather than save?

Currently, running search will optionally produce a pkl file with noise properties. realfast needs to look on disk to find these files and then ingest.
Smarter would be to have function return values to realfast. That would require rfpipe to manage that itself in standalone mode.

Very low SNR (<7) candidates being clustered and plotted

Pipeline is generating and clustering candidates with SNR less than the threshold (persistent on all the cases: injected transient on simulated data, frb with real data, and real data with injected transient).
Could be an issue of calculation of SNR on GPU vs CPU, cause this was encountered using rfgpu.
An example of the above:
t0 = time.Time.now().mjd
meta = rfpipe.metadata.mock_metadata(t0,t0+10/(24*3600), 27, 4, 32*4, 4, 5e3, datasource='sim', antconfig='D')
st = rfpipe.state.State(inmeta = meta, inprefs={'dtarr': [1, 2], 'npix_max': 512, 'savecands': True, 'saveplots': True, 'maxdm': 500, 'applyonlineflags': False, 'flaglist': [], 'clustercands': True, 'fftmode': 'cuda'})
data = rfpipe.source.read_segment(st,0)
mock = rfpipe.util.make_transient_params(st, snr = 80, data = data, ntr = 3)
st = rfpipe.state.State(inmeta = meta, inprefs={'dtarr': [1, 2, 4], 'npix_max': 512, 'savecands': True, 'saveplots' : True,'maxdm': 500, 'applyonlineflags': False, 'flaglist': [], 'clustercands': (4,2), 'simulated_transient': mock, 'fftmode': 'cuda'})
cc = rfpipe.pipeline.pipeline_scan(st)

the mocks used were:
[(0, 971, 77.4, 0.005, 0.18785936515144086, -0.0016023057774304935, 0.0), (0, 21, 129.0, 0.02, 0.09468444540203508, 0.0025197170973120567, 0.0), (0, 1859, 336.05, 0.005, 0.1892371962894011, 0.0, -0.0029736458931088282)]

cands_test_58419 88112078935 1 1_seg0-i466-dm3-dt1

Also, for the data 16A-459_TEST_1hr_000.57633.66130137732.scan7.cut the max SNR reported for the clustering, and the one on the plots was different. Relevant logger output attached.
As can be seen below, for the cluster with ~1300 cands, it reports a max SNR of 31.578176498413086, but while reproducing the candidate it was reported to be 22.6.

screen shot 2018-10-28 at 4 53 48 pm

RRAT observations of 2012 giving error on using onlineflags

As can be seen in the attached screenshot, when running rfpipe pipeline_scan on the RRAT observations taken in 2012, rfpipe throws an errors which indicate that online flags may not be present for these observations. The issue is resolved when the 'applyonlineflags' is set to False.

screen shot 2018-08-24 at 7 03 22 pm

vysmaw app overwriting some data values

vysmaw_reader chooses indexes in time, baseline, channel, and pol. The logging statement about twice-written fields is triggered sometimes.

data value alread set at 12 191 192 1
data value alread set at 11 191 192 3
data value alread set at 10 191 192 3
data value alread set at 9 191 192 3
data value alread set at 8 191 192 3
data value alread set at 7 191 192 3
data value alread set at 6 191 192 3
data value alread set at 5 191 192 3
data value alread set at 4 191 192 3
data value alread set at 3 191 192 3
data value alread set at 2 191 192 3
data value alread set at 1 191 192 3
data value alread set at 0 191 192 3

Not clear what the pattern is yet, but it may be tied to "missing" spectra ( #26 )

move to DM transform

Currently, rtpipe uses a brute force dedispersion algorithm. We should move to a tree-like algorithm that actually transforms from (f,t) to (dm,t) space.

complete baseline and channel weighting implementation

The development branch has functions to weight channels and baselines according to an estimate of their variance. Testing on ideal data shows that it improves detection significance when a few channels are noisier than the rest. However, FRB 121102 data shows worse SNRs when using channel weighting. Baseline weighting seems to show a small systematic improvement in FRB 121102 tests.
However, it seems that the variance estimates are not reliable on real data, so these weigthing schemes are turned off by default. Full testing (and perhaps tuning the variance calc) is needed.

Could the incoherent sum of baselines be used as anti-coincidence filter?

Seems that very strong RFI is often spatially incoherent. These events can trigger a huge mess of candidates, but are easily rejected by eye. Perhaps an anti-coincidence filter could help remove them sooner.
Idea would be to sum over baselines and take the amplitude to generate an incoherent, wide-field data stream. Then dedisperse and search for transients. Any detections are likely (but not guaranteed to be!) dominated by RFI.

Move some functions to GPU

Check out https://cupy.chainer.org. Good mapping of numpy to cuda functions. Flexible kernel design syntax, too.
Good plan could be to move some of prep functions in such that data are put in GPU during data_prep call.
First, need to investigate how many moves in/out of memory would be required. If we use cupy and rfgpu, would data need to come out and then in again to GPU?

scansummary plot not useful in OTF mode

The scansummary bokeh plot currently assumes that scans are long. Need to reevaluate in OTF mode, where subscans are short and scans are extended on the sky.

search_thresh_fftw image peak calc not right when nant<27

When providing mock metadata with 20 antennas, some tests fail. The test_seg.py logging shows that the state method for finding a peak pixel is finding weirdness in the input image. Something probably wrong in the structure of the data passed in to this function.

state.py                   646 WARNING  More than one peak pixel ([ 0  0  0 ... 31 31 31], [ 0  1  2 ... 29 30 31]). Using first.
state.py                   646 WARNING  More than one peak pixel ([ 0  0  0 ... 31 31 31], [ 0  1  2 ... 29 30 31]). Using first.

outlier resistant mean time subtraction

Mean time subtraction is often biased by simulated transients and makes testing confusing. Also can produce negative candidates.
Would be best to find mean by MAD and subtract that.

segmentation drops integrations from end

Creating segments currently assumes they must be on integration boundaries and are all the same size. That obligates us to drop some integrations at the end of a scan (up to the size of a segment).
Can we modify this scheme to search end integrations as a partial segment?

change number of DM trials as function of dt

Currently we have a fixed DM grid that is searched for each resampling scale. This is clearly suboptimal, since larger values of dt require fewer DM trials.
Not clear how to use dm indices if we change the dm grid for larger dt. One possibility is to take a subset of dm that are approximately correct for larger dt, so:

dmarr = [0, 10, 20, 30, 40, 50, 60]
dt = 1
for dmind in dminds:
    ...search...
dt = 2
dminds = [0, 2, 4, 6]  # or whatever
for dmind in dminds:
    ...search...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.