e-merlin / emerlin_casa_pipeline Goto Github PK

This is CASA eMERLIN pipeline to calibrate data from the e-MERLIN array. Please fork the repository before making any changes and read the Coding Practices page in the wiki. Please add issues with the pipeline in the issues tab.

License: GNU General Public License v3.0

Python 99.79% CSS 0.21%

radio-astronomy

Description

The e-MERLIN CASA Pipeline (eMCP) is a python pipeline working on top of CASA to process and calibrate interferometric data from the e-MERLIN array. Access to data information, statistics and assessment plots on calibration tables and visibilities can be accessed by the pipeline weblog, which is updated in real time as the pipeline job progresses. The output is calibrated data and preliminary lookup images of the relevant fields. It can calibrate mixed mode data that includes narrow-band high spectral resolution spectral windows for spectral lines, and also special observing modes as pseudo-wideband observations. Currently no polarization calibration is performed.

Dependencies

CASA v5.5+ (see https://casa.nrao.edu/)
aoflagger v2.9+ (see https://sourceforge.net/projects/aoflagger/), only needed to calibrate L-band data.

Download

If you have git installed, you can get the pipeline using:
git clone https://github.com/e-merlin/eMERLIN_CASA_pipeline.git

If you don't have git, you can download and unzip the files from here.

To install aoflagger check out either A. Offringa's websites:

aoflagger: https://sourceforge.net/projects/aoflagger/
wsclean: https://sourceforge.net/projects/wsclean/ (not required for running the pipeline)

or (recommended) use the handy anaconda scripts to instantly install dependcies within the conda environment. To do this follow the instructions in this repo.: https://github.com/jradcliffe5/radio_conda_recipes

Quick start

If you have received calibrated data from the observatory and you want to refine the calibration, you can:

[Optionally] Modify default_params.json or add manual flags to manual_avg.flags with your desired values.
Run:

casa -c eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -r calibration

Usage

Normal pipeline execution. When you have in your working directory the file inputs.ini and you have extracted the pipeline:

casa -c /path/to/pipeline/eMERLIN_CASA_pipeline.py

To run the parallelized version using MPI in CASA you can use:

mpicasa -n <num_cores> casa -c /path/to/pipeline/eMERLIN_CASA_pipeline.py

Optional arguments

Names in capital need to be set by the user:

  -h, --help                     show this help message and exit
  
  
  -i INPUTS_FILE
  --inputs INPUTS_FILE
                                 Inputs file to use. Default is inputs.ini
                                 
                                 
  -r          RUN_STEPS [RUN_STEPS ...]
  --run-steps RUN_STEPS [RUN_STEPS ...]
                                 Whitespace separated list of steps to run. Apart from
                                 individual steps, it also accepts "all",
                                 "pre_processing" and "calibration"
                                 
                                 
  -s           SKIP_STEPS [SKIP_STEPS ...]
  --skip-steps SKIP_STEPS [SKIP_STEPS ...]
                                 Whispace separated list of steps to skip
                                 
                                 
  -l
  --list-steps                   Show list of available steps and exit

You can get the list of available steps with:

casa -c eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -l

pre_processing
    run_importfits
    flag_aoflagger
    flag_apriori
    flag_manual
    average
    plot_data
    save_flags
    
calibration
    restore_flags
    flag_manual_avg
    init_models
    bandpass
    initial_gaincal
    fluxscale
    bandpass_final
    gaincal_final
    applycal_all
    flag_target
    plot_corrected
    first_images
    split_fields

Selection options are any combination of: a list of any individual step names, pre_processing, calibration or all

Examples of step selection

You need to specify which steps of the pipeline to run. Some example on how to choose steps:

Run all the calibration steps (ideal for observatory-processed data for which you want to tweak the calibration parameters). Includes all calibrations steps (see list above):

casa -c eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -r calibration

Run all pipeline steps (you will need the raw FITS-IDI files for the initial step):

casa -c eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -r all

Run only the pre-processing steps (usually executed by the observatory. Otherwise you need the raw FITS-IDI files):

casa -c eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -r pre_processing

Any combination of the steps above, for example:

casa -c eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -r plot_corrected first_images split_fields

Run all calibration steps except plot_corrected:

casa -c eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -r calibration -s plot_corrected

Running the pipeline interactively from CASA

To execute the pipeline from a running CASA instance you need to write in the CASA shell:

run_in_casa = True
pipeline_path = '/path/to/pipeline_path/'   # You need to define this variable explicitly
execfile(pipeline_path + 'eMERLIN_CASA_pipeline.py')
eMCP = run_pipeline(run_steps=['calibration'])

Function run_pipeline parameters and defaults are: run_pipeline(inputs_file='./inputs.ini', run_steps=[], skip_steps=[]). Variables run_steps and skip_steps are python lists of steps as explained above.

Additional information

FAQ

How do I open the weblog?

The weblog consist of a series of html files. From the working directory you can open the file ./weblog/index.html with your preferred web browser.

How do I know what has been executed?

You can visit the tab Pipeline info in the weblog, where you will find which steps were executed. You will also find a link to the Pipeline log, the CASA log and two files with all the parameters used during the data processing.

I want to re-run the pipeline to improve the calibration, what do I change?

There are two main blocks: pre-processing and calibration. Most probably you will only need to repeat the calibration part. Recommended course of action:

Identify changes you want to include in the data reduction, like changing calibration parameters or adding manual flags.
Add or edit file manual_avg.flags with your flag commands (follow the CASA syntax).
Edit the file inputs.ini if you need to change the sources used or they intend.
Edit the file default_params.json changing any parameter the pipeline is using, if needed.
Run the calibration block of the pipeline with the command:

casa -c ./eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -r calibration

Which flag files does the pipeline accept and what is the right syntax?

There are four different flag files accepted by the pipeline:

Flag file	Used by step	Notes
observatory.flags	flag_apriori	Created by the observatory with antenna slewing or other major faults. Please donot edit it yourself.
manual.flags	flag_manual	This is meant to flag the unaveraged data set during the pre-processing stage
manual_avg.flags	flag_manual_avg	This is meant to flag the averaged data set during the calibration stage
manual_narrow.flags	flag_manual_avg	Use this to add flag commands for narrow-band spectral line data set

For the syntax needed for CASA follow Basic Syntax Rules in the CASA docs flagdata (end of the section). The main rules are:

Use only ONE white space to separate the parameters (no commas). Each key should only appear once on a given command line/string.
There is an implicit mode for each command, with the default being 'manual' if not given.
Comment lines can start with '#' and will be ignored. The parser used in flagdata will check each parameter name and type and exit with an error if the parameter is not a valid flagdata parameter or of a wrong type.

Example for e-MERLIN:

mode='manual' field='1331+305' antenna='' timerange='10:00:00~10:11:30'
mode='manual' field='' antenna='' timerange='' spw='0:0~30'
mode='manual' field='' antenna='Mk2' timerange='09:05:00~16:27:00'
mode='manual' field='1258-2219' antenna='' timerange='12:57:01~12:59:59'
mode='quack' field='1258-2219,1309-2322' quackinterval=24.

Example from the CASA docs:

scan='1~3' mode='manual'
# this line will be ignored
spw='9' mode='tfcrop' correlation='ABS_XX,YY' ntime=51.0
mode='extend' extendpols=True
scan='1~3,10~12' mode='quack' quackinterval=1.0

How do I fill the source names in inputs.ini if I don't know which fields were observed?

By default you should have all the information from the observatory. But if you only have the FITS-IDI and don't know the source names, you can run the first pipeline step alone casa -c eMERLIN_CASA_pipeline/eMERLIN_CASA_pipeline.py -r run_importfits. When the execution is finished, open the weblog and go to the tab Observation Summary where you will find the fields included in the MS and the listobs file with all the scans.

As a general rule, an observation will have 1331+3030 (3C286) as flux scale calibrator, 1407+2827 (OQ208) as bandpass calibrator and 0319+4130 (3C84) as bright ptcal calibrator. To distinguish between target and phasecal, you should look for alternating scans, and the target is usually the one with longer scans.

emerlin_casa_pipeline's People

Contributors

Stargazers

Watchers

Forkers

jradcliffe5 celeritas rainsworth drawilliams apt-astro jmoldon kelvinwandia miguelcarcamov chris-j-skipper lucatelli

emerlin_casa_pipeline's Issues

Import QA scripts from Parseltongue version + flagging diagnostics

Javier action item

Check for mixed mode data sets

Easy to implement as can look at spw sizes in the spw table.

Default rfigui strategy if per-source strategies not defined

Input file

GUI interface

Add remembrance function for previous input
Add read history tab to work with inbase
Add message box to ensure that the run parameters are ok!

Write the pipeline!

AOFlagger should use a bright-source strategy for 0319+415

Currently it uses default_faint, but it should be using a strategy similar to the one for 1331+305.

Just need to add a 0319+415.rfis as a default strategy and use that one for field name 0319+415.

Adding version numbers

It would be nice to introduce version numbers to the pipeline, through the releases functionality in Github. In this way, users can report version along any problems which simplifies bugfixing.

It would be nice to have the version in the log. Without release versions, one way to do this is to read the latest commit ID from the repo. This however only works if the users checked out the pipeline with git, i.e. not if downloading as zipfile. And if the user has git installed. If these requirements are satisfied, then it is possible to write it to log using the following command. (Now assuming the the script was run using the casa -c and not using execfile within casa)

import subprocess
repoversion = subprocess.check_output(['git', '--git-dir=' + pipeline_path + '/.git', 'rev-parse', 'HEAD'])
logger.info('Starting pipeline version ' + repoversion)

The optimal way to handle version numbers should probably be to use releases instead, and hence I'm only putting this here as a note while we're investigating the releases route.

Add failsafe to check for Cm axis offset

Cm axis offset is corrected for data AFTER 16 February 2015. Need to add check in measurement set and apply correction if check fails.

Test data sets

We can write here which types of data sets we want to have as a reference to check the pipeline and to add new features. Jack has added in the wiki links to two data sets observed in mixed mode. We also need smaller data files to quickly check that the pipeline works. I will now link other fits files in the wiki (C, L, K band, just a few visibilities, a single source, etc). They come from the data exporter, so no modification have been added, but still they are < 20 Mb each.

List of possible files to test different but typical situations:

running_test: Very small dataset to check that pipeline works.
standard_C: Continuum C band observation (Target + phase ref + 3C286 + OQ208 + 3C84)
no_amp_C: Continuum C band observation, 3C286 missing for one/several or all antennas (Target + phase ref + OQ208 + 3C84)
no_bp_C: Continuum C band observation, 3C286 missing for one/several or all antennas (Target + phase ref + 3C286 + 3C84)
mixed_C: Mixed mode (continuum + spectral lines)
partial_C: Continuum C band with one or more antennas failing (present but garbage data) for part or the whole observation.

And the same for L, and K bands.

Feel free to add more situations or rename them.

Fix antenna table

Axis offsets need to be included @ the importfitsIDI function, see DARA script.
Add check for correct antenna diameters and mounting type.

Nb: cannot use station numbering to check as importfitsIDI removes the blank antenna numbers.

Write steps already conducted to history file of measurements set

Need to write steps conducted in the pipeline into the measurement set. Add unique identifier to be able to pick up when selecting in GUI

logging

Produce log using the standard module logging

Change dfluxpy.py to take into account u,v,w projection

Setting MS order with setToCasaOrder?

Short version: don't use setToCasaOrder, it makes things even slower!

We discussed some time ago if we should apply setToCasaOrder just after importfitsidi in order to make the rest of the steps faster.

I have conducted some calibration starting from fits files and running

importfitsidi 
setToCasaOrder
fixvis
flag autocorrelations
Hanning smoothing
aoflagger + a-priori flags
split
etc...

The test was conducted on richards (12 CPU @ 3.50GHz, 64 GB of RAM). The data set if 184 GB in the original fits format.

Times to run different steps with setToCasaOrder:

Times to run different steps with NO setToCasaOrder:

Conclusions:

setToCasaOrder takes a long time. About 3h spent on setting the order of a 184 GB file.
Most of the following steps with unaveraged data take the same ammount of time.
Aoflagger seems to take much longer on ordered data. Making it more than 1h slower.
The split step (including averaging to 128chan/spw and 2sec) is also slower when applying setOrder.
The rest of the steps after averaging does not make a difference.
I see the same behaviour on unaveraged data.

I will update the code to take out that option. Maybe leave if it there as an option.

Pipeline needs correct fluxscaling!

Incorporate dfluxpy.py

Pipeline continues despite no aoflagger

2017-10-12 13:32:03 | INFO | Running AOFLagger for field 1407+284 (0) using strategy pipeline/aoflagger_strategies/default/1407+284.rfis
sh: aoflagger: command not found

But then the pipeline continues with the rest of things. I think that an exception should be raised in case some part of the desired tasklist cannot be run, and the script should halt at that point.

Compatibility with CASA 5.0

For CASA 4.* we need the lines:

from casa import table as tb
from casa import ms

Without them, it cannot find the tb tasks, for example.

However CASA 5.0 crashes if those lines are there, but works fine if not.

We may implement a try/except chek of the CASA version, but I think it is better if we can find a common way to load all the CASA tasks uniformly, if possible. Also, we should get rid of:

from tasks import *
from casa import *

We should check any changes with that at least in CASA 4.7 and CASA 5.0.

Add casa logging to pipeline log file

Currently the casa output is not appended to the pipeline logfile. This can be fixed in two ways (as described at the page https://casa.nrao.edu/casadocs/casa-5.1.0/usingcasa/casa-logger)

Run CASA with "casa --logfile otherfile.log", but this makes this the responsibility of the user and it's not nice. Instead one should:
modify the pipeline to make CASA use the pipeline logfile. Quick hardcoded fix is to include the line
casalog.setlogfile('eMCP.log') in the file eMERLIN_CASA_pipeline.py after initialising the logger, i.e. at line 30 somewhere around the current command logger.addHandler(consoleHandler).

Amplitude scale

Amplitude scale
-- Fluxscale
-- dfluxpy
-- setjy
-- Plots

Pipeline log messages should have UT timestamps

CASA timestamp its log messages with UT system time. However, the Pipeline log errors are stamped with local time. See the attached log below, where I ran the pipeline on a Swedish computer. The sequence of timestamps in the log appear weird, jumping back and forth between 08... and 10... when the casa/pipeline outputs log messages. The pipeline should, for easy comparison, stamp the log messages with UT time stamps.

2017-10-12 08:18:43 WARN FITSKeywordUtil::getKeywords (file ../../fits/FITS/FITSKeywordUtil.cc, line 491) Ignoring field 'tmatx11' because its type does not match already created field tmatx. Continuing.
2017-10-12 10:19:43 | INFO | End importfitsIDI
2017-10-12 10:19:43 | INFO | Start UVFIX
2017-10-12 10:50:40 | INFO | End UVFIX
2017-10-12 10:50:42 | INFO | Start flagdata_autocorr
2017-10-12 08:51:07 WARN FlagDataHandler::preLoadColumn (file ../../flagging/Flagging/FlagDataHandler.cc, line 964) PROCESSOR sub-table is empty. Assuming CORRELATOR type.
2017-10-12 10:53:39 | INFO | End flagdata_autocorr

separationaxis auto/baseline tests

We need to test and check the differences in output when using aoflagger 2.9 on individual sources in these situations:

MMS generated with ms2mms and separationaxis='auto'
MMS generated with ms2mms and separationaxis='baseline'
MMS generated with ms2mms_fields and separationaxis='auto'
MMS generated with ms2mms_fields and separationaxis='baseline'
MS

Retreive all MS information

Basically repeating what is done in EVLA_pipe_msinfo.py from the EVLA pipeline. Read everything with a function at the beginning and store efficiently, maybe with a dictionary. We need, among other possible things:

listobs
channels, num_spw, spw_bandwidth, reference_frequencies, band
tst_delay_spw (range of good channels without IF borders)
field_ids, field_positions, field_names, field_scans
scan_summary, integration_times, quack_scan_string
nameAntenna, numAntenna
Plot antennas (plotants)
Plot Elevation vs time
Plot Azimuth vs time

Exclude lines from being flagged

Allow the user to define range of frequencies in an external file so they are not flagged.

Failsafe to check if external (non-CASA) packages are available

Need a check for executables of non-CASA packages exist in their system. If checks fail then GUI MUST NOT show these options

Tkinter gui for MS selection

Optional pre-averaging delay correction

As mentioned:
Sometimes (a minority of data sets, but not rare) there are very high delays, usually to Cm, for some scans, typically on the phase-ref/target, such that the delay calibration has to be done before channel averaging. Please can we have an optional step near the start to run gaincal 'K' on all calibrators? Ideally just before split, but I do not know whether it should be done before or after hanning/pre-flagging for data with bad rfi (at C-band, no pre-flagging etc. is OK). Then before split applycal should be run to apply the delay for all calibrators to themselves, and for the phase-ref to the target, and split needs to use Corrected.

In terms of diagnostics to decide if this is needed, averaging central ~100 chans of raw data and plotting the data (~30-sec averages to 'see' phase ref) amp v. time may show a drop in amps, usually to Cm and often different for RR and LL. You need to note scan numbers where any jumps come on the phase-ref as the intermediate target scan will have an undefined delay and need flagging. Then plot phase v. channel, phase-ref scan averages in time, central few tens chans, to check that high delay is the problem. Thanks.

Add a user-defined flag command file

Conflict with casa tasks when loading our libraries

There is a problem with casa accessing its tasks when we try to load our functions with:
sys.path.insert(0,'./CASA_eMERLIN_pipeline') # add github path at runtime
I have seen it with the casa task virtualconcat and I suspect is the same problem Jack had with importfitsidi. When the sys.path is not modified virtualconcat works normally, but when it is modified casa doesn't even recognize the task.

As a temporary solution I have written the task ms2mms_fields, which uses virtualconcat, directly in the pipeline instead on eMERLIN_CASA_functions.py, and then it works.

The solution to this is correctly importing eMERLIN_CASA_functions.py as a submodule instead of a different module linked with sys.path. I still don't know how to do that but it is possible (and recommended) for sure.

Weights

We need to test how the Wt are affecting the calibration. We have the option to initialize the weights after importing using statwt.

We need to check in particular data sets with the Lo included.

Investigate use of AOFlagger 2.7 and before

-field option is unavailable in aoflagger v2.7 and earlier so should we split and concat the data or force users to upgrade to v2.8+

Documentation

Hi, thanks very much for this. As an experienced user but not a coder, I would like to start defining requirements for documentation. I think that initially we should aim at people with a good knowledge of interferometry and CASA to enable them to test the pipeline. I would find useful

an overview of what to do to get started in a bit more detail
- minimum requirements e.g. if you don't have/want aoflagger is that OK as long as you don't ask for it?
summary or flow-chart of what each step does (e.g. where is fixvis done, averaging...)
control of averaging especially for mixed mode

Also, you might want to consider using 'steps' as in the ALMA manual QA2 (I can send an example if you want) rather than the AIPS 1-0 system for running each module...

cheers
Anita

Pipeline should check if file/directory exists before removing it

The pipeline tries to remove files and directories before running a few tasks, e.g. importfitsidi, hanning etc. This is attempted even if the directory in question does not exist. Hence the user sees message of the type
2017-10-12 10:53:40 | INFO | Start hanning
rm: cannot remove 'CY4222_C_20171001_hanning.ms': No such file or directory
The pipeline should check if a file exists before trying to remove it. This can be done e.g. by using the python module os.path.exists(path).

Delay and Preliminary BandPass calibration

Delay and Preliminary BandPass calibration:
-- Calibrate delays of calibrators
-- Calibrate phases of BP calibrator
-- Calibrate A&P of BP calibrator
-- Initial BP table
-- Additional flags on BP calibrator
-- Plots

Standard test to check pipeline is correct before release.

Hanning smoothing for mms

Hanning smoothing can also be executed on mms files using mstransform. I use a different order for the initial steps:

partition (to generate mms)
mstransform (Hanning smoothing)
aoflagger

All of those need to be on individual sources (if AOflagger is v2.7) or a combined measurement set (if AOFlagger is v2.9)

e-MERLIN models for calibrators

Need to get the eMERLIN models (CASA compatible) to be included with the pipeline. @amsrichards do you have these? I only have 3C286 @ C-band which was used for DARA.

Method to check/track previous calibration

Add aoflagger version check

see #23 comments

Standardised test. Benchmarking AOflagger versions <2.8 & 2.9+ regimes

I'll make some plots of time vs. memory with the 2.7 & 2.9 versions (i.e. forced mms shape compared to optimised). I guess I can do the same for no. of cores with different available RAM vs time.

I guess in the end we would want to test this on a 24hr run i.e. a ~300GB data set. I don't have that time on this computer atm but I will do the tests on a 106GB data set I have here.

Preliminary calibration

Preliminary calibration
-- Calibrate phases of calibrators
-- Calibrate A&P of calibrators
-- Additional flags
-- Plots

Final calibration

Final calibration
-- BP with spectral index information
-- Phase calibration
-- A&P calibration
-- Applycal
-- Plots

Flags management

Use flagmanager to track what flags are being applied within CASA and and give version names to them.

Minor bugs in fluxscale and bandpass_1_sp steps

The fluxscale step should also mention sety, as this is used to set the newly found flux densities of all calibrators (except the original flux standard) as models in the MS.

Since the fluxscale step does not always use all antennas, the _fluxscaled table is not useful for all antennas and in bandpass_1_sp the original allcal_ap.G1 should be used. This is presumably normalised and will correct the time-variable element of the amps whilst the model with spectral index is used for the flux scale (please check that this does work as I think!). (The later gaincal steps re-derive a properly scaled amp gain table for all cals).

Discuss calibration steps

Javier and I have the basis of the pipeline almost ready. We now need to get everyone's help in setting out best calibration steps in order to make the pipeline as good as it can be. Thanks to Javier, it should now work on any machine (minus flagging if you dont have aoflagger)

I propose that we have a telecon/skype soon in order to discuss how we are going to proceed, map out a plan for the calibration steps + how we are going to split the workload.