open-gamma-ray-astro / gamma-astro-data-formats Goto Github PK

View Code? Open in Web Editor NEW

28.0 23.0 27.0 9.45 MB

Data formats for gamma-ray astronomy

Home Page: https://gamma-astro-data-formats.readthedocs.io

License: Creative Commons Attribution 4.0 International

Python 68.70% Makefile 15.84% Batchfile 15.46%

astronomy gamma-ray data-format specification gamma-ray-astronomy

gamma-astro-data-formats's Introduction

Data formats for gamma-ray astronomy

The Data formats for gamma-ray astronomy is a community-driven initiative for the definition of a common and open high-level data format for gamma-ray instruments.

Repository: https://github.com/open-gamma-ray-astro/gamma-astro-data-formats
Docs: https://gamma-astro-data-formats.readthedocs.io/
Mailing list: https://lists.nasa.gov/mailman/listinfo/open-gamma-ray-astro

Stable versions

Stable versions of the spec are done via git tags and are shown as releases here: https://github.com/open-gamma-ray-astro/gamma-astro-data-formats/releases

HTML and PDF versions for stable versions are available via the version selector in the lower left on ReadTheDocs. As an example, for version 0.3:

The archived versions of the standard are available on zenodo:

v0.2:
- https://doi.org/10.5281/zenodo.1409831
- BibTeX citation.

Building the documents locally

To build this document locally, clone this repository and install poetry, the tool used for dependency management:

$ python3 -m pip install [--user] poetry

Use --user if you are using a system python installation, leave it out if you are in a virtual environment or conda environment already.

Install the dependencies for building this document:

$ poetry install

Make the html:

$ poetry run make html SPHINXOPTS="-W --keep-going -n --color -j auto"

The options are enabling more warnings to make sure everything builds correctly and run the build on multiple cores.

Start the python http server to get a preview in your browser:

$ python3 -m http.server -d build/html

You then should be able to browse http://localhost:8000 and see the document.

References

The following paper describes the context of this initiative and its evolution:

Nigro, C.; Hassan, T.; Olivera-Nieto, L. Evolution of Data Formats in Very-High-Energy Gamma-Ray Astronomy. Universe 2021, 7, 374. https://doi.org/10.3390/universe7100374.

gamma-astro-data-formats's People

Contributors

Stargazers

Watchers

gamma-astro-data-formats's Issues

Better Navigation menu in Sphinx doc

There should be a way to navigate back to the main page from sub-pages.

E.g. from the sub-page at http://gamma-astro-data-formats.readthedocs.org/en/latest/info/about.html
there's no link one can click to come back to the main page at http://gamma-astro-data-formats.readthedocs.org/en/latest/

There's probably some setting in conf.py to fix this ... all other Sphinx docs have nice navigation.

Fix point-like ``RAD_MAX`` energy binning

Hi all, @cdeil @cboisson @jknodlseder @bkhelifi @contrera

I'm kind of ashamed that never realized about this... But theta² cuts are applied as a function of the reconstructed energy, while the ENERGY axis in all IRFs is true energy. Therefore adding a RAD_MAX column to the point-like IRFs is not enough to define energy dependent cuts. We would also need to define ERECO binning somewhere (as all IRFs only contain true energy binning definition), which definitely complicates the format.

The point-like IRFs section needs to be re-edited. I could think of several possibilities:

Completely remove it and only support constant energy/FoV theta cut (not very realistic... but the simplest obviously)
Remove energy dependency for RAD_MAX (only changing with the FoV). This is enough for HESS, for example, but not very efficient for the low energies.
Add keywords to define a ERECO binning (for cuts, bins will never be overlapping, so we can follow classic linear/log binning with 3 keywords) and therefore RAD_MAX column would only need to have the dimension of the FoV coordinates (1 or 2).
Add additional ERECO_LO, ERECO_HI columns. This would follow a standard closer to what we have for the rest of axes, but would be terribly misleading (as IRFs would not be a function of that axis...)
Separate HDU (with an identical BINTABLE format) containing RAD_MAX as a functionERECO and the FoV axes. Point-like IRFs could contain a keyword pointing to that HDU. Something like EBOUNDS maybe?

Can anyone think of a better solution?

The first step should probably be 2 (which is better than what we had, and easy to implement) but I think we should support also an energy dependent cut... (it is what we are using now for the IRF production, and with the huge CTA energy range, it makes sense). Between solutions 3 4 5, I probably prefer 5 or 3, although it feels too non standard...

Opinions?

Triple-Gauss PSF format

I've added a formula for the triple-Gauss PSF:
http://gamma-astro-data-formats.readthedocs.org/en/latest/irfs/psf/psf_3gauss/index.html
as well as some background info on radial PSFs:
http://gamma-astro-data-formats.readthedocs.org/en/latest/irfs/psf/index.html#point-spread-function

I think some discussion / checks are needed:

Is the formula correct? I took it from the HAP code that fits the PSF and added the 1/pi factor to go from dP/dr^2 (which is what HAP fits) to dP/d\Omega (which is what Fermi and OGIP use as PSF value).
Is there a more sane / stable parametrisation for multi-Gauss (look what PSFEx do) that we want to adopt?
Can the presentation / explanations be improved?
@mimayer - Does the PA exporter use the same parametrisation?
==> there is no 3-gauss exporter in PA.
@JouvinLea - Do you use multi-Gauss in HAP-FR exporters? Same parametrisation?
===> This is work in progress, HAP-FR can put whatever we decide here.
I'd like to check the Gammapy code
@mimayer - Is gammalib.GCTAPsf2D consistent with this definition?

Defining a common XML format for models in Fermi/LAT Science Tools and GammaLib

The GammaLib software implements so far the XML format that has been defined for the Fermi/LAT Science Tools (ST) for model definition. GammaLib has several additional spatial model components, and the XML format has been extended to cover also these cases. More spatial models are now also added to the Fermi/LAT Science Tools, hence to keep compatibility between GammaLib and Fermi/LAT Science Tools it would be important to agree now on a common naming convention. This could also be the moment to change some of the initial naming conventions so that model components have a more coherent set of names. It is proposed that the Fermi/LAT Science Tools and GammaLib softwares will implement some proxies to ease the transition from the old to the new format, but ultimately, only the new format should be used in the future for model definition.

The discussion on this has started on Fermi/LAT confluence and in an e-mail thread, here just the summary of what is proposed so far.

The proposal is to change the type attributes of the spatialModel elements in the XML file (see an example XML file below). The following types are proposed:

PointSource as replacement of SkyDirFunction in ST and GammaLib
RadialDisk as replacement for SpatialDisk in ST and DiskFunction in GammaLib
RadialGaussian as replacement for SpatialGaussian in ST and GaussFunction in GammaLib
RadialShell as replacement for ShellFunction in GammaLib
EllipticalDisk stays as is in GammaLib
EllipticalGaussian as replacement for EllipticalGauss in GammaLib
EllipticalShell if needed in the future
DiffuseIsotropic as replacement for ConstantValue in ST and GammaLib
DiffuseMap as replacement for SpatialMap in ST and GammaLib
DiffuseMapCube as replacement for MapCubeFunction in ST and GammaLib

In addition, the type attributes of the source elements in the XML file are proposed to be defined as follows:

PointSource for point sources (as is)
ExtendedSource for all radial and elliptical models (is DiffuseSource in ST)
DiffuseSource for all diffuse models (as is)

Below for illustration of the type attributes and the source and spatialModel elements an XML of the current XML file format:

  <source name="Crab" type="PointSource">
    <spectrum type="PowerLaw">
       <parameter name="Prefactor" scale="1e-16" value="5.7"  min="1e-07" max="1000.0" free="1"/>
       <parameter name="Index"     scale="-1"    value="2.48" min="0.0"   max="+5.0"   free="1"/>
       <parameter name="Scale"     scale="1e6"   value="0.3"  min="0.01"  max="1000.0" free="0"/>
    </spectrum>
    <spatialModel type="SkyDirFunction">
      <parameter name="RA"  scale="1.0" value="83.6331" min="-360" max="360" free="0"/>
      <parameter name="DEC" scale="1.0" value="22.0145" min="-90"  max="90"  free="0"/>
    </spatialModel>
  </source>

Format of TSTART and TSTOP in obs-index table not defined consistently

The format of the fields TSTART and TSTOP is not defined consistently. On this page, it says the values are stored in MJD, whereas this page says they are stored in MET. Which one do we want?

Note that TSTART and TSTOP are also keywords of the EVENTS extension, where they are stored in MET.

Where to put livetime info in IACT DL3?

@contrera @jknodlseder @cboisson @kosack, all - Where should we put livetime info in the IACT DL3 model / formats?

At the moment LIVETIME (and ONTIME and DEADC) are required keywords in the EVENTS header (see here).

Is that a good idea?

Other options might be:

As a column in the GTI table, i.e. have a livetime per GTI interval
As a column in the POINTING table, i.e. introduce START, STOP intervals there have have LIVETIME be associated with those intervals. This is what Fermi-LAT does in their SPACECRAFT tables, see here and discussion here (where @jknodlseder and @kosack said they don't like this option).
As a column in a to-be (not exising yet) "IRF validity time interval"

I think having it as an EVENTS header keyword is maybe sub-optimal, since (I think) it prevents merging of event lists and also it might not be the natural place if time information (like GTI or "response time interval") is elsewhere.

Livetime info is central to any science tools analysis, this is something we should do well in the DL3 data model.

Let's use this issue to gather some ideas and then discuss it tomorros in the telcon.

FITS policy and notes

I propose we add more information on the gamma-astro-data-formats specs and FITS in the general notes section.

The main questions I think is which of our formats:

Require FITS, i.e. only FITS serialisation is supported
Shall work with FITS
Don't have to work with FITS

At the moment we have cases like EVENTS where the spec page is written in a very FITS-specific way, even though storing an event list e.g. in ECSV or HDF5 would be easy. I don't like this and would prefer to just specify the names and semantics of keys and columns, and not require the FITS serialisation format.

There's other cases like SED where we explicitly give sample files in FITS, ECSV and HDF5 format. For those, I think the main point of the "general info section on FITS" would be to explain some oddities e.g. in short and cryptic header keys, that originate from FITS limitations and traditions. E.g. we should explain that formats that support FITS should always use uppercase 8-character max header keys, because that's required for FITS (see #65 (comment) and following discussion)

Probably we can't get agreement on this via Github discussion.
But people can voice their use cases and preferences for the main question put above now, and then we try to settle this question at the next f2f meeting.

@woodmd @cboisson @kosack @jknodlseder

HEALPix Formats

I have posted a first draft of a proposal for HEALPix FITS formats to this page. This thread is intended to collect comments and feedback on this proposal.

The current draft spec was based mostly on the conventions we currently use in the Fermi Science Tools. The HEALPix-specific ST functionality is not yet widely used so we are also open to updating the ST implementation to reflect whatever we agree on for this specification.

SED and Binned Likelihood Formats

This is an open thread to collect comments and feedback on the proposed formats for Binned Likelihood Profiles. Since these formats are effectively just SEDs with extra information in each bin it would probably be a good idea to try to share conventions with the Flux points format. @zblz Did you have any thoughts on this?

Here are a few open questions:

Do we want a fixed 95% CL for the UL columns? I saw that the naima format specifies the confidence level with a header keyword so maybe we should do something similar.
Do we need a column to specify the bin center? The current format only specifies the lower and upper edges (E_MIN and E_MAX). Differential quantities are assumed to be evaluated at the geometric center of the bin but we could add a bin center column (E_CTR) that would allow the evaluation point to be explicitly defined.
Which columns do we want to make optional? The current specification calls for five different unit conversion factors (DFDE,E2DFDE,FLUX,EFLUX,NPRED). I mainly did this for convenience so that one doesn't need to know anything about the underlying spectral model to convert between units. Of these E2DFDE could clearly be optional (or maybe dropped entirely) since it can be trivially computed from DFDE. NPRED provides additional information that can't be rederived from the other four unit columns but since it may not always be available it should probably also be optional.
Do we want a column for exposure? This can be inferred from FLUX and NPRED but having it as an independent column could be useful in some circumstances.
Do we need a column that describes the model for the spectral distribution within each bin? For instance when making LAT SEDs we typically substitute the source with an index=2.0 power-law. For many applications this doesn't really matter (e.g. when using narrow bins) but when using wide bins or when the effect of energy dispersion is significant the assumed shape within the bin can have a big influence on the measured flux. One option would be a column for the logarithmic slope of the bin model evaluated at the bin center.

Change safe energy header keywords

We are currently using the LO_THRES and HI_THRES keywords for the safe energy thresholds in the ARF and AEFF formats:
http://gamma-astro-data-formats.readthedocs.io/en/latest/search.html?q=LO_THRES&check_keywords=yes&area=default

I saw that in this OGIP spec, LO_THRES means something different:
http://heasarc.gsfc.nasa.gov/docs/heasarc/caldb/docs/memos/cal_gen_92_002/cal_gen_92_002.html#tth_sEc4
In the RMF it's

LO_THRES - minimum probability threshold used to construct the matrix (matrix elements below this value are considered to zero and are not stored)

Given that it's a cryptic term anyways, I'd suggest we change to something else.
E.g. EMINSAFE, EMAXSAFE or really anything else that doesn't conflict with the exiting usage and maybe suggest energy range from the name.

@jknodlseder @cboisson @registerrier @joleroi - Thoughts?

Consistent column names within IRFs

Hi all,

I have a couple of questions regarding the energy naming we use for the IRFs.

Within issue #35 we discussed about true/reconstructed energy, and how to label each of them. We decided to call "ERECO_" and "ENERG_" to reconstructed and true energy respectively. This is inconsistent with the event lists naming convention, but we decided we could live with that (as the same was used in other past experiments).

In the current specs:

Energy dispersion: "ENERG_" is true energy, so it is ok (although it should be explicitly added). Also we should remove one of the "MATRIX" descriptions currently appearing.
PSF: Same as before. We should explicitly state that it is True energy
Background: Here is where I have my main doubts. In IACTs, we generally use "ERECO" to describe the energy of the cosmics, assuming they are gammas. Shouldn't we then use "ERECO"? Or should we use "ENERG" and add a note there explaining that this is not exactly correct?

I can fix these issues, but I wanted to discuss them first. What do you think would make more sense for the background?

Cheers!

Define sign of energy bias

While reviewing gammapy/gammapy#1138 by @joleroi I noticed that in Gammapy we're currently defining energy bias with an opposite sign to how @GernotMaier defined it in the "CTA IRF" draft document here.

In Gammapy we currently define energy bias as E_reco / E_true - 1, i.e. bias > 0 if E_reco > E_true. This is also how I would have defined it, and I think everyone uses that definition in HESS.
One publicly visible document I found with Google is https://pos.sissa.it/236/826/pdf (see energy bias curve in Figure 3, right).

In the "CTA IRF" document here by @GernotMaier from June 2017, I see this definition:

Ebias: Energy bias (mean of (1 − E_reco / E_true )-distribution)

i.e. it's the opposite sign from what we have, bias < 0 for E_reco > E_true.

@GernotMaier - Is this a typo or do you really use that definition in CTA? Is this definition / sign still open for debate at https://forge.in2p3.fr/boards/236/topics/1830 or do you consider it final for CTA e.g. because there's too much history in CTA MC or even publications using that definition?

Do you know what VERITAS / MAGIC use?

For Fermi-LAT I found https://fermi.gsfc.nasa.gov/ssc/data/analysis/documentation/Cicerone/Cicerone_LAT_IRFs/IRF_E_dispersion.html with Google, which does define bias parameters, but it's too late for me to decipher the math and figure out which sign they use. Maybe @woodmd you could comment?

I couldn't find a definition for other instruments or OGIP.

It's not something that we have to define here (because energy bias is currently not used in the EDISP formats we define), but still, it is a very common check / performance plot to show the energy bias curve for a given EDISP, so I think there would value documenting it here. In the end the only thing that makes sense it to follow CTA definitions here, but since it's internal we can't reference it from Gammapy.

GTI required or optional with EVENTS?

Should GTI (Good time interval) extensions be required or optional?
Are they required to be in the same file as the EVENTS extension?

@mimayer - What's the status in ctools? Are there any concrete plans to change something?
http://gamma-astro-data-formats.readthedocs.org/en/latest/events/index.html#disclaimer-about-names-of-hdu-and-file-structure

Does someone have time to look at the EVENTS and GTI specs from OGIP?
Do existing FTOOLs handling EVENTS work if no GTI is present (for operations where it's not needed)?
This is probably something where we should just follow the existing practices, no?

Add note that responses should be smooth and well-sampled

We should add a note to the current IACT DL3 specs that responses are expected to be smooth and well-sampled, i.e. the burden of e.g. smoothing responses and choosing response sampling is on the DL3 data producer, not the science tools consuming the files.

This has been discussed in detail the the April f2f meeting and I think isn't controversial, so "just" needs someone to make this clear in the current spec. But the question came up in another issue (see #67 (comment)) in a discussion with @kosack , so I'm splitting it out into this separate reminder issue.

General schemes to handle parameter ranges where the responses are bad (like mask or weight arrays) haven't been investigated, but might ultimately be needed. If you'd like to discuss this, I'd suggest to split it out yet again into a separate (very long-term, large effort) separate issue.

Event list spec required columns

The event list spec currently has way too many required columns:
http://gamma-astro-data-formats.readthedocs.org/en/latest/events/index.html

Most science tools analysis only needs energy, ra, dec, time, probably event_ID should also be required.
All the other stuff should be optional.

Also, way too many header keywords are listed as required.

@mimayer - Do you agree? Can you make an update or should I?

PSF format descriptions

We need a description of the psf_king and psf_3gauss PSF formats.

@peger089 - I'm trying to get these specs in shape for the HESS FITS data telcon on Wednesday. Do you have time to implement this today or tomorrow, or should I do it?

Comparison with astromodels

The format I am designing for astromodels to specify functions is very similar to what you are trying to achieve, so maybe we can try to converge. This is what I have:

powerlaw:

    description : 

        A simple power-law with normalization expressed as 
        a logarithm

    formula :

        expression : 10**logK * (x / piv)**index

        latex :

            \frac{dN}{dx} = 10^{logK}~\frac{x}{piv}^{index}

    parameters:

        logK :

            desc : Logarithm of normalization
            initial value : 0
            min : -40
            max : 40

        piv :

            desc : Pivot energy
            initial value : 1
            fix : yes

        index :

            desc : Photon index
            initial value : -2
            min : -10
            max : 10

powerlaw_flux:

    description:

        A simple power-law with the integral between two extremes as 
        normalization parameter

    formula:

        expression : 10**(logK) * (index + 1) * (x / piv)**index / 
                     ( xmin**(index+1) - xmax**(index+1) )

        latex :

            \frac{dN}{dx} = \frac{index + 1}{xmin^{index+1} - xmax^{index+1}}
                            10^{logK}~\frac{x}{piv}^{index} 

    parameters:

        logK :

            desc : Logarithm of normalization
            initial value : 0
            min : -40
            max : 40

        piv :

            desc : Pivot energy
            initial value : 1
            fix : yes

        index :

            desc : Photon index
            initial value : -2
            min : -10
            max : 10

        xmin :

            desc : Lower bound of integration
            initial value : 1

        xmax :

            desc : Upper bound of integration
            initial value : 10

This can easily be read from python as:

import yaml

with open("astromodels/functions/powerlaw.yaml","r") as f:
    d = yaml.load(f)

print(d['powerlaw'])
'A simple power-law with normalization expressed as a logarithm'

print(d['powerlaw']['parameters']['index'])
{'max': 10, 'initial value': -2, 'desc': 'Photon index', 'min': -10}

Maybe we can analyze our respective trials and merge the good things about both in one format.

Change units of PSF table

The units of the PSF table should be sr^-1 which is the SI unit of a solid angle (see e.g. https://en.wikipedia.org/wiki/Steradian). Square degrees is a non-SI unit

1D formats for IRFs (ARF, RMF), counts vector (PHA)

I am using the OGIP format for 1D spectral analysis in Gammapy. It has, however some slight modifications, so it would be useful to have a page here that is basically a diff to the OGIP standard pages

https://heasarc.gsfc.nasa.gov/docs/heasarc/ofwg/docs/spectra/ogip_92_007/node5.html
http://heasarc.gsfc.nasa.gov/docs/heasarc/caldb/docs/memos/cal_gen_92_002/cal_gen_92_002.html#tth_sEc4
http://heasarc.gsfc.nasa.gov/docs/heasarc/caldb/docs/memos/cal_gen_92_002/cal_gen_92_002.html#tth_sEc3.1

Question: Where should this page live?

Proposal: Split at top level - i.e. IACT IRFs -> 2d IRFs. Add page: 1D IRFs

Do you agree?

Add optional columns for MC generated events

The parameters of MC generated events should be added as optional columns, to store the true energy, direction and particle ID of each event (required to calculate IRFs from event lists).

Write time spec

I'm out of time for today, but I want to write up the time spec soon:
http://gamma-astro-data-formats.readthedocs.org/en/latest/info/time.html

I don't think there's anything controversial, we'll have MET, MJD and UTC time strings.

Split out GTI and TELARRAY in separate pages

Currently GTI and TELARRAY are listed as sub-sections of the event list spec:
http://gamma-astro-data-formats.readthedocs.org/en/latest/events/index.html

They should be split out in separate pages, because they are just as relevant for IRFs as for EVENTS.

For GTIs I'm not sure if we need to have our own page or if we can just link to
https://heasarc.gsfc.nasa.gov/docs/heasarc/ofwg/docs/rates/ogip_93_003/ogip_93_003.html#tth_sEc6.3
where we mention GTIs, or put it in the glossary.

Anyone wants to work on this or should I?

Spectral model for gammapy fit result

Hi,

I would like to store the result of the gammapy regions based 1D spectral analysis to disk. I though it might be a good idea to use the format proposed here. The problem is that I don't really understand it 😁

@zblz @woodmd : What should a YAML output file look like in order to comply to your proposed format? Where do I put the actual values of the parameters?

Versioning and referencing of our specs

@mimayer @cboisson, @joleroi, all - How should we version the specs here and obtain a stable reference URL?

The IRF specs are making good progress, I think we can finish this soon.
(days for a first version, then weeks to have some discussion with others)

@joleroi suggested that we do something similar to the OGIP specs, where a HDU_DOC keyword is used to refer to the specification:
https://heasarc.gsfc.nasa.gov/docs/heasarc/ofwg/docs/ofwg_recomm/r8.html

I'm a bit hesitant to use the https://github.com/gammapy/gamma-astro-data-formats or http://gamma-astro-data-formats.readthedocs.org/ in HDU_DOC in FITS files, because the reference might not be stable enough (e.g. if we decide to move the repo or HTML page hosting for some reason).

Thus my suggestion would be that in early January we tag the repo with a 1.0 version and put it up on https://zenodo.org/ . Then we have a DOI, which is the most stable reference one can get as far as I know.

@zblz @giacomov @woodmd - We'll have to discuss whether it makes sense to keep the astro model specs (and other things I'd like specified, like input / output to sensitivity computations) in the same repo, or whether we split things out into separate repos where development / versioning can occur at a pace that's appropriate for each spec.

We don't have to decide these things now, I suggest we take some time until January to see how the different specs evolve and keep them all here for now.
But I wanted to start the discussion early and hear what you all think about versioning and referencing of our specs.

Master index file

@cdeil: Let's use this issue to further discuss the master index file. Could you provide a JSON example file and update the specs accordingly that I can implement the handling in ctools?

Clarify SED upper limits

@woodmd - It's not clear to me yet how to store upper limits in the SED format.

The spec (see here) doesn't explicitly say what values to fill for upper limits and the example files don't contain upper limits.

In Gammapy we currently don't use an UL column, but instead a NaN value in the down error column to store an UL (see here).
I think (but didn't double-check, the Fermi-LAT catalogs also don't have an UL column, but declare ULs by NULL value in the lower error. This is just to mention examples, I'm not saying this has an influence on what we specify here.
@zblz proposed an extra bool ul flag column to say which point is which, instead of "magic values" (see here).

At this point I don't have an opinion on which UL point serialisation option is best. Is there an advantage (e.g. simpler storage or processing) to not introducing an UL colum, but to re-use the lower / upper error columns?
If we keep the UL column, should we require some special values like NaN or NULL in the other "flux" columns, or should their content be undefined / irrelevant for ULs?

Are there use cases for "lower limits", i.e. should we treat "upper limits" in a special way or just have a system that works for symmetric confidence intervals?

@adonath or @joleroi or @registerrier @giacomov - Do you have a preference here?

Maybe this?

flux flux_err flux_ul
val val nan # not an UL
nan nan val # is an UL

Or this, to still allow storing the maximum likelihood estimate, even for UL?

flux flux_err flux_ul
val val nan # not an UL
val val val # is an UL

@woodmd - Do you have time to make a pull request for this soon?
Or should I make it?
Please say your preference in any case.

(We need this for gamma-cat, so if possible, I'd like to decide and implement this in the spec and update the examples to show UL cases in the next days.)

Proposal for Semver and DOI

I propose to adhere to some semantic versioning standards.
semver 2.0 is a popular choice for many projects.

We can use digital object identifiers for linking to stable versions of the spec via github releases.

HDU index table

@mimayer I've done some changes to the HDU index table spec:
http://gamma-astro-data-formats.readthedocs.org/en/latest/data_storage/hdu_index/index.html

Any objections so far?

The main thing that's missing is linking to the HDU format spec pages, but those still need to be written.
And I'd like to think a bit more and re-discuss the BASE_DIR idea.

how to add size of energy bin to differential flux points

Is there a way to give the min/max of an energy bin in a differential flux spectrum?

Some spectra in the literature are given with uneven binning, and it would be useful to store this information somehow.

Change energy dispersion format

This is a suggestion for a change I got via email from Michael Punch and @kosack, after presenting the FITS data formats for H.E.S.S. at the collaboration meeting last week.

There we were using EDISP in the current format (see here) with y-axis range up to y = e_reco / e_true = 2 which clipped the distribution.

To avoid this they suggested to instead use y = log(e_reco), which is what OGIP RMF does, and also use that for the DL3 responses (where there are extra axes e.g. for field of view offset, and no sparse matrix zero-suppression scheme).

Energy true/estimated binning in "aeff_2d"

Hi all,

Aren't we currently "imposing" effective area vs true energy to have the same binning as the effective area vs estimated energy? Isn't that completely misleading?

I would propose to add another pair of columns defining the true energy binning (even if its identical).

I edit it further: The energy column is always labelled as "Energy axis". I would propose to always specify to which energy we are referring to. Specially in the case of IRFs.

Event classes and tagging variables (as hadronness)

Proposal: keep the format general enough so we can compute the flux of a source (or even a diffuse flux) either from the number of events surviving a cut in or from a fit to a “hadronness” distribution. I argue that this later method should be more performant at high contamination (low separation power) because in principle all events are kept and the whole information of the differences in hadronness pdfs is used. I also think this may be a road to work in cases where our separation power is small, like electron analysis.

Use case: we want to be able to compute the flux of a source using both methods

Case 1: Using cuts:

Define a set of event classes corresponding to different cuts in hadronness
Compute the effective area for each set of cuts Aeff(cut)
Compute the background for each set of cuts from data N_bckg(cut)
Count the number of events passing the cuts N_evts(cut)
Compute the signal events N_sig(cut) = N_evts(cut)-N_bckg(cut)
Flux(cut) = (N_sig(cut))/(Aeff(cut)*time)
Required: Class definitions, class for each event, Aeff(Cut), Backg(cut)

Case 2: Fitting the distributions:

Keep the hadronness of each event
Compute the pdf for the signal hadronness_pdf(sig)
Compute the pdf for the background hadronness_pdf(bckg)
Fit the hadronness distribution of the sample to
N_sig_hadronness_pdf(sig)+N_bckg_hadroness_pdf(bckg)
Flux = N_sig/(Aeff*time)
Required: hadroness for each evt, hadroness_pdf(evts), hadronness_pdf(bckg)

Note: A soft hadronness cut will be needed in any case to reduce the DL3 size

Observation index table

@mimayer - Let's use this issue to discuss details about the obs index table.

I've changed the formatting a bit and also the names or required status of some of the columns:
http://gamma-astro-data-formats.readthedocs.org/en/latest/data_storage/obs_index/index.html

Any objections / further suggestions so far?

Should we make it a requirement that ALT_PNT is computed at the observation mid-time, and not in some other way? I think that would be good, because people will use this for observation selection, and it would be confusing if the observation selection differs between HAP and PA because we fill mean ALT / AZ / ZEN differently.

@mimayer - Could you please take care of the TODO for ZTRGRATE?

TODO: define what “zenithed averaged mean” means or remove this column.

Use spherical trigonometry in PSF formulae

The formula

.. math::
\int_{0}^{\infty} 2 \pi r dP/dr(r) dr = 1, where dP/dr = 2 \pi r dP/d\Omega

on the PSF page is only true for small angles, the correct formula is

.. math::
\int_{0}^{\infty} 2 \pi sin(r) dP/dr(r) dr = 1, where dP/dr = 2 \pi sin(r) dP/d\Omega

Also in the description above the small angle approximation is used. The exact formulae should be given as for some instruments, the maximum PSF radius can become quite large, and the difference will be perceptible.

GLON_PNT and GLAT_PNT should be required columns in the obs-index table

For background modelling in particular it is important to e.g. to exclude runs from the galactic plane.
In general, to facilitate the observation selection in Galactic coordinates, "GLON_PNT" and "GLAT_PNT" should be required columns in the obs table!

Move to open-gamma-ray-astro org and split this repo?

I'd like to move this repo from https://github.com/gammapy/gamma-astro-data-formats to
the https://github.com/open-gamma-ray-astro org I created just now, which matches the name of the https://lists.nasa.gov/mailman/listinfo/open-gamma-ray-astro mailing list.

The whole point is to collaborate on data formats that are adopted by as many instruments and codes as possible, so it's not really appropriate to have those specs under the gammapy org.

Another question I have is whether we should stick with the one-repo approach for now and let people write pages for different things they'd like to spec (including e.g. model and fit result serialisation formats), or whether we should split out the "IACT DL3" part into a separate repo, because there we have most activity and the goal is to work towards a document and having separal repos could scale better for the coming years if this should get traction.
The con of splitting is the extra overhead of several repos / readthedocs pages, but that's manageable, so I'm slightly +1 to split off IACT DL3 into a separate repo / document now.

@jknodlseder @cboisson @mimayer, @joleroi - Thoughts?

IACT DL3 FITS compression information section

I think it would be nice to add an explanatory section with information (or maybe even "best practices") on IACT DL3 FITS data compression. All the HDUs (EVENTS, IRFs) are BINTABLE and it's mostly float data (with some int data). The question is whether it should be compressed (and if yes how) or not.

@joleroi and I are currently thinking about this while preparing the "HESS public test data release 1" with IACT DL3 FITS files in the format described in this spec.

So far in HESS we are running external gzip on FITS files to produce .fits.gz files, but I'm not sure if that was a well-motivated choice, i.e. how large compression ratios are if only reasonable columns are exported (e.g. no zero-filled optional columns in EVENTS).
@joleroi or @mimayer - If you could produce some numbers on compression ratios you achieve for HESS DL3 data (ideally split per HDU type, i.e. EVENTS, AEFF, ... and for HDUs that only contain the required columns, not zero-filled optional columns) with external gzip and bzip2 and paste them here to help the discussion, that would be great.

The main drawback to external compression is that the whole file needs to be read and uncompressed to get at the header. For us in HESS this isn't a problem at the moment, because we use the index files to select data and have one HDU per file. But now based on feedback from the Meudon IACT DL3 meeting we are starting to move to all HDUs per run (EVENTS, IRFs) in one file. So already for us if we compress things start to get more inefficient if someone e.g. only wants to access the EVENTS HDU (all IRF HDUs will be read from disk as well and uncompressed).

It looks like Fermi is distributing uncompressed their event and IRF data as uncompressed FITS BINTABLE (see here and here). @woodmd - Maybe in Fermi compression was considered in some little study and found not to be beneficial?

What is a compression factor where we should start to consider to compress before distributing the data? 30%? Factor 2?

Should someone here be in favour of compressing, the question is whether internal or external compression is better. Is using internal compression, i.e. this tiled table compression convention is an option?

http://fits.gsfc.nasa.gov/registry/tiletablecompression.html
http://adsabs.harvard.edu/abs/2012arXiv1201.1340P
Supported by fpack and topcat

@taldcroft - Is it correct that this convention is not supported by astropy.io.fits?
http://astropy.readthedocs.io/en/latest/io/fits/index.html#working-with-compressed-files
If that is true, it's probably not in widespread use and supported by many tools and it's off the table?

@joleroi @mimayer @kosack @cboisson @jknodlseder - thoughts?

Question: Must ENERG_LO/ENERG_HI binning be identical for different IRFs?

Hi all,

One simple question: Must the energy binning of (for instance) the effective area be identical to the migration matrix by definition? Are they assumed/checked to be identical within ctools/gammapy? @cdeil @jknodlseder

If the answer is yes, we should add some clarifications in the specs (I could do it). But I want to make sure how consistent they need to be all IRF axes ENERGY, THETA, etc...

Write flux point spec

@zblz - Would you be willing to move your flux point spec from http://naima.readthedocs.org/en/latest/dataformat.html
to this repo?

In Gammapy we have a few places where we deal with flux point tables (e.g. here or here or here), and I'd like us to clean up our act / use one class in Gammapy to handle flux points throughout.

I'm not sure this needs much discussion, but if you have time for some discussion / iteration, maybe you can make a pull request instead of committing to master directly?

Add 3D effective area to the specs

Hi all,

I have two main motivations to add a aeff_3d to the specs. and also support within science tools (3D: ENERGY, DETX and DETY):

In MAGIC, as we only have two telescopes, the acceptance is not radially symmetric.
If we want to reach the extremely low systematic uncertainty levels of CTA, we definitely need to add the zenith angle dependence within the FoV. For example, if the whole array is pointing 50deg away from zenith, and the whole FoV cover ~9 deg, then there will be a clear difference between the effective area between the top and bottom edges of the field of view (even if the offset is equal).

It shouldn't be a lot of effort to include this neither to the specs nor the science tools. What do you think, would it make sense? I could work on it 2/3 weeks from now.

Cheers!

How to handle energy dependent deadtime/livetime?

The deadtime (or livetime) for CTA is in principle energy dependent (as different telescopes and cameras will have different deadtimes). The question is how to handle that. Fold the deadtime into the IRF? Of having explicitly an energy dependent deadtime? This is even worse if there are different camera types on the same site, and events seen with one camera type will have a deadtime different to events seen by other camera types.

Inconsistent specifications for HDUCLAS2 keyword in EVENTS extension

The specifications for the HDUCLAS2 header keyword in the EVENTS extension are inconsistent: HDUCLAS2 is listed as a mandatory (!) header keyword here, but not specified at all for the EVENTS extension here.

Change PSF table axis order

For all other IRFs we have ENERG, THETA first for being consistent it should be ENERG, THETA and RAD.

Development of common tools for data format validation

Hi all,

As we discussed in the telcon, we should define a couple of methods for DL3 data format validation. Apart from the use cases that the format will need to cover, we should have methods to estimate the systematic uncertainty we are adding by using certain approximations.

Also, the current generation of IACTs will dedicate some effort to validate their own DL3 products, so there is a nice synergy on this task from which we could profit.

I guess, an obvious example is to fit the measured spectrum of a known source (CrabNebula for MAGIC) with different assumptions considered (adding or removing some IRF dependencies, for example).

In the case of CTA, we could generate event lists from several IRFs with small differences (for example, zenith angle variation expected in ~30') and measure the effect on the calculated spectrum using an averaged IRF.

@cdeil Were similar validations done for HESS? Or just comparison between standard analysis tools and DL3 analysis products?

Support for multiple responses (specifically point-like AEFF and EDISP)

The current spec has some support for multiple responses via the HDU_CLASS key in the index files.
See http://gamma-astro-data-formats.readthedocs.io/en/latest/data_storage/hdu_index/index.html#hdu-type-and-hdu-class

Specifically for AEFF we do support two versions via the keyword HDUCLAS2={'EFF_AREA', 'EFF_AREA_RECO'}
http://gamma-astro-data-formats.readthedocs.io/en/latest/irfs/effective_area/index.html

This is not yet supported in the HDU index file format, there's just one key HDUCLASS='aeff_2d' defined, so it's not possible to access both AEFF via the current index file.

In HESS, in addition now we have the use case of wanting to define and access AEFF and EDISP derived from point-like responses (with a RAD_MAX cut applied) or full-enclosure responses.
Using point-like responses is very common (in HESS it's used e.g. for all AGN analyses), and we'd like to start using it for FITS analyses.
Note that there is a difference (demonstrated in HESS by @bkhelifi), EDISP is worse for events in the PSF tail, so it's not possible to get the correct EDISP from the full-enclosure responses at a high accuracy level, as another argument that we should support point-like responses in the DL3 spec.

So we would like to extend the IACT DL3 format spec to support point-like IRFs.

Our main questions are:

Which AEFF and EDISP header keys should we use?
How should we support this via the HDU index file?

For the two different AEFF we started to use HDUCLAS2={'EFF_AREA', 'EFF_AREA_RECO'}
So we could use HDUCLAS3={'point-like', 'full-enclosure'} and add the RAD_MAX header key that's already in AEFF to EDISP also.
And then add HDUCLAS2 and HDUCLAS3 columns to the index file spec.
This would satisfy the current use cases for HESS.
It's not very pretty though, because point-like/full-enclosure is not really at a lower or higher level in the hierarchy compared to whether AEFF and EDISP were combined or not. It's just different ways to do responses / analysis.

Another option would be to change the HDUCLASS valid keys to aeff2d_point_reco, aeff2d_fullenclosure_reco, aeff2d_point_true, aeff_fullenclosure_true` instead of adding new columns.

Another option would be to replace the HDUCLASS column with a TAGS column which has a string that is a list of key-value fields, like HDU_CLASS="point=yes,energy=reco" which is super-easy to parse and then scientists or science tools can do selections based on that.

Or directly go to a hierarchical index format instead of sticking with the table.
IMO that would be the best mid-term solution (next year) to develop such a new, better format.
But short-term, my suggestion would be to either add columns or string cases to the existing index file format.

@kosack @mimayer @TarekHC @bkhelifi @registerrier @joleroi @jknodlseder - Thoughts?

(if no-one comments on this, I will probably go ahead and make a pull request adding extra optional columns to the HDU index file next week)

Add time binning and lightcurve spec

It would be nice if we had a spec for:

defining time bins as input to analysis tools
store light-curves.

From discussion here:

To my knowledge there is no VO standard, but there is http://wiki.ivoa.net/bin/view/IVOA/LightCurves. See also:

Remove all '-' from the yaml for spectral models

The current proposed YAML has a lot of extra '-' which make reading a little more cumbersome that what it could be. Let's look for example at this section:

- PowerLaw:
    - type: SpectralModel
    - reference:
        - type: float
        - description: Value at which normalization is defined.
        - ucd: [em.wl, em.energ, em.freq]
    - index:
        - type: float
        - description: Spectral index
        - ucd: spect.index

To access the "powerlaw" element we need to do:

with open("powerlaw.yaml") as f:

    d = yaml.load(f)

powerlaw = d[0]['PowerLaw']

Even more confusing, to access the only 'type' attribute of the powerlaw we need to do:

print(d[0]['PowerLaw'][0]['type'])
SpectralModel

I believe a much cleaner way would be to remove all lists of one element (i.e., removing all "-"s) and keep only the key->value structure. For example:

PowerLaw:
    type: SpectralModel
    reference:
        type: float
        description: Value at which normalization is defined.
        ucd: [em.wl, em.energ, em.freq]
    index:
        type: float
        description: Spectral index
        ucd: spect.index

This would be read as:

with open("powerlaw.yaml") as f:

    d = yaml.load(f)

powerlaw = d['PowerLaw']
print(d['PowerLaw']['type'])
SpectralModel

and so on.

Change OBS_ID keyword from integer to string

The event list extension has an OBS_ID keyword that is currently defined as integer. This comes from the original event list format specification from Karl. It appears however that OGIP has in the meantime switched to a string, see for example https://heasarc.gsfc.nasa.gov/docs/heasarc/ofwg/docs/general/ogip_94_001/ogip_94_001.html which says:

OBSERVATION ID: Sequence numbers are commonly used by missions to identify 'uniquely' a particular dataset. The definition of the sequence number is not unique across mission and typically depend upon the different way a dataset is identified within that mission. Although sequence number suggests a numerical value only, past mission had given the sequence number as a mixture of numerical and character values. To avoid a proliferation of keywords to store the sequence number value, the OGIP recommend to use OBS_ID. The keywords value is a string, to allow back compatability with the already archived mission. The OBS_ID keyword value can be used by database software.

So I would propose to change the OBS_ID keyword from integer to string.

Background model specs

@JouvinLea @mimayer - I've started to write the background model spec, can you please have a look:
http://gamma-astro-data-formats.readthedocs.org/en/latest/irfs/background/index.html

For the bkg_2d model, I prefer an axis that contains THETA values in deg, and not THETA_SQUARE values in deg^2, because a linear axis is simpler than one with a SQUARE or SQRT scale, and the OGIP aeff and PSF formats also always use THETA and RAD, and not square. We can / will still choose to create background models with a nice binning, probably equal-solid-angle bins. OK?
==> Conclusion: use THETA
@mimayer Does Gammalib / ctools need the column to be called "bgd" at the moment? I'd prefer we use "bkg" everywhere, and not sometimes "bgd". Should be easy to teach the Gammalib class to look for "bgd" or "bkg" for now? OK to have "bkg"?
==> Conclusion: use "bkg"
Any opinions on the energy unit of the bkg column i.e. "MeV-1" or "TeV^-1"? I'd prefer TeV since I think we use that everywhere else as far as I know. But I guess using MeV here has some good justification as well, i.e. what Fermi uses and what Gammalib uses at the moment. (in Gammapy I think we read and use the units, i.e. it doesn't matter to us, but I'm not even sure).
==> Conclusion: use MeV so that it still works for Gammalib
Do we want a 1D format to store acceptance curves (just THETA as an axis, for an integral energy band)? It's super common in HESS and we probably want to store / exchange / compare them with the HESS software, i.e. it's good to have a format, no?
==> Conclusion: for now: no, can be added later.

Any other comments / questions / suggestions for the spec?

Add example data files

We should add an example data file for every format.

I propose we use run 23037 for HESS with the HAP-HD exporters and std config.
An alternative would be that we add a function that generates a file with the appropriate format and simulated values using Gammapy or Gammalib or just using Python scripts from scratch.

Thoughts?

EDIT: we decided in the discussion below to write Python scripts to generate example files.

Here's the list of example files we want:

open-gamma-ray-astro / gamma-astro-data-formats Goto Github PK

gamma-astro-data-formats's Introduction

Data formats for gamma-ray astronomy

Stable versions

Building the documents locally

References

gamma-astro-data-formats's People

Contributors

Stargazers

Watchers

Forkers

gamma-astro-data-formats's Issues

Recommend Projects

Recommend Topics

Recommend Org