Coder Social home page Coder Social logo

hupo-psi / mzspeclib Goto Github PK

View Code? Open in Web Editor NEW
21.0 33.0 14.0 39.4 MB

mzSpecLib: A standard format to exchange/distribute spectral libraries

Home Page: http://www.psidev.info/mzSpecLib

License: Apache License 2.0

Python 99.82% Makefile 0.18%
hupo-psi standards file-format proteomics mass-spectrometry spectral-library

mzspeclib's People

Contributors

bittremieux avatar edeutsch avatar henryhlam avatar jshofstahl avatar mobiusklein avatar mwang87 avatar ralfg avatar sneumann avatar uly55e5 avatar ypriverol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mzspeclib's Issues

Implementation detail: What does finding `library spectrum index` mean for the a library reader?

We have three explicit unique identifiers for a library spectrum: key, name, and index.

  • The key is supposed to be a stable numeric identifier, akin to a "primary key" in a database, where the cardinality of the key is historical rather than positional. In theory, should you re-order a library, the key doesn't change.
  • The name is a theoretically human readable name that describes the spectrum chosen by its creator, which means that it is essentially free text when not stipulated by the source format (e.g. MSP). It should be unique too, leaving it up to the creator to figure out how to make it clear to a human.
  • The index is supposed to be an externally "unstable" identifier for a spectrum within the library, specifying an ordinal number starting from 0. As read, should you re-order a library, the index does change.

When reading a library, the parser "knows" how many spectra preceded the spectrum it is currently parsing, and so it can automatically "fill in" the index attribute and the authors of a library needn't include it. However, we have explicitly written that it may be included in the output:

Optionally, a library spectrum index (MS:1003062) MAY be included to refer to the ordered position of the spectrum within the library, starting with 0 for the first spectrum. A library spectrum may have its index changed as the library evolves, and therefore SHOULD only be used internally by the library management software (e.g. for random access retrieval). To refer to a library spectrum unambiguously from outside (e.g. using a Universal Spectrum Identifier), the library spectrum key MUST be used.

Should that mean that if a parser reads an index attribute, it's obligated to store it and round-trip it, while also generating its own internal index separately? This is undefined behavior, according to one reading of the spec. Another more restrained reading might suggest that the value index refers to is never actually taken from the source file verbatim but inferred, and so any read value should be ignored because it constitutes information external to the layout of the library itself.

[Term]
id: MS:1003062
name: library spectrum index
def: "Integer index value that indicates the spectrum's ordered position within a spectral library. By custom, index counters should begin with 0." [PSI:PI]
is_a: MS:1003234 ! library spectrum attribute
relationship: has_value_type xsd:integer ! The allowed value-type for this CV term

Suppose we parse this:

<Spectrum=3>
MS:1003062|library spectrum index=0
...
<Spectrum=1>
MS:1003062|library spectrum index=1
...
<Spectrum=4>
MS:1003062|library spectrum index=1000

The first two entries' index attributes match their true coordinates in the sequence of library spectra, but the third spectrum's index is totally different (2 vs. 1000). What should the parser do? I'd argue that it is context-dependent.

If I were writing a spectrum viewing application, I'd theoretically include the index in the text rendering of the spectrum so that the user knows where in the file an entry is. If I then parsed that text back into the program, I'd probably want to respect that value because I wouldn't know if I were just passing that object around to just be shown elsewhere (e.g. sent to a web app to be rendered again, or passed around via federated PROXI requests) where that index information is just as salient as if it were presented locally. I'd treat that index value as something to display, but it'd be ignored for the purposes of actually looking that single spectrum up again.

However, were I writing a library manipulation tool that's not so interactive, I'd probably say "any buffer of one or more spectra constitutes a single library" and want that library to be internally consistent and ignore the input index value. After all, if the user wants to split a library transform parts differently, and re-merge them, they'll just get re-indexed anyway. Especially if the user merges two separate libraries, not slices of the same library.

Can we say explicitly which reading is more accurate, or that both usages are acceptable and it is up to the implementer to choose which way to go?

New Repository for Spectral Library Format

@edeutsch

I have created the repository for the spectral library format, including:

  • examples : It will contain all the examples of the file formats and the legacy formats examples such as (MSP, splib, etc. )
  • legacy-formats: Legacy formats are the specifications of the previous legacy formats such as MSP, splib, etc.
  • specification: Specification contains all the information about the Spectral Library Specification.

Regards
Yasset

Decision on new or existing format

Based on the outcomes of #4, #5, #7, #9, it should be checked
if/which any of the existing https://github.com/HUPO-PSI/SpectralLibraryFormat/tree/master/legacy-formats
can already encode the required information, or how much (little)
would be required to extend an existing format. Plus, one of the areas where PSI excels
is to create controlled vocabularies, harmonise adoption and provide validators,
so libraries and software can claim they are supporting the PSI SpectralLibraryFormat.

HDF representation

It was already decided to allow multiple representations (txt, json, csv, hdf...) of the new spectral library format, based on a common framework (required (meta)data, controlled vocabulary...). In this issue thread, we can discuss the best way to represent the spectral library format in HDF.

As a reference, the current TXT format looks like this:

MS:1008014|spectrum index=500
MS:1008013|spectrum name=AAAVDPTPAAPAR/2_0
MS:1008010|molecular mass=1208.6510
MS:1008015|spectrum aggregation type=MS:1008017|consensus spectrum
[1]MS:1008030|number of enzymatic termini=2
[1]MS:1001045|cleavage agent name=MS:1001251|Trypsin
MS:1001471|peptide modification details=0
...

And JSON (for one metadata item) would take the following shape:

    {
      "accession": "MS:1001045",
      "cv_param_group": "1",
      "name": "cleavage agent name",
      "value": "Trypsin",
      "value_accession": "MS:1001251"
    },

Discussion spun off from issue #12:

@bittremieux:

I think the design decisions can be quite different based on the data format, i.e. between text-based (CSV, TSV) and binary (HDF5).

Personally, with spectral libraries increasing in size, I'm strongly in favor of HDF5. HDF5 has built-in compression, resulting in much smaller files. Also, it's much easier and efficient to slice and dice the data.
With these big, repository-scale spectral libraries I think it's quite important to focus on IO and computational efficiency. Just the time required to read a multi-GB spectral library is non-negligible, making up a considerable part of the search time.

Taking that into consideration, the compact option seems considerably superior to me. You could go even further and just have two arrays per spectrum (m/z and intensity), which fits the HDF5 data model perfectly. This minimizes the query time (2 lookups to retrieve a spectrum versus 2 * k per spectrum with k peaks), and HDF5 was developed to store (binary) arrays.
Also, keep in mind that HDF5 performance can degrade quite significantly if millions of keys are used because the internal B-tree index can become quite unbalanced, leading to significant overhead during querying. Keeping the number of keys limited might thus be essential to achieve acceptable performance.

Related to your final question, the main consideration here is what the goal of this version of the format is. Is it readability, then CSV is obviously superior. But I don't care about readability here, as you mention in #11, there's already the text version (and to a lesser extent the JSON version). Instead, when going for HDF5 performance should be the main goal. And that means using HDF5 the way it was intended and storing values in arrays instead of each individual value separately. Make it as compact as possible, make spectrum reading efficient by storing the peaks in compact arrays.

@RalfG:

Thanks for the reply! I started looking into HDF and there's a lot more to it then I initially thought. Definitely the nested key system and metadata for each group would be very useful for the spectral library format. This means that an optimal HDF representation would look pretty different from this general tabular format. Since we want to make it possible with the specification to have multiple representations (txt, json, csv, hdf...), I propose that we keep this discussion about general tabular formats (such as csv/tsv) and move the discussion on the HDF format to a new issue.

Tabular format: compact version of peak level table?

Following the current JSON format, the tabular format (HDF, CSV, TSV...) would have four tables, one for each data level (library, spectrum, peak and peak interpretation) with the following columns: cv_param_group, accession, name, value_accession, and value (and some additional grouping columns):


Library level

cv_param_group accession name value_accession value
  MS:xxxxxxx format version   0.1
  MS:xxxxxxx title   library_001
  MS:xxxxxxx description   spectral library 001
...

Spectrum level

spectrum_index ion_group cv_param_group accession name value_accession value
1 1   MS:xxxxxxx index   1
1 1   MS:xxxxxxx title   peptide1
1 1   MS:xxxxxxx is decoy spectrum   FALSE
1 1 1 MS:xxxxxxx calibrated retention index   xx
1 1 1 UO:0000000 unit UO:0000031 minute
...

Peak level

spectrum_index peak_index cv_param_group accession name value_accession value
1 1   MS:xxxxxxx m/z 725.123   
1 1   MS:xxxxxxx theoretical m/z 725.1244   
1 1   MS:xxxxxxx intensity 2138.325   
...

Peak interpretation level

spectrum_index peak_index peak_interpretation_index cv_param_group accession name value_accession value
1 1 1     peptidoform ion series type   y
1 1 1     peptidoform ion series start ordinal   1
1 1 1     product ion series charge state   1
...

This works perfectly fine for the library, spectrum and peak interpretation levels (where there are a lot of possible attributes per entry), but for the peak level, it might be better to have a compact form:

Peak level (compact)

spectrum_index peak_index product ion m/z product ion intensity
1 1 138.0661469 190.7953186
1 2 219.1087494 29.48472786
1 3 305.0644836 1067.439087
...    

This could be extended with a few optional columns.

To keep everything well standardized and machine readable, I would add an additional table Peak level columns defining the used columns the Peak level (compact) table, which could also contain info about the used units (if applicable). E.g.:

Peak level columns additional table

column_index accession name unit_accession unit_name
0   spectrum_index    
1   peak_index    
2 MS:1001225 product ion m/z MS:1000040 m/z
3 MS:1001226 product ion intensity MS:1000132 percent of base peak

To summarize:

  • Peak level would get very verbose if we would follow the same fields as the JSON specification.
  • Solution would be a compact form, together with a small table specifying the columns.

Questions:

  • Does everyone agree with having a compact form for the peak level?
  • We could completely deviate from the JSON spec and have a compact form on all columns. This would make the file more compact in general and more database-like. A drawback is that for all levels, the number of columns can variate between libraries, which would make parsing the metadata somewhat harder. We would also have to deal with the value, value_accession duality, which we do not have at the peak level, as all values are just numbers. What does everyone think about the "full-on compact form" idea?

Peak interpretation format

As part of the new PSI spectral library format, it will be possible to annotate the interpretations of individual peaks, as is already done in NIST, SpectraST, and PeptideAtlas libraries. However, there have been several different styles of interpretations in the past (even from a single provider), and therefore this document describes a single common peak interpretation format for peptides that is recommended for all peptide libraries and related applications from which peak interpretations are desirable.

This format, as currently described, is designed for unbranched peptides with simple PTMs and for fragmentation methods commonly used in proteomics such as CID, HCD and ETD. Although there are some provisions for annotating small molecules (e.g., contaminants in a predominantly peptide spectrum), as well as unusual fragments, it is expected that for other major classes of analytes (metabolites, glycans, glycopeptides, cross-linked peptides...), alternative peak interpretation formats should be defined.

See working document for ongoing discussion.

[Pitch] Apache avro serialization

Hi y'all!

I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro.
I think it would be a good alternative to json with more efficient disk usage.

https://github.com/jspaezp/avrospeclib

I am still implementing the schema using pydantic and deriving form it the
avro schema.

Some disk usage metrics on a reasonably large speclib I have

    # ~ 50MB  binary speclib file from diann
    #  552M   tmp/speclib_out.tsv
    #  448M   tmp/speclib_out.mzlib.json # using mzspeclib
    #  148M   tests/data/test.mzlib.avro

Read-write speeds

avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.

let me know if there is any interest in adopting it!
best!

Features Not Yet Implemented

This issue is a running list of features that are not yet implemented in the Python implementation or in the repository in general:

  • Clusters - This feature was added on late in the game and is only being specified now. We should also hand-craft some examples to be able to do something with.
  • Attribute set in groups - This feature should be straight-forward to implement.
  • sptxt backend attribute parsing - This requires some familiarity with the sptxt format. The syntax is identical to msp, but uses different names for some things. Documentation is missing from http://tools.proteomecenter.org/wiki/index.php?title=Software:SpectraST
  • Tabular file parsing - Something to read CSV or TSV files from search tools like Spectronaut or DIA-NN.
  • Library-level JSON Schema - The drafted PR needs to be re-worked to cover the year's worth of changes since it was first written.
  • GNPS MGF, MassBank TXT, other MGF?

There are always more read-only backends to implement or improve:

  • The BiblioSpec backend could get considerably better with a more complete and up to date example file.
  • A generic mzIdentML + mzML/MGF read-only backend would let us convert ID experiments directly, but this is "hard" to do well.
  • More attribute handlers for parsing msp comments from the wild will help cover more use-cases.

New extension for the file format

As we discuss today (PSI2018 Meeting). One of the starting points would be to change the extension of the format as proposed. I recommend starting adding possible name he and people can vote on the comment using +1 or the icon for voting.

Please whatever proposal you made, check first this site that the file extension does not exists https://fileinfo.com/extension/msp

Update?

Hi,

I just wanted to follow up on this - is this going anywhere?

Refine and finalize metadata and CV terms

The current msp and other spectral library formats only capture the metadata around each entry in the library (cluster, consensus spectra, peptides, small molecules), but not the way the spectral library has been generated. We need to define a general metadata section at the beginning this metadata. Similar to mztab, I think would be great to have something like:

The MTD version is helping the readers to know that this is a metadata field. The second column is the Key of the metadata attribute and the third is the value of the metadata field.

The following fields can be reused from mzTab:

MTD   mzL-version	1.0.0      
MTD   title  Spectral Library Human from Peptide Atlas 
MTD   id     PXL00000001 
MTD   description Some description that can be used for example in the web about the library
MTD   instrument [MS, MS:1000703, LTQ Orbitrap,]
MTD   instrument [MS, MS:1000008, Velos Orbitrap,]

Can we add to this issue all the fields we think are interesting or important to trace?

Naming replicate spectra in consensus libraries

This question has to do with how to mark up a consensus spectrum to link it back to the replicating spectra in their raw files when those raw files aren't on ProteomeExchange.

Files

Spectrum Library:
https://chemdata.nist.gov/download/peptide_library/libraries/skin_hair/IARPA3_best_tissue_add_info.msp.zip

Metadata File:
https://chemdata.nist.gov/download/peptide_library/libraries/skin_hair/IARPA3_all.out.zip

Metadata is a sparse table mapping consensus spectra to their contributing replicates:

Peptide	Charge	Modification	Scans	Raw file	Folder	Tissue
AAAIAYGLDK	2	0	73415	"am_03_rg_t100_nlumos_2021-02-19_350-1600_100_nm_hcd30_360min_tryp_pos.raw"	"hair_rg_guan"	Hair
AAAPGPCPPPPPPP	2	1(6,C,CAM)	33291;33463	"hf1_18_rg_l1_2021-08-06_380-2000_120_hcd30_255min_sp3_lysctryp_i_pos.raw"	"6donors_sp3"	Hair
			30890;32043	"hf3_17_rg_l1_2021-08-06_380-2000_120_hcd30_255min_sp3_lysctryp_i_pos.raw"	"6donors_sp3"	Hair
AAAQWVR	2	0	41910;42017	"xxx_2019_0215_rj_74_strapskin.raw"	"method_development"	Skin
			7991	"20190429_009_llnl_pr_014_dda_2ug.raw"	"osu_dda"	Skin
			9486	"20190503_037_llnl_pr_29_dda_2ug.raw"	"osu_dda"	Skin
			10724	"20190508_002_llnl_pr_08_dda_2ug.raw"	"osu_dda"	Skin
			9504	"20190508_008_llnl_pr_24_dda_2ug.raw"	"osu_dda"	Skin
			10287	"20190508_044_llnl_pr_06_dda_2ug.raw"	"osu_dda"	Skin
			9556	"20190508_047_llnl_pr_13_dda_2ug.raw"	"osu_dda"	Skin
			10231	"20190508_050_llnl_pr_16_dda_2ug.raw"	"osu_dda"	Skin
			10186	"20190508_058_llnl_pr_21_dda_2ug.raw"	"osu_dda"	Skin
			9342	"20190508_061_llnl_pr_22_dda_2ug.raw"	"osu_dda"	Skin
			9280	"20190508_064_llnl_pr_23_dda_2ug.raw"	"osu_dda"	Skin
			10586	"20190508_067_llnl_pr_015_dda_2ug.raw"	"osu_dda"	Skin

Mapping AAAQWVR/2 to many, many spectra across multiple raw files. The appropriate way to express this (so far as I can tell) is to use either contributing replicate spectrum keys or contributing replicate spectrum USI. The second option makes sense since those contributing spectra aren't in the library itself. However, this project didn't publish its data on ProteomeExchange so I cannot construct a "real" USI for it.

Options

Fake USI

This looks "okay" and preserves the available information, but feels wrong because it sets up an expectation that this URI resolves to something. If there were a way to canonically express that this is a "local" or "private" dataset in the accession field, that would make this less misleading.

<Spectrum=...>
...
MS:1003065|spectrum aggregation type=MS:1003067|consensus spectrum
MS:1003299|contributing replicate spectrum USI=mzspec:_IARPA3:20190508_067_llnl_pr_015_dda_2ug:scan:10586
MS:1003299|contributing replicate spectrum USI=mzspec:_IARPA3:20190503_037_llnl_pr_29_dda_2ug:9486
...

Attribute Groups

This looks better as written, though it isn't preferable since A) it's more verbose, B) is universally not portable, and C) it conflicts with the usage of scan number and constituent spectrum file as used for individual spectra where they are not grouped.

<Spectrum=...>
...
MS:1003065|spectrum aggregation type=MS:1003067|consensus spectrum
[1]MS:1003203|constituent spectrum file=20190508_067_llnl_pr_015_dda_2ug.raw
[1]MS:1003057|scan number=10586
[2]MS:1003203|constituent spectrum file=20190503_037_llnl_pr_29_dda_2ug.raw
[2]MS:1003057|scan number=9486
...

MSP Spectral Libraries To Convert

setup tool:

EDIT: PR merged, safe to use master

git clone https://github.com/HUPO-PSI/mzSpecLib.git
pip install ./mzSpecLib/implementations/python/

conversion tool:

mzspeclib convert -f text $MSP $MZLB.TXT

spectral libraries

Add SMILES to mzPAF docs and reference implementation

Per discussion on 12/2/22, we agreed to add SMILES support to the mzPAF format

  • Update the specification document
    • Note that the total charge will still be derived from the charge slot in the peak annotation, but localized charge may be specified in the SMILES string.
  • Update the regular expression(s)
  • Update the grammar and diagrams

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.