hupo-psi / mzspeclib Goto Github PK

View Code? Open in Web Editor NEW

22.0 33.0 14.0 39.4 MB

mzSpecLib: A standard format to exchange/distribute spectral libraries

Home Page: http://www.psidev.info/mzSpecLib

License: Apache License 2.0

Python 99.82% Makefile 0.18%

hupo-psi standards file-format proteomics mass-spectrometry spectral-library

mzspeclib's Introduction

mzSpecLib

HUPO-PSI standardized spectral library format mzSpecLib is a formal standard and file format in development at HUPO-PSI to store and distribute spectral libraries/archives. The target main target audience for this format are the developers of spectral library search tools and resources.

Introduction
Development
Contributing

Introduction

Over past years several file formats have been created to store and disseminate spectral libraries, such as MSP, X!Hunter binary MGF, BiblioSpec SQLite and SSL/MS2, SpectraST SPLIB/SPTXT, MassBank formats, Spectronaut CSV...

Each spectral library provider uses one of these formats. For example, PeptideAtlas uses splib, PRIDE uses MSP, GPMDB uses X!Hunter binary MGF, and NIST uses MSP. Some spectral library search engines support multiple formats, and some do not, making it difficult to share libraries and compare spectral library searching tools. In the proteomics community, there has been a long-standing effort to standardize raw mass spectrometric data and the results of data analysis, primarily identification. But spectral libraries straddle the boundary between the two and cannot be adequately served by either efforts.

As there remains much fluidity and disagreement in what information should go into a spectral library, the format must be flexible enough to fit all the potential use cases of spectral libraries, and yet retain sufficient structure for it to be a practically useful standard.

Development

The new format is being developed by the PSI-MS working group. This repository is used as a central point of information regarding the format's development:

The issue tracker is used for discussions and decisions regarding the format
The repository files contain information about current spectral libraries formats, the spectral library controlled vocabulary, the spectral library specifications, examples and tools that export/validate and visualize those files.
Additionally, Google Docs are used to quickly and efficiently collaborate on the specification documents. Links to these documents are listed here:

Reference implementation

A reference implementation of the mzSpecLib format is available in the form of a Python package. Check out the mzspeclib-py repository or the Python package documentation for more information.

Contributing

All community input is welcome! Feel free to join the discussions in the Issue tracker or to open a new issue if you have questions, recommendations or requests. Additionally, everyone is allowed to post comments in the Google documents or to request full write access to fully contribute to the specification.

mzspeclib's People

Contributors

Stargazers

Watchers

Forkers

sneumann ypriverol mwang87 inambioinfo ralfg mobiusklein oscar-gr hsiyjnd wulongict uly55e5 hechth nilshoffmann jspaezp

mzspeclib's Issues

Features Not Yet Implemented

This issue is a running list of features that are not yet implemented in the Python implementation or in the repository in general:

Clusters - This feature was added on late in the game and is only being specified now. We should also hand-craft some examples to be able to do something with.
Attribute set in groups - This feature should be straight-forward to implement.
sptxt backend attribute parsing - This requires some familiarity with the sptxt format. The syntax is identical to msp, but uses different names for some things. Documentation is missing from http://tools.proteomecenter.org/wiki/index.php?title=Software:SpectraST
Tabular file parsing - Something to read CSV or TSV files from search tools like Spectronaut or DIA-NN.
Library-level JSON Schema - The drafted PR needs to be re-worked to cover the year's worth of changes since it was first written.
GNPS MGF, MassBank TXT, other MGF?

There are always more read-only backends to implement or improve:

The BiblioSpec backend could get considerably better with a more complete and up to date example file.
A generic mzIdentML + mzML/MGF read-only backend would let us convert ID experiments directly, but this is "hard" to do well.
More attribute handlers for parsing msp comments from the wild will help cover more use-cases.

Naming replicate spectra in consensus libraries

This question has to do with how to mark up a consensus spectrum to link it back to the replicating spectra in their raw files when those raw files aren't on ProteomeExchange.

Files

Spectrum Library:
https://chemdata.nist.gov/download/peptide_library/libraries/skin_hair/IARPA3_best_tissue_add_info.msp.zip

Metadata File:
https://chemdata.nist.gov/download/peptide_library/libraries/skin_hair/IARPA3_all.out.zip

Metadata is a sparse table mapping consensus spectra to their contributing replicates:

Peptide	Charge	Modification	Scans	Raw file	Folder	Tissue
AAAIAYGLDK	2	0	73415	"am_03_rg_t100_nlumos_2021-02-19_350-1600_100_nm_hcd30_360min_tryp_pos.raw"	"hair_rg_guan"	Hair
AAAPGPCPPPPPPP	2	1(6,C,CAM)	33291;33463	"hf1_18_rg_l1_2021-08-06_380-2000_120_hcd30_255min_sp3_lysctryp_i_pos.raw"	"6donors_sp3"	Hair
			30890;32043	"hf3_17_rg_l1_2021-08-06_380-2000_120_hcd30_255min_sp3_lysctryp_i_pos.raw"	"6donors_sp3"	Hair
AAAQWVR	2	0	41910;42017	"xxx_2019_0215_rj_74_strapskin.raw"	"method_development"	Skin
			7991	"20190429_009_llnl_pr_014_dda_2ug.raw"	"osu_dda"	Skin
			9486	"20190503_037_llnl_pr_29_dda_2ug.raw"	"osu_dda"	Skin
			10724	"20190508_002_llnl_pr_08_dda_2ug.raw"	"osu_dda"	Skin
			9504	"20190508_008_llnl_pr_24_dda_2ug.raw"	"osu_dda"	Skin
			10287	"20190508_044_llnl_pr_06_dda_2ug.raw"	"osu_dda"	Skin
			9556	"20190508_047_llnl_pr_13_dda_2ug.raw"	"osu_dda"	Skin
			10231	"20190508_050_llnl_pr_16_dda_2ug.raw"	"osu_dda"	Skin
			10186	"20190508_058_llnl_pr_21_dda_2ug.raw"	"osu_dda"	Skin
			9342	"20190508_061_llnl_pr_22_dda_2ug.raw"	"osu_dda"	Skin
			9280	"20190508_064_llnl_pr_23_dda_2ug.raw"	"osu_dda"	Skin
			10586	"20190508_067_llnl_pr_015_dda_2ug.raw"	"osu_dda"	Skin

Mapping AAAQWVR/2 to many, many spectra across multiple raw files. The appropriate way to express this (so far as I can tell) is to use either contributing replicate spectrum keys or contributing replicate spectrum USI. The second option makes sense since those contributing spectra aren't in the library itself. However, this project didn't publish its data on ProteomeExchange so I cannot construct a "real" USI for it.

Options

Fake USI

This looks "okay" and preserves the available information, but feels wrong because it sets up an expectation that this URI resolves to something. If there were a way to canonically express that this is a "local" or "private" dataset in the accession field, that would make this less misleading.

<Spectrum=...>
...
MS:1003065|spectrum aggregation type=MS:1003067|consensus spectrum
MS:1003299|contributing replicate spectrum USI=mzspec:_IARPA3:20190508_067_llnl_pr_015_dda_2ug:scan:10586
MS:1003299|contributing replicate spectrum USI=mzspec:_IARPA3:20190503_037_llnl_pr_29_dda_2ug:9486
...

Attribute Groups

This looks better as written, though it isn't preferable since A) it's more verbose, B) is universally not portable, and C) it conflicts with the usage of scan number and constituent spectrum file as used for individual spectra where they are not grouped.

<Spectrum=...>
...
MS:1003065|spectrum aggregation type=MS:1003067|consensus spectrum
[1]MS:1003203|constituent spectrum file=20190508_067_llnl_pr_015_dda_2ug.raw
[1]MS:1003057|scan number=10586
[2]MS:1003203|constituent spectrum file=20190503_037_llnl_pr_29_dda_2ug.raw
[2]MS:1003057|scan number=9486
...

Implementation detail: What does finding `library spectrum index` mean for the a library reader?

We have three explicit unique identifiers for a library spectrum: key, name, and index.

The key is supposed to be a stable numeric identifier, akin to a "primary key" in a database, where the cardinality of the key is historical rather than positional. In theory, should you re-order a library, the key doesn't change.
The name is a theoretically human readable name that describes the spectrum chosen by its creator, which means that it is essentially free text when not stipulated by the source format (e.g. MSP). It should be unique too, leaving it up to the creator to figure out how to make it clear to a human.
The index is supposed to be an externally "unstable" identifier for a spectrum within the library, specifying an ordinal number starting from 0. As read, should you re-order a library, the index does change.

When reading a library, the parser "knows" how many spectra preceded the spectrum it is currently parsing, and so it can automatically "fill in" the index attribute and the authors of a library needn't include it. However, we have explicitly written that it may be included in the output:

Optionally, a library spectrum index (MS:1003062) MAY be included to refer to the ordered position of the spectrum within the library, starting with 0 for the first spectrum. A library spectrum may have its index changed as the library evolves, and therefore SHOULD only be used internally by the library management software (e.g. for random access retrieval). To refer to a library spectrum unambiguously from outside (e.g. using a Universal Spectrum Identifier), the library spectrum key MUST be used.

Should that mean that if a parser reads an index attribute, it's obligated to store it and round-trip it, while also generating its own internal index separately? This is undefined behavior, according to one reading of the spec. Another more restrained reading might suggest that the value index refers to is never actually taken from the source file verbatim but inferred, and so any read value should be ignored because it constitutes information external to the layout of the library itself.

[Term]
id: MS:1003062
name: library spectrum index
def: "Integer index value that indicates the spectrum's ordered position within a spectral library. By custom, index counters should begin with 0." [PSI:PI]
is_a: MS:1003234 ! library spectrum attribute
relationship: has_value_type xsd:integer ! The allowed value-type for this CV term

Suppose we parse this:

<Spectrum=3>
MS:1003062|library spectrum index=0
...
<Spectrum=1>
MS:1003062|library spectrum index=1
...
<Spectrum=4>
MS:1003062|library spectrum index=1000

The first two entries' index attributes match their true coordinates in the sequence of library spectra, but the third spectrum's index is totally different (2 vs. 1000). What should the parser do? I'd argue that it is context-dependent.

If I were writing a spectrum viewing application, I'd theoretically include the index in the text rendering of the spectrum so that the user knows where in the file an entry is. If I then parsed that text back into the program, I'd probably want to respect that value because I wouldn't know if I were just passing that object around to just be shown elsewhere (e.g. sent to a web app to be rendered again, or passed around via federated PROXI requests) where that index information is just as salient as if it were presented locally. I'd treat that index value as something to display, but it'd be ignored for the purposes of actually looking that single spectrum up again.

However, were I writing a library manipulation tool that's not so interactive, I'd probably say "any buffer of one or more spectra constitutes a single library" and want that library to be internally consistent and ignore the input index value. After all, if the user wants to split a library transform parts differently, and re-merge them, they'll just get re-indexed anyway. Especially if the user merges two separate libraries, not slices of the same library.

Can we say explicitly which reading is more accurate, or that both usages are acceptable and it is up to the implementer to choose which way to go?

Current proposal for intensities, mass and fragment ion annotations

@edeutsch @RalfG:

We have one request in mzTab to encode the information of the peptide identified from the spectral library into optional columns. The user wants to encode:

array intensities,
array masses
array of annotations.

I can't find this information in the current proposal. Can you point me to the document that contains this.

HDF representation

It was already decided to allow multiple representations (txt, json, csv, hdf...) of the new spectral library format, based on a common framework (required (meta)data, controlled vocabulary...). In this issue thread, we can discuss the best way to represent the spectral library format in HDF.

As a reference, the current TXT format looks like this:

MS:1008014|spectrum index=500
MS:1008013|spectrum name=AAAVDPTPAAPAR/2_0
MS:1008010|molecular mass=1208.6510
MS:1008015|spectrum aggregation type=MS:1008017|consensus spectrum
[1]MS:1008030|number of enzymatic termini=2
[1]MS:1001045|cleavage agent name=MS:1001251|Trypsin
MS:1001471|peptide modification details=0
...

And JSON (for one metadata item) would take the following shape:

    {
      "accession": "MS:1001045",
      "cv_param_group": "1",
      "name": "cleavage agent name",
      "value": "Trypsin",
      "value_accession": "MS:1001251"
    },

Discussion spun off from issue #12:

@bittremieux:

I think the design decisions can be quite different based on the data format, i.e. between text-based (CSV, TSV) and binary (HDF5).

Personally, with spectral libraries increasing in size, I'm strongly in favor of HDF5. HDF5 has built-in compression, resulting in much smaller files. Also, it's much easier and efficient to slice and dice the data.
With these big, repository-scale spectral libraries I think it's quite important to focus on IO and computational efficiency. Just the time required to read a multi-GB spectral library is non-negligible, making up a considerable part of the search time.

Taking that into consideration, the compact option seems considerably superior to me. You could go even further and just have two arrays per spectrum (m/z and intensity), which fits the HDF5 data model perfectly. This minimizes the query time (2 lookups to retrieve a spectrum versus 2 * k per spectrum with k peaks), and HDF5 was developed to store (binary) arrays.
Also, keep in mind that HDF5 performance can degrade quite significantly if millions of keys are used because the internal B-tree index can become quite unbalanced, leading to significant overhead during querying. Keeping the number of keys limited might thus be essential to achieve acceptable performance.

Related to your final question, the main consideration here is what the goal of this version of the format is. Is it readability, then CSV is obviously superior. But I don't care about readability here, as you mention in #11, there's already the text version (and to a lesser extent the JSON version). Instead, when going for HDF5 performance should be the main goal. And that means using HDF5 the way it was intended and storing values in arrays instead of each individual value separately. Make it as compact as possible, make spectrum reading efficient by storing the peaks in compact arrays.

@RalfG:

Thanks for the reply! I started looking into HDF and there's a lot more to it then I initially thought. Definitely the nested key system and metadata for each group would be very useful for the spectral library format. This means that an optimal HDF representation would look pretty different from this general tabular format. Since we want to make it possible with the specification to have multiple representations (txt, json, csv, hdf...), I propose that we keep this discussion about general tabular formats (such as csv/tsv) and move the discussion on the HDF format to a new issue.

Refine and finalize data model

See slide with schematic.

Confirmation of example

@sneumann @jshofstahl can you confirm that you have seen this type of annotation also:

https://github.com/HUPO-PSI/SpectralLibraryFormat/blob/master/examples/legacy-formats/metabolomics/PB000447.txt

If not I will remove the example.

Add SMILES to mzPAF docs and reference implementation

Per discussion on 12/2/22, we agreed to add SMILES support to the mzPAF format

Update the specification document
- Note that the total charge will still be derived from the charge slot in the peak annotation, but localized charge may be specified in the SMILES string.
Update the regular expression(s)
Update the grammar and diagrams

Decision on new or existing format

Based on the outcomes of #4, #5, #7, #9, it should be checked
if/which any of the existing https://github.com/HUPO-PSI/SpectralLibraryFormat/tree/master/legacy-formats
can already encode the required information, or how much (little)
would be required to extend an existing format. Plus, one of the areas where PSI excels
is to create controlled vocabularies, harmonise adoption and provide validators,
so libraries and software can claim they are supporting the PSI SpectralLibraryFormat.

Are Analyte IDs globally unique?

Are the ID values for Analytes supposed to be globally unique, or only unique within the Spectrum they are associated with?

MSP Spectral Libraries To Convert

setup tool:

EDIT: PR merged, safe to use master

git clone https://github.com/HUPO-PSI/mzSpecLib.git
pip install ./mzSpecLib/implementations/python/

conversion tool:

mzspeclib convert -f text $MSP $MZLB.TXT

spectral libraries

Tabular format: compact version of peak level table?

Following the current JSON format, the tabular format (HDF, CSV, TSV...) would have four tables, one for each data level (library, spectrum, peak and peak interpretation) with the following columns: cv_param_group, accession, name, value_accession, and value (and some additional grouping columns):

Library level

cv_param_group	accession	name	value
	MS:xxxxxxx	format version	0.1
	MS:xxxxxxx	title	library_001
	MS:xxxxxxx	description	spectral library 001
...

Spectrum level

spectrum_index	ion_group	cv_param_group	accession	name	value_accession	value
1	1		MS:xxxxxxx	index		1
1	1		MS:xxxxxxx	title		peptide1
1	1		MS:xxxxxxx	is decoy spectrum		FALSE
1	1	1	MS:xxxxxxx	calibrated retention index		xx
1	1	1	UO:0000000	unit	UO:0000031	minute
...

Peak level

spectrum_index	peak_index	accession	name	value_accession
1	1	MS:xxxxxxx	m/z	725.123
1	1	MS:xxxxxxx	theoretical m/z	725.1244
1	1	MS:xxxxxxx	intensity	2138.325
...

Peak interpretation level

spectrum_index	peak_index	peak_interpretation_index	name	value
1	1	1	peptidoform ion series type	y
1	1	1	peptidoform ion series start ordinal	1
1	1	1	product ion series charge state	1
...

This works perfectly fine for the library, spectrum and peak interpretation levels (where there are a lot of possible attributes per entry), but for the peak level, it might be better to have a compact form:

Peak level (compact)

spectrum_index	peak_index	product ion m/z	product ion intensity
1	1	138.0661469	190.7953186
1	2	219.1087494	29.48472786
1	3	305.0644836	1067.439087
...

This could be extended with a few optional columns.

To keep everything well standardized and machine readable, I would add an additional table Peak level columns defining the used columns the Peak level (compact) table, which could also contain info about the used units (if applicable). E.g.:

Peak level columns additional table

column_index	accession	name	unit_accession	unit_name
0		spectrum_index
1		peak_index
2	MS:1001225	product ion m/z	MS:1000040	m/z
3	MS:1001226	product ion intensity	MS:1000132	percent of base peak

To summarize:

Peak level would get very verbose if we would follow the same fields as the JSON specification.
Solution would be a compact form, together with a small table specifying the columns.

Questions:

Does everyone agree with having a compact form for the peak level?
We could completely deviate from the JSON spec and have a compact form on all columns. This would make the file more compact in general and more database-like. A drawback is that for all levels, the number of columns can variate between libraries, which would make parsing the metadata somewhat harder. We would also have to deal with the value, value_accession duality, which we do not have at the peak level, as all values are just numbers. What does everyone think about the "full-on compact form" idea?

Peak interpretation format

As part of the new PSI spectral library format, it will be possible to annotate the interpretations of individual peaks, as is already done in NIST, SpectraST, and PeptideAtlas libraries. However, there have been several different styles of interpretations in the past (even from a single provider), and therefore this document describes a single common peak interpretation format for peptides that is recommended for all peptide libraries and related applications from which peak interpretations are desirable.

This format, as currently described, is designed for unbranched peptides with simple PTMs and for fragmentation methods commonly used in proteomics such as CID, HCD and ETD. Although there are some provisions for annotating small molecules (e.g., contaminants in a predominantly peptide spectrum), as well as unusual fragments, it is expected that for other major classes of analytes (metabolites, glycans, glycopeptides, cross-linked peptides...), alternative peak interpretation formats should be defined.

See working document for ongoing discussion.

Refine and finalize metadata and CV terms

The current msp and other spectral library formats only capture the metadata around each entry in the library (cluster, consensus spectra, peptides, small molecules), but not the way the spectral library has been generated. We need to define a general metadata section at the beginning this metadata. Similar to mztab, I think would be great to have something like:

The MTD version is helping the readers to know that this is a metadata field. The second column is the Key of the metadata attribute and the third is the value of the metadata field.

The following fields can be reused from mzTab:

MTD   mzL-version	1.0.0      
MTD   title  Spectral Library Human from Peptide Atlas 
MTD   id     PXL00000001 
MTD   description Some description that can be used for example in the web about the library
MTD   instrument [MS, MS:1000703, LTQ Orbitrap,]
MTD   instrument [MS, MS:1000008, Velos Orbitrap,]

Can we add to this issue all the fields we think are interesting or important to trace?

Update?

Hi,

I just wanted to follow up on this - is this going anywhere?

New Repository for Spectral Library Format

@edeutsch

I have created the repository for the spectral library format, including:

examples : It will contain all the examples of the file formats and the legacy formats examples such as (MSP, splib, etc. )
legacy-formats: Legacy formats are the specifications of the previous legacy formats such as MSP, splib, etc.
specification: Specification contains all the information about the Spectral Library Specification.

Regards
Yasset

[Pitch] Apache avro serialization

Hi y'all!

I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro.
I think it would be a good alternative to json with more efficient disk usage.

https://github.com/jspaezp/avrospeclib

I am still implementing the schema using pydantic and deriving form it the
avro schema.

Some disk usage metrics on a reasonably large speclib I have

    # ~ 50MB  binary speclib file from diann
    #  552M   tmp/speclib_out.tsv
    #  448M   tmp/speclib_out.mzlib.json # using mzspeclib
    #  148M   tests/data/test.mzlib.avro

Read-write speeds

avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.

let me know if there is any interest in adopting it!
best!