hupo-psi / mzspeclib Goto Github PK
View Code? Open in Web Editor NEWmzSpecLib: A standard format to exchange/distribute spectral libraries
Home Page: http://www.psidev.info/mzSpecLib
License: Apache License 2.0
mzSpecLib: A standard format to exchange/distribute spectral libraries
Home Page: http://www.psidev.info/mzSpecLib
License: Apache License 2.0
Are the ID values for Analyte
s supposed to be globally unique, or only unique within the Spectrum
they are associated with?
We have three explicit unique identifiers for a library spectrum
: key
, name
, and index
.
key
is supposed to be a stable numeric identifier, akin to a "primary key" in a database, where the cardinality of the key is historical rather than positional. In theory, should you re-order a library, the key
doesn't change.name
is a theoretically human readable name that describes the spectrum chosen by its creator, which means that it is essentially free text when not stipulated by the source format (e.g. MSP). It should be unique too, leaving it up to the creator to figure out how to make it clear to a human.index
is supposed to be an externally "unstable" identifier for a spectrum within the library, specifying an ordinal number starting from 0. As read, should you re-order a library, the index
does change.When reading a library, the parser "knows" how many spectra preceded the spectrum it is currently parsing, and so it can automatically "fill in" the index
attribute and the authors of a library needn't include it. However, we have explicitly written that it may be included in the output:
Optionally, a library spectrum index (MS:1003062) MAY be included to refer to the ordered position of the spectrum within the library, starting with 0 for the first spectrum. A library spectrum may have its index changed as the library evolves, and therefore SHOULD only be used internally by the library management software (e.g. for random access retrieval). To refer to a library spectrum unambiguously from outside (e.g. using a Universal Spectrum Identifier), the library spectrum key MUST be used.
Should that mean that if a parser reads an index
attribute, it's obligated to store it and round-trip it, while also generating its own internal index
separately? This is undefined behavior, according to one reading of the spec. Another more restrained reading might suggest that the value index
refers to is never actually taken from the source file verbatim but inferred, and so any read value should be ignored because it constitutes information external to the layout of the library itself.
[Term]
id: MS:1003062
name: library spectrum index
def: "Integer index value that indicates the spectrum's ordered position within a spectral library. By custom, index counters should begin with 0." [PSI:PI]
is_a: MS:1003234 ! library spectrum attribute
relationship: has_value_type xsd:integer ! The allowed value-type for this CV term
Suppose we parse this:
<Spectrum=3>
MS:1003062|library spectrum index=0
...
<Spectrum=1>
MS:1003062|library spectrum index=1
...
<Spectrum=4>
MS:1003062|library spectrum index=1000
The first two entries' index
attributes match their true coordinates in the sequence of library spectra, but the third spectrum's index
is totally different (2 vs. 1000). What should the parser do? I'd argue that it is context-dependent.
If I were writing a spectrum viewing application, I'd theoretically include the index
in the text rendering of the spectrum so that the user knows where in the file an entry is. If I then parsed that text back into the program, I'd probably want to respect that value because I wouldn't know if I were just passing that object around to just be shown elsewhere (e.g. sent to a web app to be rendered again, or passed around via federated PROXI requests) where that index information is just as salient as if it were presented locally. I'd treat that index value as something to display, but it'd be ignored for the purposes of actually looking that single spectrum up again.
However, were I writing a library manipulation tool that's not so interactive, I'd probably say "any buffer of one or more spectra constitutes a single library" and want that library to be internally consistent and ignore the input index
value. After all, if the user wants to split a library transform parts differently, and re-merge them, they'll just get re-indexed anyway. Especially if the user merges two separate libraries, not slices of the same library.
Can we say explicitly which reading is more accurate, or that both usages are acceptable and it is up to the implementer to choose which way to go?
We have one request in mzTab to encode the information of the peptide identified from the spectral library into optional columns. The user wants to encode:
I can't find this information in the current proposal. Can you point me to the document that contains this.
We are currently editing a document in google docs with the current terms in MSP and the corresponding proposals in the SpectralLibraryFormat.
Document here: https://docs.google.com/document/d/1LgSGtR_t5IcUS9rV7YtsLveDVX9X8KsOpU-5NJ4vuYI/edit?usp=sharing
I have created the repository for the spectral library format, including:
Regards
Yasset
Based on the outcomes of #4, #5, #7, #9, it should be checked
if/which any of the existing https://github.com/HUPO-PSI/SpectralLibraryFormat/tree/master/legacy-formats
can already encode the required information, or how much (little)
would be required to extend an existing format. Plus, one of the areas where PSI excels
is to create controlled vocabularies, harmonise adoption and provide validators,
so libraries and software can claim they are supporting the PSI SpectralLibraryFormat.
It was already decided to allow multiple representations (txt, json, csv, hdf...) of the new spectral library format, based on a common framework (required (meta)data, controlled vocabulary...). In this issue thread, we can discuss the best way to represent the spectral library format in HDF.
As a reference, the current TXT format looks like this:
MS:1008014|spectrum index=500
MS:1008013|spectrum name=AAAVDPTPAAPAR/2_0
MS:1008010|molecular mass=1208.6510
MS:1008015|spectrum aggregation type=MS:1008017|consensus spectrum
[1]MS:1008030|number of enzymatic termini=2
[1]MS:1001045|cleavage agent name=MS:1001251|Trypsin
MS:1001471|peptide modification details=0
...
And JSON (for one metadata item) would take the following shape:
{
"accession": "MS:1001045",
"cv_param_group": "1",
"name": "cleavage agent name",
"value": "Trypsin",
"value_accession": "MS:1001251"
},
Discussion spun off from issue #12:
I think the design decisions can be quite different based on the data format, i.e. between text-based (CSV, TSV) and binary (HDF5).
Personally, with spectral libraries increasing in size, I'm strongly in favor of HDF5. HDF5 has built-in compression, resulting in much smaller files. Also, it's much easier and efficient to slice and dice the data.
With these big, repository-scale spectral libraries I think it's quite important to focus on IO and computational efficiency. Just the time required to read a multi-GB spectral library is non-negligible, making up a considerable part of the search time.Taking that into consideration, the compact option seems considerably superior to me. You could go even further and just have two arrays per spectrum (m/z and intensity), which fits the HDF5 data model perfectly. This minimizes the query time (2 lookups to retrieve a spectrum versus 2 * k per spectrum with k peaks), and HDF5 was developed to store (binary) arrays.
Also, keep in mind that HDF5 performance can degrade quite significantly if millions of keys are used because the internal B-tree index can become quite unbalanced, leading to significant overhead during querying. Keeping the number of keys limited might thus be essential to achieve acceptable performance.Related to your final question, the main consideration here is what the goal of this version of the format is. Is it readability, then CSV is obviously superior. But I don't care about readability here, as you mention in #11, there's already the text version (and to a lesser extent the JSON version). Instead, when going for HDF5 performance should be the main goal. And that means using HDF5 the way it was intended and storing values in arrays instead of each individual value separately. Make it as compact as possible, make spectrum reading efficient by storing the peaks in compact arrays.
Thanks for the reply! I started looking into HDF and there's a lot more to it then I initially thought. Definitely the nested key system and metadata for each group would be very useful for the spectral library format. This means that an optimal HDF representation would look pretty different from this general tabular format. Since we want to make it possible with the specification to have multiple representations (txt, json, csv, hdf...), I propose that we keep this discussion about general tabular formats (such as csv/tsv) and move the discussion on the HDF format to a new issue.
Here:
We will capture the metadata around each spectrum in the library.
See slide with schematic.
Following the current JSON format, the tabular format (HDF, CSV, TSV...) would have four tables, one for each data level (library
, spectrum
, peak
and peak interpretation
) with the following columns: cv_param_group
, accession
, name
, value_accession
, and value
(and some additional grouping columns):
Library level
cv_param_group | accession | name | value_accession | value |
---|---|---|---|---|
MS:xxxxxxx | format version | 0.1 | ||
MS:xxxxxxx | title | library_001 | ||
MS:xxxxxxx | description | spectral library 001 | ||
... |
Spectrum level
spectrum_index | ion_group | cv_param_group | accession | name | value_accession | value |
---|---|---|---|---|---|---|
1 | 1 | MS:xxxxxxx | index | 1 | ||
1 | 1 | MS:xxxxxxx | title | peptide1 | ||
1 | 1 | MS:xxxxxxx | is decoy spectrum | FALSE | ||
1 | 1 | 1 | MS:xxxxxxx | calibrated retention index | xx | |
1 | 1 | 1 | UO:0000000 | unit | UO:0000031 | minute |
... |
Peak level
spectrum_index | peak_index | cv_param_group | accession | name | value_accession | value |
---|---|---|---|---|---|---|
1 | 1 | MS:xxxxxxx | m/z | 725.123 | ||
1 | 1 | MS:xxxxxxx | theoretical m/z | 725.1244 | ||
1 | 1 | MS:xxxxxxx | intensity | 2138.325 | ||
... |
Peak interpretation level
spectrum_index | peak_index | peak_interpretation_index | cv_param_group | accession | name | value_accession | value |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | peptidoform ion series type | y | |||
1 | 1 | 1 | peptidoform ion series start ordinal | 1 | |||
1 | 1 | 1 | product ion series charge state | 1 | |||
... |
This works perfectly fine for the library
, spectrum
and peak interpretation
levels (where there are a lot of possible attributes per entry), but for the peak level
, it might be better to have a compact form:
Peak level (compact)
spectrum_index | peak_index | product ion m/z | product ion intensity |
---|---|---|---|
1 | 1 | 138.0661469 | 190.7953186 |
1 | 2 | 219.1087494 | 29.48472786 |
1 | 3 | 305.0644836 | 1067.439087 |
... |
This could be extended with a few optional columns.
To keep everything well standardized and machine readable, I would add an additional table Peak level columns
defining the used columns the Peak level (compact)
table, which could also contain info about the used units (if applicable). E.g.:
Peak level columns additional table
column_index | accession | name | unit_accession | unit_name |
---|---|---|---|---|
0 | spectrum_index | |||
1 | peak_index | |||
2 | MS:1001225 | product ion m/z | MS:1000040 | m/z |
3 | MS:1001226 | product ion intensity | MS:1000132 | percent of base peak |
To summarize:
Questions:
peak level
?value
, value_accession
duality, which we do not have at the peak
level, as all values are just numbers. What does everyone think about the "full-on compact form" idea?As part of the new PSI spectral library format, it will be possible to annotate the interpretations of individual peaks, as is already done in NIST, SpectraST, and PeptideAtlas libraries. However, there have been several different styles of interpretations in the past (even from a single provider), and therefore this document describes a single common peak interpretation format for peptides that is recommended for all peptide libraries and related applications from which peak interpretations are desirable.
This format, as currently described, is designed for unbranched peptides with simple PTMs and for fragmentation methods commonly used in proteomics such as CID, HCD and ETD. Although there are some provisions for annotating small molecules (e.g., contaminants in a predominantly peptide spectrum), as well as unusual fragments, it is expected that for other major classes of analytes (metabolites, glycans, glycopeptides, cross-linked peptides...), alternative peak interpretation formats should be defined.
See working document for ongoing discussion.
Hi y'all!
I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro.
I think it would be a good alternative to json with more efficient disk usage.
https://github.com/jspaezp/avrospeclib
I am still implementing the schema using pydantic and deriving form it the
avro schema.
Some disk usage metrics on a reasonably large speclib I have
# ~ 50MB binary speclib file from diann
# 552M tmp/speclib_out.tsv
# 448M tmp/speclib_out.mzlib.json # using mzspeclib
# 148M tests/data/test.mzlib.avro
Read-write speeds
avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.
let me know if there is any interest in adopting it!
best!
This issue is a running list of features that are not yet implemented in the Python implementation or in the repository in general:
sptxt
backend attribute parsing - This requires some familiarity with the sptxt
format. The syntax is identical to msp
, but uses different names for some things. Documentation is missing from http://tools.proteomecenter.org/wiki/index.php?title=Software:SpectraSTThere are always more read-only backends to implement or improve:
msp
comments from the wild will help cover more use-cases.As we discuss today (PSI2018 Meeting). One of the starting points would be to change the extension of the format as proposed. I recommend starting adding possible name he and people can vote on the comment using +1 or the icon for voting.
Please whatever proposal you made, check first this site that the file extension does not exists https://fileinfo.com/extension/msp
Hi,
I just wanted to follow up on this - is this going anywhere?
The current msp and other spectral library formats only capture the metadata around each entry in the library (cluster, consensus spectra, peptides, small molecules), but not the way the spectral library has been generated. We need to define a general metadata section at the beginning this metadata. Similar to mztab, I think would be great to have something like:
The MTD version is helping the readers to know that this is a metadata field. The second column is the Key of the metadata attribute and the third is the value of the metadata field.
The following fields can be reused from mzTab:
MTD mzL-version 1.0.0
MTD title Spectral Library Human from Peptide Atlas
MTD id PXL00000001
MTD description Some description that can be used for example in the web about the library
MTD instrument [MS, MS:1000703, LTQ Orbitrap,]
MTD instrument [MS, MS:1000008, Velos Orbitrap,]
Can we add to this issue all the fields we think are interesting or important to trace?
This question has to do with how to mark up a consensus spectrum to link it back to the replicating spectra in their raw files when those raw files aren't on ProteomeExchange.
Spectrum Library:
https://chemdata.nist.gov/download/peptide_library/libraries/skin_hair/IARPA3_best_tissue_add_info.msp.zip
Metadata File:
https://chemdata.nist.gov/download/peptide_library/libraries/skin_hair/IARPA3_all.out.zip
Metadata is a sparse table mapping consensus spectra to their contributing replicates:
Peptide Charge Modification Scans Raw file Folder Tissue
AAAIAYGLDK 2 0 73415 "am_03_rg_t100_nlumos_2021-02-19_350-1600_100_nm_hcd30_360min_tryp_pos.raw" "hair_rg_guan" Hair
AAAPGPCPPPPPPP 2 1(6,C,CAM) 33291;33463 "hf1_18_rg_l1_2021-08-06_380-2000_120_hcd30_255min_sp3_lysctryp_i_pos.raw" "6donors_sp3" Hair
30890;32043 "hf3_17_rg_l1_2021-08-06_380-2000_120_hcd30_255min_sp3_lysctryp_i_pos.raw" "6donors_sp3" Hair
AAAQWVR 2 0 41910;42017 "xxx_2019_0215_rj_74_strapskin.raw" "method_development" Skin
7991 "20190429_009_llnl_pr_014_dda_2ug.raw" "osu_dda" Skin
9486 "20190503_037_llnl_pr_29_dda_2ug.raw" "osu_dda" Skin
10724 "20190508_002_llnl_pr_08_dda_2ug.raw" "osu_dda" Skin
9504 "20190508_008_llnl_pr_24_dda_2ug.raw" "osu_dda" Skin
10287 "20190508_044_llnl_pr_06_dda_2ug.raw" "osu_dda" Skin
9556 "20190508_047_llnl_pr_13_dda_2ug.raw" "osu_dda" Skin
10231 "20190508_050_llnl_pr_16_dda_2ug.raw" "osu_dda" Skin
10186 "20190508_058_llnl_pr_21_dda_2ug.raw" "osu_dda" Skin
9342 "20190508_061_llnl_pr_22_dda_2ug.raw" "osu_dda" Skin
9280 "20190508_064_llnl_pr_23_dda_2ug.raw" "osu_dda" Skin
10586 "20190508_067_llnl_pr_015_dda_2ug.raw" "osu_dda" Skin
Mapping AAAQWVR/2
to many, many spectra across multiple raw files. The appropriate way to express this (so far as I can tell) is to use either contributing replicate spectrum keys
or contributing replicate spectrum USI
. The second option makes sense since those contributing spectra aren't in the library itself. However, this project didn't publish its data on ProteomeExchange so I cannot construct a "real" USI for it.
This looks "okay" and preserves the available information, but feels wrong because it sets up an expectation that this URI resolves to something. If there were a way to canonically express that this is a "local" or "private" dataset in the accession field, that would make this less misleading.
<Spectrum=...>
...
MS:1003065|spectrum aggregation type=MS:1003067|consensus spectrum
MS:1003299|contributing replicate spectrum USI=mzspec:_IARPA3:20190508_067_llnl_pr_015_dda_2ug:scan:10586
MS:1003299|contributing replicate spectrum USI=mzspec:_IARPA3:20190503_037_llnl_pr_29_dda_2ug:9486
...
This looks better as written, though it isn't preferable since A) it's more verbose, B) is universally not portable, and C) it conflicts with the usage of scan number
and constituent spectrum file
as used for individual spectra where they are not grouped.
<Spectrum=...>
...
MS:1003065|spectrum aggregation type=MS:1003067|consensus spectrum
[1]MS:1003203|constituent spectrum file=20190508_067_llnl_pr_015_dda_2ug.raw
[1]MS:1003057|scan number=10586
[2]MS:1003203|constituent spectrum file=20190503_037_llnl_pr_29_dda_2ug.raw
[2]MS:1003057|scan number=9486
...
EDIT: PR merged, safe to use master
git clone https://github.com/HUPO-PSI/mzSpecLib.git
pip install ./mzSpecLib/implementations/python/
mzspeclib convert -f text $MSP $MZLB.TXT
Per discussion on 12/2/22, we agreed to add SMILES support to the mzPAF
format
@sneumann @jshofstahl can you confirm that you have seen this type of annotation also:
If not I will remove the example.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.