hupo-psi / proforma Goto Github PK

HUPO-PSI Standardized peptidoform notation

mass-spectrometry proteomics standards peptidoform

proforma's Introduction

ProForma (Proteoform and Peptidoform Notation)

Protein and peptide sequences are usually represented using a string of amino acids using a well-known one letter code endorsed by the IUPAC. However, there is still no clear consensus about how to represent ‘proteoforms’ and ‘peptidoforms’, meaning all possible variations of a protein/peptide sequence, including protein modifications, both artefactual and post-translational modifications (PTMs). There are indeed multiple ways of encoding mass modifications and extended discussion has taken place to achieve a consensus. A standard notation for proteoforms and peptidoforms is then required for the community, so that it can be embedded in many relevant PSI (and potentially other) file formats.

The PSI has developed a format called PEFF (PSI Extended FASTA Format) that can be used to represent proteoforms. Additionally, the Consortium for Top Down Proteomics CTDP developed a notation format called ProForma v1, aiming to represent proteoforms.

This format specification represents the consensus for the standard representation of proteoforms and peptidoforms. This notation aims to support the main proteomics approaches, including bottom-up (focused on peptides/peptidoforms) and top-down (focused on proteins/proteoforms) approaches.

Use cases supported (with examples)

The ProForma notation is a string of characters that represent linearly one or more peptidoform/proteoform primary structures with possibilities to link peptidic chains together. It is not meant to represent secondary or tertiary structures.

Canonical IUPAC amino acids

EMEVEESPEK

PTMs using common ontologies or controlled vocabularies (e.g. Unimod, PSI-MOD, and RESID)

EM[Oxidation]EVEES[UNIMOD:21]PEK
EM[L-methionine sulfoxide]EVEES[MOD:00046]PEK
EM[R:L-methionine (R)-sulfoxide]EVEES[RESID:AA0037]PEK

Cross-linkers using the XL-MOD ontology

EMEVTK[XLMOD:02001#XL1]SESPEK[#XL1]
EVTSEKC[L-cystine (cross-link)#XL1]LEMSC[#XL1]EFD

Glycans using the GNO (Glycan Naming Ontology) ontology

YPVLN[GNO:G62765YT]VTMPN[GNO:G02815KT]NSNGKFDK

Arbitrary mass shifts and unknown mass gaps

EM[+15.9949]EVEES[-79.9663]PEK
RTAAX[+367.0537]WT

Elemental formulas and Glycan compositions

SEQUEN[Formula:C12H20O2]CE
SEQUEN[Glycan:HexNAc1Hex 2]CE

Terminal and Labile Modifications

[iTRAQ4plex]-EMEVNESPEK-[Methyl]
{Glycan:Hex}EMEVNESPEK

Ambiguity of modification position (completely unlocalised, n possible sites, or a range of sites)

[Phospho]?EMEVTSESPEK
EMEVT[#g1]S[#g1]ES[Phospho#g1]PEK
PROT(EOSFORMS)[+19.0523]ISK

Global modifications (e.g. isotopic labeling or fixed protein modifications)

<13C>ATPEILTVNSIGQLK
<[S-carboxamidomethyl-L-cysteine]@C>ATPEILTCNSIGCLK

Additional user-supplied information and multi-valued tags

ELV[info:AnyString]IS
ELV[+11.9784|info:suspected frobinylation]IS

proforma's People

Contributors

Stargazers

Watchers

Forkers

rfellers oscar-gr douweschulte

proforma's Issues

Specification clarification

Since #6 needs an update to the specification from my side I would like some things to be specified a bit clearer in text in the specification. For these there does not have to be any changes in the format itself, just in the text.

The order of the different kinds of pre sequence modifications. This is specified in one comment in an open issue already as the following: <GLOBAL_MOD>[UNKNOWN_POS]?{LABILE_MOD}[N_TERM]-PEPTIDE-[C_TERM]
There are two different peptide sequence dividers, // for crosslinked peptides (4.2.3.2), and \\ for branched peptides (4.2.4). But this is only clear from its use in the example in the branched section. As a minimum I think this needs explicit mentioning in the branched section. And as a side note maybe the reasoning as I would be interested to hear why two different notations are needed and they cannot be used interchangeably. The last one can be important because of human error, it is easy to misremember and use the 'wrong' one.
The chimeric spectra are quite underspecified in regards to its tie in with the rest of the specification.
- It is unspecified if any global and/or ambiguous modifications on one also is of influence to any of the other peptidoforms. I assumed any of these is only valid on that peptidoform, which is logical in the MS context where generally the precursor mass is defined. So this example would be invalid: [oxidation#g1]?A[#g1]+B[#g1]
- It is unspecified how cross linked peptides work in chimeric spectra, I assume no one will actually have any problem with this, but potentially if DIA continues to be used by more and more MS subfields this might be happening at some point. My assumption is that + has the lowest precedence. Meaning that A[#XL1]//B[#XL1]+C[#XL1]//D[#XL1] is a valid expression in the current specification and this means a chimeric spectra containing the peptidoform A linked to B and the peptidoform C linked to D. If there is consensus on this point it might be nice to specify this in the specification.
In section 4.1 page 7 on the amino acids it links to section 7.5.3 for the definition of the ambiguous amino acids, this should be section 7.4.3.
In section 4.2 page 8 the link to XLMOD links to its old location in mzIdentML.
In section 4.6.2 fixed protein modifications it is not defined if the amino acids are allowed to be lowercase (might also be of interest for the discussion in #6). (Answered in section 4: the whole specification is capitalisation insensitive)

Specify ionic species further

Ion charges can be represented in the current version of ProForma (section 7.1 - MS Extensions)

EMEVEESPEK/3[+2Na+,+H+]
EMEVEESPEK/-1[+e-]

Given examples in the specification
This does not specify notation for the ionic species. It does state however that quite high complexity options are valid: the removal of a OH-. It does not state how higher charged ionic species are notated, for example on higher charged metal ions like Fe[III].

EMEVEESPEK/7[+2Fe+3,+H+]
EMEVEESPEK/1[-OH-]
EMEVEESPEK/1[+N1H3+]

Potential uses for a higher complexity notation. Note on the last example: the 3+ here indicates that there are 3 H and that the charge of the whole species is +1

To me it seems logical to specify this as using the full modification Formula: notation (with e allowed as well) followed by the total number of charges for that species. But that implies that this field can have paired square brackets [], positive and negative numbers internally, and this might introduce some visual ambiguity on what the final number is doing.

Additionally there is one example that does not use a sign on the number of the ions. While there are also examples where there is only a sign used. Formalising the notation seems warranted to me.

EMEVEESPEK/-2[2I-]

Given example in the specification using only a number as the number of times an ionic species is present

\/([+-]?\d+)(\[((?:(?:[+-]?\d+)|(?:[+-]))((?:\[\d+[A-Z][A-Za-z]?\d+)|(?:[A-Z][A-Za-z]?\d*))+([+-]\d*),)*((?:(?:[+-]?\d+)|(?:[+-]))((?:\[\d+[A-Z][A-Za-z]?\d+)|(?:[A-Z][A-Za-z]?\d*))+([+-]\d*))\])?

Here is a beast of a regular expression for how this format could look (does not check for use of valid elements)
Here it is in regex101 with some example matches

"/" <number> ("[" ( <number_or_sign> <formula> <number_or_sign> ",")+ "]")?

This is the same in a bit of somewhat more readable BNF like notation

For a bit of background I came upon this when implementing my own parser for ProForma. I have no serious need or use for any of the complexity here, but my code internally allows to specify any chemical formula as ionic species so I was looking into this section to look into how to export the internal peptides back to fully valid ProForma.

Minor inconsistencies in spec

There are some minor inaccuracies in some of the examples in the specification draft 12:

page 8: EM[R: Methionine sulfone]EVEES[O-phospho-L-serine]PEK -> This term doesn't appear in RESID. Note the leading space, but even without that the name is incorrect. Probably it should be L-methionine sulfone (RESID:AA0251)?
page 9: EM[UNIMOD:15]EVEES[UNIMOD:56]PEK -> accession UNIMOD:15 does not exist. In case consistency with the previous examples is desired, UNIMOD:35 corresponds to Oxidation. Same for the invalid example with U:15 just underneath.
page 11: EVTSEKC[half-cystine]LEMSC[half-cystine]EFD -> half-cystine should be half cystine (no hyphen).
page 14: The mass of HexS is specified with only three decimals, whereas other masses in that list have four decimals. It's also not rounded correctly. Instead use 242.0096 as the mass with four decimals.

Potential error in monosaccharide list

According to the ProForma monosaccharide list the chemical formula of a Neu5Ac is H₁₇C₁₁N₁O₈. But according to most other sources the formula is H₁₇C₁₁N₁O₉ (note the additional oxygen) (PubChem glyco mass calc wikipedia).

I assume this is just an error in the list, or is there something else going on?

(For ease of checking I reference @mobiusklein, as he made the list)

Request specification clarification on sequence truncations

I would like to see a paragraph in the specification indicating how proteoform sequence truncations are to be specified. N-terminal truncations may be biological, as in the removal of the initial Met (perhaps with PTM) or the cleavage of a signal peptide or the action of a viral protease. The truncations may be instead be related to sample treatment, such as a rare cutter like CNBr for middle-down proteomics or due to a "hot" ion source. I believe ProForma should specify how a proteoform sequence compares to the sequence described by the accession, such as indicating the position of the first and last amino acids in the accession's sequence. Are amino acids preceding and succeeding the proteoform sequence expected to be included?

Consider how to support non-standard amino acids

A user asked me how to specify some custom amino acids:
Ahx (amino hexanoic acid): C6H13NO2 residue mass = 113.08406
lysyl biotin (aka biocytin): C16H28N4O4S residue mass = 354.17256

We found workarounds for this one, but this seems like a more generic issue that we will face, especially with synthetic peptides.

Ideas?

Explicit support for global terminal modifications

Fixed modifications, such as carbamidomethylation of C can be written as a global modification (section 4.6.2). For instance:

<[Carbamidomethyl]@C>ATPEILTCNSIGCLK

However, it is not explicitly stated whether global terminal modifications are supported, and if so, which "target tags" should be used. I would use this in the case of isobaric labeling modifications. For instance:

<[TMT6plex]@K,N-term>ATPEILTCNSIGCLK

Which would be equivalent to:

[TMT6plex]-ATPEILTCNSIGCLK[TMT6plex]

This would require a definition of the tags to be used for terminal modifications, for example N-term and C-term.