Coder Social home page Coder Social logo

proforma's Introduction

ProForma (Proteoform and Peptidoform Notation)

Protein and peptide sequences are usually represented using a string of amino acids using a well-known one letter code endorsed by the IUPAC. However, there is still no clear consensus about how to represent ‘proteoforms’ and ‘peptidoforms’, meaning all possible variations of a protein/peptide sequence, including protein modifications, both artefactual and post-translational modifications (PTMs). There are indeed multiple ways of encoding mass modifications and extended discussion has taken place to achieve a consensus. A standard notation for proteoforms and peptidoforms is then required for the community, so that it can be embedded in many relevant PSI (and potentially other) file formats.

The PSI has developed a format called PEFF (PSI Extended FASTA Format) that can be used to represent proteoforms. Additionally, the Consortium for Top Down Proteomics CTDP developed a notation format called ProForma v1, aiming to represent proteoforms.

This format specification represents the consensus for the standard representation of proteoforms and peptidoforms. This notation aims to support the main proteomics approaches, including bottom-up (focused on peptides/peptidoforms) and top-down (focused on proteins/proteoforms) approaches.

Use cases supported (with examples)

The ProForma notation is a string of characters that represent linearly one or more peptidoform/proteoform primary structures with possibilities to link peptidic chains together. It is not meant to represent secondary or tertiary structures.

  • EMEVEESPEK
PTMs using common ontologies or controlled vocabularies (e.g. Unimod, PSI-MOD, and RESID)
  • EM[Oxidation]EVEES[UNIMOD:21]PEK
  • EM[L-methionine sulfoxide]EVEES[MOD:00046]PEK
  • EM[R:L-methionine (R)-sulfoxide]EVEES[RESID:AA0037]PEK
Cross-linkers using the XL-MOD ontology
  • EMEVTK[XLMOD:02001#XL1]SESPEK[#XL1]
  • EVTSEKC[L-cystine (cross-link)#XL1]LEMSC[#XL1]EFD
Glycans using the GNO (Glycan Naming Ontology) ontology
  • YPVLN[GNO:G62765YT]VTMPN[GNO:G02815KT]NSNGKFDK
Arbitrary mass shifts and unknown mass gaps
  • EM[+15.9949]EVEES[-79.9663]PEK
  • RTAAX[+367.0537]WT
Elemental formulas and Glycan compositions
  • SEQUEN[Formula:C12H20O2]CE
  • SEQUEN[Glycan:HexNAc1Hex 2]CE
Terminal and Labile Modifications
  • [iTRAQ4plex]-EMEVNESPEK-[Methyl]
  • {Glycan:Hex}EMEVNESPEK
Ambiguity of modification position (completely unlocalised, n possible sites, or a range of sites)
  • [Phospho]?EMEVTSESPEK
  • EMEVT[#g1]S[#g1]ES[Phospho#g1]PEK
  • PROT(EOSFORMS)[+19.0523]ISK
Global modifications (e.g. isotopic labeling or fixed protein modifications)
  • <13C>ATPEILTVNSIGQLK
  • <[S-carboxamidomethyl-L-cysteine]@C>ATPEILTCNSIGCLK
Additional user-supplied information and multi-valued tags
  • ELV[info:AnyString]IS
  • ELV[+11.9784|info:suspected frobinylation]IS

proforma's People

Contributors

javizca avatar mobiusklein avatar ralfg avatar rfellers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proforma's Issues

Specification clarification

Since #6 needs an update to the specification from my side I would like some things to be specified a bit clearer in text in the specification. For these there does not have to be any changes in the format itself, just in the text.

  1. The order of the different kinds of pre sequence modifications. This is specified in one comment in an open issue already as the following: <GLOBAL_MOD>[UNKNOWN_POS]?{LABILE_MOD}[N_TERM]-PEPTIDE-[C_TERM]
  2. There are two different peptide sequence dividers, // for crosslinked peptides (4.2.3.2), and \\ for branched peptides (4.2.4). But this is only clear from its use in the example in the branched section. As a minimum I think this needs explicit mentioning in the branched section. And as a side note maybe the reasoning as I would be interested to hear why two different notations are needed and they cannot be used interchangeably. The last one can be important because of human error, it is easy to misremember and use the 'wrong' one.
  3. The chimeric spectra are quite underspecified in regards to its tie in with the rest of the specification.
    • It is unspecified if any global and/or ambiguous modifications on one also is of influence to any of the other peptidoforms. I assumed any of these is only valid on that peptidoform, which is logical in the MS context where generally the precursor mass is defined. So this example would be invalid: [oxidation#g1]?A[#g1]+B[#g1]
    • It is unspecified how cross linked peptides work in chimeric spectra, I assume no one will actually have any problem with this, but potentially if DIA continues to be used by more and more MS subfields this might be happening at some point. My assumption is that + has the lowest precedence. Meaning that A[#XL1]//B[#XL1]+C[#XL1]//D[#XL1] is a valid expression in the current specification and this means a chimeric spectra containing the peptidoform A linked to B and the peptidoform C linked to D. If there is consensus on this point it might be nice to specify this in the specification.
  4. In section 4.1 page 7 on the amino acids it links to section 7.5.3 for the definition of the ambiguous amino acids, this should be section 7.4.3.
  5. In section 4.2 page 8 the link to XLMOD links to its old location in mzIdentML.
  6. In section 4.6.2 fixed protein modifications it is not defined if the amino acids are allowed to be lowercase (might also be of interest for the discussion in #6). (Answered in section 4: the whole specification is capitalisation insensitive)

Specify ionic species further

Ion charges can be represented in the current version of ProForma (section 7.1 - MS Extensions)

EMEVEESPEK/3[+2Na+,+H+]
EMEVEESPEK/-1[+e-]

Given examples in the specification
This does not specify notation for the ionic species. It does state however that quite high complexity options are valid: the removal of a OH-. It does not state how higher charged ionic species are notated, for example on higher charged metal ions like Fe[III].

EMEVEESPEK/7[+2Fe+3,+H+]
EMEVEESPEK/1[-OH-]
EMEVEESPEK/1[+N1H3+]

Potential uses for a higher complexity notation. Note on the last example: the 3+ here indicates that there are 3 H and that the charge of the whole species is +1

To me it seems logical to specify this as using the full modification Formula: notation (with e allowed as well) followed by the total number of charges for that species. But that implies that this field can have paired square brackets [], positive and negative numbers internally, and this might introduce some visual ambiguity on what the final number is doing.

Additionally there is one example that does not use a sign on the number of the ions. While there are also examples where there is only a sign used. Formalising the notation seems warranted to me.

EMEVEESPEK/-2[2I-]

Given example in the specification using only a number as the number of times an ionic species is present

\/([+-]?\d+)(\[((?:(?:[+-]?\d+)|(?:[+-]))((?:\[\d+[A-Z][A-Za-z]?\d+)|(?:[A-Z][A-Za-z]?\d*))+([+-]\d*),)*((?:(?:[+-]?\d+)|(?:[+-]))((?:\[\d+[A-Z][A-Za-z]?\d+)|(?:[A-Z][A-Za-z]?\d*))+([+-]\d*))\])?

Here is a beast of a regular expression for how this format could look (does not check for use of valid elements)
Here it is in regex101 with some example matches

"/" <number> ("[" ( <number_or_sign> <formula> <number_or_sign> ",")+ "]")?

This is the same in a bit of somewhat more readable BNF like notation

For a bit of background I came upon this when implementing my own parser for ProForma. I have no serious need or use for any of the complexity here, but my code internally allows to specify any chemical formula as ionic species so I was looking into this section to look into how to export the internal peptides back to fully valid ProForma.

Minor inconsistencies in spec

There are some minor inaccuracies in some of the examples in the specification draft 12:

  • page 8: EM[R: Methionine sulfone]EVEES[O-phospho-L-serine]PEK -> This term doesn't appear in RESID. Note the leading space, but even without that the name is incorrect. Probably it should be L-methionine sulfone (RESID:AA0251)?
  • page 9: EM[UNIMOD:15]EVEES[UNIMOD:56]PEK -> accession UNIMOD:15 does not exist. In case consistency with the previous examples is desired, UNIMOD:35 corresponds to Oxidation. Same for the invalid example with U:15 just underneath.
  • page 11: EVTSEKC[half-cystine]LEMSC[half-cystine]EFD -> half-cystine should be half cystine (no hyphen).
  • page 14: The mass of HexS is specified with only three decimals, whereas other masses in that list have four decimals. It's also not rounded correctly. Instead use 242.0096 as the mass with four decimals.

More conceptual question:

  • Q: page 14: Parsing glycan compositions is somewhat non-trivial because some labels overlap. It would be easier if spaces between monosaccharides are used (split on space) or cardinality is always specified (split on [a-zA-Z]+\d+). Maybe this can be a bit more strongly recommended in section 4.2.8?
    A: Parsing is possible without enforcing spaces or cardinality by checking for only defined monosaccharides rather than any string.

  • Q: page 18: I'm a bit confused how parsers should interpret that global modifications are isotopes? The examples (13C, 15N, D) don't seem to be specified using a controlled vocabulary, whereas this is the case throughout the rest of the document. Is it that when no @ is used in the global modification part, as specified in section 4.6.2, it should always be considered an isotope instead?
    A: Yes, I currently interpret global modifications of the form INT* LETTER+ SIGNED_INT* as an isotope and global modifications of the form "[" mod "]@" (AA ",")* AA as global amino acid modifications (so square brackets and "@" sign).

  • Q: page 19: How should multiple global modifications on different amino acids be specified? I guess the following example, with a comma separating the global modifications within the angular brackets, would lie in line with the spec, but this is not explicitly detailed: <[Carbamidomethyl]@C,[Oxidation]@M>MTPEILTCNSIGCLK.
    A: Multiple global modifications are each specified in their own block between angled brackets.

Request specification clarification on sequence truncations

I would like to see a paragraph in the specification indicating how proteoform sequence truncations are to be specified. N-terminal truncations may be biological, as in the removal of the initial Met (perhaps with PTM) or the cleavage of a signal peptide or the action of a viral protease. The truncations may be instead be related to sample treatment, such as a rare cutter like CNBr for middle-down proteomics or due to a "hot" ion source. I believe ProForma should specify how a proteoform sequence compares to the sequence described by the accession, such as indicating the position of the first and last amino acids in the accession's sequence. Are amino acids preceding and succeeding the proteoform sequence expected to be included?

Consider how to support non-standard amino acids

A user asked me how to specify some custom amino acids:
Ahx (amino hexanoic acid): C6H13NO2 residue mass = 113.08406
lysyl biotin (aka biocytin): C16H28N4O4S residue mass = 354.17256

We found workarounds for this one, but this seems like a more generic issue that we will face, especially with synthetic peptides.

Ideas?

Explicit support for global terminal modifications

Fixed modifications, such as carbamidomethylation of C can be written as a global modification (section 4.6.2). For instance:

<[Carbamidomethyl]@C>ATPEILTCNSIGCLK

However, it is not explicitly stated whether global terminal modifications are supported, and if so, which "target tags" should be used. I would use this in the case of isobaric labeling modifications. For instance:

<[TMT6plex]@K,N-term>ATPEILTCNSIGCLK

Which would be equivalent to:

[TMT6plex]-ATPEILTCNSIGCLK[TMT6plex]

This would require a definition of the tags to be used for terminal modifications, for example N-term and C-term.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.