peff,hupo-psi

add Proteoform key for protein sequence to CV and spec

TODO: add Proteoform key for protein sequence ; allowed format:

\Proteoform=(identifier¦positions¦indexes to variants and PTMs¦proteoform description)

example : \Proteoform=(P12345-pf1¦1-200,210-320¦1-4,8¦phosphorylated, active form of the enzyme)

Uniprot Example

@edeutsch do you want me to generate a Uniprot minimal example?

How to encode special characters in a list?

There will be a need to encode special characters in the data that will sometimes interfere with parsing. We should address this in the spec.

Proposed general rules:

A backslash anywhere in the description line MUST be escaped with the backslash character
Open and close parentheses or square brackets in the data MUST be escaped with the backslash character
A pipe character ( | ) in pipe-separated-value (PSV) fields MUST be escaped with the backslash character, but only MAY be separated in ordinary fields

In a scalar value:
Wrong: \GName=EPB\41
Correct: \GName=EPB\41
Wrong: \Comment=I like \crazy characters
Correct: \Comment=I like \crazy characters
Wrong: \Comment=I like parentheses like this ()
Correct: \Comment=I like parentheses like this ()

In a list:
Wrong: \PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))
Correct: \PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))
Wrong: \PName=(EPB\41)(PAR)53)(PAR[53])
Correct: \PName=(EPB\41)(PAR)53)(PAR[53])
Correct: \PName=(sp|O75530|EED_HUMAN) okay because \PName is not a PSV field
Correct: \PName=(sp|O75530|EED_HUMAN)
Wrong: \VariantSimple=(1|I)(21|K|dbSNP|COS[]MIC) if the optionalTag is "dbSNP|COS[]MIC"
Correct: \VariantSimple=(1|I)(21|K|dbSNP|COS[]MIC)
Correct: \VariantSimple=(1|I)(21|K|[dbSNP][COS[]MIC])

icky. But it must be dealt with. XML neatly avoids all these problems.

What do you think?

Add a term dbSourceId in the CV

TODO: Add a term dbSourceId in the CV (def: protein entry identifier in the source database)

Question: do we need to add the Uniprot FT keys as CV terms of the PEFF CV?

Question: do we need to add the Uniprot FT keys as CV terms of the PEFF CV? (usecase protein API, longer term) Alternatively: use customKeys, but would be a shame.

What about duplicate keys?

Assess what happens/should happen with duplicate keys (e.g. two \VariantSimple in the same record). Is that a validation error? or just concatenate?

Add the multiple optionalTag square bracket formalism to the spec

Decision from Heidelberg:
spec doc: OptionalTag : [ ] is to be used to separate elements of a list in OptionalTag only. [ ] is also recommended for a list of one element
Example: \VariantSimple=(1¦H¦[1000Genomes][dbSNP])
Example: \VariantSimple=(1¦H¦[1000Genomes])

Add this properly to the spec doc and create examples and test with validator

PEFF should have an official validator

Reviewer points out that PEFF should have an official validator. On-line one at PeptideAtlas is not mentioned in spec.

TODO: Formalize the one at PeptideAtlas a bit more and reference it in the spec doc.
Or does someone else volunteer to write a validator?

Allow spaces between list items?

Current spec says: There MUST NOT be spaces between items

We have a spec doc page 8 comment regarding list items: Tiny Valid example has the following: \PName=(Nucleolar protein NOP5) (Nucleolar protein 5) (NOP58) i.e. with spaces

Apparently my validator is tolerant of spaces between them.

Shall we fix the examples and the validator to NOT allow these?
Or should we adjust the specification to permit spaces here and require parsers to tolerate them?

What do you think?

add initiator methionine?

We do not currently have a term for initiator methionine. Is this an oversight or intentional?

Do we want to add this term and be specific about initiator methionines in the PEFF? I would guess yes, but what do you think?

Example in neXtProt:
https://www.nextprot.org/entry/P60484/sequence

accession number of "Processed" header key in sequence entries

Hello,
It seems that the PEFF CV was incorporated into PSI MS CV.
Originally in the PEFF 1.0 draft, the accession number of Processed key was PEFF:1027 and PEFF:1028, one is for signal sequence and the other is mature protein.
However, I cannot find these corresponding items in PSI MS CV.
Are they the terms, [Term] id: PEFF:0001021 and [Term] id: PEFF:0001022, which names are Signal and Transit?
Thanks.

Add a Comment term for sequence entry header level

TODO: Add a Comment term for sequence entry header level (format in instance doc: \Comment= free text , can have a list of comments : \Comment=(comment 1)(comment 2) )

update spec, update CV, update examples

Flesh out OptionalTag abbreviations

Decision in Heidelberg:
OptionalTag : if values are to be repeated along the PEFF document, it is allowed to define abbreviations in the Database Header section
Example:
OptionalTagDef=(1000Genomes¦A)(Ensembl¦B)
...

UP:P12345 \VariantSimple=(1¦H¦[A][B])(100¦Q¦[B])

Flesh this out, add to spec doc, create examples, verify that validator handles it

Create tiny example file that exercises all all the weird escaped characters we could have

Once we resolve Issue #7
Then we need to create two diabolical files, one with all conceivable escaping correct
and then one with all conceivable weird characters incorrectly applied for the validator to catch

Address MIAPE and PEFF

One page 5 of the spec doc we say:

It is expected that the common sequence database format will be used to capture requirements specified in MIAPE MSI. However, the format does not enforce MIAPE compliance itself and MAY be valid and useful without being fully MIAPE compliant.

Reviewer says:

In other formats we had a mapping table between MIAPE field and XML element (http://www.psidev.info/mzidentml-conformance-miape). Is that possible here? Maybe instead of a table just one to half a dozen sentences are enough.

How shall we address this?

Reviewers request a diagram to explain annotation identifiers

on pg 14: Need a better description and maybe diagram of annotation identifier section

propose neXtProt to change the annotation of disulfide bridges

TODO: propose neXtProt to change the annotation of disulfide bridges : from \ModRes=(2¦¦Disulfide) to \ModResPsi=(2¦MOD:00798¦half cystine)

Allow empty lines?

The current spec disallows empty lines between entries. Is that too harsh? It's pretty easy for parsers to ignore them.

Reviewer wrote:
That is more strict than FASTA, where newlines are ignored until the next header token “>”. Is that really necessary? Of course, existing FASTA parsers will probably crash however, because of the description block

Shall we recant and allow empty lines? Parsers are required to ignore them?

What do you think?

Document the avoidance of customKeys

customKey can be used to create custom keywords for a PEFF document.
TODO: Document the following rule: if there is a key available for a bit of information in the CV, this CV term must be used.
Consider then adding a new term in the CV before using customKeys

Google Doc of PEFF_SpecDoc_1.0.draft33

Edit access: https://docs.google.com/document/d/1OI3x5WfYlFzDYJrCib7_V8uBZU1jBPO6jYK4OgW3smI/edit?usp=sharing

Quick Question of Variants

In the title of each entry, some VariantComplex have no amino acid or * character.
What is that mean?
ex) VariantComplex=(298|298| ) in NX_W5XKT8-1

Thank you!

Regards,

Heeyoun Hwang
KBSI, Rep. of Korea

Remove the \DbUniqueId=nnn

TODO: Remove the \DbUniqueId=nnn from the example files (it's redundant with the >prefix:DbUniqueId identifier)

This affects all example files.
And neXtProt
Presumably we'll allow it as a deprecated form in the validator instead of a real error

Write a formal response to the Steering Group review

Once all or most of previous issues resolved, write the formal response to the Steering Group review

Finish and fix all documentation related to signal peptides

signal peptides have two different styles:
\Processed=(1@40¦PEFF:0001021¦Signal)
\Signal=(1¦40)

In Heidelberg we decided to use the first one:
\Processed=(1@40¦PEFF:0001021¦Signal)

All examples and documentation and CV needs to be unified. We need to make sure neXtProt and UniProt export are using the decided style

How do we encode parentheses in items in list?

This is related to Issue #7
Reviewer asks on page 8 of spec: How do we encode parentheses in items in list?

Based on proposal in Issue #7 they MUST be escaped with a backslash.

In principle, a clever stack-based parser should be able to handle this:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))

but then these diabolical examples would be very hard to deal with, although possible:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five)
or:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5)five)

We need a very clear policy and then response to the reviewer.
Note that backslashes are not faithfully rendered in a GitHub issue response, so we need to move this discussion to a Google doc

allow GeneralComment in the database header block

TODO: Change the spec doc: allow # GeneralComment in the database header block

Ambiguity in "regular expression for PEFF description line"

I'm attempting to implement a more strict PEFF parser in Python, but after consulting the controlled vocabulary, I'm not sure I see how to type-check annotations which are defined by the regex "regular expression for PEFF description line"

[Term]
id: PEFF:1002001
name: regular expression for PEFF description line
def: "([0-9]+|[0-9]+|[a-zA-Z0-9]*)." [PSI:PEFF]
is_a: MS:1002479 ! regular expression

With syntax highlighting, the regex is:

/([0-9]+|[0-9]+|[a-zA-Z0-9]*)/

First, the expression translated into words seems partially redundant "One or more digits between 0 and 9 OR One or more digits between 0 and 9 OR Zero or more alphanumeric characters". The first two alternatives are identical, which seems odd. The reduced regex would be

/([0-9]+|[a-zA-Z0-9]*)/

This reads as "One or more digits between 0 and 9 OR Zero or more alphanumeric characters". This seems to suggest that implicitly each element of a | separated tuple will be interpreted separately, and that the indices of the tuple are not governed by the CV. This information is described in the format specification's text.

Is this interpretation consistent with the intentions of the authors?

Address PEFF and mzIdentML and mzTab

On page 5, we say:

For searches performed using a PEFF file, the downstream result in mzIdentML will need to encode a reference to the PEFF file used.

Reviewer says:

Although actually an action point for mzIdentML: have you already checked, whether referencing a PEFF file as database is encodable? Or have you already added a CV term? If Yes, that could be a comment here. (same for mzTab)

How shall we address this?

hupo-psi / peff Goto Github PK

peff's People

Contributors

Stargazers

Watchers

Forkers

peff's Issues

Recommend Projects

Recommend Topics

Recommend Org