hupo-psi / peff Goto Github PK
View Code? Open in Web Editor NEWRepository for the PSI Extended FASTA Format
License: Apache License 2.0
Repository for the PSI Extended FASTA Format
License: Apache License 2.0
TODO: add Proteoform key for protein sequence ; allowed format:
\Proteoform=(identifier¦positions¦indexes to variants and PTMs¦proteoform description)
example : \Proteoform=(P12345-pf1¦1-200,210-320¦1-4,8¦phosphorylated, active form of the enzyme)
@edeutsch do you want me to generate a Uniprot minimal example?
There will be a need to encode special characters in the data that will sometimes interfere with parsing. We should address this in the spec.
Proposed general rules:
In a scalar value:
Wrong: \GName=EPB\41
Correct: \GName=EPB\41
Wrong: \Comment=I like \crazy characters
Correct: \Comment=I like \crazy characters
Wrong: \Comment=I like parentheses like this ()
Correct: \Comment=I like parentheses like this ()
In a list:
Wrong: \PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))
Correct: \PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))
Wrong: \PName=(EPB\41)(PAR)53)(PAR[53])
Correct: \PName=(EPB\41)(PAR)53)(PAR[53])
Correct: \PName=(sp|O75530|EED_HUMAN) okay because \PName is not a PSV field
Correct: \PName=(sp|O75530|EED_HUMAN)
Wrong: \VariantSimple=(1|I)(21|K|dbSNP|COS[]MIC) if the optionalTag is "dbSNP|COS[]MIC"
Correct: \VariantSimple=(1|I)(21|K|dbSNP|COS[]MIC)
Correct: \VariantSimple=(1|I)(21|K|[dbSNP][COS[]MIC])
icky. But it must be dealt with. XML neatly avoids all these problems.
What do you think?
TODO: Add a term dbSourceId in the CV (def: protein entry identifier in the source database)
Question: do we need to add the Uniprot FT keys as CV terms of the PEFF CV? (usecase protein API, longer term) Alternatively: use customKeys, but would be a shame.
Assess what happens/should happen with duplicate keys (e.g. two \VariantSimple in the same record). Is that a validation error? or just concatenate?
Decision from Heidelberg:
spec doc: OptionalTag : [ ] is to be used to separate elements of a list in OptionalTag only. [ ] is also recommended for a list of one element
Example: \VariantSimple=(1¦H¦[1000Genomes][dbSNP])
Example: \VariantSimple=(1¦H¦[1000Genomes])
Add this properly to the spec doc and create examples and test with validator
Reviewer points out that PEFF should have an official validator. On-line one at PeptideAtlas is not mentioned in spec.
TODO: Formalize the one at PeptideAtlas a bit more and reference it in the spec doc.
Or does someone else volunteer to write a validator?
Current spec says: There MUST NOT be spaces between items
We have a spec doc page 8 comment regarding list items: Tiny Valid example has the following: \PName=(Nucleolar protein NOP5) (Nucleolar protein 5) (NOP58) i.e. with spaces
Apparently my validator is tolerant of spaces between them.
Shall we fix the examples and the validator to NOT allow these?
Or should we adjust the specification to permit spaces here and require parsers to tolerate them?
What do you think?
We do not currently have a term for initiator methionine. Is this an oversight or intentional?
Do we want to add this term and be specific about initiator methionines in the PEFF? I would guess yes, but what do you think?
Example in neXtProt:
https://www.nextprot.org/entry/P60484/sequence
Hello,
It seems that the PEFF CV was incorporated into PSI MS CV.
Originally in the PEFF 1.0 draft, the accession number of Processed key was PEFF:1027 and PEFF:1028, one is for signal sequence and the other is mature protein.
However, I cannot find these corresponding items in PSI MS CV.
Are they the terms, [Term] id: PEFF:0001021 and [Term] id: PEFF:0001022, which names are Signal and Transit?
Thanks.
TODO: Add a Comment term for sequence entry header level (format in instance doc: \Comment= free text , can have a list of comments : \Comment=(comment 1)(comment 2) )
update spec, update CV, update examples
Decision in Heidelberg:
OptionalTag : if values are to be repeated along the PEFF document, it is allowed to define abbreviations in the Database Header section
Example:
OptionalTagDef=(1000Genomes¦A)(Ensembl¦B)
...
UP:P12345 \VariantSimple=(1¦H¦[A][B])(100¦Q¦[B])
Flesh this out, add to spec doc, create examples, verify that validator handles it
Create tiny example file that exercises all all the weird escaped characters we could have
Once we resolve Issue #7
Then we need to create two diabolical files, one with all conceivable escaping correct
and then one with all conceivable weird characters incorrectly applied for the validator to catch
One page 5 of the spec doc we say:
It is expected that the common sequence database format will be used to capture requirements specified in MIAPE MSI. However, the format does not enforce MIAPE compliance itself and MAY be valid and useful without being fully MIAPE compliant.
Reviewer says:
In other formats we had a mapping table between MIAPE field and XML element (http://www.psidev.info/mzidentml-conformance-miape). Is that possible here? Maybe instead of a table just one to half a dozen sentences are enough.
How shall we address this?
on pg 14: Need a better description and maybe diagram of annotation identifier section
TODO: propose neXtProt to change the annotation of disulfide bridges : from \ModRes=(2¦¦Disulfide) to \ModResPsi=(2¦MOD:00798¦half cystine)
The current spec disallows empty lines between entries. Is that too harsh? It's pretty easy for parsers to ignore them.
Reviewer wrote:
That is more strict than FASTA, where newlines are ignored until the next header token “>”. Is that really necessary? Of course, existing FASTA parsers will probably crash however, because of the description block
Shall we recant and allow empty lines? Parsers are required to ignore them?
What do you think?
customKey can be used to create custom keywords for a PEFF document.
TODO: Document the following rule: if there is a key available for a bit of information in the CV, this CV term must be used.
Consider then adding a new term in the CV before using customKeys
In the title of each entry, some VariantComplex have no amino acid or * character.
What is that mean?
ex) VariantComplex=(298|298| ) in NX_W5XKT8-1
Thank you!
Regards,
Heeyoun Hwang
KBSI, Rep. of Korea
TODO: Remove the \DbUniqueId=nnn from the example files (it's redundant with the >prefix:DbUniqueId identifier)
This affects all example files.
And neXtProt
Presumably we'll allow it as a deprecated form in the validator instead of a real error
Once all or most of previous issues resolved, write the formal response to the Steering Group review
signal peptides have two different styles:
\Processed=(1@40¦PEFF:0001021¦Signal)
\Signal=(1¦40)
In Heidelberg we decided to use the first one:
\Processed=(1@40¦PEFF:0001021¦Signal)
All examples and documentation and CV needs to be unified. We need to make sure neXtProt and UniProt export are using the decided style
This is related to Issue #7
Reviewer asks on page 8 of spec: How do we encode parentheses in items in list?
Based on proposal in Issue #7 they MUST be escaped with a backslash.
In principle, a clever stack-based parser should be able to handle this:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))
but then these diabolical examples would be very hard to deal with, although possible:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five)
or:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5)five)
We need a very clear policy and then response to the reviewer.
Note that backslashes are not faithfully rendered in a GitHub issue response, so we need to move this discussion to a Google doc
TODO: Change the spec doc: allow # GeneralComment in the database header block
I'm attempting to implement a more strict PEFF parser in Python, but after consulting the controlled vocabulary, I'm not sure I see how to type-check annotations which are defined by the regex "regular expression for PEFF description line"
[Term]
id: PEFF:1002001
name: regular expression for PEFF description line
def: "([0-9]+|[0-9]+|[a-zA-Z0-9]*)." [PSI:PEFF]
is_a: MS:1002479 ! regular expression
With syntax highlighting, the regex is:
/([0-9]+|[0-9]+|[a-zA-Z0-9]*)/
First, the expression translated into words seems partially redundant "One or more digits between 0 and 9 OR One or more digits between 0 and 9 OR Zero or more alphanumeric characters". The first two alternatives are identical, which seems odd. The reduced regex would be
/([0-9]+|[a-zA-Z0-9]*)/
This reads as "One or more digits between 0 and 9 OR Zero or more alphanumeric characters". This seems to suggest that implicitly each element of a |
separated tuple will be interpreted separately, and that the indices of the tuple are not governed by the CV. This information is described in the format specification's text.
Is this interpretation consistent with the intentions of the authors?
On page 5, we say:
For searches performed using a PEFF file, the downstream result in mzIdentML will need to encode a reference to the PEFF file used.
Reviewer says:
Although actually an action point for mzIdentML: have you already checked, whether referencing a PEFF file as database is encodable? Or have you already added a CV term? If Yes, that could be a comment here. (same for mzTab)
How shall we address this?
Somewhat related to Issue #12
TODO: add the Uniprot processed keywords in the CV
Mention proforma somewhere And also the citation at the bottom for the paper
DocProc says that we should have a reference implementation.
What shall we name as the reference implementation?
Web page is out of date and not very prominently mentioned in spec doc.
Fix.
http://psidev.info/peff
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.