Coder Social home page Coder Social logo

peff's People

Contributors

edeutsch avatar gerbenm avatar pabinz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

peff's Issues

add Proteoform key for protein sequence to CV and spec

TODO: add Proteoform key for protein sequence ; allowed format:

\Proteoform=(identifier¦positions¦indexes to variants and PTMs¦proteoform description)

example : \Proteoform=(P12345-pf1¦1-200,210-320¦1-4,8¦phosphorylated, active form of the enzyme)

How to encode special characters in a list?

There will be a need to encode special characters in the data that will sometimes interfere with parsing. We should address this in the spec.

Proposed general rules:

  • A backslash anywhere in the description line MUST be escaped with the backslash character
  • Open and close parentheses or square brackets in the data MUST be escaped with the backslash character
  • A pipe character ( | ) in pipe-separated-value (PSV) fields MUST be escaped with the backslash character, but only MAY be separated in ordinary fields

In a scalar value:
Wrong: \GName=EPB\41
Correct: \GName=EPB\41
Wrong: \Comment=I like \crazy characters
Correct: \Comment=I like \crazy characters
Wrong: \Comment=I like parentheses like this ()
Correct: \Comment=I like parentheses like this ()

In a list:
Wrong: \PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))
Correct: \PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))
Wrong: \PName=(EPB\41)(PAR)53)(PAR[53])
Correct: \PName=(EPB\41)(PAR)53)(PAR[53])
Correct: \PName=(sp|O75530|EED_HUMAN) okay because \PName is not a PSV field
Correct: \PName=(sp|O75530|EED_HUMAN)
Wrong: \VariantSimple=(1|I)(21|K|dbSNP|COS[]MIC) if the optionalTag is "dbSNP|COS[]MIC"
Correct: \VariantSimple=(1|I)(21|K|dbSNP|COS[]MIC)
Correct: \VariantSimple=(1|I)(21|K|[dbSNP][COS[]MIC])

icky. But it must be dealt with. XML neatly avoids all these problems.

What do you think?

What about duplicate keys?

Assess what happens/should happen with duplicate keys (e.g. two \VariantSimple in the same record). Is that a validation error? or just concatenate?

Add the multiple optionalTag square bracket formalism to the spec

Decision from Heidelberg:
spec doc: OptionalTag : [ ] is to be used to separate elements of a list in OptionalTag only. [ ] is also recommended for a list of one element
Example: \VariantSimple=(1¦H¦[1000Genomes][dbSNP])
Example: \VariantSimple=(1¦H¦[1000Genomes])

Add this properly to the spec doc and create examples and test with validator

PEFF should have an official validator

Reviewer points out that PEFF should have an official validator. On-line one at PeptideAtlas is not mentioned in spec.

TODO: Formalize the one at PeptideAtlas a bit more and reference it in the spec doc.
Or does someone else volunteer to write a validator?

Allow spaces between list items?

Current spec says: There MUST NOT be spaces between items

We have a spec doc page 8 comment regarding list items: Tiny Valid example has the following: \PName=(Nucleolar protein NOP5) (Nucleolar protein 5) (NOP58) i.e. with spaces

Apparently my validator is tolerant of spaces between them.

Shall we fix the examples and the validator to NOT allow these?
Or should we adjust the specification to permit spaces here and require parsers to tolerate them?

What do you think?

accession number of "Processed" header key in sequence entries

Hello,
It seems that the PEFF CV was incorporated into PSI MS CV.
Originally in the PEFF 1.0 draft, the accession number of Processed key was PEFF:1027 and PEFF:1028, one is for signal sequence and the other is mature protein.
However, I cannot find these corresponding items in PSI MS CV.
Are they the terms, [Term] id: PEFF:0001021 and [Term] id: PEFF:0001022, which names are Signal and Transit?
Thanks.

Add a Comment term for sequence entry header level

TODO: Add a Comment term for sequence entry header level (format in instance doc: \Comment= free text , can have a list of comments : \Comment=(comment 1)(comment 2) )

update spec, update CV, update examples

Flesh out OptionalTag abbreviations

Decision in Heidelberg:
OptionalTag : if values are to be repeated along the PEFF document, it is allowed to define abbreviations in the Database Header section
Example:
OptionalTagDef=(1000Genomes¦A)(Ensembl¦B)
...

UP:P12345 \VariantSimple=(1¦H¦[A][B])(100¦Q¦[B])

Flesh this out, add to spec doc, create examples, verify that validator handles it

Address MIAPE and PEFF

One page 5 of the spec doc we say:

It is expected that the common sequence database format will be used to capture requirements specified in MIAPE MSI. However, the format does not enforce MIAPE compliance itself and MAY be valid and useful without being fully MIAPE compliant.

Reviewer says:

In other formats we had a mapping table between MIAPE field and XML element (http://www.psidev.info/mzidentml-conformance-miape). Is that possible here? Maybe instead of a table just one to half a dozen sentences are enough.

How shall we address this?

Allow empty lines?

The current spec disallows empty lines between entries. Is that too harsh? It's pretty easy for parsers to ignore them.

Reviewer wrote:
That is more strict than FASTA, where newlines are ignored until the next header token “>”. Is that really necessary? Of course, existing FASTA parsers will probably crash however, because of the description block

Shall we recant and allow empty lines? Parsers are required to ignore them?

What do you think?

Document the avoidance of customKeys

customKey can be used to create custom keywords for a PEFF document.
TODO: Document the following rule: if there is a key available for a bit of information in the CV, this CV term must be used.
Consider then adding a new term in the CV before using customKeys

Quick Question of Variants

In the title of each entry, some VariantComplex have no amino acid or * character.
What is that mean?
ex) VariantComplex=(298|298| ) in NX_W5XKT8-1

Thank you!

Regards,

Heeyoun Hwang
KBSI, Rep. of Korea

Remove the \DbUniqueId=nnn

TODO: Remove the \DbUniqueId=nnn from the example files (it's redundant with the >prefix:DbUniqueId identifier)

This affects all example files.
And neXtProt
Presumably we'll allow it as a deprecated form in the validator instead of a real error

Finish and fix all documentation related to signal peptides

signal peptides have two different styles:
\Processed=(1@40¦PEFF:0001021¦Signal)
\Signal=(1¦40)

In Heidelberg we decided to use the first one:
\Processed=(1@40¦PEFF:0001021¦Signal)

All examples and documentation and CV needs to be unified. We need to make sure neXtProt and UniProt export are using the decided style

How do we encode parentheses in items in list?

This is related to Issue #7
Reviewer asks on page 8 of spec: How do we encode parentheses in items in list?

Based on proposal in Issue #7 they MUST be escaped with a backslash.

In principle, a clever stack-based parser should be able to handle this:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five))

but then these diabolical examples would be very hard to deal with, although possible:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5(five)
or:
\PName=(Nucleolar protein \NOP5)(Nucleolar protein 5)five)

We need a very clear policy and then response to the reviewer.
Note that backslashes are not faithfully rendered in a GitHub issue response, so we need to move this discussion to a Google doc

Ambiguity in "regular expression for PEFF description line"

I'm attempting to implement a more strict PEFF parser in Python, but after consulting the controlled vocabulary, I'm not sure I see how to type-check annotations which are defined by the regex "regular expression for PEFF description line"

[Term]
id: PEFF:1002001
name: regular expression for PEFF description line
def: "([0-9]+|[0-9]+|[a-zA-Z0-9]*)." [PSI:PEFF]
is_a: MS:1002479 ! regular expression

With syntax highlighting, the regex is:

/([0-9]+|[0-9]+|[a-zA-Z0-9]*)/

First, the expression translated into words seems partially redundant "One or more digits between 0 and 9 OR One or more digits between 0 and 9 OR Zero or more alphanumeric characters". The first two alternatives are identical, which seems odd. The reduced regex would be

/([0-9]+|[a-zA-Z0-9]*)/

This reads as "One or more digits between 0 and 9 OR Zero or more alphanumeric characters". This seems to suggest that implicitly each element of a | separated tuple will be interpreted separately, and that the indices of the tuple are not governed by the CV. This information is described in the format specification's text.

Is this interpretation consistent with the intentions of the authors?

Address PEFF and mzIdentML and mzTab

On page 5, we say:

For searches performed using a PEFF file, the downstream result in mzIdentML will need to encode a reference to the PEFF file used.

Reviewer says:

Although actually an action point for mzIdentML: have you already checked, whether referencing a PEFF file as database is encodable? Or have you already added a CV term? If Yes, that could be a comment here. (same for mzTab)

How shall we address this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.