hupo-psi / mzidentml Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 24.0 1.19 GB

Repository for mzIdentML and the corresponding examples

HTML 46.92% XSLT 0.57% Batchfile 0.07% Shell 0.01% Java 43.90% Rich Text Format 8.54%

mzidentml's People

Contributors

Stargazers

Watchers

mzidentml's Issues

Cross-linking - protein interaction evidence

I have closed down long convoluted thread on various CV items: #2.

The have made a proposal for how I think we should represent protein interaction evidence in mzid - see attached pdf.
xl_protein_interaction.pdf

Cross-linking CV

The following tasks are needed for the cross-linking CV:

Hosting here on GitHub (CSV format?)
Unique identifiers for all cross-link and dead-end mods
Put identifiers into XL example documents

Update list of anticipated CV terms during validation

The following terms ought to be anticipated when running the validator (and thus remove the warning messages):

MS:1002567 (phosphoRS score threshold) and MS:1002557 (D-Score threshold) ought to be anticipated terms under Threshold under SpectrumIdentificationProtocol?
MS:1002497 (group PSMs by sequence with modifications) (and the other new related terms) ought to be anticipated terms under AdditionalSearchParams?
MS:1000894 (retention time) ought to be anticipated terms under SpectrumIdentificationResult?
MS:1002471, MS:1002470 and MS:1002542 ought to be anticipated terms under ProteinAmbiguityGroup?

mzid 1.2 to do list from Gent

Hi all,

Various items from the sessions this morning in Gent:

mzid 1.2 validator not easy to locate on GitHub - @germa , are you able to make this prominently available as a build under a "Downloads" webpage on Githut. Apparently, it is fairly easy to build end user web-pages on GitHub - @tobias and @ypriverol has experience of this.
- There are various bugs in the validator - if you have identified one, please make sure to email it to @germa to fix
- @edeutsch has taken an action item to do some CV tidying, particularly around scores for PSMs,
  PTM, peptides, proteins and protein groups
- We need ~11 new CV terms, as detailed in Table 2 of the spec doc (just pushed to GitHub) for combined spectra as input to searches
- Term MS:102499 peptide level score should be deprecated as this was never intended to go in the CV - @germa , can you do this please
- Terms starting "group PSMs by..." are under the wrong parent. They should be under a new parent
  term called "identification parameters", and a new link added to the mapping file for
  "AdditionalSearchParameters" allowing the child terms of this term to be used i.e. these terms are
  intended to be used in this part of the Protocol.
- These terms: de novo, proteogenomics spectral need an extra parent term: special processing (MS:1002489), so they trigger special behaviour of the validator.

Please add to this list with things I missed,
best wishes
@andrewrobertjones

Scientific notation not allowed in PTM localization scores?

Why is the following not allowed?

"MS:1001969" name="phosphoRS score" value="1:1.0468246849444967E-8:4:false"

Is scientific annotation not supported?

Full error:
"The regular expression in SpectrumIdentificationItem (id='SII_4070_1') element at /MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/cvParam ('phosphoRS score') is not valid."

Is mzid valid without spectra data in referenced file?

My recollection is that a valid mzIdentML file SHOULD be accompanied by the spectra searched in an external file, but I don't actually see this line in the 1.1 or 1.2-draft specs.

Should we add this line?
If so,

all our example files should have the searched file alongside them
The validator should check that the file exists as referenced in the file. Even better would check that experimentalMassToCharge values match what is reported in the file, based on referencing system used. This would be tricky to implement beyond referencing MGF or mzML files, so arguable whether useful in practice

Crash of Validator (on invalid file)

I get a reproducible crash of the validator (1.4.18) when trying to validate the attached file:[pia_testMzIdentML.mzId.gz](https://github.com/HUPO-PSI/mzIdentML/files/315992/pia_testMzIdentML.mzId.gz

The file is erroneous, but still the validator should not crash. After the last uncatched exception, the GUI stay where it is, pretending to do something.

This is the output on the console:

$ java -Xmx8G -jar mzIdentMLValidator-1.4.18-SNAPSHOT.jar
BrendaTissueOBO.obo
gene_ontology.obo
psi-ms.obo
PSI-MOD.obo
pato.obo
unimod.obo
unit.obo

There were errors validating against the XML schema:

ValidatorMessage{message='Non-fatal XML Parsing error detected on line 20215
Error message: cvc-complex-type.2.4.b: The content of element 'AnalysisCollection' is not complete. One of '{"http://psidev.info/psi/pi/mzIdentML/1.1":SpectrumIdentification}' is expected.', level=ERROR, context=null, rule=null}

ValidatorMessage{message='Non-fatal XML Parsing error detected on line 41189
Error message: Key 'FK_SoftwareContact' with value 'ORG_MSL' not found for identity constraint of element 'MzIdentML'.', level=ERROR, context=null, rule=null}

ValidatorMessage{message='Non-fatal XML Parsing error detected on line 20216
Error message: cvc-complex-type.2.4.b: The content of element 'AnalysisProtocolCollection' is not complete. One of '{"http://psidev.info/psi/pi/mzIdentML/1.1":SpectrumIdentificationProtocol}' is expected.', level=ERROR, context=null, rule=null}

ValidatorMessage{message='Non-fatal XML Parsing error detected on line 20261
Error message: cvc-complex-type.4: Attribute 'id' must appear on element 'SpectrumIdentificationList'.', level=ERROR, context=null, rule=null}
Number of rules to check: 35
Exception in thread "Thread-1" java.lang.IllegalStateException: Could not instantiate reference resolver: uk.ac.ebi.jmzidml.xml.jaxb.resolver.ContactRoleRefResolver
at uk.ac.ebi.jmzidml.xml.jaxb.unmarshaller.listeners.RawXMLListener.referenceResolving(RawXMLListener.java:192)
at uk.ac.ebi.jmzidml.xml.jaxb.unmarshaller.listeners.RawXMLListener.afterUnmarshal(RawXMLListener.java:55)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.Loader.fireAfterUnmarshal(Loader.java:221)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StructureLoader.leaveElement(StructureLoader.java:276)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext.endElement(UnmarshallingContext.java:585)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector.endElement(SAXConnector.java:165)
at org.xml.sax.helpers.XMLFilterImpl.endElement(XMLFilterImpl.java:570)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1783)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2970)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:118)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:243)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:221)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:273)
at uk.ac.ebi.jmzidml.xml.io.MzIdentMLObjectIterator.next(MzIdentMLObjectIterator.java:88)
at uk.ac.ebi.jmzidml.xml.io.MzIdentMLObjectIterator.next(MzIdentMLObjectIterator.java:41)
at psidev.psi.pi.validator.MzIdentMLValidator.checkElementCvMapping(MzIdentMLValidator.java:1143)
at psidev.psi.pi.validator.MzIdentMLValidator.applyCVMappingRules(MzIdentMLValidator.java:779)
at psidev.psi.pi.validator.MzIdentMLValidator.doValidationWork(MzIdentMLValidator.java:511)
at psidev.psi.pi.validator.MzIdentMLValidator.startValidation(MzIdentMLValidator.java:429)
at psidev.psi.pi.validator.MzIdentMLValidatorGUI$4.construct(MzIdentMLValidatorGUI.java:681)
at psidev.psi.pi.validator.swingworker.SwingWorker.lambda$new$1(SwingWorker.java:138)
at java.lang.Thread.run(Thread.java:745)

CV priority, Unimod or XLMOD if both applicable?

Some of the cross-linkers have CV terms in Unimod, e.g. the mono-links for DSS (but not the cross-link).
If there are CV terms in Unimod available, should the Unimod terms be used, or should the XLMOD terms from the csv file have priority for cross-linking specific information?
The cross-link CV terms in Unimod are patchy, so if we prioritize Unimod we will have both Unimod and XLMOD terms in most files. Otherwise we could cover all cross-linking specific information with XLMOD terms and be more uniform.

MS:1002404 (count of identified proteins) ought to be a child of MS:1001184 (search statistics)?

MS:1002404 (count of identified proteins) is not a child of MS:1001184 (search statistics)? Should it not be? Will remove warning messages when running the validator.

generic neutral loss term with value slot for the formula

Should we define a generic neutral loss term as proposed by Steffen Neumann,
see https://sourceforge.net/p/psidev/mailman/psidev-ms-vocab/?viewmonth=201605&viewday=25

and should be the terms
id: MS:1002455 ! H2O neutral loss,
id: MS:1002456 ! NH3 neutral loss and
id: MS:1002457 ! H3PO4 neutral loss
then be made obsolete?

Validation issues with mzidLib_Rosetta2a_Ecoli_spectra_msgfplus_fdr_threshold_groups.mzid

SchemaLocation is http://www.psidev.info/files/mzIdentML1.1.0.xsd
ERROR: cvParam PSM-level q-value should have units, but it does not!
WARNING: MS:1001062 should be 'Mascot MGF format' instead of 'Mascot MGF file'
WARNING: MS:1001189 should be 'modification specificity peptide N-term' instead of 'modification specificity N-term'
WARNING: MS:1002241 should be 'mzidLib:ProteoGrouper' instead of 'ProteoGrouper'
WARNING: MS:1002404 should be 'count of identified proteins' instead of 'count of identified protein'

Proteogenomics encoding

I am adding the proteogenomics encoding to the spec doc. I am opening this issue to check that the validator @germa checks that this term is present in SIProtocol:

<cvParam cvRef="PSI-MS" accession="MS:1002635" name="proteogenomics search"></cvParam>

And then expects ALL the following elements to be present on every PeptideEvidence:

 <cvParam cvRef="PSI-MS" accession="MS:1002637" name="chromosome name" value="4"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002638" name="chromosome strand" value="+"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002639" name="peptide start on chromosome" value="73417647"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002640" name="peptide end on chromosome" value="73418129"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002641" name="peptide exon count" value="2"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002642" name="peptide exon nucleotide sizes" value="24,42"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002643" name="peptide start positions on chromosome" value="73417647,73418087"></cvParam>

SearchDatabase MUST have the genome reference version:

 <SearchDatabase numDatabaseSequences="299106" location="PXD000764_34939_combined_concatenated_target_decoy.fasta" id="SearchDB_1">
    <FileFormat>
      <cvParam cvRef="PSI-MS" accession="MS:1001348" name="FASTA format"></cvParam>
    </FileFormat>
    <DatabaseName>
      <userParam name="PXD000764_34939_combined_concatenated_target_decoy.fasta"></userParam>      
    </DatabaseName>
   <cvParam cvRef="PSI-MS" accession="MS:1002644" name="genome reference version" value="Homo_sapiens.GRCh38.77.gff3"/>
  </SearchDatabase>

Consensus Spectra (labelled and unlabelled cross-linkers)

Current state:
As it is specified now, the two spectra are referenced as an unordered list of spectrum IDs in the SpectrumIdentificationResult.
For some use cases it could be useful to know which of these is the light and which is the heavy spectrum, since they are not necessarily treated equally.
Also the experimentalMassToCharge and chargeState from only one spectrum are written in SpectrumIdentificationItem and it is not specified from which, while it would actually be useful to have both.
For cleavable cross-linkers it was agreed to use different values here for the two SpectrumIdentificationItems corresponding to the MS3 spectra each containing one peptide.
Although the information is there, it is still unclear here, which spectrum ID in the list belongs to which SpectrumIdentificationItem, MassToCharge and chargeState, since one only has the IDs and the other only the other values.
But I think MassToCharge and chargeState values for each spectrum ID should be in the file and somehow linked to their corresponding IDs.

Proposals:
Could we add CV terms to make that more clear?
One easy thing to do would be to specify, that in the case of labelled cross-linkers there is a fixed order for the spectrumIDs, e.g. the first ID must be the unlabelled or light spectrum, or more generally the IDs should be in ascending order of label weight or MassToCharge.
That would already add a crucial bit of information.
Or we could add lists of MassToCharge and chargeState values, that must have the same order as the list of IDs.
I would like to avoid specific CV terms for light and heavy spectra and come up with something general enough to cover labelled and cleavable cross-linkers.

Duplicate mapping files

There seem to be two mapping files in the repo. I have been using:

33846 Mar 31 12:43 cv/mzIdentML-mapping_1.2.0.xml

But this one seems larger and is perhaps the right one?
41666 Jun 23 10:38 validator/trunk/src/main/resources/mzIdentML-mapping_1.2.0.xml

If the one in cv/ is obsolete, we should either make sure they are sync'ed, or only have one copy.

Or would you let me know if there is an important difference?

more structure in XLMOD.obo

I think it would be very valuable to add some hierarchical structure to XLMOD.obo. right now there are no parents. I propose that we have a top-level term, perhaps similar to the PSI-MS top term:

[Term]
id: XL:00000
name: Proteomics Standards Initiative cross-linking controlled vocabulary
def: "Proteomics Standards Initiative cross-linking controlled vocabulary." [PSI:XL]

and then a child something like:

[Term]
id: XL:00012
name: cross-linking entity
def: "Entity relevant to the domain of cross-linking in proteomics." [PSI:XL]
is_a: XL:00000 ! Proteomics Standards Initiative cross-linking attribute
relationship: part_of XL:00000 Proteomics Standards Initiative cross-linking controlled vocabulary

[Term]
id: XL:00013
name: cross-linker
def: "Compound that can link one polymer chain to another." [PSI:XL]
is_a: XL:00012 ! cross-linking entity

This would allow us to grow the CV in a more tidy fashion, plus would allow the mapping file to stipulate that location X is the right place to put a child of XL:00013. I don't think this is possible in the current layout.

What do you think?

Encoding of cross-linker as part of SearchModifications

Currently SearchModifications do not encode for separate sites of e.g. a cross-linker.
To circumvent this we could use the same method as we encode the peptide modifications. I.e. encode the cross-linker as one modification that holds the mass and the specificities for one site and have a second 0 mass modification that encodes for the specificities second site of the cross-linker.

E.g. for BS3 alone:

      <SearchModification fixedMod="false" massDelta="138.06808" residues="S T Y K">
        <cvParam cvRef="XLMOD" accession="XL:00001" name="Xlink:BS3"></cvParam>
        <cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="0"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="138.06808" residues=".">
        <SpecificityRules>
          <cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
        </SpecificityRules>
        <cvParam cvRef="XLMOD" accession="XL:00001" name="Xlink:BS3"></cvParam>
        <cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="0"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="0.0" residues="S T Y K">
        <cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="0"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="0.0" residues=".">
        <SpecificityRules>
          <cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
        </SpecificityRules>
        <cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="0"></cvParam>
      </SearchModification>

both sides of the cross-linker are linked (the same as in peptide modifications) vi cvterms:

<cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="X">
<cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="X">

For a case with two cross-linker (e.g BS3-d0/BS3-d4) thsi would look like this:

      <SearchModification fixedMod="false" massDelta="138.06808" residues="S T Y K">
        <cvParam cvRef="XLMOD" accession="XL:00001" name="Xlink:BS3"></cvParam>
        <cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="0"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="138.06808" residues=".">
        <SpecificityRules>
          <cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
        </SpecificityRules>
        <cvParam cvRef="XLMOD" accession="XL:00001" name="Xlink:BS3"></cvParam>
        <cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="0"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="0.0" residues="S T Y K">
        <cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="0"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="0.0" residues=".">
        <SpecificityRules>
          <cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
        </SpecificityRules>
        <cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="0"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="142.09317" residues="S T Y K">
        <cvParam cvRef="XLMOD" accession="XL:00005" name="Xlink:BS3:d4"></cvParam>
        <cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="1"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="142.09317" residues=".">
        <SpecificityRules>
          <cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
        </SpecificityRules>
        <cvParam cvRef="XLMOD" accession="XL:00005" name="Xlink:BS3:d4"></cvParam>
        <cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="1"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="0.0" residues="S T Y K">
        <cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="1"></cvParam>
      </SearchModification>
      <SearchModification fixedMod="false" massDelta="0.0" residues=".">
        <SpecificityRules>
          <cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
        </SpecificityRules>
        <cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="1"></cvParam>
      </SearchModification>

Validation issues with mzidLib_rosetta_2a_uniprot_proteogrouped.mzid

ERROR: cvParam product ion intensity should have units, but it does not!
ERROR: cvParam product ion m/z should have units, but it does not!
WARNING: MS:1001062 should be 'Mascot MGF format' instead of 'Mascot MGF file'
WARNING: MS:1001149 should be 'param: b ion-NH3 DEPRECATED' instead of 'param: b ion-NH3'
WARNING: MS:1001150 should be 'param: b ion-H2O DEPRECATED' instead of 'param: b ion-H2O'
WARNING: MS:1001151 should be 'param: y ion-NH3 DEPRECATED' instead of 'param: y ion-NH3'
WARNING: MS:1001152 should be 'param: y ion-H2O DEPRECATED' instead of 'param: y ion-H2O'
WARNING: MS:1001171 should be 'Mascot:score' instead of 'mascot:score'
WARNING: MS:1001172 should be 'Mascot:expectation value' instead of 'mascot:expectation value'
WARNING: MS:1001189 should be 'modification specificity peptide N-term' instead of 'modification specificity N-term'
WARNING: MS:1001199 should be 'Mascot DAT format' instead of 'Mascot DAT file'
WARNING: MS:1001316 should be 'Mascot:SigThreshold' instead of 'mascot:SigThreshold'
WARNING: MS:1001370 should be 'Mascot:homology threshold' instead of 'mascot:homology threshold'
WARNING: MS:1001371 should be 'Mascot:identity threshold' instead of 'mascot:identity threshold'
WARNING: MS:1002241 should be 'mzidLib:ProteoGrouper' instead of 'ProteoGrouper'
WARNING: MS:1002404 should be 'count of identified proteins' instead of 'count of identified protein'

Error in PeptideLevelStatsObjectRule

When running the current validator on the PeptideShaker mzid 1.2 example file I now get the following error:

    Rule ID: PeptideLevelStatsObjectRule
    Level: ERROR
    Context(/MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem )
    --> The SpectrumIdentificationItem (id='SII_2388_1') element at /MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem doesn't contain the  triplet of terms MS:1002520 (peptide group ID), MS:1002500 (peptide passes threshold) and a child of MS:1002358 (search engine specific score for distinct peptides) required in case of peptide-level scoring
    Tip: Add the triplet of terms MS:1002520 (peptide group ID), MS:1002500 (peptide passes threshold) and a child of MS:1002358 (search engine specific score for distinct peptides) to each SpectrumIdentificationItem

However, the following CV terms are all there:

<cvParam cvRef="PSI-MS" accession="MS:1002468" name="PeptideShaker peptide score" value="17.781512503836435"/>
<cvParam cvRef="PSI-MS" accession="MS:1002500" name="peptide passes threshold" value="true"/>
<cvParam cvRef="PSI-MS" accession="MS:1002520" name="peptide group ID" value="AVVTVPAYFNDAQR"/>

And MS:1002468 (PeptideShaker peptide score) is a child of MS:1002358 (search engine specific score for distinct peptides), so all should be ok? Note that this error did not show up in previous version of the validator.

Incorrect CV term requirement for PTM localization scoring?

It seems to be mandatory to add MS:1002507 (modification rescoring:false localization rate) (or any of its child terms, of which there are none btw) when including MS:1002491 (modification localization scoring) in the AdditionalSearchParams list.

This means that terms such as MS:1001969 (phosphoRS score), MS:1002536 (D-Score), MS:1002550 (peptide:phosphoRS score and MS:1002553 (peptide:D-Score) (and others) cannot be used for the PTM rescoring.

I think that either these terms have to be children of MS:1002507 (modification rescoring:false localization rate) or the mandatory term when using modification rescoring ought to be MS:1002505 (regular expression for modification localization scoring) or one of its children?

Referencing CVs

Many/most of the mzIdentML files do not reference the CVs properly. All files should be checked and corrected.

I believe that the correct CV locations should be:
https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo
https://raw.githubusercontent.com/bio-ontology-research-group/unit-ontology/master/unit.obo

Is that right? Corrections welcome.

Manual annotation of PSMs

We briefly discussed at the PSI meeting, that manual annotations would be possible by manually setting the passThreshold value in SpectrumIdentificationItem.
Since manual examination of spectra is still very important in cross-linking pipelines, could we add more flexible annotations, maybe a CV term for free text comments?
Manually changing the passThreshold flag would not be transparent and reproducible.
Or is there a good reason to generally avoid such free text comments?

Validation issues with PeptideShaker_mzid_1_2_example.mzid

Latest attempt to validate yields these problems:

WARNING: CV term MS:1002694 ('frag: precursor ion - CH4OS') is not in the cv
WARNING: MS:1002674 should be 'X500R QTOF' instead of 'frag: b ion - CH4OS'
WARNING: MS:1002686 should be 'OpenXQuest:wTIC' instead of 'frag: y ion - CH4OS'

Looks like attempts to use terms that were proposed but rejected.

Support for .mzid.gz in the validator

@germa Can you add support for .mzid.gz directly in the validator. As gzip is the recommended compression for mzid, and we have compressed files in GitHub, it would be nice if we could validate without manually decompressing.
thanks
Andy

Build of validator

@germa Apologies if I missed it somewhere - it's not obvious if there is a release of the validator available for developers. It would be good to post this on the front page of the mzIdentML GitHub, or at least have it available inside the Git repository. I'm struggling to locate it?

thanks
Andy

Cross-Linking fragment annotations

We are trying to implement fragment annotations for cross-linked peptides in OpenMS and uploaded an example file here.
In the case of cross-linking we have fragments resulting from cleavages of both peptides in one spectrum. Therefore distributing the annotations between several SIIs would make further usage e.g. visualization of these annotations more complicated and the separation would not be possible in more complex fragmentation patterns, e.g. cross-linked fragments where both peptides were cleaved. So we would prefer to store the annotations for one PSM under one SII, e.g. the light alpha case.

The example suggests adapting FragmentArray to also carry strings (words) or have an additional array carrying these (not in the example).
The specific CVs we would need are:
One Array for "cross-link chain" containing CV strings for "donor" or "acceptor" for each fragment.
One Array for "cross-link ion category" containing CV strings for "common ion" or "cross-linked ion" for each fragment.
The example file shows roughly how that could look like.

Another way to do it would be to introduce additional IonTypes and therefore separate these types into different arrays. In that case we would need all four combinations of donor/receiver with common/cross-linked

MS:1001355 peptide descriptions

Can we make the following term obsolete?

[Term]
id: MS:1001355
name: peptide descriptions
def: "Descriptions of peptides." [PSI:PI]
is_a: MS:1001105 ! peptide result details

It has no children and no associated value, so it makes not much sense.

Validator: null pointer for missing version number in the cvList

I get the following null pointer when trying to run the validator on the PeptideShaker 1.2 example file:

Exception in thread "Thread-4" java.lang.NullPointerException
        at psidev.psi.pi.validator.objectrules.CvListObjectRule.check(CvListObjectRule.java:77)
        at psidev.psi.pi.validator.objectrules.CvListObjectRule.check(CvListObjectRule.java:21)
        at psidev.psi.pi.validator.MzIdentMLValidator.validate(MzIdentMLValidator.java:864)
        at psidev.psi.pi.validator.MzIdentMLValidator.checkElementObjectRule(MzIdentMLValidator.java:823)
        at psidev.psi.pi.validator.MzIdentMLValidator.applyObjectRules(MzIdentMLValidator.java:676)
        at psidev.psi.pi.validator.MzIdentMLValidator.doValidationWork(MzIdentMLValidator.java:447)
        at psidev.psi.pi.validator.MzIdentMLValidator.startValidation(MzIdentMLValidator.java:417)
        at psidev.psi.pi.validator.MzIdentMLValidatorGUI$4.construct(MzIdentMLValidatorGUI.java:675)
        at psidev.psi.pi.validator.swingworker.SwingWorker$2.run(SwingWorker.java:147)
        at java.lang.Thread.run(Unknown Source)

The problem seems to be that the validator requires a version number for the elements in the cvList, but as far as I can tell the version number is not mandatory here?

I think it can all be easily fixed by checking if the versions number is not null before using the version object?

@germa Can you take a look at this one?

URI problem with mzidLib_rosetta_2a_uniprot_proteogrouped.mzid

Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/mzidLib_rosetta_2a_uniprot_proteogrouped.mzid, line 3474, char 72
Message: Datatype error: Type:InvalidDatatypeValueException, Message:Value 'file:///Rosetta peak list 2a.mgf' is NOT a valid URI .

URIs may not have spaces in them according to my Xerces validator

Validation issues with Panalyzer* files

These files do not properly reference that they are mzIdentML 1.2. Mixed references to 1.1 and 1.2

cannot validate

Validation issues with OpenxQuest_example.mzid

ERROR: cvParam retention time should have units, but it does not!
ERROR: cvParam search engine specific score for PSMs has a value, but it should not!
WARNING: cvParam product ion intensity has a legal unit MS:1000131 but its name 'MS:1000131' should be 'number of detector counts'!
WARNING: MS:1002510 should be 'cross-link acceptor' instead of 'cross-link receiver'

Cross-linking CV list needs versioning

for better referencing. Also see #2

Validation issues with OpenxQuest_example.mzid

WARNING: XL:00002 should be 'DSS' instead of 'Xlink:DSS'

ERROR: cvParam retention time has an illegal unit accession number "second"!

The latter is complaining about this:
cvParam accession="MS:1000894" cvRef="PSI-MS" name="retention time" value="5458.13539999998" unitAccession="second" unitName="" unitCvRef="se"/>

The name is written in the accession. There is no name. Also in error the unitCvRef is "se", but in the cvlist, the ID is "UO":

How to annotate reporter ions as part of the Fragmentation section?

Related to #9, are there any of examples of how reporter ions from, for example, iTRAQ and TMT are to be annotated as part of the Fragmentation section? I could not find the required CV terms?

Validator has issues with new threshold CV terms for mzid 1.1 files

If I recreate the PeptideShaker mzid 1.2 example file but remove all the new 1.2 additions, i.e., creating an mzid 1.1 compatible file, the validator is not happy and gives the following errors:

Message 1:
Rule ID: ProteinDetectionProtocolThreshold_rule
Level: ERROR
Context(/threshold/cvParam/@accession ) in 2 locations
--> The result found at: /threshold/cvParam/@accession for which the values is ''MS:1002369'' didn't match any of the 4 specified CV terms:

The sole term MS:1001447 (prot:FDR threshold) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.
The sole term MS:1001494 (no threshold) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.
Any children term of MS:1001153 (search engine specific score). The term can be repeated. The matching value has to be the identifier of the term, not its name.
Any children term of MS:1001302 (search engine specific input parameter). The term can be repeated. The matching value has to be the identifier of the term, not its name.

Message 2:
Rule ID: SpectrumIdentificationProtocolThreshold_must_rule
Level: ERROR
Context(/threshold/cvParam/@accession ) in 2 locations
--> The result found at: /threshold/cvParam/@accession for which the values are ''MS:1001364', 'MS:1002350', 'MS:1002567', 'MS:1002557'' didn't match any of the 4 specified CV terms:

The sole term MS:1001448 (pep:FDR threshold) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.
The sole term MS:1001494 (no threshold) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.
Any children term of MS:1001153 (search engine specific score). The term can be repeated. The matching value has to be the identifier of the term, not its name.
Any children term of MS:1001302 (search engine specific input parameter). The term can be repeated. The matching value has to be the identifier of the term, not its name.

Both seem to be due to the validator not considering new currently supported threshold terms?

In the first message it says that "MS:1001447 (prot:FDR threshold) or any of its children" ought to be used. I think the correct term now is "MS:1002572 (protein detection statistical threshold) or any of its children"? As this covers the new "protein group-level statistical threshold" terms?

Similar for the second message: "MS:1001448 (pep:FDR threshold) or any of its children" ought to be changed into "MS:1002484 (peptide-level statistical threshold) or any of its children"?

Multiple search engine encoding

We decided in Ghent to get rid of the concept of "final PSM list" and "intermediate list" - only final results are allowed to make reading easier. @germa Please remove this from the mapping file and validator once agreed. Log any complaints here!

I have added a schema-level check that spectrumID + spectra_Data_ref are unique - to lock down this potential for error in results files.

Please check that this is okay on all your example files.

CV terms for PSM, peptide, protein, protein group and mod localisation - need clean up

Just opening a comment here to remind us we need to do this. In Ghent, we discussed making different CV terms for all of PSM, peptide, protein, protein group and mod localisation so that reading software could easily tell, say under SpectrumIdentificationItem, whether the score is for a PSM, peptide or mod localisation.

Mod localisation will have to specific terms anyway, since the values have a special structure. Most important work therefore is separating PSM and peptide-level scores, so a reader can figure this out.

I have assigned to @edeutsch since Eric offered to take a look in Ghent, but other helpers would be very welcome!

Validation issues with mzidLib_Rosetta2a_Ecoli_spectra_msgfplus_fdr_threshold_groups.mzid

WARNING: MS:1002235 should be 'ProteoGrouper:PDH score' instead of 'mzidLib:ProteoGrouper:PDH score'
WARNING: MS:1002236 should be 'ProteoGrouper:PAG score' instead of 'mzidLib:ProteoGrouper:PAG score'
WARNING: MS:1002241 should be 'mzidLib:ProteoGrouper' instead of 'ProteoGrouper'

How to use MS:1001239 (frag: immonium ion) to annotate immonium ions on specific amino acids?

What is the intended use of MS:1001239 (frag: immonium ion) and how can this be used to annotate immonium ions on specific amino acids?

Here's what we currently do in the PeptideShaker mzIdentML export using the outdated PRIDE CV terms:

<IonType charge="1" index="4 5">
   <FragmentArray measure_ref="Measure_MZ" values="120.0808411"/>
   <FragmentArray measure_ref="Measure_Int" values="14483.125"/>
   <FragmentArray measure_ref="Measure_Error" values="6.53397579810644E-5"/>
   <cvParam cvRef="PRIDE" accession="PRIDE:0000244" name="immonium F"/>
</IonType>

How would the same be done with MS:1001239 (frag: immonium ion)?

Remove the value requirement for CV terms used to indicate data type for a list

At the meeting we agreed to remove the value requirement from the following MS:1001221 (fragmentation information) child terms:

MS:1001226 (product ion intensity)
MS:1002225 (average product ion intensity)
MS:1002226 (product ion intensity standard deviation)
MS:1000904 (product ion m/z delta)

The reason being that these terms are used as to describe the data type in mzid files as part of the FragmentationTable element and can thus have no value in this setup.

Validation issues with OpenxQuest_example_added_annotations.mzid

ERROR: cvParam retention time should have units, but it does not!
ERROR: cvParam search engine specific score for PSMs has a value, but it should not!
WARNING: cvParam product ion intensity has a legal unit MS:1000131 but its name 'MS:1000131' should be 'number of detector counts'!
WARNING: CV term MS:100xxxx ('crosslink ion category') is not in the cv
WARNING: MS:1002510 should be 'cross-link acceptor' instead of 'cross-link receiver'

Is the DenovoSearchType_may_rule used correctly?

What is the purpose of the DenovoSearchType_may_rule? I get the following for the PeptideShaker mzid 1.2 example file:

Rule ID: DenovoSearchType_may_rule
Level: INFO
Context(/additionalSearchParams/cvParam/@accession ) in 2 locations
--> Not all of the 6 values ParamList's CV terms ['MS:1001211', 'MS:1001256', 'MS:1002492', 'MS:1002490', 'MS:1002497', 'MS:1002491'] found using the Xpath '/additionalSearchParams/cvParam/@accession' matched any of the 1 CvTerm(s):

The sole term MS:1001010 (de novo search) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.

But this is not a de novo search. Shouldn't this rule only be triggered for mzid files that appear to be de novo searches?

Validator: too strict on "cross-link spectrum identification item" values

The CV accession="MS:1002511" name="cross-link spectrum identification item" must have the same value for SIIs that represent a pair of cross-linked peptides.
Currently the validator only allows exactly two equal values, but we have 4 SIIs for the same cross-link when using isotope labelled linkers.

Validation issues with mzidLib_peaklist2a_plus_ecoli_versus_unimod_full_xtandem_fdr_threshold_groups.mzid

ERROR: cvParam local FDR should have units, but it does not!
ERROR: cvParam q-value for peptides should have units, but it does not!
(the previous two may be errors in the CV itself?)

WARNING: MS:1001062 should be 'Mascot MGF format' instead of 'Mascot MGF file'
WARNING: MS:1001189 should be 'modification specificity peptide N-term' instead of 'modification specificity N-term'
WARNING: MS:1001330 should be 'X!Tandem:expect' instead of 'X!Tandem:expect'
WARNING: MS:1001331 should be 'X!Tandem:hyperscore' instead of 'X!Tandem:hyperscore'
WARNING: MS:1001401 should be 'X!Tandem xml format' instead of 'X!Tandem xml file'
WARNING: MS:1001476 should be 'X!Tandem' instead of 'X!Tandem'
WARNING: MS:1001868 should be 'distinct peptide-level q-value' instead of 'q-value for peptides'
WARNING: MS:1002244 should be 'mzidLib:FalseDiscoveryRate' instead of 'mzidLib:FalseDiscoveryRat'
WARNING: MS:1002404 should be 'count of identified proteins' instead of 'count of identified protein'

I'm not quite sure what to make of the X!Tandem business. That is the the way it is written in the OBO file, but I assume that is an escape character that is not to be included in the XML? Or? Should that even be in the OBO file? Is that a limitation of the OBO format? I assume so, because the OBO format uses an ! as a separater character in some places. This should not be repeated in the XML?

Validation of PIA examples

Just asking, because they were left out before I uploaded the new versions:
@edeutsch Are the PIA examples validating for you now? these are MSGFplus_tandem.pia.1.2.mzid.gz and mouse_dataset_-combination.pia.1.2.mzid.gz in multi_search and teh PIA... in protein_inference.

Validation issues with OpenxQuest_example_added_annotations.mzid

Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/OpenxQuest_example_added_annotations.mzid, line 7, char 37
Message: Schema in mzIdentML1.2.0.xsd has a different target namespace from the one specified in the instance document http://psidev.info/psi/pi/mzIdentML/1.2.0.

Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/OpenxQuest_example_added_annotations.mzid, line 204, char 81
Message: Attribute 'values' is not declared for element 'userParam'

Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/OpenxQuest_example_added_annotations.mzid, line 446, char 13
Message: The key for identity constraint of element 'MzIdentML' is not found.

and many further instances of these same errors.

combined_fdr_1.2.mzid validation issues

My CV term validator finds these issues with this file:
ERROR: cvParam distinct peptide-level q-value should have units, but it does not!
WARNING: MS:1001062 should be 'Mascot MGF format' instead of 'Mascot MGF file'
WARNING: MS:1001400 should be 'OMSSA xml format' instead of 'OMSSA xml file'
WARNING: MS:1002439 should be 'final PSM list' instead of 'final PSM list UNDER DISCUSSION'

the first error may be an error in the CV. I don't think we want units for q-value in the term? Should we remove units from all q-value terms? This issue affects several.

[Term]
id: MS:1001868
name: distinct peptide-level q-value
def: "Estimation of the q-value for distinct peptides once redundant identifications of the same peptide have been removed (id e
st multiple PSMs, possibly with different mass modifications, mapping to the same sequence have been collapsed to one entry)." [
PSI:PI]
xref: value-type:xsd:double "The allowed value-type for this CV term."
is_a: MS:1002484 ! peptide-level statistical threshold
relationship: has_units UO:0000166 ! parts per notation unit
relationship: has_units UO:0000187 ! percent
relationship: has_domain MS:1002305 ! value between 0 and 1 inclusive

Cross-link site specificity might need refinement

Hi,
I wanted to point out that the current format e.g. "(K,L,n-term)" might be insufficient to represent site-specific and site-unspecific terminal cross-linker.
e.g. in unimod there is the distinction between a terminal modification e.g. ("N-term") and a terminal modification at a specific site e.g. ("N-term K").
Maybe we could also adapt this notation.

Mod re-scoring

I have opened an issue here to make sure that:

There are valid CV terms for mod rescoring
Example files are correctly validated.
The spec doc is correct and up-to-date.

Please add any other issues related to mod rescoring. We will close it once all is complete. I've assigned it to Harald - hope that is okay!

URI problem with mzidLib_peaklist2a_plus_ecoli_versus_unimod_full_xtandem_fdr_threshold_groups.mzid

And this file doesn't pass my XML validator with these errors:

Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/mzidLib_peaklist2a_plus_ecoli_versus_unimod_full_xtandem_fdr_threshold_groups.mzid, line 352, char 146
Message: Datatype error: Type:InvalidDatatypeValueException, Message:Value 'C:\Work\PSI\mzIdentML\ProteinInference\Rosetta2\tandem\peaklist2a_plus_ecoli_versus_unimod_full.xml' is NOT a valid URI .

Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/mzidLib_peaklist2a_plus_ecoli_versus_unimod_full_xtandem_fdr_threshold_groups.mzid, line 357, char 197
Message: Datatype error: Type:InvalidDatatypeValueException, Message:Value 'C:/Work/PSI/mzIdentML/ProteinInference/Rosetta2/FASTAs, neat/Rosetta_uniprot_20130402_mouse_FULL_UNIPROT_can+iso.fasta' is NOT a valid URI .

Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/mzidLib_peaklist2a_plus_ecoli_versus_unimod_full_xtandem_fdr_threshold_groups.mzid, line 365, char 151
Message: Datatype error: Type:InvalidDatatypeValueException, Message:Value 'C:/Work/PSI/mzIdentML/ProteinInference/Rosetta2/Peak lists with ecoli/Rosetta peak list 2a + Ecoli spectra.mgf' is NOT a valid URI .

URIs need to start with file:/// for local files I think.

And they may not contain spaces.

Difference between two XLMOD files?

What is the distinction between:
XLMOD-1.0.0.csv
XLMOD.csv

It would be nice to have a proper OBO file for my validator, but I could code up handling for csv. What does the Java validator do?

XLMOD-1.0.0.csv shows this:
XL:00001,Xlink:BS3,2,138.06807961,"(K,S,T,Y,Protein N-term)&(K,S,T,Y,Protein N-term)"

while XLMOD.csv shows this:
XL:00001,BS3,bis(sulfosuccinimidyl)suberate,https://www.thermofisher.com/order/catalog/product/21580,Xlink:BS3,BS3,2,138.06807961,138.16,"C8 H10 O2",
"(K,S,T,Y,Protein N-term)&(K,S,T,Y,Protein N-term)"

The latter seems to be the more complete one?
I hope we're leaving the "Xlink:" prefix behind as not needed?

hupo-psi / mzidentml Goto Github PK

mzidentml's People

Contributors

Stargazers

Watchers

Forkers

mzidentml's Issues

Recommend Projects

Recommend Topics

Recommend Org