hupo-psi / mzidentml Goto Github PK
View Code? Open in Web Editor NEWRepository for mzIdentML and the corresponding examples
Repository for mzIdentML and the corresponding examples
I have closed down long convoluted thread on various CV items: #2.
The have made a proposal for how I think we should represent protein interaction evidence in mzid - see attached pdf.
xl_protein_interaction.pdf
The following tasks are needed for the cross-linking CV:
The following terms ought to be anticipated when running the validator (and thus remove the warning messages):
Hi all,
Various items from the sessions this morning in Gent:
Please add to this list with things I missed,
best wishes
@andrewrobertjones
Why is the following not allowed?
"MS:1001969" name="phosphoRS score" value="1:1.0468246849444967E-8:4:false"
Is scientific annotation not supported?
Full error:
"The regular expression in SpectrumIdentificationItem (id='SII_4070_1') element at /MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/cvParam ('phosphoRS score') is not valid."
My recollection is that a valid mzIdentML file SHOULD be accompanied by the spectra searched in an external file, but I don't actually see this line in the 1.1 or 1.2-draft specs.
Should we add this line?
If so,
I get a reproducible crash of the validator (1.4.18) when trying to validate the attached file:[pia_testMzIdentML.mzId.gz](https://github.com/HUPO-PSI/mzIdentML/files/315992/pia_testMzIdentML.mzId.gz
The file is erroneous, but still the validator should not crash. After the last uncatched exception, the GUI stay where it is, pretending to do something.
This is the output on the console:
$ java -Xmx8G -jar mzIdentMLValidator-1.4.18-SNAPSHOT.jar
BrendaTissueOBO.obo
gene_ontology.obo
psi-ms.obo
PSI-MOD.obo
pato.obo
unimod.obo
unit.oboThere were errors validating against the XML schema:
- ValidatorMessage{message='Non-fatal XML Parsing error detected on line 20215
Error message: cvc-complex-type.2.4.b: The content of element 'AnalysisCollection' is not complete. One of '{"http://psidev.info/psi/pi/mzIdentML/1.1":SpectrumIdentification}' is expected.', level=ERROR, context=null, rule=null}- ValidatorMessage{message='Non-fatal XML Parsing error detected on line 41189
Error message: Key 'FK_SoftwareContact' with value 'ORG_MSL' not found for identity constraint of element 'MzIdentML'.', level=ERROR, context=null, rule=null}- ValidatorMessage{message='Non-fatal XML Parsing error detected on line 20216
Error message: cvc-complex-type.2.4.b: The content of element 'AnalysisProtocolCollection' is not complete. One of '{"http://psidev.info/psi/pi/mzIdentML/1.1":SpectrumIdentificationProtocol}' is expected.', level=ERROR, context=null, rule=null}- ValidatorMessage{message='Non-fatal XML Parsing error detected on line 20261
Error message: cvc-complex-type.4: Attribute 'id' must appear on element 'SpectrumIdentificationList'.', level=ERROR, context=null, rule=null}
Number of rules to check: 35
Exception in thread "Thread-1" java.lang.IllegalStateException: Could not instantiate reference resolver: uk.ac.ebi.jmzidml.xml.jaxb.resolver.ContactRoleRefResolver
at uk.ac.ebi.jmzidml.xml.jaxb.unmarshaller.listeners.RawXMLListener.referenceResolving(RawXMLListener.java:192)
at uk.ac.ebi.jmzidml.xml.jaxb.unmarshaller.listeners.RawXMLListener.afterUnmarshal(RawXMLListener.java:55)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.Loader.fireAfterUnmarshal(Loader.java:221)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StructureLoader.leaveElement(StructureLoader.java:276)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext.endElement(UnmarshallingContext.java:585)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector.endElement(SAXConnector.java:165)
at org.xml.sax.helpers.XMLFilterImpl.endElement(XMLFilterImpl.java:570)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1783)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2970)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:118)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:243)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:221)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:273)
at uk.ac.ebi.jmzidml.xml.io.MzIdentMLObjectIterator.next(MzIdentMLObjectIterator.java:88)
at uk.ac.ebi.jmzidml.xml.io.MzIdentMLObjectIterator.next(MzIdentMLObjectIterator.java:41)
at psidev.psi.pi.validator.MzIdentMLValidator.checkElementCvMapping(MzIdentMLValidator.java:1143)
at psidev.psi.pi.validator.MzIdentMLValidator.applyCVMappingRules(MzIdentMLValidator.java:779)
at psidev.psi.pi.validator.MzIdentMLValidator.doValidationWork(MzIdentMLValidator.java:511)
at psidev.psi.pi.validator.MzIdentMLValidator.startValidation(MzIdentMLValidator.java:429)
at psidev.psi.pi.validator.MzIdentMLValidatorGUI$4.construct(MzIdentMLValidatorGUI.java:681)
at psidev.psi.pi.validator.swingworker.SwingWorker.lambda$new$1(SwingWorker.java:138)
at java.lang.Thread.run(Thread.java:745)
Some of the cross-linkers have CV terms in Unimod, e.g. the mono-links for DSS (but not the cross-link).
If there are CV terms in Unimod available, should the Unimod terms be used, or should the XLMOD terms from the csv file have priority for cross-linking specific information?
The cross-link CV terms in Unimod are patchy, so if we prioritize Unimod we will have both Unimod and XLMOD terms in most files. Otherwise we could cover all cross-linking specific information with XLMOD terms and be more uniform.
MS:1002404 (count of identified proteins) is not a child of MS:1001184 (search statistics)? Should it not be? Will remove warning messages when running the validator.
Should we define a generic neutral loss term as proposed by Steffen Neumann,
see https://sourceforge.net/p/psidev/mailman/psidev-ms-vocab/?viewmonth=201605&viewday=25
and should be the terms
id: MS:1002455 ! H2O neutral loss,
id: MS:1002456 ! NH3 neutral loss and
id: MS:1002457 ! H3PO4 neutral loss
then be made obsolete?
SchemaLocation is http://www.psidev.info/files/mzIdentML1.1.0.xsd
ERROR: cvParam PSM-level q-value should have units, but it does not!
WARNING: MS:1001062 should be 'Mascot MGF format' instead of 'Mascot MGF file'
WARNING: MS:1001189 should be 'modification specificity peptide N-term' instead of 'modification specificity N-term'
WARNING: MS:1002241 should be 'mzidLib:ProteoGrouper' instead of 'ProteoGrouper'
WARNING: MS:1002404 should be 'count of identified proteins' instead of 'count of identified protein'
I am adding the proteogenomics encoding to the spec doc. I am opening this issue to check that the validator @germa checks that this term is present in SIProtocol:
<cvParam cvRef="PSI-MS" accession="MS:1002635" name="proteogenomics search"></cvParam>
And then expects ALL the following elements to be present on every PeptideEvidence:
<cvParam cvRef="PSI-MS" accession="MS:1002637" name="chromosome name" value="4"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002638" name="chromosome strand" value="+"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002639" name="peptide start on chromosome" value="73417647"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002640" name="peptide end on chromosome" value="73418129"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002641" name="peptide exon count" value="2"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002642" name="peptide exon nucleotide sizes" value="24,42"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002643" name="peptide start positions on chromosome" value="73417647,73418087"></cvParam>
SearchDatabase MUST have the genome reference version:
<SearchDatabase numDatabaseSequences="299106" location="PXD000764_34939_combined_concatenated_target_decoy.fasta" id="SearchDB_1">
<FileFormat>
<cvParam cvRef="PSI-MS" accession="MS:1001348" name="FASTA format"></cvParam>
</FileFormat>
<DatabaseName>
<userParam name="PXD000764_34939_combined_concatenated_target_decoy.fasta"></userParam>
</DatabaseName>
<cvParam cvRef="PSI-MS" accession="MS:1002644" name="genome reference version" value="Homo_sapiens.GRCh38.77.gff3"/>
</SearchDatabase>
Current state:
As it is specified now, the two spectra are referenced as an unordered list of spectrum IDs in the SpectrumIdentificationResult.
For some use cases it could be useful to know which of these is the light and which is the heavy spectrum, since they are not necessarily treated equally.
Also the experimentalMassToCharge and chargeState from only one spectrum are written in SpectrumIdentificationItem and it is not specified from which, while it would actually be useful to have both.
For cleavable cross-linkers it was agreed to use different values here for the two SpectrumIdentificationItems corresponding to the MS3 spectra each containing one peptide.
Although the information is there, it is still unclear here, which spectrum ID in the list belongs to which SpectrumIdentificationItem, MassToCharge and chargeState, since one only has the IDs and the other only the other values.
But I think MassToCharge and chargeState values for each spectrum ID should be in the file and somehow linked to their corresponding IDs.
Proposals:
Could we add CV terms to make that more clear?
One easy thing to do would be to specify, that in the case of labelled cross-linkers there is a fixed order for the spectrumIDs, e.g. the first ID must be the unlabelled or light spectrum, or more generally the IDs should be in ascending order of label weight or MassToCharge.
That would already add a crucial bit of information.
Or we could add lists of MassToCharge and chargeState values, that must have the same order as the list of IDs.
I would like to avoid specific CV terms for light and heavy spectra and come up with something general enough to cover labelled and cleavable cross-linkers.
There seem to be two mapping files in the repo. I have been using:
33846 Mar 31 12:43 cv/mzIdentML-mapping_1.2.0.xml
But this one seems larger and is perhaps the right one?
41666 Jun 23 10:38 validator/trunk/src/main/resources/mzIdentML-mapping_1.2.0.xml
If the one in cv/ is obsolete, we should either make sure they are sync'ed, or only have one copy.
Or would you let me know if there is an important difference?
I think it would be very valuable to add some hierarchical structure to XLMOD.obo. right now there are no parents. I propose that we have a top-level term, perhaps similar to the PSI-MS top term:
[Term]
id: XL:00000
name: Proteomics Standards Initiative cross-linking controlled vocabulary
def: "Proteomics Standards Initiative cross-linking controlled vocabulary." [PSI:XL]
and then a child something like:
[Term]
id: XL:00012
name: cross-linking entity
def: "Entity relevant to the domain of cross-linking in proteomics." [PSI:XL]
is_a: XL:00000 ! Proteomics Standards Initiative cross-linking attribute
relationship: part_of XL:00000 Proteomics Standards Initiative cross-linking controlled vocabulary
[Term]
id: XL:00013
name: cross-linker
def: "Compound that can link one polymer chain to another." [PSI:XL]
is_a: XL:00012 ! cross-linking entity
This would allow us to grow the CV in a more tidy fashion, plus would allow the mapping file to stipulate that location X is the right place to put a child of XL:00013. I don't think this is possible in the current layout.
What do you think?
Currently SearchModifications do not encode for separate sites of e.g. a cross-linker.
To circumvent this we could use the same method as we encode the peptide modifications. I.e. encode the cross-linker as one modification that holds the mass and the specificities for one site and have a second 0 mass modification that encodes for the specificities second site of the cross-linker.
E.g. for BS3 alone:
<SearchModification fixedMod="false" massDelta="138.06808" residues="S T Y K">
<cvParam cvRef="XLMOD" accession="XL:00001" name="Xlink:BS3"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="0"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="138.06808" residues=".">
<SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
</SpecificityRules>
<cvParam cvRef="XLMOD" accession="XL:00001" name="Xlink:BS3"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="0"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="0.0" residues="S T Y K">
<cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="0"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="0.0" residues=".">
<SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
</SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="0"></cvParam>
</SearchModification>
both sides of the cross-linker are linked (the same as in peptide modifications) vi cvterms:
<cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="X">
<cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="X">
For a case with two cross-linker (e.g BS3-d0/BS3-d4) thsi would look like this:
<SearchModification fixedMod="false" massDelta="138.06808" residues="S T Y K">
<cvParam cvRef="XLMOD" accession="XL:00001" name="Xlink:BS3"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="0"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="138.06808" residues=".">
<SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
</SpecificityRules>
<cvParam cvRef="XLMOD" accession="XL:00001" name="Xlink:BS3"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="0"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="0.0" residues="S T Y K">
<cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="0"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="0.0" residues=".">
<SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
</SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="0"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="142.09317" residues="S T Y K">
<cvParam cvRef="XLMOD" accession="XL:00005" name="Xlink:BS3:d4"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="1"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="142.09317" residues=".">
<SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
</SpecificityRules>
<cvParam cvRef="XLMOD" accession="XL:00005" name="Xlink:BS3:d4"></cvParam>
<cvParam cvRef="PSI-MS" accession="MS:1002509" name="cross-link donor" value="1"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="0.0" residues="S T Y K">
<cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="1"></cvParam>
</SearchModification>
<SearchModification fixedMod="false" massDelta="0.0" residues=".">
<SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002057" name="modification specificity protein N-term"></cvParam>
</SpecificityRules>
<cvParam cvRef="PSI-MS" accession="MS:1002510" name="cross-link acceptor" value="1"></cvParam>
</SearchModification>
ERROR: cvParam product ion intensity should have units, but it does not!
ERROR: cvParam product ion m/z should have units, but it does not!
WARNING: MS:1001062 should be 'Mascot MGF format' instead of 'Mascot MGF file'
WARNING: MS:1001149 should be 'param: b ion-NH3 DEPRECATED' instead of 'param: b ion-NH3'
WARNING: MS:1001150 should be 'param: b ion-H2O DEPRECATED' instead of 'param: b ion-H2O'
WARNING: MS:1001151 should be 'param: y ion-NH3 DEPRECATED' instead of 'param: y ion-NH3'
WARNING: MS:1001152 should be 'param: y ion-H2O DEPRECATED' instead of 'param: y ion-H2O'
WARNING: MS:1001171 should be 'Mascot:score' instead of 'mascot:score'
WARNING: MS:1001172 should be 'Mascot:expectation value' instead of 'mascot:expectation value'
WARNING: MS:1001189 should be 'modification specificity peptide N-term' instead of 'modification specificity N-term'
WARNING: MS:1001199 should be 'Mascot DAT format' instead of 'Mascot DAT file'
WARNING: MS:1001316 should be 'Mascot:SigThreshold' instead of 'mascot:SigThreshold'
WARNING: MS:1001370 should be 'Mascot:homology threshold' instead of 'mascot:homology threshold'
WARNING: MS:1001371 should be 'Mascot:identity threshold' instead of 'mascot:identity threshold'
WARNING: MS:1002241 should be 'mzidLib:ProteoGrouper' instead of 'ProteoGrouper'
WARNING: MS:1002404 should be 'count of identified proteins' instead of 'count of identified protein'
When running the current validator on the PeptideShaker mzid 1.2 example file I now get the following error:
Rule ID: PeptideLevelStatsObjectRule
Level: ERROR
Context(/MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem )
--> The SpectrumIdentificationItem (id='SII_2388_1') element at /MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem doesn't contain the triplet of terms MS:1002520 (peptide group ID), MS:1002500 (peptide passes threshold) and a child of MS:1002358 (search engine specific score for distinct peptides) required in case of peptide-level scoring
Tip: Add the triplet of terms MS:1002520 (peptide group ID), MS:1002500 (peptide passes threshold) and a child of MS:1002358 (search engine specific score for distinct peptides) to each SpectrumIdentificationItem
However, the following CV terms are all there:
<cvParam cvRef="PSI-MS" accession="MS:1002468" name="PeptideShaker peptide score" value="17.781512503836435"/>
<cvParam cvRef="PSI-MS" accession="MS:1002500" name="peptide passes threshold" value="true"/>
<cvParam cvRef="PSI-MS" accession="MS:1002520" name="peptide group ID" value="AVVTVPAYFNDAQR"/>
And MS:1002468 (PeptideShaker peptide score) is a child of MS:1002358 (search engine specific score for distinct peptides), so all should be ok? Note that this error did not show up in previous version of the validator.
It seems to be mandatory to add MS:1002507 (modification rescoring:false localization rate) (or any of its child terms, of which there are none btw) when including MS:1002491 (modification localization scoring) in the AdditionalSearchParams list.
This means that terms such as MS:1001969 (phosphoRS score), MS:1002536 (D-Score), MS:1002550 (peptide:phosphoRS score and MS:1002553 (peptide:D-Score) (and others) cannot be used for the PTM rescoring.
I think that either these terms have to be children of MS:1002507 (modification rescoring:false localization rate) or the mandatory term when using modification rescoring ought to be MS:1002505 (regular expression for modification localization scoring) or one of its children?
Many/most of the mzIdentML files do not reference the CVs properly. All files should be checked and corrected.
I believe that the correct CV locations should be:
https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo
https://raw.githubusercontent.com/bio-ontology-research-group/unit-ontology/master/unit.obo
Is that right? Corrections welcome.
We briefly discussed at the PSI meeting, that manual annotations would be possible by manually setting the passThreshold value in SpectrumIdentificationItem.
Since manual examination of spectra is still very important in cross-linking pipelines, could we add more flexible annotations, maybe a CV term for free text comments?
Manually changing the passThreshold flag would not be transparent and reproducible.
Or is there a good reason to generally avoid such free text comments?
Latest attempt to validate yields these problems:
WARNING: CV term MS:1002694 ('frag: precursor ion - CH4OS') is not in the cv
WARNING: MS:1002674 should be 'X500R QTOF' instead of 'frag: b ion - CH4OS'
WARNING: MS:1002686 should be 'OpenXQuest:wTIC' instead of 'frag: y ion - CH4OS'
Looks like attempts to use terms that were proposed but rejected.
@germa Can you add support for .mzid.gz directly in the validator. As gzip is the recommended compression for mzid, and we have compressed files in GitHub, it would be nice if we could validate without manually decompressing.
thanks
Andy
@germa Apologies if I missed it somewhere - it's not obvious if there is a release of the validator available for developers. It would be good to post this on the front page of the mzIdentML GitHub, or at least have it available inside the Git repository. I'm struggling to locate it?
thanks
Andy
We are trying to implement fragment annotations for cross-linked peptides in OpenMS and uploaded an example file here.
In the case of cross-linking we have fragments resulting from cleavages of both peptides in one spectrum. Therefore distributing the annotations between several SIIs would make further usage e.g. visualization of these annotations more complicated and the separation would not be possible in more complex fragmentation patterns, e.g. cross-linked fragments where both peptides were cleaved. So we would prefer to store the annotations for one PSM under one SII, e.g. the light alpha case.
The example suggests adapting FragmentArray to also carry strings (words) or have an additional array carrying these (not in the example).
The specific CVs we would need are:
One Array for "cross-link chain" containing CV strings for "donor" or "acceptor" for each fragment.
One Array for "cross-link ion category" containing CV strings for "common ion" or "cross-linked ion" for each fragment.
The example file shows roughly how that could look like.
Another way to do it would be to introduce additional IonTypes and therefore separate these types into different arrays. In that case we would need all four combinations of donor/receiver with common/cross-linked
Can we make the following term obsolete?
[Term]
id: MS:1001355
name: peptide descriptions
def: "Descriptions of peptides." [PSI:PI]
is_a: MS:1001105 ! peptide result details
It has no children and no associated value, so it makes not much sense.
I get the following null pointer when trying to run the validator on the PeptideShaker 1.2 example file:
Exception in thread "Thread-4" java.lang.NullPointerException
at psidev.psi.pi.validator.objectrules.CvListObjectRule.check(CvListObjectRule.java:77)
at psidev.psi.pi.validator.objectrules.CvListObjectRule.check(CvListObjectRule.java:21)
at psidev.psi.pi.validator.MzIdentMLValidator.validate(MzIdentMLValidator.java:864)
at psidev.psi.pi.validator.MzIdentMLValidator.checkElementObjectRule(MzIdentMLValidator.java:823)
at psidev.psi.pi.validator.MzIdentMLValidator.applyObjectRules(MzIdentMLValidator.java:676)
at psidev.psi.pi.validator.MzIdentMLValidator.doValidationWork(MzIdentMLValidator.java:447)
at psidev.psi.pi.validator.MzIdentMLValidator.startValidation(MzIdentMLValidator.java:417)
at psidev.psi.pi.validator.MzIdentMLValidatorGUI$4.construct(MzIdentMLValidatorGUI.java:675)
at psidev.psi.pi.validator.swingworker.SwingWorker$2.run(SwingWorker.java:147)
at java.lang.Thread.run(Unknown Source)
The problem seems to be that the validator requires a version number for the elements in the cvList, but as far as I can tell the version number is not mandatory here?
I think it can all be easily fixed by checking if the versions number is not null before using the version object?
@germa Can you take a look at this one?
Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/mzidLib_rosetta_2a_uniprot_proteogrouped.mzid, line 3474, char 72
Message: Datatype error: Type:InvalidDatatypeValueException, Message:Value 'file:///Rosetta peak list 2a.mgf' is NOT a valid URI .
URIs may not have spaces in them according to my Xerces validator
These files do not properly reference that they are mzIdentML 1.2. Mixed references to 1.1 and 1.2
cannot validate
ERROR: cvParam retention time should have units, but it does not!
ERROR: cvParam search engine specific score for PSMs has a value, but it should not!
WARNING: cvParam product ion intensity has a legal unit MS:1000131 but its name 'MS:1000131' should be 'number of detector counts'!
WARNING: MS:1002510 should be 'cross-link acceptor' instead of 'cross-link receiver'
for better referencing. Also see #2
WARNING: XL:00002 should be 'DSS' instead of 'Xlink:DSS'
ERROR: cvParam retention time has an illegal unit accession number "second"!
The latter is complaining about this:
cvParam accession="MS:1000894" cvRef="PSI-MS" name="retention time" value="5458.13539999998" unitAccession="second" unitName="" unitCvRef="se"/>
The name is written in the accession. There is no name. Also in error the unitCvRef is "se", but in the cvlist, the ID is "UO":
Related to #9, are there any of examples of how reporter ions from, for example, iTRAQ and TMT are to be annotated as part of the Fragmentation section? I could not find the required CV terms?
If I recreate the PeptideShaker mzid 1.2 example file but remove all the new 1.2 additions, i.e., creating an mzid 1.1 compatible file, the validator is not happy and gives the following errors:
Message 1:
Rule ID: ProteinDetectionProtocolThreshold_rule
Level: ERROR
Context(/threshold/cvParam/@accession ) in 2 locations
--> The result found at: /threshold/cvParam/@accession for which the values is ''MS:1002369'' didn't match any of the 4 specified CV terms:
Message 2:
Rule ID: SpectrumIdentificationProtocolThreshold_must_rule
Level: ERROR
Context(/threshold/cvParam/@accession ) in 2 locations
--> The result found at: /threshold/cvParam/@accession for which the values are ''MS:1001364', 'MS:1002350', 'MS:1002567', 'MS:1002557'' didn't match any of the 4 specified CV terms:
Both seem to be due to the validator not considering new currently supported threshold terms?
In the first message it says that "MS:1001447 (prot:FDR threshold) or any of its children" ought to be used. I think the correct term now is "MS:1002572 (protein detection statistical threshold) or any of its children"? As this covers the new "protein group-level statistical threshold" terms?
Similar for the second message: "MS:1001448 (pep:FDR threshold) or any of its children" ought to be changed into "MS:1002484 (peptide-level statistical threshold) or any of its children"?
We decided in Ghent to get rid of the concept of "final PSM list" and "intermediate list" - only final results are allowed to make reading easier. @germa Please remove this from the mapping file and validator once agreed. Log any complaints here!
I have added a schema-level check that spectrumID + spectra_Data_ref are unique - to lock down this potential for error in results files.
Please check that this is okay on all your example files.
Just opening a comment here to remind us we need to do this. In Ghent, we discussed making different CV terms for all of PSM, peptide, protein, protein group and mod localisation so that reading software could easily tell, say under SpectrumIdentificationItem, whether the score is for a PSM, peptide or mod localisation.
Mod localisation will have to specific terms anyway, since the values have a special structure. Most important work therefore is separating PSM and peptide-level scores, so a reader can figure this out.
I have assigned to @edeutsch since Eric offered to take a look in Ghent, but other helpers would be very welcome!
WARNING: MS:1002235 should be 'ProteoGrouper:PDH score' instead of 'mzidLib:ProteoGrouper:PDH score'
WARNING: MS:1002236 should be 'ProteoGrouper:PAG score' instead of 'mzidLib:ProteoGrouper:PAG score'
WARNING: MS:1002241 should be 'mzidLib:ProteoGrouper' instead of 'ProteoGrouper'
What is the intended use of MS:1001239 (frag: immonium ion) and how can this be used to annotate immonium ions on specific amino acids?
Here's what we currently do in the PeptideShaker mzIdentML export using the outdated PRIDE CV terms:
<IonType charge="1" index="4 5">
<FragmentArray measure_ref="Measure_MZ" values="120.0808411"/>
<FragmentArray measure_ref="Measure_Int" values="14483.125"/>
<FragmentArray measure_ref="Measure_Error" values="6.53397579810644E-5"/>
<cvParam cvRef="PRIDE" accession="PRIDE:0000244" name="immonium F"/>
</IonType>
How would the same be done with MS:1001239 (frag: immonium ion)?
At the meeting we agreed to remove the value requirement from the following MS:1001221 (fragmentation information) child terms:
The reason being that these terms are used as to describe the data type in mzid files as part of the FragmentationTable element and can thus have no value in this setup.
ERROR: cvParam retention time should have units, but it does not!
ERROR: cvParam search engine specific score for PSMs has a value, but it should not!
WARNING: cvParam product ion intensity has a legal unit MS:1000131 but its name 'MS:1000131' should be 'number of detector counts'!
WARNING: CV term MS:100xxxx ('crosslink ion category') is not in the cv
WARNING: MS:1002510 should be 'cross-link acceptor' instead of 'cross-link receiver'
What is the purpose of the DenovoSearchType_may_rule? I get the following for the PeptideShaker mzid 1.2 example file:
Rule ID: DenovoSearchType_may_rule
Level: INFO
Context(/additionalSearchParams/cvParam/@accession ) in 2 locations
--> Not all of the 6 values ParamList's CV terms ['MS:1001211', 'MS:1001256', 'MS:1002492', 'MS:1002490', 'MS:1002497', 'MS:1002491'] found using the Xpath '/additionalSearchParams/cvParam/@accession' matched any of the 1 CvTerm(s):
But this is not a de novo search. Shouldn't this rule only be triggered for mzid files that appear to be de novo searches?
The CV accession="MS:1002511" name="cross-link spectrum identification item" must have the same value for SIIs that represent a pair of cross-linked peptides.
Currently the validator only allows exactly two equal values, but we have 4 SIIs for the same cross-link when using isotope labelled linkers.
ERROR: cvParam local FDR should have units, but it does not!
ERROR: cvParam q-value for peptides should have units, but it does not!
(the previous two may be errors in the CV itself?)
WARNING: MS:1001062 should be 'Mascot MGF format' instead of 'Mascot MGF file'
WARNING: MS:1001189 should be 'modification specificity peptide N-term' instead of 'modification specificity N-term'
WARNING: MS:1001330 should be 'X!Tandem:expect' instead of 'X!Tandem:expect'
WARNING: MS:1001331 should be 'X!Tandem:hyperscore' instead of 'X!Tandem:hyperscore'
WARNING: MS:1001401 should be 'X!Tandem xml format' instead of 'X!Tandem xml file'
WARNING: MS:1001476 should be 'X!Tandem' instead of 'X!Tandem'
WARNING: MS:1001868 should be 'distinct peptide-level q-value' instead of 'q-value for peptides'
WARNING: MS:1002244 should be 'mzidLib:FalseDiscoveryRate' instead of 'mzidLib:FalseDiscoveryRat'
WARNING: MS:1002404 should be 'count of identified proteins' instead of 'count of identified protein'
I'm not quite sure what to make of the X!Tandem business. That is the the way it is written in the OBO file, but I assume that is an escape character that is not to be included in the XML? Or? Should that even be in the OBO file? Is that a limitation of the OBO format? I assume so, because the OBO format uses an ! as a separater character in some places. This should not be repeated in the XML?
Just asking, because they were left out before I uploaded the new versions:
@edeutsch Are the PIA examples validating for you now? these are MSGFplus_tandem.pia.1.2.mzid.gz and mouse_dataset_-combination.pia.1.2.mzid.gz in multi_search and teh PIA... in protein_inference.
Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/OpenxQuest_example_added_annotations.mzid, line 7, char 37
Message: Schema in mzIdentML1.2.0.xsd has a different target namespace from the one specified in the instance document http://psidev.info/psi/pi/mzIdentML/1.2.0.
Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/OpenxQuest_example_added_annotations.mzid, line 204, char 81
Message: Attribute 'values' is not declared for element 'userParam'
Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/OpenxQuest_example_added_annotations.mzid, line 446, char 13
Message: The key for identity constraint of element 'MzIdentML' is not found.
and many further instances of these same errors.
My CV term validator finds these issues with this file:
ERROR: cvParam distinct peptide-level q-value should have units, but it does not!
WARNING: MS:1001062 should be 'Mascot MGF format' instead of 'Mascot MGF file'
WARNING: MS:1001400 should be 'OMSSA xml format' instead of 'OMSSA xml file'
WARNING: MS:1002439 should be 'final PSM list' instead of 'final PSM list UNDER DISCUSSION'
the first error may be an error in the CV. I don't think we want units for q-value in the term? Should we remove units from all q-value terms? This issue affects several.
[Term]
id: MS:1001868
name: distinct peptide-level q-value
def: "Estimation of the q-value for distinct peptides once redundant identifications of the same peptide have been removed (id e
st multiple PSMs, possibly with different mass modifications, mapping to the same sequence have been collapsed to one entry)." [
PSI:PI]
xref: value-type:xsd:double "The allowed value-type for this CV term."
is_a: MS:1002484 ! peptide-level statistical threshold
relationship: has_units UO:0000166 ! parts per notation unit
relationship: has_units UO:0000187 ! percent
relationship: has_domain MS:1002305 ! value between 0 and 1 inclusive
Hi,
I wanted to point out that the current format e.g. "(K,L,n-term)" might be insufficient to represent site-specific and site-unspecific terminal cross-linker.
e.g. in unimod there is the distinction between a terminal modification e.g. ("N-term") and a terminal modification at a specific site e.g. ("N-term K").
Maybe we could also adapt this notation.
I have opened an issue here to make sure that:
Please add any other issues related to mod rescoring. We will close it once all is complete. I've assigned it to Harald - hope that is okay!
And this file doesn't pass my XML validator with these errors:
Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/mzidLib_peaklist2a_plus_ecoli_versus_unimod_full_xtandem_fdr_threshold_groups.mzid, line 352, char 146
Message: Datatype error: Type:InvalidDatatypeValueException, Message:Value 'C:\Work\PSI\mzIdentML\ProteinInference\Rosetta2\tandem\peaklist2a_plus_ecoli_versus_unimod_full.xml' is NOT a valid URI .
Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/mzidLib_peaklist2a_plus_ecoli_versus_unimod_full_xtandem_fdr_threshold_groups.mzid, line 357, char 197
Message: Datatype error: Type:InvalidDatatypeValueException, Message:Value 'C:/Work/PSI/mzIdentML/ProteinInference/Rosetta2/FASTAs, neat/Rosetta_uniprot_20130402_mouse_FULL_UNIPROT_can+iso.fasta' is NOT a valid URI .
Error at file /net/db/projects/PSI/mzIdentML/1.2/genDoc/mzidLib_peaklist2a_plus_ecoli_versus_unimod_full_xtandem_fdr_threshold_groups.mzid, line 365, char 151
Message: Datatype error: Type:InvalidDatatypeValueException, Message:Value 'C:/Work/PSI/mzIdentML/ProteinInference/Rosetta2/Peak lists with ecoli/Rosetta peak list 2a + Ecoli spectra.mgf' is NOT a valid URI .
URIs need to start with file:/// for local files I think.
And they may not contain spaces.
What is the distinction between:
XLMOD-1.0.0.csv
XLMOD.csv
It would be nice to have a proper OBO file for my validator, but I could code up handling for csv. What does the Java validator do?
XLMOD-1.0.0.csv shows this:
XL:00001,Xlink:BS3,2,138.06807961,"(K,S,T,Y,Protein N-term)&(K,S,T,Y,Protein N-term)"
while XLMOD.csv shows this:
XL:00001,BS3,bis(sulfosuccinimidyl)suberate,https://www.thermofisher.com/order/catalog/product/21580,Xlink:BS3,BS3,2,138.06807961,138.16,"C8 H10 O2",
"(K,S,T,Y,Protein N-term)&(K,S,T,Y,Protein N-term)"
The latter seems to be the more complete one?
I hope we're leaving the "Xlink:" prefix behind as not needed?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.