hupo-psi / mztab Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 17.0 100.22 MB

mzTab Reporting MS-based Proteomics and Metabolomics Results

Home Page: https://hupo-psi.github.io/mzTab

Batchfile 0.11% Java 99.67% Shell 0.22%

mztab's People

Contributors

Stargazers

Watchers

Forkers

timosachsenberg javizca nilshoffmann kayrein sneumann oliveralka ypriverol andrewrobertjones aspirincode mozartich svaksha ohwey bernt-matthias savita-nferx oscar-gr cmungall

mztab's Issues

In the case of Identification Complete mzTab files, the numbers of protein columns grow very fast because of the mandatory fields referencing ms run.

Uploading peptides_1_1_0.pride.mztab.zip…

In the case of Identification Complete mzTab files, the numbers of protein 
columns grow very fast because of the mandatory fields referencing ms run.

This situation is not very common but you can find it when converting a 
mzIdentML file generated with a tool like pep2pro to mztab.

As a temporary solution the file can be generated as a Identification Summary 
because in this case these fields are not mandatory.

Example:
20 ms_run
2 protein_search_engine_score

-mandatory columns in an Identification Complete mzTab 
ms_run/protein_search_engine_score unrelated = 10 columns
-best_search_engine_score[1-n] = num protein_search_engines_score = 2 columns
-search_engine_score[1-n]_ms_run[1-n] = num protein_search_engines_score x num 
ms_run = 40 columns
-num_psms_ms_run[1-n] = num ms_run = 20 columns
-num_peptides_distinct_ms_run[1-n] = num ms_run = 20 columns 
-num_peptide_unique_ms_run[1-n] = num ms_run = 20 columns

Protein section total columns = 112 columns

Original issue reported on code.google.com by noedelta on 9 Oct 2014 at 3:13

Define terms for derivatization agents

For GC and HPLC, derivatization is often applied in order to specifically target compounds that are otherwise hard to measure at all, being non-volatile or otherwise chemically / phyiscally suboptimally suited for the separation method and to increase ionization eﬃciency and selectivity for subsequent MS analysis [1,2].

For GC, the primary derivatization methods are

acylation
alkylation and esterification
silylation

Example:

MSTFA (N-Methyl-N-(trimethylsilyl)trifluoroacetamide), is a derivatization agent for silylation in GC applications (http://www.sigmaaldrich.com/catalog/product/sial/69479)
-- molecular weight=199.25 g/mol
-- PubChem Substance ID 329761689
-- CAS Number 24589-78-4
-- "modification mass" (the mass actually added to derivatized molecules, in this case k trimethylsilyl groups): k* 73,19. This could be optional, but is important to discern analytes (including groups added by derivatization) from actual metabolites (the unmodified molecule).

For HPLC, some examples can be found for the following methods:

chiral derivatization

References:

[1] Qi et al., Derivatization for liquid chromatography-mass spectrometry; TrAC Trends in Analytical Chemistry. 59. . 10.1016/j.trac.2014.03.013.
[2] Halket et al.; Chemical derivatization and mass spectral libraries in metabolic profiling by GC/MS and LC/MS/MS, Journal of Experimental Botany, Volume 56, Issue 410, 1 January 2005, Pages 219–243, https://doi.org/10.1093/jxb/eri069

Add linkage to ISA-TAB protocols in MetaData section

Would be good to have examples showing how the MetaData section of mztab could link to an ISA-TAB for a richer description of experimental design. @rsalek

Simplifying columns in Protein and Peptide table around Databases

In the metabolomics 1.1-draft, there is a plan to use a prefix system before identifiers to mark which database the given ID has come from, where the prefix is explained in the header.

Propose same change for Protein and peptide table for "Database" and "Database version".

Metadata would say following:

database[1-n] "UniProt human"
database[1-n]-prefix "uh"
database[1-n]-version "v82"
... accession .....
PRT uh:P678435

At the moment, we have proposed e.g. small_molecule-database[1-n], but the model can be the same for both kinds of databases, no need to split into two models.

Further to this point, taxid and species seem to be null in almost all examples, since this information may not be available to most export software (especially with customised databases). Propose to suggest that these are optional columns or list possible species contained within database as part of the header.

jmzTab CLI jar assumes argument file paths are relative

What steps will reproduce the problem?
1. java -cp mzTabCLI.jar uk.ac.ebi.pride.jmztab.MZTabCommandLine -convert 
inFile=/some/absolute/path/input.mzid format=MZIDENTML -outFile 
/some/absolute/path/output.mzTab

What is the expected output? What do you see instead?

Expected = Successful conversion from mzid to mztab.
Actual =
Exception in thread "main" java.lang.IllegalStateException: XML File to index 
does not exist: /current/working/path/./some/absolute/path/input.mzid

Apparently, the command-line converter attempts to resolve the argument file 
path relative to the current working directory.  Not only does this break the 
use case of absolute paths, but it should not even be necessary in Java.  It 
should be sufficient to just literally pass the String argument value from the 
command line to a File constructor, and then test for existence and readability 
of the input file.

Original issue reported on code.google.com by [email protected] on 28 Oct 2014 at 6:09

Question: at which places is polarity stored?

Modularization of jmztab

I have created a fork of the currently latest svn respository version of jmztab 
in order to 
make it more modular and to cut down the dependencies to the absolute necessary 
ones. This may be especially useful for developers who do not need the 
peptide/proteomics parts of jmztab, but rather only the small molecules part, 
as in my case. 

I therefore separated the model from the utils packages, as well as providing 
separate modules for the pride-converter, the cli, and the gui. 
Building the cli or the gui distributions can now be triggered by using maven 
profiles.
Would be great if someone else found this useful.

You can find the git repository here:

github.com/nilshoffmann/jmztab

Original issue reported on code.google.com by [email protected] on 21 Feb 2014 at 4:48

How to handle "no database" and UNKNOWN and their combinations

This issue was raised in #58

Order in Peptide columns

What steps will reproduce the problem?
1. In the validation seems like the order of columns do not matter, for example 
when you have a secptra_ref after the assays and study variables the parsers do 
not give any error, but when you try to retrieve the data, it does not provide 
any data. The problem must be related with the the order of the mandatory 
columns.
2.
3.

What is the expected output? What do you see instead?


Please use labels and text to provide additional information.

Original issue reported on code.google.com by ypriverol on 29 Sep 2014 at 2:04

small addition to specification document

change:

protein_search_engine_score[1-n] SC SC
peptide_search_engine_score[1-n] SC SC
psm_search_engine_score[1-n] SC SC
smallmolecule_search_engine_score[1-n] SC SC

to :
protein_search_engine_score[1-n] SC (if protein section presen) SC (if protein 
section presen)
peptide_search_engine_score[1-n] SC (if peptide section presen) SC (if 
peptidesection presen)
psm_search_engine_score[1-n] SC (if psm section presen) SC  (if psm section 
presen)
smallmolecule_search_engine_score[1-n] SC (if smallmolecule section presen) SC  
(if smallmolecule section presen)

in table and detailed specification

Original issue reported on code.google.com by [email protected] on 15 Jul 2014 at 12:20

Optional columns should contains null values

It would be interesting to generate optional columns with null value by default.

What is the expected output? What do you see instead?


Please use labels and text to provide additional information.

Original issue reported on code.google.com by ypriverol on 17 Oct 2014 at 3:06

Complete vs Summary; Quant vs ID in 1.1

Opening an issue here for discussion of Complete vs Summary and Quant vs ID as we want to encode it for mzTab 1.1

In brief - in metabolomics, there is no need of the ID only workflow.

Complete vs Summary is much simpler in the 1.1-metabolomics draft, 3 tables (SML, SMF, SME) vs 1 table (SML).

Reading back the 1.0 specs, the proteomics complete vs summary, quant vs ID split looks over-complicated to me. Plus if we want to release the metabolomics update, we may need to make breaking changes to 1.0.

As such, feels like we need to decide whether we can/should revisit the proteomics part now, or just make 1.1 a metabolomics only branch of the standard.

I've added a powerpoint with some items for discussion here: https://github.com/HUPO-PSI/mzTab/blob/master/specification_document/1_1_draft_specs/Version11_design_considerations.pptx

Please take a look. Would be good if @javizca could take a look, since I know you're away for the workshop coming up

Tag ms_run with fraction identifier

Currently, fraction support is rather limited.
One possibility to define the fraction number would be adding additional meta data.

One ms_run should only be associated with one fraction

Following up on discussions on the PSI meeting.
The specification document seems to be wrong/not clear in this regards.
Having one ms_run linking to several files is problematic because linking to source spectra is not easily possible.
Having one assay linking to multiple ms_runs might help but is currently not supported in the standard.

Specification of allowed cvParams

Hi, in the mzML and other formats, we have a mappingfile.xml,
which specifies which CV Terms are allowed in place of some cvParam.
We should be able to specify that e.g. a
https://github.com/HUPO-PSI/mzTab/blob/master/specification_document/1_1_draft_specs/mzTab_format_specification_1_1-M_draft.adoc#6217-quantification_method
must be a child of
https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1001833

This will require a new filed in the spec doc, and the validator should check that.

Yours, Steffen

Uknown modification

What steps will reproduce the problem?
1. Unknow modifications in MS experiments exporter from mzidentml should be 
converted using the CHEMOD notations. The converter should be updated.
2.
3.

What is the expected output? What do you see instead?


Please use labels and text to provide additional information.

Original issue reported on code.google.com by ypriverol on 29 Sep 2014 at 2:14

StudyVariable vs assay as mandatory (formerly issue called: consider if assay only reports are allowed)

mandatory reporting of study_variable columns prior to statistical downstream processing might not fit well to data processing workflows.
Maybe relaxing the requirement of summary files to be also able to report assays only yields cleaner files. Otherwise, assay information needs to be represented in study_variables columns that might not match to actual study_variables (call it a hack).

Add IsNullable explicitly in proteomics section

While drafting metabolomics1.1, we added to the specs whether a given value was nullable. This seems really useful to implementers.

The 1.0 specs say the following
In general, “null” values SHOULD not be given within any column of a “Complete” file if the information is available.

But this is never really used in practice, since there are lots of cases where information is unknown to export software or lazy export writers don't want to locate something difficult (perfectly reasonably).

I vote for adding isNullable to each data type in the specs (and not separating by Complete or Summary - but that's another discussion).

Represent abundance_assay from not-grouped feature lists

Hi, I think we discussed this some time, but I couldn't find it,
and it doesn't harm to document here. I need advice how to represent XCMS results in mzTab-1.1
So, the SML has the summary of all grouped features after running group(xset) (i.e. a wide matrix). But where to put the peak picking results, which you get from peaks(xset) (i.e. the tall matrix) ?

IIRC we suggested that for file/sample no. 1 we'd have SMF
with abundance_assay[1]=value and abundance_assay[2..N]=NULL,

The issue I have with that approach is that we're encoding the fact
that this encodes the tall matrix in the pattern of values. I'd prefer
to either mention that in the MTD, or have just one column abundance_assay
and another column assay_name referencing for which assay this is the abundance.

Thoughts ? Yours, Steffen

> head(peaks(xset)[])
           mz mzmin mzmax       rt    rtmin    rtmax       into      intf  maxo       maxf sample
[1,] 200.1000 200.1 200.1 2928.610 2912.961 2942.695  147887.53  290506.9  9687  15899.054      1
[2,] 201.0638 201.0 201.1 2531.112 2515.463 2549.892  204572.42  280386.0  7726  13300.725      1
[3,] 205.0000 205.0 205.0 2784.635 2770.550 2800.284 1778568.94 3610059.7 84280 195026.431      1
[4,] 205.9819 205.9 206.0 2786.200 2772.115 2800.284  237993.62  448580.0 10681  23860.099      1
[5,] 207.0821 207.0 207.1 2712.647 2698.562 2726.731  380873.05  730980.9 18800  40065.736      1
[6,] 208.0671 208.0 208.1 2640.659 2625.009 2656.308   96070.72  150033.4  4112   7560.078      1
> tail(peaks(xset)[])
              mz mzmin mzmax       rt    rtmin    rtmax     into      intf  maxo     maxf sample
[4771,] 596.3574 596.3 596.4 3825.328 3811.244 3839.413 511236.1 1106531.3 25928 59878.60     12
[4772,] 596.3193 596.3 596.4 3615.625 3601.540 3628.144 249717.7  507054.4 14174 28983.33     12
[4773,] 597.3714 597.3 597.4 3825.328 3809.679 3840.978 206925.5  388002.7  9424 19741.25     12
[4774,] 597.3132 597.3 597.4 2803.414 2789.329 2817.498 122272.2  288468.1  7136 16469.23     12
[4775,] 599.2920 599.2 599.3 3662.573 3651.618 3676.658 236041.1  377861.0 12721 22822.38     12
[4776,] 599.3033 599.3 599.4 3615.625 3601.540 3628.144 341495.2  604176.8 17448 35206.17     12

> head(peakTable(xset)[,c(2,3,5,6,10:21)])
     mzmin    mzmax    rtmin    rtmax      ko15      ko16       ko18      ko19       ko21      ko22      wt15       wt16       wt18       wt19      wt21       wt22
1 200.1000 200.1000 2876.967 2931.740  147887.5  451600.7   65290.38        NA   91635.45  162012.4  175177.1   82619.48         NA   69198.22  153273.5   98144.28
2 205.0000 205.0000 2784.635 2795.591 1778568.9 1567038.1 1482796.38 1039129.8 1223132.35 1072037.7 1950287.5 1466780.60 1572679.16 1275312.76 1356014.3 1231442.16
3 205.9786 206.0023 2784.635 2795.591  237993.6  269714.0  201393.42  150107.3  176989.65  156797.0  276541.8  222366.15  211717.71  186850.88  188285.9  172348.76
4 207.0440 207.1000 2712.647 2726.731  380873.0  460629.7  351750.14  219288.0  286848.56  235022.6  417169.6  324892.46  277990.70  220972.35  252874.0  236728.16
5 219.0488 219.1000 2518.592 2529.547  235544.9  173623.4         NA        NA  185792.43  174458.8  244584.5  161184.05   72029.38         NA  238194.4  173829.95
6 231.0000 231.0812 2509.202 2535.807        NA        NA  222609.07  286232.1  435094.49        NA        NA         NA         NA  240261.21  201316.2  179437.72
> tail(peakTable(xset)[,c(2,3,5,6,10:21)])
       mzmin    mzmax    rtmin    rtmax      ko15      ko16      ko18      ko19      ko21     ko22     wt15      wt16      wt18      wt19      wt21      wt22
402 595.3000 595.3411 3603.105 3664.138 234262.30  339975.4        NA        NA 276909.61       NA       NA        NA        NA        NA 256556.99 1041407.3
403 595.2000 595.2633 2994.338 3006.858 178436.86  257167.1        NA        NA        NA       NA 857615.4        NA 195758.33        NA        NA  493703.1
404 595.8645 596.0487 3707.957 3736.128        NA  381904.3  65769.92  61369.30        NA       NA       NA  94422.29  47342.49        NA  52713.48        NA
405 596.3000 596.3645 3797.160 3831.590 195519.46 4482351.4 273204.87 161920.11 171784.27 137865.6       NA 620204.70 293010.81 142623.14 473434.14  511236.1
406 597.3073 597.3714 3798.724 3831.590  67248.65 1671423.3 104149.97  66215.79  67707.45       NA       NA 230637.90 124673.31  59829.39 180230.29  206925.5
407 598.3695 598.4778 3700.133 3726.738        NA  592732.1  51683.75  43868.52        NA       NA       NA  93591.09        NA        NA  55230.02        NA

Reagent Terms in PSI-MS

What steps will reproduce the problem?
1. I would be interesting to define the proper CVTerm in the PSI-MS for 
different reagent to be use for search 
    engines. This Param should be use in the metadata: 
  MTD   assay[1]-quantification_reagent 
    We can use the PRIDE Terms like http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=PRIDE&termId=PRIDE:0000433&termName=Reagents%20used%20in%20Labeled%20Methods
    but we should move some of them to PSI-MS.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by ypriverol on 16 Sep 2014 at 2:51

Multiplicity of [UNIT_ID]-uri

Currently, the multiplicity for [UNIT_ID]-uri is set to 0..*

Does it make sense to have multiple URIs for a single unit? Personally, I think 
it should be changed to 0..1.

Original issue reported on code.google.com by [email protected] on 27 Jun 2011 at 3:37

Protein group

We should have a way to represent Protein Ambiguity Groups in mzTab. My 
suggestions is that we can add an optional columns with the CVterm MS:1001591 
which is the anchor protein. If we use this way, we will know which is the 
anchor protein for the group and to which group bellows each protein.
Best regards

Original issue reported on code.google.com by ypriverol on 17 Oct 2014 at 3:13

go_terms in protein table

There is some concern whether GO should be given an extra column as it is just 
one of several systems to classify proteins. F.e. some people might want to add 
Reactome pathway accessions instead.

Original issue reported on code.google.com by [email protected] on 4 Jul 2011 at 2:27

add something like MTD {UNIT_ID}-used_obo

We implicitly refer to the most commonly used CV in the spec. doc.
With:
MTD {UNIT_ID}-used_obo
Description: A Connection between the used CV token and the URI
Type: String
Multiplicity: 0 .. *
Example:
MTD  PRIDE_1234-used_obo  
MS:http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVoc
abulary/psi-ms.obo
MTD  PRIDE_1234-used_obo  
PRIDE:http://ebi-pride.googlecode.com/svn/trunk/pride-core/schema/pride_cv.obo

Original issue reported on code.google.com by [email protected] on 11 Feb 2013 at 5:05

Assays of fractionated data needs rework.

One assay is reported for all fractions. This does not allow to model fractionated design with channel swaps between fractions (though I don't know how relevant this is).
Example: consider Assay 1 (A1). It is bound to one channel (e.g., 114 of an iTRAQ experiment)
A1 iTRAQ reagent 114

At the same time it is bound to 3 ms_runs (one for each fraction)
A1 F1,F2,F3

Handling / Reporting of multiple adduct masses in exp_mass_to_charge

From a recent Email exchange with Jürgen:

I have question to the summary section. I am referring to the example MTBLS263.mztab. There, in the

first line of the summary section, is Creatinine found with the SML_ID 469. As far as I can see, this molecule
has been found by [M+H]+ and [M+Na]+ adducts, and has a theoretical mass of 113.0589. However,
there is a column “exp_mass_to_charge”, which I assume is the experimental mass to charge ratio.

How is this value calculated when there is more than one adduct? Is it the mean of the neutral masses
of both adducts? Or is it weighted mean according to the abundance of the found hits?

Quoting the draft 1.1 standard:

The experimental mass to charge of the small molecule’s primary adduct form (e.g. mean m/z across assays), assumed by default to be the protonated (positive mode) or de-protonated (negative mode), otherwise the first reported adduct under the adduct ions column. For GC-MS approaches, this MAY be the m/z of the ion used for quantification.

We could consider to also allow multiple masses in this field, each separated by '|', following the same order as in adduct_ions.

For the Example, MTBLS263.mztab, there seems to be an inconsistency regarding exp_mass_to_charge:

According to the definition you sent me, it looks like the protonated form
has to be reported in the MTBLS263 example in in that column which would be around 114 m/z,
however, a value of 113.0582 is reported which can only be the neutral mass.

I would argue, that exp_mass_to_charge should only report the actually measured mass (adducts, derivatized ...)

Code Review Request for jmztab/branches/jmztab-modular

Purpose of code changes on this branch:
Modularization of the monolithic maven project. The following modules were 
added:

jmztab-model
jmztab-util
jmztab-cli
jmztab-gui
jmztab-converter-pride

The artifact Ids have been changed for now for disambiguation reasons, e.g. for 
the jmztab-model module, the artifact id is "jmztab-modular-model", where 
"jmztab-modular" is the artifact id of the parent pom.xml of the multi-module 
project. 
The artifact Ids are preliminary proposals. 

When reviewing my code changes, please focus on:
I checked that all test cases still run, however, please stay alert for 
possible issues. The main benefit of this modularization is that users do not 
have to include all dependencies since all modules have clear and minimal 
dependencies. This allows selective inclusion of the cli, gui, and 
converter-pride modules, for those who need them. The model module now only 
contains domain-dependent code. The util module contains the former 
...jmztab.utils package code, expect for the pride converter related code, 
which now resides in its own module.  

After the review, I'll merge this branch into:
jmztab/trunk

Original issue reported on code.google.com by [email protected] on 28 Mar 2014 at 2:16

MTBLS263 description

Hi, as part of the mzTab-M discussions we are using
https://www.ebi.ac.uk/metabolights/MTBLS263 as an example.
There are some aspects unclear to me:

Description says samples came from 30 donors, but Source only has one value -> Just one person ? Everything pooled in one vial ?
There are 4 Preparation replicates
From the preparation 1 three injections were performed.

From my understanding of the ISA-Tab approach, there should be only four rows in the sample table, with four distinct sample names, and then six rows in the a_assay.txt, where three rows would have the same sample name but three different MS Assay Names.

Yours, Steffen

MzTab-M: RegEx for Adduct and question

Hi,
I put together a RegEx for the way could encode adducts and several possibilities arise:
https://regex101.com/r/9gcJZG/2

if we want to encode: [kM+nAdduct]charge(+|-)
e.g.:
[4M+2NH4]4+
[M-H]1-
we could use:

\[\d*M(\+|\-)\d*(([A-Z]\d*)+)\]\d+(\+|\-)

if we also want to allow:
[M-H]-
we could use:

\[\d*M(\+|\-)\d*(([A-Z]\d*)+)\]((\+|\-)|\d+(\+|\-))

Open question: How to encode multiple adducts on the same molecule: [M+Na+CH3OH]+ ?
The spec could be clearer (e.g., by adding more examples)

The spec is also a bit weak on the SME section that seems to contain quite a few copy and paste artefacts from the SMF section. I think this needs to be discusses and reworked a bit.

mzTab 1.0: PRT search_engine_score[1-n]_ms_run[1-n] column should be removed (or made optional)

In the protein section search_engine_score[1-n]_ms_run[1-n] is the protein score of a protein for an individual ms run.
This does not make sense if the inference (and probability score) is calculated based on IDs from several runs.
Given that this also leads to a ton of columns I would vote for removing it or to make it optional in subsequent versions of the format.

Tag ms_run with technical replicate group

doesn't make sense to offload this to sample annotation. correct would be to annotate this at ms_run level

MTD id_confidence_measure and some important issues related to scores

MTD id_confidence_measure could be renamed.
For one, id is used several times in mzTab in different contexts, thus it would be better to not abbreviate it.

Why not use identification_score?
Higher confidence = better, scores may also have the opposite direction e.g. lower p_value = better
We do not specify the direction of the score, Determining this manually may be super difficult for simple parsers (e.g., require additional fields in an obo lookup (or are we even not restricted to obo - then it would be even more difficult?)). Note: in OpenMS we made good experience with storing the score direction (higher is better / worse) along with the score type.
to which section is this score applied? this should be clear from the MTD entry
Example needs update to metabolomics use case

Example Files Outdated?

The example file "PSM_SQ.mzTab" contains the PSM header "search_engine_score" 
which is stored as a string. But in the specification version 1.0 section 6.5.8 
indicates that it should be in the format: "search_engine_score[1-n]" and type 
of double. This example file, and others , do not follow this specification. I 
am assuming the specs are correct and these example files are 
invalid/out-of-date.

Original issue reported on code.google.com by [email protected] on 14 Jul 2014 at 4:58

Error in specification

In the small molecule section:
best_search_engine_score[1-n] and search_engine_score[1-n]_ms_run[1-n]  still 
have "Parameter List" as type.
This should be a double like in the other sections.

Original issue reported on code.google.com by [email protected] on 18 Jan 2015 at 5:07

Multiplicity of Instrument source, analyzer, detector

In mzML instruments can have multiple sources, analyzers and detectors. 
Currently we've only defined one cvParam for these attributes. 

I'd suggest to change that to any number of "|" delimited params.

Original issue reported on code.google.com by [email protected] on 27 Jun 2011 at 10:46

Peptides table: "accession" value if peptide assigned to multiple proteins

In the current specification it's stated (page 26): "The protein's accession 
the peptide is associated with. In case no protein section is present in the 
file or the peptide was not assigned to a protein the field should be filled 
with “NA”."

It's not clear from this description how peptides shared by several proteins 
should be treated? Should it be NA (but then "unique" column doesn't make sense 
since it's true iff the accession is not NA), or should it be comma-separated 
list of the protein accession codes (in this case "unique" column also looks 
redundant, maybe it could be replaced by the column specifying the number of 
protein peptide could be assigned to, "num_proteins_shared")?

Original issue reported on code.google.com by [email protected] on 30 Nov 2012 at 3:04

Issues with example files

Please post any issues you spot with examples files here

Fix typo in mzTab 1.0 specification

updated
In the spec document, we use_
protein-quantification-unit and protein-quantification_unit

For consistency reasons, it should be protein_quantification_unit or protein-quantification_unit

How to handle multi-species ?

Both metagenomics, but also if you align files from samples from different species

Need some examples

Bugs in jmzTab related with ms_run-location

What steps will reproduce the problem?

1. ms_run[1]-location in the specification allow null values but the jmzTab 
library fail the testing process.
2. The Unknown modifications should be implemented in jmztab using CHEMOD 
notation for converters.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by ypriverol on 15 Sep 2014 at 3:48

Error in setting the PSM_ID

Setting the PSM_ID seems not to work in 2.1.5.

When reading in a PSM line with getRecord, all PSM_IDs are null.
Also by setting the PSM_ID in a PSM created with "new PSM(metadata)" and 
setting the ID via setPSM_ID does not change it.

Every time calling getPSM_ID() returns null.

Original issue reported on code.google.com by [email protected] on 2 Apr 2014 at 3:34

Example with PEP

There seems to be no example containing a PEP section - would it be possible to add one?

Inconsistent definition of fixed_mod or fixed_modifications (also variable_mod vs. variable_modification)

In the Specification document (1.0 rc 5, dated 20 June 2014) there are 
inconsistencies in how modifcations are listed.

In the text in section 6.2 (Metadata Section) the last bullet point references 
"fixed_modification[1-n]" and "variable_modification[1-n]" but sections 6.2.24 
and 6.2.27 abbreviate modification to mod.

I noticed this inconsistency as well in section 5.8.

Original issue reported on code.google.com by [email protected] on 2 Jul 2014 at 4:25

Spec doc is missing text on standard numerical encoding

	TODO Insert some text in here about standard numerical encoding, e.g. US default style “x.x”, i.e. using a period for decimal separation and no commas to separate thousands.

@jmrein to look into it

MzTab-M: id used with different meanings throughout MzTab

MzTab-M uses the abbreviations id and ID in many different contexts.

I first suggested renaming:
ms_run[1-n]-id_format to ms_run[1-n]-nativeID_format or just ms_run[1-n]-nativeID
which would break compatibility with MzTab 1.0

maybe we can still check for the newly introduced ones if replacing the abbreviation is less ambiguous

Param object's "value" argument cannot contain special characters

What steps will reproduce the problem?
1. Param param = new UserParam("Some parameter", "\"[...]\"");
2. System.out.println(param.getValue());

What is the expected output? What do you see instead?

Expected = [...]
Actual = ...

Please use labels and text to provide additional information.

CV parameters are normally encoded in string format using the standard square 
bracket-enclosed, comma-delimited tuple format:
[<label>,<accession>,<name>,<value>]

Because this format makes use of square brackets ("[]") and commas (","), these 
are generally reserved characters that should not appear in the element values. 
 However, some CV param names are known to contain commas, e.g.:

MOD:00648 - N,O-diacetylated L-serine

Therefore, it is well-documented that a Param's "name" argument, when 
containing illegal characters, should be enclosed in quotation marks ("") to 
inform the parsing engine that it should not treat those characters as 
delimiters in the overall parameter tuple string.

However, the Param's "value" argument is not treated in this same manner, even 
though it should be, since the semantics of this element can be arbitrary and 
user-defined.  Currently, even when the string "value" argument is explicitly 
enclosed in quotation marks, these special characters are always just stripped 
out of the stored string value.  This should not happen when the argument value 
is enclosed in quotation marks.

Original issue reported on code.google.com by [email protected] on 29 Sep 2014 at 11:13

Example mismatch `study_variable_function`

Hi, in
https://github.com/HUPO-PSI/mzTab/blob/master/specification_document/1_1_draft_specs/mzTab_format_specification_1_1-M_draft.adoc#6225-study_variable_function1-n
the example has MTD small_molecule-quantification_unit [PRIDE, PRIDE:0000395, Ratio, ]

Recreate Figure 1 for mzTab-M 1.1 document

Figure 1 is currently a PNG graphics file with the figure caption being part of the graphics file.

@andrewrobertjones do you have the original source, is the schematic good as-is, or should we redo it?

Figure 1

Defining modifications in Small Molecules Table needs to be revised

Modifications in Small Molecules must have a different structure than used for 
proteins / peptides.

Suggestion 1: support modifications without positional information.

Original issue reported on code.google.com by [email protected] on 11 Nov 2011 at 11:45

Peptide to protein mapping in quant files

If a peptide can be mapped to multiple proteins, the 1.0 specs recommend duplicating the rows, and just changing the accession. I have a strong preference to change this so that multiple accessions can be separated by semi-colons (or other second separator).

Otherwise this can cause problems for stats/visualisation or other software that wants to work with the quant data. Logic to work out duplicates would need to be encoded