Coder Social home page Coder Social logo

bigbio / sdrf-pipelines Goto Github PK

View Code? Open in Web Editor NEW
14.0 10.0 21.0 2.53 MB

A repository to convert SDRF proteomics files into pipelines config files

License: Apache License 2.0

Python 100.00%
multiomics msstats sdrf proteomics-data-analysis proteomics mass-spectrometry

sdrf-pipelines's Introduction

sdrf-pipelines

Python application Python package Upload Python Package Codacy Badge PyPI version PyPI - Downloads

The SDRF pipelines provide a set of tools to validate and convert SDRF files to different workflow configuration files such as MSstats,OpenMS and MaxQuant.

Installation

pip install sdrf-pipelines

Validate the SDRF

How to use it:

Then, you can use the tool by executing the following command:

parse_sdrf validate-sdrf --sdrf_file {here_the_path_to_sdrf_file}

Convert to OpenMS: Usage

parse_sdrf convert-openms -s sdrf.tsv

Description:

  • experiment settings (search engine settings etc.)
  • experimental design

The experimental settings file contains one row for every raw file. Columns contain relevevant parameters like precursor mass tolerance, modifications etc. These settings can usually be derived from the sdrf file.

URI Filename FixedModifications VariableModifications Label PrecursorMassTolerance PrecursorMassToleranceUnit FragmentMassTolerance FragmentMassToleranceUnit DissociationMethod Enzyme
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/XX/PXD324343/A0218_1A_R_FR01.raw A0218_1A_R_FR01.raw Acetyl (Protein N-term) Gln->pyro-glu (Q),Oxidation (M) label free sample 10 ppm 10 ppm HCD Trypsin
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/XX/PXD324343/A0218_1A_R_FR02.raw A0218_1A_R_FR02.raw Acetyl (Protein N-term) Gln->pyro-glu (Q),Oxidation (M) label free sample 10 ppm 10 ppm HCD Trypsin

The experimental design file contains information how to unambiguously map a single quantitative value. Most entries can be derived from the sdrf file. However, definition of conditions might need manual changes.

  • Fraction_Group identifier that indicates which fractions belong together. In the case of label-free data, the fraction group identifier has the same cardinality as the sample identifier.
  • The Fraction identifier indicates which fraction was measured in this file. In the case of unfractionated data the fraction identifier is 1 for all samples.
  • The Label identifier. 1 for label-free, 1 and 2 for SILAC light/heavy, e.g. 1-10 for TMT10Plex
  • The Spectra_Filepath (e.g., path = "/data/SILAC_file.mzML")
  • MSstats_Condition the condition identifier as used by MSstats
  • MSstats_BioReplicate an identifier to indicate replication. (MSstats requires that there are no duplicate entries. E.g., if MSstats_Condition, Fraction_Group group and Fraction number are the same - as in the case of biological or technical replication, one uses the MSstats_BioReplicate to make entries non-unique)
Fraction_Group Fraction Spectra_Filepath Label MSstats_Condition MSstats_BioReplicate
1 1 A0218_1A_R_FR01.raw 1 1 1
1 2 A0218_1A_R_FR02.raw 1 1 1
. . ... . . .
1 15 A0218_2A_FR15.raw 1 1 1
2 1 A0218_2A_FR01.raw 1 2 2
. . ... . . .
. . ... . . .
10 15 A0218_10A_FR15.raw 1 10 10

For details, please see the MSstats documentation

Convert to MaxQuant: Usage

parse_sdrf convert-maxquant -s sdrf.tsv -f {here_the_path_to_protein_database_file} -m {True or False} -pef {default 0.01} -prf {default 0.01} -t {temporary folder} -r {raw_data_folder} -n {number of threads:default 1} -o1 {parameters(.xml) output file path} -o2 {maxquant experimental design(.txt) output file path}

eg.

parse_sdrf convert-maxquant -s /root/ChengXin/Desktop/sdrf.tsv -f /root/ChengXin/MyProgram/search_spectra/AT/TAIR10_pep_20101214.fasta -r /root/ChengXin/MyProgram/virtuabox/share/raw_data/ -o1 /root/ChengXin/test.xml -o2 /root/ChengXin/test_exp.xml -t /root/ChengXin/MyProgram/virtuabox/share/raw_data/ -pef 0.01 -prf 0.01 -n 4
  • -s : SDRF file
  • -f : fasta file
  • -r : spectra raw file folder
  • -mcf : MaxQuant default configure path (if given, Can add new modifications)
  • -m : via matching between runs to boosts number of identifications
  • -pef : posterior error probability calculation based on target-decoy search
  • -prf : protein score = product of peptide PEPs (one for each sequence)
  • -t : place on SSD (if possible) for faster search,It is recommended not to be the same as the raw file directory
  • -n : each thread needs at least 2 GB of RAM,number of threads should be ≤ number of logical cores available(otherwise, MaxQuant can crash)

Description

  • maxquant parameters file (mqpar.xml)
  • maxquant experimental design file (.txt)

The maxquant parameters file mqpar.xml contains the parameters required for maxquant operation.some settings can usually be derived from the sdrf file such as enzyme, fixed modification, variable modification, instrument, fraction and label etc.Set other parameters as default.The current version of maxquant supported by the script is 1.6.10.43

Some parameters are listed:

  • <fastaFilePath>TAIR10_pep_20101214.fasta</fastaFilePath>
  • <matchBetweenRuns>True</matchBetweenRuns>
  • <maxQuantVersion>1.6.10.43</maxQuantVersion>
  • <tempFolder>C:/Users/test</tempFolder>
  • <numThreads>2</numThreads>
  • <filePaths>
    • <string>C:\Users\search_spectra\AT\130402_08.raw</string>
    • <string>C:\Users\search_spectra\AT\130412_08.raw</string>
  • </filePaths>
  • <experiments>
    • <string>sample 1_Tr_1</string>
    • <string>sample 2_Tr_1</string>
  • </experiments>
  • <fractions>
    • <short>32767</short>
    • <short>32767</short>
  • </fractions>
  • <paramGroupIndices>
    • <int>0</int>
    • <int>1</int>
  • </paramGroupIndices>
  • <msInstrument>0</msInstrument>
  • <fixedModifications>
    • <string>Carbamidomethyl (C)</string>
  • </fixedModifications>
  • <enzymes>
    • <string>Trypsin</string>
  • </enzymes>
  • <variableModifications>
    • <string>Oxidation (M)</string>
    • <string>Phospho (Y)</string>
    • <string>Acetyl (Protein N-term)</string>
    • <string>Phospho (T)</string>
    • <string>Phospho (S)</string>
  • </variableModifications>

For details, please see the MaxQuant documentation

The maxquant experimental design file contains name,Fraction,Experiement and PTM column.Most entries can be derived from the sdrf file.

  • Name raw data file name.
  • Fraction In the Fraction column you must assign if the corresponding files shown in the left column belong to a fraction of a gel fraction. If your data is not obtained through gel-based pre-fractionation you must assign the same number(default 1) for all files in the column Fraction.
  • Experiment In the column named as Experiment if you want to combine all experimental replicates as a single dataset to be analyzed by MaxQuant, you must enter the same identifier for the files which should be concatenated . However, if you want each individual file to be treated as a different experiment which you want to compare further you should assign different identifiers to each of the files as shown below.
Name Fraction Experiment PTM
130402_08.raw 1 sample 1_Tr_1
130412_08.raw 1 sample 2_Tr_1

Convert to MSstats annotation file: Usage

parse_sdrf convert-msstats -s ./testdata/PXD000288.sdrf.tsv -o ./test1.csv
  • -s : SDRF file
  • -c : Create conditions from provided (e.g., factor) columns as used by MSstats
  • -o : annotation out file path
  • -swath : from openswathtomsstats output to msstats default false
  • -mq : from maxquant output to msstats default false

Convert to NormalyzerDE design file: Usage

parse_sdrf convert-normalyzerde -s ./testdata/PXD000288.sdrf.tsv -o ./testPXD000288_design.tsv
  • -s : SDRF file
  • -c : Create groups from provided (e.g., factor) columns as used by NormalyzerDE, for example -c ["characteristics[spiked compound]"] (optional)
  • -o : NormalyzerDE design out file path
  • -oc : Out file path for comparisons towards first group (optional)
  • -mq : Path to MaxQuant experimental design file for mapping MQ sample names. (optional)

Citations

  • Dai C, Füllgrabe A, Pfeuffer J, Solovyeva EM, Deng J, Moreno P, Kamatchinathan S, Kundu DJ, George N, Fexova S, Grüning B, Föll MC, Griss J, Vaudel M, Audain E, Locard-Paulet M, Turewicz M, Eisenacher M, Uszkoreit J, Van Den Bossche T, Schwämmle V, Webel H, Schulze S, Bouyssié D, Jayaram S, Duggineni VK, Samaras P, Wilhelm M, Choi M, Wang M, Kohlbacher O, Brazma A, Papatheodorou I, Bandeira N, Deutsch EW, Vizcaíno JA, Bai M, Sachsenberg T, Levitsky LI, Perez-Riverol Y. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat Commun. 2021 Oct 6;12(1):5854. doi: 10.1038/s41467-021-26111-3. PMID: 34615866; PMCID: PMC8494749. Manuscript

  • Perez-Riverol, Yasset, and European Bioinformatics Community for Mass Spectrometry. "Toward a Sample Metadata Standard in Public Proteomics Repositories." Journal of Proteome Research 19.10 (2020): 3906-3909. Manuscript

sdrf-pipelines's People

Contributors

bgruening avatar daichengxin avatar fabianegli avatar flevander avatar jpfeuffer avatar jspaezp avatar lazear avatar levitsky avatar timosachsenberg avatar veitveit avatar wanghong007 avatar ypriverol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sdrf-pipelines's Issues

Unimod and ols error

If the user annotate the unimod modifications using a lowercase in the name of the ontology. It will not work the ols search. I have tried the following:

Unimod - > fail
unimod - > fail
UNIMOD - > success

[DOC] Link to relevant repos

It would be nice to have links to the most relevant repositories and resources related to sdrf-pipelines in the README.md. e.g. to the https://github.com/bigbio/proteomics-metadata-standard repo.

There is already a link to the publication in the Citation paragraph, but nothing else yet. Anything else that would be nice to have a link to despite the proteomics-metadata-standard repo?

Error SDRF pipeline to OpenMS

Dataset PXD012255

Error executing process > 'sdrf_parsing (1)'

Caused by:
  Process `sdrf_parsing (1)` terminated with an error exit status (1)

Command executed:

  ## -t2 since the one-table format parser is broken in OpenMS2.5
  ## -l for legacy behavior to always add sample columns
  parse_sdrf convert-openms -t2 -l -s PXD012255-Sample-1.tsv > sdrf_parsing.log

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/bin/parse_sdrf", line 10, in <module>
      sys.exit(main())
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/sdrf_pipelines/parse_sdrf.py", line 109, in main
      cli()
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/click/core.py", line 829, in __call__
      return self.main(*args, **kwargs)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/click/core.py", line 782, in main
      rv = self.invoke(ctx)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
      return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/click/core.py", line 610, in invoke
      return callback(*args, **kwargs)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
      return f(get_current_context(), *args, **kwargs)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/sdrf_pipelines/parse_sdrf.py", line 36, in openms_from_sdrf
      OpenMS().openms_convert(sdrf, raw, onetable, legacy, verbose, conditionsfromcolumns)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/sdrf_pipelines/openms/openms.py", line 163, in openms_convert
      variable_mods_string = self.openms_ify_mods(var_mods)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-dbfe576f0ba2eaab3a44c499db5d3312/lib/python3.8/site-packages/sdrf_pipelines/openms/openms.py", line 82, in openms_ify_mods
      aa = ta.split(",")  # multiply target site e.g., S,T,Y including potentially termini "C-term"
  UnboundLocalError: local variable 'ta' referenced before assignment

Work dir:
  /hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/32/6f42c67549e5629673f5e7abc9e54b

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

[CODE] Use pandas to parse the SDRF

This repo contains a lot of custom code when parsing files/string and putting it back together. I strongly believe that this is a sure way to have weird file input pass validation testing, besides other issues. The package already depends on the pandas package and thus could make more extensive use of it for parsing inputs and later validation.

[DOC] Document how to run tests

Developers need to run tests locally and it would be nice to have documentation about how to run the tests in this repo during package development.

Erroneous TMT label extraction logic

label_list = sorted(label)

Sorting the default label results in label_list being ['2', 'M', 'T', 'T'] with a length of 4 which then probably leads to leads being true and we have a 6-plex form a 2-plex.

I think the whole label extraction should be more stringent and fail on unknown input with a note that an issue should be opened in this repo to adjust for new labels (and fix bugs). Guessing could lead to issues that are hard to diagnose by users and errors might even go unnoticed.

[Usability] Fix uncaught exception when unsupported labels are used in SDRF file

Currently, the use of other label specifications causes the following error:

File "/opt/conda/envs/nf-core-proteomicstmt-1.0dev/lib/python3.8/site-packages/sdrf_pipelines/openms/openms.py", line 674, in save_search_settings_to_file
    URI + "\t" + raw + "\t" + f2c.file2mods[raw][0] + "\t" + f2c.file2mods[raw][1] + "\t" + label + "\t" +
UnboundLocalError: local variable 'label' referenced before assignment

Solution

Add proper error handling and well formatted (helpful) error message

(I will create a PR for this later on)

Error exporting openMS

SDRF: https://github.com/multiomics/multiomics-configs/blob/master/datasets/cancer-celllines-samples/sample-specific/PXD015270-Sample-1.tsv

Error executing process > 'sdrf_parsing (1)'

Caused by:
  Process `sdrf_parsing (1)` terminated with an error exit status (1)

Command executed:

  ## -t2 since the one-table format parser is broken in OpenMS2.5
  ## -l for legacy behavior to always add sample columns
  parse_sdrf convert-openms -t2 -l -s PXD015270-Sample-1.tsv > sdrf_parsing.log

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/bin/parse_sdrf", line 10, in <module>
      sys.exit(main())
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/sdrf_pipelines/parse_sdrf.py", line 109, in main
      cli()
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/click/core.py", line 829, in __call__
      return self.main(*args, **kwargs)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/click/core.py", line 782, in main
      rv = self.invoke(ctx)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
      return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/click/core.py", line 610, in invoke
      return callback(*args, **kwargs)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
      return f(get_current_context(), *args, **kwargs)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/sdrf_pipelines/parse_sdrf.py", line 36, in openms_from_sdrf
      OpenMS().openms_convert(sdrf, raw, onetable, legacy, verbose, conditionsfromcolumns)
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/sdrf_pipelines/openms/openms.py", line 294, in openms_convert
      self.writeTwoTableExperimentalDesign("experimental_design.tsv", sdrf, f2c.file2technical_rep,
    File "/hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/conda/nf-core-proteomicslfq-1.1.0dev-44454b721375d966e976bcdaee627e4d/lib/python3.8/site-packages/sdrf_pipelines/openms/openms.py", line 489, in writeTwoTableExperimentalDesign
      f.write(str(sample) + "\t" + condition + "\t" + MSstatsBioReplicate + "\n")
  TypeError: can only concatenate str (not "int") to str

Work dir:
  /hps/nobackup2/proteomics/yperez_temp/proteogenomics_project/datasets/cell-lines/work/c3/1b32093276ea6741203e5bb20e4417

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out``

[ENH] Need for more comprehensive integration testing

@ypriverol Do you have a set of SDRF files for experiments of all different kinds of expected flavors?

If not we should systematically generate SDRF example files with all the different labelling techniques, fractionations, biological and technical replicates, file names (spaces/weird characters, if allowable) to have test cases for at least the most common experimental setups. Based on those we can then generate examples of SDRF files that do not comply with the standard, but should be readable nonetheless and some that are wrong in various ways we expect users to mess up writing SDRF files with common tools or even by hand.

Error in MaxQuant converter

Hi,

I tried to run the MaxQuant converter with the sdrf file from here: https://github.com/bigbio/sdrf-pipelines/tree/master/sdrf_pipelines/testdata and a newly downloaded human uniprot fasta file:

parse_sdrf convert-maxquant -s sdrf_test.tsv -f /test_uniprot.fasta -m False -pef 0.01 -prf 0.01

results in:

None PROCESSING: sdrf_test.tsv" Traceback (most recent call last): File "/home/meli/.local/bin/parse_sdrf", line 8, in <module> sys.exit(main()) File "/home/meli/.local/lib/python3.8/site-packages/sdrf_pipelines/parse_sdrf.py", line 98, in main cli() File "/home/meli/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/home/meli/.local/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/meli/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/meli/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/meli/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/home/meli/.local/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func return f(get_current_context(), *args, **kwargs) File "/home/meli/.local/lib/python3.8/site-packages/sdrf_pipelines/parse_sdrf.py", line 56, in maxquant_from_sdrf Maxquant().maxquant_convert(sdrf, fastafilepath, matchbetweenruns, peptidefdr, proteinfdr, File "/home/meli/.local/lib/python3.8/site-packages/sdrf_pipelines/maxquant/maxquant.py", line 690, in maxquant_convert tempFolder_node.appendChild(doc.createTextNode(tempFolder)) File "/usr/lib/python3.8/xml/dom/minidom.py", line 1659, in createTextNode raise TypeError("node contents must be a string") TypeError: node contents must be a string

Any idea where the error comes from?

Thanks,
Melanie

500 Server Error

I tried to run a validation, but got an error on reaching an ebi.ac.uk server:

> parse_sdrf validate-sdrf --sdrf_file .\annotated-projects\PXD011839\sdrf.tsv
Traceback (most recent call last):
  File "c:\users\henry\.conda\envs\metadataproj\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\henry\.conda\envs\metadataproj\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Henry\.conda\envs\metadataproj\Scripts\parse_sdrf.exe\__main__.py", line 7, in <module>
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\sdrf_pipelines\parse_sdrf.py", line 98, in main
    cli()
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\click\decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\sdrf_pipelines\parse_sdrf.py", line 77, in validate_sdrf
    errors = df.validate(template)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\sdrf_pipelines\sdrf\sdrf.py", line 52, in validate
    errors = default_schema.validate(self)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\sdrf_pipelines\sdrf\sdrf_schema.py", line 157, in validate
    error_ontology_terms = self.validate_columns(panda_sdrf)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\sdrf_pipelines\sdrf\sdrf_schema.py", line 214, in validate_columns
    errors += column.validate(series)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\pandas_schema\column.py", line 27, in validate
    return [error for validation in self.validations for error in validation.get_errors(series, self)]
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\pandas_schema\column.py", line 27, in <listcomp>
    return [error for validation in self.validations for error in validation.get_errors(series, self)]
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\pandas_schema\validation.py", line 85, in get_errors
    simple_validation = ~self.validate(series)
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\sdrf_pipelines\sdrf\sdrf_schema.py", line 116, in validate
    ontology_terms = client.search(term[TERM_NAME], ontology=self._ontology_name, exact="true")
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\sdrf_pipelines\zooma\ols.py", line 157, in search
    req.raise_for_status()
  File "c:\users\henry\.conda\envs\metadataproj\lib\site-packages\requests\models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://www.ebi.ac.uk/ols/api/search?q=homo+sapiens&exact=on&type=class&ontology=ncbitaxon

[ENH] Using some Python coding best practices for open source repositories

I propose that tis project adopts some of the standard Python development tools to make contributions easier and reduce some of the burden of new contributors.

  • black and black-disable-checker for code formatting
  • isort to systematically sorting the imports
  • pylint to spot and fix mistakes, errors and code smells

Further tools that are not yet very useful here but might become handy in the future

  • blacken-docs for documentation
  • mypy if type annotations become a thing in this package.

[DISCUSSION] A more concise CLI that is well behaved

When looking at the sdrf-pipelines package there are many different terms that have a range of meanings and uses.

The string sdrf-pipelines is only used for the installation and the import of the package. I think that is fine.

Now for the CLI. It introduces the command parse_sdrf. I think this is inherently not a bad name if what the tool does is parsing one or more SDRF files. But the tool actually does more. It validates SDRF files when called with parse_sdrf validate-sdrf ... and converts parse_sdrf convert-openms ... them to input files for other tools based. Validation requires parsing and conversion as well so the parse in the parse_sdrf seems redundant and even somewhat misleading because the tool advertises the parsing but goes much further than parsing SDRF files. Thats why I think the name of the command line tool is not ideal.

Also note that the "conversion" it is not actually a pure conversion of the information in the in the SDRF. In the case of the MaxQuant output it also an enrichment. parse_sdrf as a command name doesn't do that justice.

Because of all the above I propose to adopt a new CLI naming and behaviour as follows:

  • a command called sdrf which can be used to validate and write SDRF files.
  • sdrf validate only validates sdrf files. There might be something like a --strict flag to make it only validate byte-perfect SDRF files and would complain about trailing whitespaces and other errors ignored by a the permissive parser.

The CLI should be a well-behaved. Some specific properties that come to mind are pipes. It would be great if input and output can be piped.

sdrf is NOT a

  • debian/ubuntu/redhat package name
  • Python package name
  • MacPorts package
  • bioconda package
  • biocontainer

I know there is already a sdrf-pipelines/sdrf_parse/convert-openms/convert-maxquant/... But I think that the tool is still relatively new and would profit from a change in the long run. Implementing the proposals above would also not require that the current syntax would break immediately.

For the conversion, there could be an analog sdrf command:

  • sdrf convert --from-format [input-format] --in [file] --to-format [output format] --out [file] [additional configurations] This should also handle the --strict flag mentioned above and convert from and to the SDRF file format. The format only has to be specified for the non-SDRF file.

Thoughts and discussions are welcome.

Order of properties not validated

Can we validate the following rules:

  • No comment can be before assay name.
  • No characteristics can be after assay name.
  • Can you implement the logic of splitting and SDRF by columns, for example, the user can say, source name and it will spliit the SDRF into multiple SDRFs where all the rows on each of the SDRF correspond to one specific value. This will help for example to split SDRFs by facor value, samples, instruments etc. The user should be able to pass a prefix for the output files or you can take the SDRF name and then, add the value replacing spaces by -, example: PXD000621.sdrf.tsv can be
  • We need to implement also a mechanism to merge SDRFs.

Access to yml and xml files

Until now, these files only seem accessible when in the same folder and thus one cannot use pkg_resources to access them when being somewhere else.

I suspect that this is due to a missing general setting of the main folder in the setup.py, but admit not knowing sufficient about python packaging.

[ENH] Use logging instead of print

There is a range of print statements that scattered throughout the package. Using logging would allow to better control what is written to stdout and stderr.

Hint: Find print in the repo with git grep "print("

[ENH] Empty rows in the middle of the dataset are erroneously accepted

The test for empty rows is probably good enough if these rows are at the beginning or end of the file, but it is not sufficient for the case when there are empty rows in the middle, because that could also be a hint that the file content does not comply with SDRF standards or contains more parts than only the SDRF.

df = df.dropna(axis='index', how='all')
if df.shape[0] < nrows:
logging.warning('There were empty lines.')

KeyError in validate-sdrf when optional SDRF column is missing in the file

Absence of optional columns should not be registered as an error, but it leads to a KeyError exception in validate_sdrf:

for column in columns_to_pair:
if column.name not in panda_sdrf and column._optional == False:
message = 'The column {} is not present in the SDRF'.format(column.name)
errors.append(LogicError(message, error_type=logging.ERROR))
else:
column_pairs.append((panda_sdrf[column.name], column))

The last line results in an error for any optional column not present in the file. The solution would be to check that the column is actually present in the DataFrame before accessing it.

strange method name

The Zooma class has a method called process_zumma_results. Shouldn't that be process_zooma_results?

[DISCUSSION] Permissive or strict parser?

There is a good discussion to be had about how permissive or strict a parser for a file standard should be and if it is permissive which errors in the format should be tolerated and which not. To me, the answer to this question for the case of SDRF files is not yet clear and I would welcome a discussion about that from contributors and users of sdrf-pipelines. It follows a list of questions (not comprehensive, at all):

  1. What are permissible errors?
    1. Can a trailing whitespace always be stripped? Or can a trailing whitespace have meaning?
    2. Can an empty line be tolerated? At the beginning? At the end? In the middle?
  2. Can we make valid assumptions about strings? Is the encoding UTF-8? Are file names supposed to be composed of only a limited charset?
    1. Filenames?
    2. Column names?
    3. Fields in the SDRF table?
  3. How thoroughly is the content checked?
    1. Are empty fields allowed? Or filled with some value?
    2. Do we need the same number of value X and Y in a column?
    3. Is invalid content detected? e.g. labelling information in a fraction column?
  4. Which detected issues are how severe?
    1. What do they affect?
    2. Should they be handled silently, trigger warning or raise an error?

Some of these questions have clear answers, others not so much. I would very much welcome a discussion around and about

These questions might also have different answers for different use cases. The SDRF is a tool expected to be applied in a broad range of environments and use cases. Discussing these questions will help us anticipate the requirements better and help in the design and implementation of the next iteration of the sdrf-pipelines package.

Since I am new to this project, such a discussion will also help me get going with contributions. Or in other words, keep me from straying into territories that are better left uncharted.

[Feature request] Make parsed SDRF data usable

Once SDRF information is parsed and validated it should be possible to use it in a convenient way.

Possible use cases are:

  • exposing it via an API for consumption in scripts or
  • writing it to a file.

Benefits

Python based proteomics data analysis tools don't need to reinvent the wheel, i.e. the SDRF parsing/validation.

A somewhat permissive parser could be used to enforce a strict adherence of the written SDRF file to the SDRF standard. This would have obvious benefits for anyone creating SDRF files as it would allow them to create byte-perfect SDRF files.

[Feature request] Allow "merging" of technical replicates in MQ exporter

E.g. I would like to able to construct an mqpar.xml with the following:

	<experiments>
		<string>Sample1</string>
		<string>Sample1</string>
		<string>Sample1</string>
		<string>Sample2</string>
		<string>Sample2</string>
		<string>Sample2</string>
...
	</experiments>

Instead of the current:

	<experiments>
		<string>Sample1_Tr_1</string>
		<string>Sample1_Tr_2</string>
		<string>Sample1_Tr_3</string>
		<string>Sample2_Tr_1</string>
		<string>Sample2_Tr_2</string>
		<string>Sample2_Tr_3</string>
...
	</experiments>

Whether this should be the default is debatable I guess.

[Feature request / bug ]Missing warning for default value selection

Currently when using an SDRF file containing a precursor mass tolerance in the opposite unit of fraction mass tolerance, it default to a set value depending on the unit of the fraction mass tolerance.

I would be nice for a warning message, just as if there were no provided precursor or fraction.
These are the lines:
if 'comment[precursor mass tolerance]' in row:
pc_tol_str = row['comment[precursor mass tolerance]']
if "ppm" in pc_tol_str or "Da" in pc_tol_str:
pc_tmp = pc_tol_str.split(" ")
file2pctol[raw] = pc_tmp[0]
file2pctolunit[raw] = pc_tmp[1]
else:
warning_message = "Invalid precursor mass tolerance set. Assuming 4.5 ppm."
self.warnings[warning_message] = self.warnings.get(warning_message, 0) + 1
file2pctol[raw] = "4.5"
file2pctolunit[raw] = "ppm"
else:
warning_message = "No precursor mass tolerance set. Assuming 4.5 ppm."
self.warnings[warning_message] = self.warnings.get(warning_message, 0) + 1
file2pctol[raw] = "4.5"
file2pctolunit[raw] = "ppm"

        if 'comment[fragment mass tolerance]' in row:
            f_tol_str = row['comment[fragment mass tolerance]']
            f_tol_str.replace("PPM", "ppm")  # workaround
            if "ppm" in f_tol_str:
                f_tmp = f_tol_str.split(" ")
                file2fragtol[raw] = f_tmp[0]
                file2fragtolunit[raw] = f_tmp[1]
                if "Da" in file2pctolunit[raw]:
                    file2pctol[raw] = "4.5"
                    file2pctolunit[raw] = "ppm"
            elif "Da" in f_tol_str:
                f_tmp = f_tol_str.split(" ")
                file2fragtol[raw] = f_tmp[0]
                file2fragtolunit[raw] = f_tmp[1]
                if "ppm" in file2pctolunit[raw]:
                    file2pctol[raw] = "0.01"
                    file2pctolunit[raw] = "Da"
            else:
                warning_message = "Invalid fragment mass tolerance set. Assuming 20 ppm."
                self.warnings[warning_message] = self.warnings.get(warning_message, 0) + 1
                file2fragtol[raw] = "20"
                file2fragtolunit[raw] = "ppm"
        else:
            warning_message = "No fragment mass tolerance set. Assuming 20 ppm."
            self.warnings[warning_message] = self.warnings.get(warning_message, 0) + 1
            file2fragtol[raw] = "20"
            file2fragtolunit[raw] = "ppm"

Furthermore, the settings in regard to the unit of the two settings, (ppm or Da) is hard coded to always be True and not depending on the submitted unit:

searchTolInPpm = doc.createElement('searchTolInPpm')
searchTolInPpm.appendChild(doc.createTextNode('True'))

Would be beneficial for a check to select either True or False depending on unit of the two.

problems with dependency in conda

@daichengxin the dependency introduce: pyecharts is stopping the project to be built in conda because this dependency is not in Conda. Can we use another library for this that at least is in conda-forge.

maxquant parameter file cannot have fragment and precursor tolerance in different units

sdrf-parser runs into problems when the tolerances are given in different units (Da and ppm) as MaxQuant needs them to be the same.

I am not sure about the actual outcome, but when running a sdrf file with fragment tolerance given in Da, the MaxQuant parameter <searchTolInPpm>True</searchTolInPpm> is still set to ppm. This means that the tolerance most likely will be wrongly translated.

And there is no warning or alike.

You might need to convert the units.

[Pitch] Partial keep raw in OpenMS

Hello,

I am working on a project where I would like to convert an sdrf to an openms workflow where I would like to keep the extension as is, only for some files.

(I am also implementing the workflow... nf-core/quantms#64 (comment))

Ideas:

parse_sdrf convert-openms {other options/args} --raw {.d/.raw/all/none)
# --raw .d would keep the raw extenion if it is .d, make it .mzML otherwise
# --raw all would keep all as raw
# --raw none would convert all to mzML
parse_sdrf convert-openms {other options/args} --extension_convert "raw:mzML,mzml:MZML,mzML:mzML,d:d"

# I am not that creative with names ...
# if --extension_alias is set, it will try to replace with the passed options
# and error out if a file does not match

# In this example
# specs.raw -> specs.mzML
# specs.mzml -> specs.mzML
# specs.mzML -> specs.mzML
# specs.d -> specs.d
# specs.docx -> # NOTHINGS, ERROR raised

let me know what you think and if the feature would be something you want me to make a PR for

Places where changes would go:

if not keep_raw:
ext = os.path.splitext(raw)
out = ext[0] + ".mzML"
else:
out = raw

(there is another place in the same file)

https://github.com/bigbio/sdrf-pipelines/blob/adf9279a5c1c03d578575c4a89dfd535af8715c2/sdrf_pipelines/parse_sdrf.py#L37C1-L37C1

OpenMS Exporter recently broke

Hi,

recent commits broke the OpenMS converter. It has to do with making the conditions a semicolon-separated list and only putting the first element to the OpenMS experimental design.
E.g. for PXD0001819 we now get (for the sample table):

Sample  MSstats_Condition       MSstats_BioReplicate
1       CT=mixture      1
2       CT=mixture      2
3       CT=mixture      3
4       CT=mixture      4
5       CT=mixture      5
6       CT=mixture      6
7       CT=mixture      7
8       CT=mixture      8
9       CT=mixture      9

But we would like to have either the full entry "CT=mixture;QN=500amol...." or maybe even better for readability only the values of this "dictionary" (e.g. something like "mixture UPS1 500amol").
It has quite some high priority since it breaks the whole pipeline so go ahead with what is easiest/quickest for now.
Thanks!

[BUG] openms/unimod.py

UnimodDatabase.modifications is a list but gets treated as if it was a dict.

pylint --disable all --enable E sdrf_pipelines
************* Module sdrf_pipelines.openms.unimod
sdrf_pipelines/openms/unimod.py:109:10: E1101: Instance of 'list' has no 'get' member (no-member)
sdrf_pipelines/openms/unimod.py:119:13: E1101: Instance of 'list' has no 'keys' member (no-member)
sdrf_pipelines/openms/unimod.py:126:10: E1101: Instance of 'list' has no 'get' member (no-member)
sdrf_pipelines/openms/unimod.py:139:10: E1101: Instance of 'list' has no 'get' member (no-member)

conda implementation misses file

When using the conda version 0.05, then the openms folder seems to lack the unimod.xml file, resulting in an error when running parse_sdrf convert-openms

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.