taxprofiler / taxpasta Goto Github PK

TAXnomic Profile Aggregation and STAndardisation

Home Page: https://taxpasta.readthedocs.io/

License: Apache License 2.0

Makefile 1.16% Python 98.84%

metagenomics python standardisation bioinformatics profiling classification metagenomic-classification taxonomic-classifications taxonomic-profiling

taxpasta's Introduction

TAXonomic Profile Aggregation and STAndardisation


Package
Meta
Automation

About

The main purpose of taxpasta is to standardise taxonomic profiles created by a range of bioinformatics tools. We call those tools taxonomic profilers. They each come with their own particular tabular output format. Across the profilers, relative abundances can be reported in read counts, fractions, or percentages, as well as any number of additional columns with extra information. We therefore decided to take the lessons learnt to heart and provide our own solution to deal with this pasticcio. With taxpasta you can ingest all of those formats and, at a minimum, output taxonomy identifiers and their integer counts. Taxpasta can not only standardise profiles but also merge them across samples for the same profiler into a single table.

Supported Taxonomic Profilers

Taxpasta currently supports standardisation and generation of comparable taxonomic tables for:

See supported profilers for more information.

Install

It's as simple as:

pip install taxpasta

Taxpasta is also available from the Bioconda channel

conda install -c bioconda taxpasta

and thus automatically generated Docker and Singularity BioContainers images also exist.

Optional Dependencies

Taxpasta supports a number of extras that you can install for additional features; primarily support for additional output file formats. You can install them by specifying a comma separated list within square brackets, for example,

pip install 'taxpasta[rich,biom]'

rich provides rich-formatted command line output and logging.
arrow supports writing output tables in Apache Arrow format.
parquet supports writing output tables in Apache Parquet format.
biom supports writing output tables in BIOM format.
ods supports writing output tables in ODS format.
xlsx supports writing output tables in Microsoft Excel format.
all includes all of the above.
dev provides all tools needed for contributing to taxpasta.

Usage

The main entry point for taxpasta is its command-line interface (CLI). You can interactively explore the offered commands through the help system.

taxpasta -h

Taxpasta currently offers two commands corresponding to the main use-cases. You can find out more in the commands' documentation.

Standardise

Since the supported profilers all produce their own flavour of tabular output, a quick way to normalize such files, is to standardise them with taxpasta. You need to let taxpasta know what tool the file was created by. As an example, let's standardise a MetaPhlAn profile. (You can find an example file in our test data.)

curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/metaphlan/MOCK_002_Illumina_Hiseq_3000_se_metaphlan3-db.metaphlan3_profile.txt
taxpasta standardise -p metaphlan -o standardised.tsv MOCK_002_Illumina_Hiseq_3000_se_metaphlan3-db.metaphlan3_profile.txt

With these minimal arguments, taxpasta produces a two column output consisting of

taxonomy_id	count

You can count on the second column being integers 😉. Having such a simple and tidy table should make your downstream analysis much smoother to start out with. Please have a look at the full getting started tutorial for a more thorough introduction.

Merge

Converting single tables is nice, but hopefully you have many shiny samples to analyze. The taxpasta merge command works similarly to standardise except that you provide multiple profiles as input. You can grab a few more 'MOCK' examples from our test data and try it out.

LOCATION=https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/metaphlan
curl -O "${LOCATION}/MOCK_001_Illumina_Hiseq_3000_se_metaphlan3-db.metaphlan3_profile.txt"
curl -O "${LOCATION}/MOCK_002_Illumina_Hiseq_3000_se_metaphlan3-db.metaphlan3_profile.txt"
curl -O "${LOCATION}/MOCK_003_Illumina_Hiseq_3000_se_metaphlan3-db.metaphlan3_profile.txt"

taxpasta merge -p metaphlan -o merged.tsv MOCK_*.metaphlan3_profile.txt

The output of the merge command has one column for the taxonomic identifier and one more column for each input profile. Again, have a look at the full getting started tutorial for a more thorough introduction.

Citation

If you use TAXPASTA in your academic work, please cite our article in the Journal of Open Source Software.

Beber, M. E., Borry, M., Stamouli, S., & Fellows Yates, J. A. (2023). TAXPASTA: TAXonomic Profile Aggregation and STAndardisation. Journal of Open Source Software, 8(87), 5627. https://doi.org/10.21105/joss.05627

Acknowledgments

Many thanks to:

nf-core for bringing together the original developers
Zandra Fagernäs for the logo design

Copyright

Copyright © 2022-2024, Moritz E. Beber, Maxime Borry, James A. Fellows Yates, and Sofia Stamouli.
Free software distributed under the Apache Software License 2.0.

taxpasta's People

Contributors

Stargazers

Watchers

Forkers

genomic-medicine-sweden rpetit3 maxibor clinical-genomics jfy133 vikash84

taxpasta's Issues

Document a curl command to download sample files in the readme

Based on pyOpenSci/software-submission#84 (comment)

New Functionality: merge

If taxpasta runs on a per-sample basis for standardising outputs, we should offer the ability to merge each standardised output (of a given tool/database) into a single table containing all samples

Add Python 3.11 to the test matrix

[BUG] inconsistency formatting behaviour between standardise and merge

Is there an existing issue for this?

I have searched the existing issues

Problem description

I've noticed that when writiting the tutotiral, that for standardise the output header column is named as count. Whereas in merge it represents the file name.

I wonder if we should match the behaviour between the two, so both merge and standardise use the same column header format

However as this I now wonder we could even just collapse the two commands in two one... simply have standardise, which can accept one or more profiles (with if more provided, all are automatically merged...? But then someone may wish to merge themselves later on... so maybe safer as it is)

Code sample

Code run:

Traceback:

Environment

Anything else?

For example, if I merge output of standardise of one tool, and merge of another tool

taxonomy_id	2612_pe-ERR5766176-db_mOTU	2612_se-ERR5766180-db_mOTU	count
40518	20	2	NA
216816	1	0	NA
1680	6	1	NA
1262820	1	0	NA
74426	2	1	NA
1907654	1	0	NA
1852370	3	1	NA
39491	3	0	NA
33039	2	0	NA
39486	1	0	NA

Where coutn was from a stadnarise on kraken output

[BUG] taxpasta deletes nodes/names.dmp from --taxonomy dir

Is there an existing issue for this?

I have searched the existing issues

Problem description

Twice now, I've Ctrl-C'd taxpasta when I made some simple mistake, and then found that both nodes.dmp and names.dmp were deleted from the dir provided with --taxonomy. This does not happen when I let it complete successfully, and might happen upon a crash?

I dug in to see if I could find the issue, but AFAICT taxpasta uses the keep_files=True option to taxopy's TaxDB(), so in theory no files should be deleted. Perhaps there's a issue within taxopy whereby a KeyboardInterrupt causes the deletion to be triggered even when not asked for (e.g. an incorrect try/except block). Again I couldn't see anything obvious, but perhaps the author (@apcamargo) has wise words

Code sample

Code run:

taxpasta merge \
    --long  \
    --add-name  \
    --taxonomy /db/NCBI_taxonomy/2023-04-18  \
    --summarise-at species  \
    --output-format csv  \
    -o all.csv  \
    --profiler bracken output/kraken/*bracken.txt

Traceback:

(none)

Environment

version 0.3.0 from conda

Anything else?

No response

Research: default output formats for each tool - per sample vs merged

Tool	Supported
MALT	Via custom script on rma2info output
KRAKEN2
CENTRIFUGE
METAPHLAN3	merge_abundance_tables.py
KAIJU	kaiju-mergeOutput
DIAMOND
MOTUS	motus-merge

Auto-generate help output for command quick reference

We can include other files into the documentation. Thus it should be possible to auto-generate the help output for each command. This would ensure always up-to-date quick reference docs.

Add support for new metagenomic tool: MALT

MALT: uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/malt
Default output:
RMA6 file, which is transformed in gzip compressed text file using rma2info https://github.com/nf-core/taxprofiler/blob/dev/modules/nf-core/modules/megan/rma2info/meta.yml

Avoid type coercion for domain models

We are in full control of the creation of domain models, hence, we should not allow domain models to coerce their input types. This increases overall type strictness.

Rename MALT to MEGAN6

Problem description

Even though we are using MALT in taxprofiler, this actually produces a RMA6 format file which is actually for use in MEGAN6.

The current input files that we parse here in taxpasta are actually from a MEGAN6 utility tool called rma2info.

I propose we rename all references to MALT to MEGAN6, as this more accurately describes where you get the data file type from. Furhtermore MEGAN6 is much more widely used than MALT (which stands for 'Megan ALignment Tool' btw), so would be more accessible from that point of view.

Research: list of tools and which taxonomies they support

GTDB vs NCBI for example? Fix/non-fixed

Tool	Default Supported	via	profile identifier output formats	Supports raw read counts
MALT	NCBI only (limitation of having to using 0.4 version due to borked LCA in 0.5	custom MEGAN mapping file (0.4: `.abin`, 0.5 `megan*.db`	Taxonomy ID Nr. OR Taxonomy name OR Taxonomy path (via megan's rma2info)	☑️
KRAKEN2	NCBI/Greengenes/RDP/SILVA	`names.dmp/nodes.dmp/.accession2taxid` / Custom taxonomy	Taxonomy name, Taxonomy ID Nr., Path (`--use-map-style`)	☑️(? fragment == read or just kmer?)
CENTRIFUGE	NCBI	`nodes.dmp/names/dmp/.accession2taxid` (other taxonomies, same structure)	Taxonomy ID, Taxonomy Name (different files, also supports kraken-style report - although presumably without mpa-style)	☑️
METAPHLAN3	Chochophlan	mpa_v31_CHOCOPhlAn_201901_marker_info.txt.bz2 (containing NCBI tax ID)	rank, NCBI tax ID	☑️ (with `-t rel_ab_w_read_stats`)
KAIJU	NCBI	`nodes.dmp/names.dmp`	Taxonomy ID Nr., Taxon names (via `kaiju2table` or `kaiju-addTaxonNames`)	☑️(? fragment == read or just kmer?)
DIAMOND	NCBI	prot.accession2taxid.FULL.gz, `nodes.dmp`, `nodes.dmp`	Taxonomy ID	Not directly, only by summarising
MOTUS	NCBI	custom NCBI database file (genomes.tax)	Taxonomy Name, Taxonomd ID Nr., Taxonomy levels	☑️ (`-c` or `-M`?)
KrakenUniq	NCBI	Custom `taxDB` file but has Taxonomy ID, parent, and scientific name	see Kraken (+extra)	☑️(? fragment == read or just kmer?)
Bracken	See Kraken	See Kraken	Taxonomy ID, Scentific Name	☑️

Note tax_from_gtdb.py from GTDB or even

Issue regarding Taxpasta application in the kaiju classification tables.[BUG]

Is there an existing issue for this?

I have searched the existing issues

Problem description

Issue regarding Taxpasta application in the kaiju classification tables.

Code sample

Code run:

nextflow run nfcore/taxprofiler/. -profile hasta,dev_prio,singularity --input samplesheet.csv --databases databases.csv --outdir kaiju/nfcore_default --save_preprocessed_reads --perform_shortread_qc --shortread_qc_mergepairs --perform_shortread_complexityfilter --save_complexityfiltered_reads --perform_longread_qc --perform_shortread_hostremoval --perform_longread_hostremoval --hostremoval_reference /home/proj/development/microbial/metagenomics/lili/references/GCF_000001405.39_GRCh38.p13_genomic.fna --save_hostremoval_index --save_hostremoval_mapped --save_hostremoval_unmapped --perform_ru
 nmerging --save_runmerged_reads --run_kaiju --kaiju_taxon_rank species --run_profile_standardisation --run_krona --standardisation_taxpasta_format tsv --taxpasta_taxonomy_dir database/taxonomy --taxpasta_add_name --taxpasta_add_rank -ansi-log false

Traceback:

nf-core/taxprofiler execution completed unsuccessfully!
The exit status of the task that caused the workflow execution to fail was: 1.

The full error message was:

Error executing process > 'NFCORE_TAXPROFILER:TAXPROFILER:STANDARDISATION_PROFILES:TAXPASTA_MERGE (kaiju)'

Caused by:
  Process `NFCORE_TAXPROFILER:TAXPROFILER:STANDARDISATION_PROFILES:TAXPASTA_MERGE (kaiju)` terminated with an error exit status (1)

Command executed:

  taxpasta merge \
      -p kaiju -o kaiju_kaiju.tsv --taxonomy database/taxonomy --add-name --add-rank \
       \
       \
      sample1_se_kaiju.kaijutable.txt sample2_se_kaiju.kaijutable.txt sample3_se_kaiju.kaijutable.txt sample4_se_kaiju.kaijutable.txt sampel5_se_kaiju.kaijutable.txt sample6_se_kaiju.kaijutable.txt sample7_se_kaiju.kaijutable.txt sample8_se_kaiju.kaijutable.txt sample9_se_kaiju.kaijutable.txt sample10_se_kaiju.kaijutable.txt sample11_se_kaiju.kaijutable.txt sample12_se_kaiju.kaijutable.txt
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_TAXPROFILER:TAXPROFILER:STANDARDISATION_PROFILES:TAXPASTA_MERGE":
      taxpasta: $(taxpasta --version)
  END_VERSIONS

Command exit status:
  1

Command output:
  [18:21:48] CRITICAL Error in sample 'sample1_se_kaiju.kaijutable'  merge.py:422
                      with profile                                                
                      'sample1_se_kaiju.kaijutable.txt'.                         
             CRITICAL   schema_context   column             check     merge.py:425
                      check_number  failure_case index                            
                      0         Column  percent  compositionality                 
                      2         False  None                                       

Command wrapper:
  [18:21:48] CRITICAL Error in sample 'sample1_se_kaiju.kaijutable'  merge.py:422
                      with profile                                                
                      'sample1_se_kaiju.kaijutable.txt'.                         
             CRITICAL   schema_context   column             check     merge.py:425
                      check_number  failure_case index                            
                      0         Column  percent  compositionality                 
                      2         False  None                                       

Work dir:
  work/9d/97336e25a9468c7fa536cddd606eb8

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Environment

Anything else?

Here is the header of the sample1_se_kaiju.kaiju.tsv.

file percent reads taxon_id taxon_name
sample1_se_kaiju.kaiju.tsv 2.190930 76859 727 taxonid:727
sample1_se_kaiju.kaiju.tsv 1.868985 65565 470 taxonid:470
sample1_se_kaiju.kaiju.tsv 1.720755 60365 90241 taxonid:90241
sample1_se_kaiju.kaiju.tsv 1.217598 42714 1906665 taxonid:1906665
sample1_se_kaiju.kaiju.tsv 1.048986 36799 664683 taxonid:664683

[BUG] Certain parser problems are only warnings

Certain parser issues should be raised as exceptions instead of warnings. Reported in pyOpenSci/software-submission#84 (comment).

taxpasta merge -p motus -o /tmp/motus.tsv \
    tests/data/motus/2612_pe-ERR5766176-db_mOTU.out \
    tests/data/motus/2612_se-ERR5766180-db_mOTU.out \
    tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt

Add repostatus.org badge

Based on review at pyOpenSci/software-submission#84 (comment)

[Feature] Reorder columns after merging when options "--add-name/rank/lineage" and wide format are selected

Checklist

There are no similar issues or pull requests for this yet.

Problem

When merging multiple profiles of Kraken2, the current workflow is that first all profiles are merged and then second the additional taxonomic information is added. This leads to the case that when selecting the output format "wide", the columns containing the additional taxonomic information, such as name, rank, or lineage, are appearing at the far right side of the table. For very wide tables with many samples, this requires a lot of scrolling before finding the additional information on the taxonomic ID.

Solution

I think that reordering of the columns prior to saving the table to file so that the all info on the taxonomy appear in the first columns of the table would help to make the table produced by taxpasta more user-friendly because it makes the tax information immediately available.

Alternatives

No response

Anything else?

No response

Support tool: MetaPhlAn3

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier MetaPhlAn3

Additional context

Support tool: kaiju

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier Kaiju

Additional context

Expose the diamond profiler

Diamond profile is already supported but not yet exposed in CLI.

Support tool: mOTUs

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier mOTUs

Additional context

Duplicate taxIDs in mOTUs

From a comment in mOTUs PR so we do not forget:

Apparently, mOTUs profiles can contain duplicate tax IDs. Clarify with Sofia and Maxime. For now, sum up read counts.

[Feature] Add support for ganon 'tre' report

Checklist

There are no similar issues or pull requests for this yet.

Problem

taxprofiler should support ganon output

Solution

Support

https://pirovc.github.io/ganon/outputfiles/#ganon-report

and/or

https://pirovc.github.io/ganon/outputfiles/#ganon-classify

(the output is the same)

Alternatives

No response

Anything else?

No response

Add taxonomy to BIOM format

BIOM expects taxonomic lineages in full rectangular format, i.e., all lineages need to be padded to the same length.

[BUG] Bracken compositionality tolerance is too strict

Is there an existing issue for this?

I have searched the existing issues

Problem description

Reported by @alexhbnr

I would like to merge multiple Bracken output tables. Prior to merging them, taxpasta performs a check whether the relative abundances that Bracken reports add up to 1. The default threshold for rounding errors is 0.001. However, from my 32 profiled samples, the sum of all relative abundances is less than 0.999 for 25 of them and taxpasta throws a critical error. Since all of the samples have a total relative abundance of 0.995, I would personally consider the samples perfectly fine for merging and would like taxpasta to continue anyway.

Code sample

No response

Environment

Package	Version
taxpasta	0.2.2

Anything else?

No response

Document that setting the output format overrules any given file extension

Based on pyOpenSci/software-submission#84 (comment)

Support tool: centrifuge

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier: centrifuge

Additional context

Add support for kaiju-mergeOutputs

Oops wrong repo

Make taxonomy identifiers integer always

Basically, we expect a taxonomy created in a form similar to taxonkit's taxdump.

[BUG] Some internal links on readthedeocs broken

Is there an existing issue for this?

I have searched the existing issues

Problem description

pyOpenSci/software-submission#84 (comment)

Code sample

Code run:

Traceback:

Environment

Anything else?

No response

Document extras in installation docs

Based on comment from pyOpenSci/software-submission#84

Should we support the OPAL format as output?

In order to use the OPAL tool for analysis and visualization, it might be useful to convert any supported profiler to that format.

[BUG] Improve docs on tutorial data & warning against cross-profile merging

Is there an existing issue for this?

I have searched the existing issues

Problem description

Based on reviewer here pyOpenSci/software-submission#84 (comment)

To do:

Provide more biological context on example data
Add more docs woranings about merging across profilers
Give examples of how to merge across profilers if a user really wants

Add support for new metagenomic tool: metaphlan3

metaphlan3: github.com/biobakery/MetaPhlAn

Default output:

#mpa_v30_CHOCOPhlAn_201901
#/n/huttenhower_lab/tools/metaphlan3/bin/metaphlan SRS014476-Supragingival_plaque.fasta.gz --input_type fasta
#SampleID       Metaphlan_Analysis
#clade_name     NCBI_tax_id     relative_abundance      additional_species
k__Bacteria     2       100.0   
k__Bacteria|p__Actinobacteria   2|201174        100.0   
k__Bacteria|p__Actinobacteria|c__Actinobacteria 2|201174|1760   100.0   
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales    2|201174|1760|85007     65.25681        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales        2|201174|1760|85006     34.74319        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae      2|201174|1760|85007|1653        65.25681        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae      2|201174|1760|85006|1268        34.74319        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae|g__Corynebacterium   2|201174|1760|85007|1653|1716   65.25681        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae|g__Rothia    2|201174|1760|85006|1268|32207  34.74319        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Corynebacteriales|f__Corynebacteriaceae|g__Corynebacterium|s__Corynebacterium_matruchotii    2|201174|1760|85007|1653|1716|43768     65.25681        
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Micrococcaceae|g__Rothia|s__Rothia_dentocariosa     2|201174|1760|85006|1268|32207|2047     34.74319        k__Bacteria|p__Actinobacteria|

Output format: long (tidy) Taxon Table

Column one: Sample name
Column two: Species name
Column three: read count

This would be a good format for consensus building

Add parquet as output table option

Metaphlan3 output contains reads classified as taxid '-1' [BUG]

Is there an existing issue for this?

I have searched the existing issues

Problem description

Apparently this shouldn't happen! Values for the taxid are either 0 or 100000000 in all cases.

Occured when running taxpasta through nf-core/taxprofiler - I don't have an independent taxpasta installation.

Code sample

nextflow run nf-core/taxprofiler
-resume
-r 1.0.0
-profile scw
--input "inputs/samples.csv"
--databases "inputs/databases.csv"
--preprocessing_qc_tool "falco"
--run_kraken2
--run_metaphlan3
--run_motus
--outdir "results/"

Environment

N/A

Anything else?

No response

[Feature] Add support for Sourmash

I think sourmash is an interesting tool, as it is so fast in scanning vast libraries of genomes. We should add support for its output.

Support tool: bracken

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier Bracken

Additional context

[Feature] Allow to set threshold for compositionality check of Bracken results

Checklist

There are no similar issues or pull requests for this yet.

Problem

I would like to merge multiple Bracken output tables. Prior to merging them, taxpasta performs a check whether the relative abundances that Bracken reports add up to 1. The default threshold for rounding errors is 0.001. However, from my 32 profiled samples, the sum of all relative abundances is less than 0.999 for 25 of them and taxpasta throws a critical error. Since all of the samples have a total relative abundance of 0.995, I would personally consider the samples perfectly fine for merging and would like taxpasta to continue anyway.

Solution

I am currently not aware that there is an option to set this threshold oneself to avoid taxpasta failing for slightly sub-optimal data. Therefore, I would suggest to add a commandline option to allow to set the parameter for the user.

Alternatives

No response

Anything else?

No response

Support tool: DIAMOND

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier DIAMOND

Additional context

Add tests

For easing the task of adding new taxonomic profiler and for TDD, tests would be nice to have.

@Midnighter Do you want to create a base example of tests for Kraken2 ?

[BUG] Getting started tutorial is too complex

Is there an existing issue for this?

I have searched the existing issues

Problem description

x-ref pyOpenSci/software-submission#84 (comment)

Should simplify to just main workflow

Current getting started should move to a 'deep dive' type tutorial

Code sample

No response

Environment

Anything else?

No response

Output format: Wide "Taxon Table"

First column: species
Subsequent columns: samples
Cells: (read) counts

Support tool: KrakenUniq

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier KrakenUniq

Additional context

Support tool: MALT/MEGAN

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier MALT/MEGAN

Additional context

This might be slightly tricky due to the default format being a binary file (RMA6), but the MEGAN utility script rma2info can generate text files

Add the Zenodo webhook and badge

Based on comment from pyOpenSci/software-submission#84

[BUG] CRITICAL error when parsing MetaPhlAn4's output

Is there an existing issue for this?

I have searched the existing issues

Problem description

I'm getting the following error message when trying to parse MetaPhlAn4's outputs with taxpasta.

[21:40:18] CRITICAL Error in sample 'ERR7569997' with profile 'metaphlan/ERR7569997.txt'.          merge.py:422
           CRITICAL     schema_context  ... index                                                  merge.py:425
                    4  DataFrameSchema  ...  None
                    0           Column  ...     0
                    1           Column  ...  None
                    2           Column  ...  None
                    3           Column  ...  None

                    [5 rows x 6 columns]

I wonder if this has anything to do with the non-standard taxa in MetaPhlAn4's output (see example below).

k__Bacteria|p__Atribacterota|c__CFGB8897|o__OFGB8897|f__FGB8897 2|67818|||      0.01864

Full MetaPhlAn4 output available here.

Code sample

taxpasta merge -o taxpasta/taxpasta.tsv --taxonomy /home/geoadmin/geonomecontainer/KMCP/taxdump --profiler metaphlan --add-lineage metaphlan/*.txt

Environment

### Package Information

Package	Version
taxpasta	0.3.0

Dependency Information

Package	Version
bash-kernel	missing
biom-format	2.1.15
depinfo~	missing
jupyter	missing
mkdocs-awesome-pages-plugin~	missing
mkdocs-exclude~	missing
mkdocs-material~	missing
mkdocstrings[python]~	missing
numpy~	missing
odfpy	missing
openpyxl	missing
pandas~	missing
pandera~	missing
pre-commit	missing
pyarrow	11.0.0
rich	13.4.2
tabulate~	missing
taxopy~	missing
tox~	missing
typer~	missing

Build Tools Information

Package	Version
pip	23.0.1
setuptools	67.4.0
wheel	0.38.4

Platform Information


Linux	5.15.0-1037-azure-x86_64
CPython	3.11.0

Anything else?

No response

Kaiju documentation specifies incorrect input source

The 5 column'ed output table that is supportd by taxpasta is from kaiju2table, not kaiju itself

Support tool: Kraken2

Checklist

There are no similar issues or pull requests for this yet.

Describe the solution you would like.

We should support the metagenomic classifier Kraken2

Additional context

Documentation: add workflow for adding new profiler

Variability in KrakenUniq output causes bugs

Tried running taxpasta 1.1 with a command that included the files in the attached zip, and it errored out.

@sofstam suspects it's due to the CI test file having a slightly different structure of the contents

Command executed:

  taxpasta merge \
      -p krakenuniq -o krakenuniq_krakenuniq-db.tsv \
       \
       \
      MOCK_001_Illumina_Hiseq_3000_1.krakenuniq.report.txt MOCK_002_Illumina_Hiseq_3000_1.krakenuniq.report.txt MOCK_003_Illumina_Hiseq_3000.krakenuniq.report.txt MOCK_001_Minion_R9_1.krakenuniq.report.txt MOCK_002_Minion_R9_1.krakenuniq.report.txt MOCK_003_Minion_R9_1.krakenuniq.report.txt
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_TAXPROFILER:TAXPROFILER:STANDARDISATION_PROFILES:TAXPASTA_MERGE":
      taxpasta: $(taxpasta --version)
  END_VERSIONS

Command exit status:
  1

Command output:
  [17:26:39] CRITICAL Error in sample                                 merge.py:331
                      'MOCK_001_Illumina_Hiseq_3000_1.krakenuniq.repo             
                      rt' with profile                                            
                      'MOCK_001_Illumina_Hiseq_3000_1.krakenuniq.repo             
                      rt.txt'.                                                    
             CRITICAL      schema_context column  ...  failure_case   merge.py:334
                      index                                                       
                      0   DataFrameSchema   None  ...         3.418               
                      None                                                        
                      1   DataFrameSchema   None  ...         15525               
                      None                                                        
                      16  DataFrameSchema   None  ...          rank               
                      None                                                        
                      15  DataFrameSchema   None  ...         taxID               
                      None                                                        
                      14  DataFrameSchema   None  ...           cov               
                      None                                                        
                      13  DataFrameSchema   None  ...           dup               
                      None                                                        
                      12  DataFrameSchema   None  ...         kmers               
                      None                                                        
                      11  DataFrameSchema   None  ...      taxReads               
                      None                                                        
                      10  DataFrameSchema   None  ...         reads               
                      None                                                        
                      9   DataFrameSchema   None  ...             %               
                      None                                                        
                      8   DataFrameSchema   None  ...  unclassified               
                      None                                                        
                      7   DataFrameSchema   None  ...       no rank               
                      None                                                        
                      6   DataFrameSchema   None  ...             0               
                      None                                                        
                      5   DataFrameSchema   None  ...            NA               
                      None                                                        
                      4   DataFrameSchema   None  ...          1.17               
                      None                                                        
                      3   DataFrameSchema   None  ...       5226988               
                      None                                                        
                      2   DataFrameSchema   None  ...       15525.1               
                      None                                                        
                      17  DataFrameSchema   None  ...       taxName               
                      None                                                        
                                                                                  
                      [18 rows x 6 columns]

taxpasta_ku_error.zip