Coder Social home page Coder Social logo

Comments (16)

d-callan avatar d-callan commented on August 27, 2024 1

@Midnighter I think that's as reasonable as anything.. if I want strains from biobakery tools I'd look to strainphlan rather than metaphlan. It's possible a warning would be good, or maybe making the behavior configurable.

from taxpasta.

Midnighter avatar Midnighter commented on August 27, 2024

Thank you for the detailed report. It is somewhat curious that the names of the species are distinguished by a letter suffix, but the numeric identifier is the same... And no identifiers for the strains at all. I will look into it but I'm actually not sure what the correct solution should be.

from taxpasta.

Midnighter avatar Midnighter commented on August 27, 2024

@MajoroMask, the only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

Can you think of a better solution?

from taxpasta.

luozhy88 avatar luozhy88 commented on August 27, 2024

I have same error by metaphlan,how to slove it?
I download it(http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/bowtie2_indexes/mpa_vJan21_CHOCOPhlAnSGB_202103_bt2.tar)

image
image

from taxpasta.

MajoroMask avatar MajoroMask commented on August 27, 2024

@MajoroMask, the only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

Can you think of a better solution?

@Midnighter I got no idea... can author of MetaPhlAn 4 be reached? Maybe they have a solution for generating an ID to the output.

from taxpasta.

d-callan avatar d-callan commented on August 27, 2024

another ex if it helps:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15143     2|1239|186801|186802|||1898207| 0.02366
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15159     2|1239|186801|186802|||1898207| 0.02308 

from taxpasta.

Midnighter avatar Midnighter commented on August 27, 2024

@d-callan thank you for the additional data. Do you have any thoughts on the following? I'm not clear on how to solve this at the moment.

only solution that I can see immediately, is to sum up the relative abundances for the same taxon identifier. This would mean all relative abundances for the species are added together, while the strains would be added to unclassified as there is no identifier. Not really ideal.

from taxpasta.

d-callan avatar d-callan commented on August 27, 2024

@Midnighter im also wondering if you have a sense for what this would take in terms of effort? i am very interested in getting this working, and would be willing to put effort to it if you wanted.

from taxpasta.

Midnighter avatar Midnighter commented on August 27, 2024

I think, code change is minimal. 2-3 lines. Will need an extra test case or so.

from taxpasta.

d-callan avatar d-callan commented on August 27, 2024

this one i think is fun

k__Bacteria|p__Firmicutes|c__Negativicutes      2|1239|909932   5.20485
k__Bacteria|p__Actinobacteria|c__Actinomycetia  2|201174|1760   2.05981
k__Bacteria|p__Firmicutes|c__CFGB2834   2|1239| 0.94398
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria     2|1224|28216    0.81827
k__Bacteria|p__Actinobacteria|c__Coriobacteriia 2|201174|84998  0.46979
k__Bacteria|p__Firmicutes|c__CFGB1227   2|1239| 0.404
k__Bacteria|p__Firmicutes|c__CFGB3038   2|1239| 0.18149
k__Bacteria|p__Firmicutes|c__CFGB3054   2|1239| 0.16661
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified    2|1239| 0.12308
k__Bacteria|p__Firmicutes|c__CFGB2906   2|1239| 0.03655
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    2|1224|1236     0.02883
k__Bacteria|p__Firmicutes|c__CFGB1765   2|1239| 0.02468
k__Bacteria|p__Candidatus_Melainabacteria|c__Candidatus_Melainabacteria_unclassified    2|1798710|      0.00509
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales     2|976|200643|171549     57.0883
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales        2|74201|203494|48461    7.59373
k__Bacteria|p__Firmicutes|c__Negativicutes|o__Veillonellales    2|1239|909932|1843489   5.20485
k__Bacteria|p__Firmicutes|c__CFGB2834|o__OFGB2834       2|1239||        0.94398
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales  2|1224|28216|80840      0.81827
k__Bacteria|p__Firmicutes|c__CFGB1227|o__OFGB1227       2|1239||        0.404
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales     2|201174|84998|84999    0.38515
k__Bacteria|p__Firmicutes|c__CFGB3038|o__OFGB3038       2|1239||        0.18149
k__Bacteria|p__Firmicutes|c__CFGB3054|o__OFGB3054       2|1239||        0.16661
k__Bacteria|p__Firmicutes|c__Firmicutes_unclassified|o__Firmicutes_unclassified 2|1239||        0.12308
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriaceae      2|1239|186801|186802|186806     1.31747
k__Bacteria|p__Firmicutes|c__CFGB2834|o__OFGB2834|f__FGB2834    2|1239|||       0.94398
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Clostridiaceae      2|1239|186801|186802|31979      0.85
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae        2|1224|28216|80840|995019       0.81827
k__Bacteria|p__Firmicutes|c__CFGB1227|o__OFGB1227|f__FGB1227    2|1239|||       0.404
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae        2|201174|84998|84999|84107      0.38515
k__Bacteria|p__Firmicutes|c__CFGB3038|o__OFGB3038|f__FGB3038    2|1239|||       0.18149

from taxpasta.

harper357 avatar harper357 commented on August 27, 2024

Sorry for randomly jumping in here, but I have used MetaPhlAn a fair bit. The clade tax id values come from NCBI, but the taxa/clade name are coming from their own clustering/GTDB.

I believe that the authors even kind of discourage using the tax ids.

I don't know if this would cause problems when merging across profilers, but you could add/use the last section of the clade name.

Ex:

k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15143     2|1239|186801|186802|||1898207| 0.02366
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium|t__SGB15159     2|1239|186801|186802|||1898207| 0.02308 

becomes

1898207_SGB15143 0.02366
1898207_SGB15159 0.02308 

from taxpasta.

Midnighter avatar Midnighter commented on August 27, 2024

Hi @harper357,

No need to apologize, more information is always welcome. Thank you for the explanation also, I was not aware how MetaPhlAn handles this.

Unfortunately, even though such a change looks simple from the outside, it would change taxpasta's internal logic a lot. There's not only the validation part which assumes integers, but also the whole integration with an existing taxonomy. Basically, we only maintain the identifiers and if users desire, we add back names and lineages using the identifiers to get information from a taxonomy. @harper357 do you know if they publish their taxonomy in a format that can be read by taxopy?

from taxpasta.

harper357 avatar harper357 commented on August 27, 2024

@Midnighter I am not completely sure on what the format for taxopy is. MetaPhlAn 4's second column is the NCBI TaxIDs. Are you talking about the first column that needs to be in a different format?

from taxpasta.

Midnighter avatar Midnighter commented on August 27, 2024

We use taxopy to load taxonomies in taxdump format. That means, we normally drop all information from individual profiles except taxon identifiers and their relative abundances. If a user wishes to output names, ranks, or lineages, we retrieve that from the taxonomy.

There are two things that concern me with MetaPhlAn then. 1) You say that they use NCBI identifiers, but actually use a custom clustering. I don't know if that will practically make a big difference, but it's nonetheless misleading if true. 2) If they have their own clustering, it is straight forward to create the taxdump output, which will also assign unique identifiers that can be used.

I realize that that will not happen soon, so we still need a solution right now. While I like your suggestion @harper357, it does have big consequences for how taxpasta is built. Need to think about that. It would also mean that the way we use taxonomies would not work for MetaPhlAn.

from taxpasta.

d-callan avatar d-callan commented on August 27, 2024

I'll put this here in case it proves a helpful reference http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html

Also, I'll comment that mapping metaphlan outputs to ncbi taxonomy seems a reasonable use case nonetheless and makes sense to support even if imperfectly.

from taxpasta.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.