Coder Social home page Coder Social logo

Comments (8)

donovan-h-parks avatar donovan-h-parks commented on July 17, 2024

Hey Connor,

The set of marker genes used to place a genome into the reference genome tree is not the same as the set used to calculate completeness and contamination. Having said that, these results are still very surprising. Can you please run the following and see how many phylogenetically informative marker genes these bins have (out of 43):

checkm tree_qa [output_dir]

Bins with less than 10 marker genes by default should be assigned to root (see checkm lineage_wf or checkm lineage_set). However, perhaps there is a bug here as it is hard to imagine all the bins you have listed actually have this many phylogenetic marker genes.

Cheers,
Donovan

from checkm.

ctSkennerton avatar ctSkennerton commented on July 17, 2024

Hi Donovan,

These are the results from tree_qa for the bins listed above

final.contigs.fa.metabat-bins-.53   12  0   k__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfobacterales;f__Desulfobacteraceae
final.contigs.fa.metabat-bins-.54   28  4   k__Bacteria;p__Chloroflexi
final.contigs.fa.metabat-bins-.59   35  7   k__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfuromonadales;f__Geobacteraceae
final.contigs.fa.metabat-bins-.6    27  0   k__Archaea;p__Euryarchaeota;c__Methanomicrobia;o__Methanosarcinales
final.contigs.fa.metabat-bins-.60   9   0   k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria

from checkm.

donovan-h-parks avatar donovan-h-parks commented on July 17, 2024

Hey Connor,

Bins are placed using the phylogenetically-informative set of 43 marker genes. Since all your bins have 1 or more of these marker genes they are all placed in the reference genome tree. Bin 60 only has 9 of these genes (less than the default of 10 required) so is evaluated with a domain-level set of marker genes. The others are all evaluated using lineage-specific marker genes. What is surprising here is that these genomes all seem to have phylogenetically-informative marker genes, but no lineage-specific marker genes. Is this just a small set among many that appear correct? I'm guess this results are accurate even if they are surprising.

As an additional check, you could run domain-specific sets over these bins just to verify that they indeed appear to be missing these expected genes.

from checkm.

ctSkennerton avatar ctSkennerton commented on July 17, 2024

Hi Donovan,

I think the lineage_wf command has some non-deterministic behaviour! I took the bins listed above and ran just the 5 of them in a separate checkm run and got very different results (see below). I think these results are correct, or at least more correct than when I ran them with the other bins as the completeness matches up better with the count from the universal set of markers

final.contigs.fa.metabat_bins_59    c__Deltaproteobacteria  (UID3216)   83  248 156 58  146 39  5   0   0   70.09   17.98
final.contigs.fa.metabat_bins_54    k__Bacteria (UID1452)   924 163 110 55  100 8   0   0   0   62.92   3
final.contigs.fa.metabat_bins_60    k__Bacteria (UID203)    5449    104 58  63  40  1   0   0   0   62.07   0.86
final.contigs.fa.metabat_bins_6 p__Euryarchaeota    (UID49) 95  229 154 100 125 4   0   0   0   48.63   2.11
final.contigs.fa.metabat_bins_53    c__Deltaproteobacteria  (UID3216)   83  248 156 215 30  3   0   0   0   10.11   0.85

from checkm.

ctSkennerton avatar ctSkennerton commented on July 17, 2024

I've just done some more sleuthing and found that running the bins through lineage_wf with a single thread gives the correct results (before I was running the full dataset through with 32 threads). It could be that there is a thread lock missing somewhere such that data is getting overwritten.

from checkm.

donovan-h-parks avatar donovan-h-parks commented on July 17, 2024

The lineage_wf command is not deterministic as pplacer is not deterministic. I think it is telling that the selected node (as given by the UIDs) is identical in the two examples you have sent me (first message and last message). Note that the UIDs will be different between the qa and tree_qa commands. The first indicates which node was used to calculate lineage-specific marker sets, while the second indicates the closes node to where the genome was inserted by pplacer. As such, the latter will always be more specific as is the case here.

I will play around with multi-threading today to see if there is indeed some sort of race condition.

from checkm.

donovan-h-parks avatar donovan-h-parks commented on July 17, 2024

I can't reproduce the problem on my end. I get identical results over a set of 30 bins between runs regardless of if I use multiple threads or a single threads. I've run this six times without any deviation. Are your results jumping from reasonable estimates to estimates of 0% completeness and contamination (as per your first post)? Can you put your bins somewhere for me so I can play with them and send me the exact command you are running? Ta!

from checkm.

donovan-h-parks avatar donovan-h-parks commented on July 17, 2024

Fixed after update to latest version.

from checkm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.