Comments (8)
Hey Connor,
The set of marker genes used to place a genome into the reference genome tree is not the same as the set used to calculate completeness and contamination. Having said that, these results are still very surprising. Can you please run the following and see how many phylogenetically informative marker genes these bins have (out of 43):
checkm tree_qa [output_dir]
Bins with less than 10 marker genes by default should be assigned to root (see checkm lineage_wf or checkm lineage_set). However, perhaps there is a bug here as it is hard to imagine all the bins you have listed actually have this many phylogenetic marker genes.
Cheers,
Donovan
from checkm.
Hi Donovan,
These are the results from tree_qa
for the bins listed above
final.contigs.fa.metabat-bins-.53 12 0 k__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfobacterales;f__Desulfobacteraceae
final.contigs.fa.metabat-bins-.54 28 4 k__Bacteria;p__Chloroflexi
final.contigs.fa.metabat-bins-.59 35 7 k__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfuromonadales;f__Geobacteraceae
final.contigs.fa.metabat-bins-.6 27 0 k__Archaea;p__Euryarchaeota;c__Methanomicrobia;o__Methanosarcinales
final.contigs.fa.metabat-bins-.60 9 0 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria
from checkm.
Hey Connor,
Bins are placed using the phylogenetically-informative set of 43 marker genes. Since all your bins have 1 or more of these marker genes they are all placed in the reference genome tree. Bin 60 only has 9 of these genes (less than the default of 10 required) so is evaluated with a domain-level set of marker genes. The others are all evaluated using lineage-specific marker genes. What is surprising here is that these genomes all seem to have phylogenetically-informative marker genes, but no lineage-specific marker genes. Is this just a small set among many that appear correct? I'm guess this results are accurate even if they are surprising.
As an additional check, you could run domain-specific sets over these bins just to verify that they indeed appear to be missing these expected genes.
from checkm.
Hi Donovan,
I think the lineage_wf
command has some non-deterministic behaviour! I took the bins listed above and ran just the 5 of them in a separate checkm run and got very different results (see below). I think these results are correct, or at least more correct than when I ran them with the other bins as the completeness matches up better with the count from the universal set of markers
final.contigs.fa.metabat_bins_59 c__Deltaproteobacteria (UID3216) 83 248 156 58 146 39 5 0 0 70.09 17.98
final.contigs.fa.metabat_bins_54 k__Bacteria (UID1452) 924 163 110 55 100 8 0 0 0 62.92 3
final.contigs.fa.metabat_bins_60 k__Bacteria (UID203) 5449 104 58 63 40 1 0 0 0 62.07 0.86
final.contigs.fa.metabat_bins_6 p__Euryarchaeota (UID49) 95 229 154 100 125 4 0 0 0 48.63 2.11
final.contigs.fa.metabat_bins_53 c__Deltaproteobacteria (UID3216) 83 248 156 215 30 3 0 0 0 10.11 0.85
from checkm.
I've just done some more sleuthing and found that running the bins through lineage_wf
with a single thread gives the correct results (before I was running the full dataset through with 32 threads). It could be that there is a thread lock missing somewhere such that data is getting overwritten.
from checkm.
The lineage_wf
command is not deterministic as pplacer
is not deterministic. I think it is telling that the selected node (as given by the UIDs) is identical in the two examples you have sent me (first message and last message). Note that the UIDs will be different between the qa
and tree_qa
commands. The first indicates which node was used to calculate lineage-specific marker sets, while the second indicates the closes node to where the genome was inserted by pplacer
. As such, the latter will always be more specific as is the case here.
I will play around with multi-threading today to see if there is indeed some sort of race condition.
from checkm.
I can't reproduce the problem on my end. I get identical results over a set of 30 bins between runs regardless of if I use multiple threads or a single threads. I've run this six times without any deviation. Are your results jumping from reasonable estimates to estimates of 0% completeness and contamination (as per your first post)? Can you put your bins somewhere for me so I can play with them and send me the exact command you are running? Ta!
from checkm.
Fixed after update to latest version.
from checkm.
Related Issues (20)
- Understanding contamination value HOT 2
- ERROR conda.core.link:_execute(952): An error occurred while installing package 'bioconda::checkm-genome-1.2.2-pyhdfd78af_1'. HOT 1
- Unexpected error: <class 'KeyError'>
- FileNotFoundError: [Errno 2] No such file or directory: '/home/majorram/anaconda3/envs/checkm/hmms/phylo.hmm' HOT 3
- Chekm test: line 1: 16491 Bus error (core dumped) pplacer -j 1 HOT 2
- maximum number of genomes/MAGs to run checkm HOT 2
- WARNING: Expected all files to contain sequences in amino acid space. HOT 1
- local variable 'seqId' referenced before assignment
- 'utf-8' codec can't decode byte 0xb0 in position 37: invalid start byte HOT 1
- checkm ssu_finder HOT 2
- Fatal exception (source file p7_hmmfile.c, line 2095): hmm write failed. system error: No space left on device HOT 2
- checkm_coverage dropped some bins
- NameError: free variable 'mask' referenced before assignment in enclosing scope
- Having trouble interpreting the meaning of "root" as a marker lineage HOT 1
- Make sure pplacer is on your system path. HOT 1
- Having trouble running CheckM: Models must be parsed before identifying HMM hits.
- Unexpected error: <class 'FileNotFoundError'> HOT 1
- question:Anaconda ——conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
- No space left on device error in HMM step not caught HOT 2
- For a specific bin, why is the Marker lineage in lineage.ms different with it in quality report?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from checkm.