Comments (5)
No problem! Let me know if you have more questions :)
from genomad.
Hi @alienzj
There are a couple of points here:
- geNomad's prediction's won't necessarily match VirSorter2's. geNomad's precision is higher than VirSorter2's (and even higher if you run VirSorter2 with all the models), so geNomad's predictions tend to be more conservative (although this can vary from dataset to dataset). I can't really say what is causing the classification conflicts with VirSorter2 without looking at the data.
- I can't really say why that those contigs are being classified as plasmids without looking at the data. Can you share the
_genes.tsv
and_summary.tsv
files? Keep in mind that VirSorter2 and phamb don't take plasmids into account, so there's a chance they classify plasmids as viruses. - There are two reasons that only 4932 out of 8439 sequences got taxonomic classification: (1) some sequences were classified as plasmid, (2) because you used a very conservative cutoff (
--min-score 0.8
), some sequences won't be classified as viruses or plasmids. The good news is that you can still check the taxonomic assignment of all sequences (regardless of their classification) in the annotation discovery (try to look forvMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv
).
Aggregating the output of several classification tools is difficult because they will often diverge. geNomad is, in average, more accurate than VirSorter2 (see figure below), but VS2 is an amazing tool and I can't guarantee that geNomad will be correct in every single scenario they diverge. You should gather as much information as possible.
The good news is that geNomad's output includes some information that makes it easier to understand why a given sequence was classified as a plasmid or virus:
This is an example of a _plasmid_summary.tsv
file:
seq_name length topology n_genes genetic_code plasmid_score fdr n_hallmarks marker_enrichment conjugation_genes
----------- ------ -------- ------- ------------ ------------- --- ----------- ----------------- -----------------
NC_002128.1 92721 Linear 88 11 0.9942 NA 5 46.4458 T_virB11;MOBP1
NC_002127.1 3306 Linear 3 11 0.9913 NA 1 1.6586 NA
Here you can see that these sequences encode plasmid hallmarks, which is a very good indication that those sequences are indeed plasmids. Try to check if your sequences also encode those. In addition, the marker_enrichment
field is a number that increases proportionally to the number of plasmid markers. So, if the marker_enrichment
of a given sequence is high (say, higher than 6), it is probably a plasmid, not a virus.
The same is true for the _virus_summary.tsv
output. Try to run the classification again with a lower --min-score
and see if the sequences look viral from the summary (if you like to do the filtering by yourself, based on your criteria, just leave --min-score 0
). You might have some false positives in your dataset.
Again, if you are only interested in the taxonomy, just look at vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv
:)
Hope this helps!
from genomad.
Dear @apcamargo, thanks a lot for your quick and detailed reply.
Sure sure, here are the tsv files generated by geNomad using the above command line: genomad_output_tsv.tar.gz
- From the figure you provided, it is excellent that geNomad has such an accurate performance.
- Yes, there's a chance Virsorter2 and phamb classify plasmids as viruses.
- I checked
vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv
, it recorded 8269 taxonomic assignments. It is quite useful. Yes, I shall change the--min-score
to see what will happen based on your suggestions.
Thanks a lot again!
from genomad.
Thanks @alienzj
There are certainly lots of plasmids in your data. You can easily see that in the _plasmid_summary.tsv
file:
- Sequences with very high
marker_enrichment
, which means that there are multiple plasmid markers in them. - Sequences with multiple plasmid hallmark genes (
n_hallmarks
) - Sequences with multiple conjugation genes (
conjugation_genes
). It is important to note that there are phages capable of conjugation, though.
from genomad.
Hi, @apcamargo,
Thanks a lot for your reply.
Yes, I shall remove those plasmids when doing virome profiling.
It is quite interesting that find so many plasmids from the viral vMAGs identified by VirSorter2 and phamb.
from genomad.
Related Issues (20)
- Error downloading database HOT 2
- Inquiry on virus from MAG HOT 4
- [feature request] query database clustering HOT 1
- Whether measures have been taken by genomad to avoid identifying genomic islands as viruses? HOT 5
- AMR annotations on chromsome? HOT 1
- Errors when download and the same issue when running genomad -h HOT 3
- The virus identified by genomad weren't annotated as virus sequence by VIBRANT? HOT 3
- geNomad taxonomy about Baltimore classification HOT 1
- Error with geNomad v1.8.0, missing tensorflow.keras HOT 5
- mmseqs2 error HOT 3
- Different protein number from genomad and pyrodigal-gv HOT 2
- Small (reference) data for testing HOT 9
- Error while classifying sequences HOT 6
- Error mmseqs prefilter HOT 4
- genomad annotate fastq file is empty or contains multiple entries HOT 3
- plasmid classified as virus? HOT 7
- Optimization Request for Analyzing Large Number of MAGs with geNomad HOT 5
- Fewer viral contigs identified from genomad vs virsorter2 HOT 4
- The question about --disable-nn-classification HOT 1
- Provirus detection in genomad vs checkv HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from genomad.