Comments (10)
This shouldn't happen. The only thing that would lead to a difference in the taxonomy is a variance in the annotation process, but changing the input should not alter the annotation of a given sequence.
Can you provide the rows of both _genes.tsv
files that contain the genes of this example you mentioned?
from genomad.
Thanks for the quick response. Please see below:
all assembled contigs as input:
contig_11421_1 2 1918 1917 1 0.529 11 None GENOMAD.002410.VV 8.55e-08 58 0 0 1 11844 Phycodnaviridae NA NA PF16903 Major capsid protein N-terminus
contig_38880_1 3 242 240 1 0.483 11 None NA NA NA 0 0 0 1 NA NA NA NA NA
contig_38880_2 302 967 666 1 0.434 11 None GENOMAD.171343.VP 3.504e-16 80 0 0 0 12031 Mimiviridae NA NA PF18406 Ferredoxin-like domain in Api92-like protein
contig_38880_3 1043 1222 180 1 0.383 11 AGGA NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_1 1 126 126 -1 0.548 11 GGAG/GAGG NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_2 153 3035 2883 -1 0.537 11 None GENOMAD.068590.VV 0.0001592 48 0 0 1 2561 Caudoviricetes NA NA PF13871;PF11242 C-terminal domain on Strawberry notch homologue
only viral contigs as input:
contig_11421_1 2 1918 1917 1 0.529 11 None GENOMAD.088220.VV 7.588e-11 68 0 0 0 2561 Caudoviricetes NA NA NA NA
contig_38880_1 3 242 240 1 0.483 11 None NA NA NA 0 0 0 1 NA NA NA NA NA
contig_38880_2 302 967 666 1 0.434 11 None GENOMAD.055798.VV 8.984e-50 178 0 0 0 2561 Caudoviricetes NA NA PF19174 NA
contig_38880_3 1043 1222 180 1 0.383 11 AGGA NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_1 1 126 126 -1 0.548 11 GGAG/GAGG NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_2 153 3035 2883 -1 0.537 11 None GENOMAD.056032.VV 1.96e-05 51 0 0 0 11844 Phycodnaviridae NA NA PF00176 ERCC4-related helicase
from genomad.
Did you use the same geNomad version for both runs? Which version was it? What version of the database did you use?
Not related to this, are these all the genes in your contigs? Be extra careful when working with such short sequences.
from genomad.
Yes, I used the same geNomad version 1.5.2 and database version 1.2.
We had to use contigs >= 1kb since the majority of the assembled contigs were short than 3 kb.
from genomad.
Indeed, it seems that the size of the input does change the annotation in my testing. I didn't expect that since the E-value depends on the database size, not query size.
My guess is that this is a MMseqs2 thing. @milot-mirdita, do you know what could be causing this? geNomad basically uses the proteins encoded by the input sequences as query and searches a profile database. I then take the best hit of each query protein. You can find the commands here.
Maybe this is because I'm using --max-rejected
? If the order of the alignments of a given query is different depending on the number of queries, --max-rejected
would stop the alignment computation at different points.
from genomad.
@xwu35 I did some more investigation and I found the cause of this issue.
Short story: the annotation should be very slightly more reliable when your input has less sequences (in your case, the input with only viral sequences). In any case, I wouldn't really recommend using the taxonomy of very short sequences.
Long story: It seems that the order of the sequences in the input matter when using --max-rejected
. I don't really understand why, because I'd assume that the counter resets for each query. The problem is that if I stop using --max-rejected
the runtime will increase significantly, so I'll look into other options to solve this.
from genomad.
Thanks for looking into it.
Will this issue make the virus identification step unreliable since part of the identification method is to find aligned viral marker genes?
from genomad.
The effects will be minimal, you shouldn't worry. This issue affects very few proteins within a sample and only alignments with high E-value are susceptible to variance.
from genomad.
Thanks, it is good to know.
from genomad.
I just pushed a change to the way the MMseqs2 searches are performed that will mitigate this issue
from genomad.
Related Issues (20)
- Error downloading database HOT 2
- Inquiry on virus from MAG HOT 4
- [feature request] query database clustering HOT 1
- Whether measures have been taken by genomad to avoid identifying genomic islands as viruses? HOT 5
- AMR annotations on chromsome? HOT 1
- Errors when download and the same issue when running genomad -h HOT 3
- The virus identified by genomad weren't annotated as virus sequence by VIBRANT? HOT 3
- geNomad taxonomy about Baltimore classification HOT 1
- Error with geNomad v1.8.0, missing tensorflow.keras HOT 5
- mmseqs2 error HOT 3
- Different protein number from genomad and pyrodigal-gv HOT 2
- Small (reference) data for testing HOT 9
- Error while classifying sequences HOT 6
- Error mmseqs prefilter HOT 4
- genomad annotate fastq file is empty or contains multiple entries HOT 3
- plasmid classified as virus? HOT 7
- Optimization Request for Analyzing Large Number of MAGs with geNomad HOT 5
- Fewer viral contigs identified from genomad vs virsorter2 HOT 4
- The question about --disable-nn-classification HOT 1
- Provirus detection in genomad vs checkv HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from genomad.