Coder Social home page Coder Social logo

Comments (10)

apcamargo avatar apcamargo commented on July 29, 2024

This shouldn't happen. The only thing that would lead to a difference in the taxonomy is a variance in the annotation process, but changing the input should not alter the annotation of a given sequence.

Can you provide the rows of both _genes.tsv files that contain the genes of this example you mentioned?

from genomad.

xwu35 avatar xwu35 commented on July 29, 2024

Thanks for the quick response. Please see below:

all assembled contigs as input:
contig_11421_1 2 1918 1917 1 0.529 11 None GENOMAD.002410.VV 8.55e-08 58 0 0 1 11844 Phycodnaviridae NA NA PF16903 Major capsid protein N-terminus
contig_38880_1 3 242 240 1 0.483 11 None NA NA NA 0 0 0 1 NA NA NA NA NA
contig_38880_2 302 967 666 1 0.434 11 None GENOMAD.171343.VP 3.504e-16 80 0 0 0 12031 Mimiviridae NA NA PF18406 Ferredoxin-like domain in Api92-like protein
contig_38880_3 1043 1222 180 1 0.383 11 AGGA NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_1 1 126 126 -1 0.548 11 GGAG/GAGG NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_2 153 3035 2883 -1 0.537 11 None GENOMAD.068590.VV 0.0001592 48 0 0 1 2561 Caudoviricetes NA NA PF13871;PF11242 C-terminal domain on Strawberry notch homologue

only viral contigs as input:
contig_11421_1 2 1918 1917 1 0.529 11 None GENOMAD.088220.VV 7.588e-11 68 0 0 0 2561 Caudoviricetes NA NA NA NA
contig_38880_1 3 242 240 1 0.483 11 None NA NA NA 0 0 0 1 NA NA NA NA NA
contig_38880_2 302 967 666 1 0.434 11 None GENOMAD.055798.VV 8.984e-50 178 0 0 0 2561 Caudoviricetes NA NA PF19174 NA
contig_38880_3 1043 1222 180 1 0.383 11 AGGA NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_1 1 126 126 -1 0.548 11 GGAG/GAGG NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_2 153 3035 2883 -1 0.537 11 None GENOMAD.056032.VV 1.96e-05 51 0 0 0 11844 Phycodnaviridae NA NA PF00176 ERCC4-related helicase

from genomad.

apcamargo avatar apcamargo commented on July 29, 2024

Did you use the same geNomad version for both runs? Which version was it? What version of the database did you use?

Not related to this, are these all the genes in your contigs? Be extra careful when working with such short sequences.

from genomad.

xwu35 avatar xwu35 commented on July 29, 2024

Yes, I used the same geNomad version 1.5.2 and database version 1.2.

We had to use contigs >= 1kb since the majority of the assembled contigs were short than 3 kb.

from genomad.

apcamargo avatar apcamargo commented on July 29, 2024

Indeed, it seems that the size of the input does change the annotation in my testing. I didn't expect that since the E-value depends on the database size, not query size.

My guess is that this is a MMseqs2 thing. @milot-mirdita, do you know what could be causing this? geNomad basically uses the proteins encoded by the input sequences as query and searches a profile database. I then take the best hit of each query protein. You can find the commands here.

Maybe this is because I'm using --max-rejected? If the order of the alignments of a given query is different depending on the number of queries, --max-rejected would stop the alignment computation at different points.

from genomad.

apcamargo avatar apcamargo commented on July 29, 2024

@xwu35 I did some more investigation and I found the cause of this issue.

Short story: the annotation should be very slightly more reliable when your input has less sequences (in your case, the input with only viral sequences). In any case, I wouldn't really recommend using the taxonomy of very short sequences.

Long story: It seems that the order of the sequences in the input matter when using --max-rejected. I don't really understand why, because I'd assume that the counter resets for each query. The problem is that if I stop using --max-rejected the runtime will increase significantly, so I'll look into other options to solve this.

from genomad.

xwu35 avatar xwu35 commented on July 29, 2024

Thanks for looking into it.

Will this issue make the virus identification step unreliable since part of the identification method is to find aligned viral marker genes?

from genomad.

apcamargo avatar apcamargo commented on July 29, 2024

The effects will be minimal, you shouldn't worry. This issue affects very few proteins within a sample and only alignments with high E-value are susceptible to variance.

from genomad.

xwu35 avatar xwu35 commented on July 29, 2024

Thanks, it is good to know.

from genomad.

apcamargo avatar apcamargo commented on July 29, 2024

I just pushed a change to the way the MMseqs2 searches are performed that will mitigate this issue

from genomad.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.