Coder Social home page Coder Social logo

How to understand the results? about genomad HOT 5 CLOSED

alienzj avatar alienzj commented on July 29, 2024
How to understand the results?

from genomad.

Comments (5)

apcamargo avatar apcamargo commented on July 29, 2024 1

No problem! Let me know if you have more questions :)

from genomad.

apcamargo avatar apcamargo commented on July 29, 2024

Hi @alienzj

There are a couple of points here:

  • geNomad's prediction's won't necessarily match VirSorter2's. geNomad's precision is higher than VirSorter2's (and even higher if you run VirSorter2 with all the models), so geNomad's predictions tend to be more conservative (although this can vary from dataset to dataset). I can't really say what is causing the classification conflicts with VirSorter2 without looking at the data.
  • I can't really say why that those contigs are being classified as plasmids without looking at the data. Can you share the _genes.tsv and _summary.tsv files? Keep in mind that VirSorter2 and phamb don't take plasmids into account, so there's a chance they classify plasmids as viruses.
  • There are two reasons that only 4932 out of 8439 sequences got taxonomic classification: (1) some sequences were classified as plasmid, (2) because you used a very conservative cutoff (--min-score 0.8), some sequences won't be classified as viruses or plasmids. The good news is that you can still check the taxonomic assignment of all sequences (regardless of their classification) in the annotation discovery (try to look for vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv).

Aggregating the output of several classification tools is difficult because they will often diverge. geNomad is, in average, more accurate than VirSorter2 (see figure below), but VS2 is an amazing tool and I can't guarantee that geNomad will be correct in every single scenario they diverge. You should gather as much information as possible.

VIRUS_MAIN_BENCHMARK

The good news is that geNomad's output includes some information that makes it easier to understand why a given sequence was classified as a plasmid or virus:

This is an example of a _plasmid_summary.tsv file:

seq_name      length   topology   n_genes   genetic_code   plasmid_score   fdr   n_hallmarks   marker_enrichment   conjugation_genes
-----------   ------   --------   -------   ------------   -------------   ---   -----------   -----------------   -----------------
NC_002128.1   92721    Linear     88        11             0.9942          NA    5             46.4458             T_virB11;MOBP1
NC_002127.1   3306     Linear     3         11             0.9913          NA    1             1.6586              NA

Here you can see that these sequences encode plasmid hallmarks, which is a very good indication that those sequences are indeed plasmids. Try to check if your sequences also encode those. In addition, the marker_enrichment field is a number that increases proportionally to the number of plasmid markers. So, if the marker_enrichment of a given sequence is high (say, higher than 6), it is probably a plasmid, not a virus.

The same is true for the _virus_summary.tsv output. Try to run the classification again with a lower --min-score and see if the sequences look viral from the summary (if you like to do the filtering by yourself, based on your criteria, just leave --min-score 0). You might have some false positives in your dataset.

Again, if you are only interested in the taxonomy, just look at vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv :)

Hope this helps!

from genomad.

alienzj avatar alienzj commented on July 29, 2024

Dear @apcamargo, thanks a lot for your quick and detailed reply.
Sure sure, here are the tsv files generated by geNomad using the above command line: genomad_output_tsv.tar.gz

  1. From the figure you provided, it is excellent that geNomad has such an accurate performance.
  2. Yes, there's a chance Virsorter2 and phamb classify plasmids as viruses.
  3. I checked vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv, it recorded 8269 taxonomic assignments. It is quite useful. Yes, I shall change the --min-score to see what will happen based on your suggestions.

Thanks a lot again!

from genomad.

apcamargo avatar apcamargo commented on July 29, 2024

Thanks @alienzj

There are certainly lots of plasmids in your data. You can easily see that in the _plasmid_summary.tsv file:

  • Sequences with very high marker_enrichment, which means that there are multiple plasmid markers in them.
  • Sequences with multiple plasmid hallmark genes (n_hallmarks)
  • Sequences with multiple conjugation genes (conjugation_genes). It is important to note that there are phages capable of conjugation, though.

from genomad.

alienzj avatar alienzj commented on July 29, 2024

Hi, @apcamargo,

Thanks a lot for your reply.
Yes, I shall remove those plasmids when doing virome profiling.

It is quite interesting that find so many plasmids from the viral vMAGs identified by VirSorter2 and phamb.

from genomad.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.