apcamargo / genomad Goto Github PK

View Code? Open in Web Editor NEW

174.0 5.0 17.0 32.21 MB

geNomad: Identification of mobile genetic elements

Home Page: https://portal.nersc.gov/genomad/

License: Other

Python 99.91% Dockerfile 0.09%

genomad's Introduction

geNomad

geNomad: Identification of mobile genetic elements

Features

geNomad's primary goal is to identify viruses and plasmids in sequencing data (isolates, metagenomes, and metatranscriptomes). It also provides a couple of additional features that can help you in your analysis:

Taxonomic assignment of viral genomes.
Identification of viruses integrated in host genomes (proviruses).
Functional annotation of proteins.

Documentation

For installation instructions, information about how geNomad works, and a detailed explanation of how to execute it, please check the full documentation: https://portal.nersc.gov/genomad/

Web app

geNomad is available as a web app in the NMDC EDGE platform. There you can upload your sequence data, visualize the results in your browser, and download the data to your computer.

Citing geNomad

If you use geNomad in your work, please consider citing its manuscript:

Identification of mobile genetic elements with geNomad

Camargo, A. P., Roux, S., Schulz, F., Babinski, M., Xu, Y., Hu, B., Chain, P. S. G., Nayfach, S., & Kyrpides, N. C. — Nature Biotechnology (2023), DOI: 10.1038/s41587-023-01953-y.

Quick start

We recommend users to read the documentation before starting to use geNomad. If you are in a rush, however, you can follow this quick step-by-step example.

Installation

First, you need to install geNomad. There's a couple of ways to do that, but here we will use mamba as it will handle all dependencies for us.

# Create an environment for geNomad
mamba create -n genomad -c conda-forge -c bioconda genomad
# Activate the geNomad environment
mamba activate genomad

Alternatively, you can run geNomad using Docker.

# Pull the image
docker pull antoniopcamargo/genomad
# Run the image
docker run --rm -ti -v "$(pwd):/app" antoniopcamargo/genomad

Downloading the database

geNomad depends on a database that contains the profiles of the markers that are used to classify sequences, their taxonomic information, their functional annotation, etc. So, you should first download the database to your current directory:

genomad download-database .

The database will be contained within the genomad_db directory.

If you prefer, you can also download the database from Zenodo and extract it manually.

Executing geNomad

Now you are ready to go! geNomad works by executing a series of modules sequentially (you can find more information about this in the pipeline documentation), but we provide a convenient end-to-end command that will execute the entire pipeline for you in one go.

In this example, we will use an Klebsiella pneumoniae genome (GCF_009025895.1) as input. You can use any FASTA file containing nucleotide sequences as input. geNomad will work for isolate genomes, metagenomes, and metatranscriptomes.

The command to execute geNomad is structured like this:

genomad end-to-end [OPTIONS] INPUT OUTPUT DATABASE

So, to run the full geNomad pipeline (end-to-end command), taking a nucleotide FASTA file (GCF_009025895.1.fna.gz) and the database (genomad_db) as input, we will execute the following command:

genomad end-to-end --cleanup --splits 8 GCF_009025895.1.fna.gz genomad_output genomad_db

The results will be written inside the genomad_output directory.

Three important details about the command above:

The --cleanup option was used to force geNomad to delete intermediate files that were generated during the execution. This will save you some storage space.
The --splits 8 parameter was used here to make it possible to run this example in a notebook. geNomad searches a big database of protein profiles that take up a lot of space in memory. To prevent the execution from failing due to insufficient memory, we can use the --splits parameter to split the search into chuncks. If you are running geNomad in a big server you might not need to split your search, increasing the execution speed.
Note that the input FASTA file that I used as input was compressed. This is possible because geNomad supports input files compressed as .gz, .bz2, or .xz.

Note By default, geNomad applies a series of post-classification filters to remove likely false positives. For example, sequences are required to have a plasmid or virus score of at least 0.7 and sequences shorter than 2,500 bp are required to encode at least one hallmark gene. If you want to disable the post-classification filters, add the --relaxed flag to your command. On the other hand, if you want to be very conservative with your classification, you may use the --conservative flag. This will make the post-classification filters more aggressive, preventing sequences without strong support from being classified as plasmid or virus. You can check out the default, relaxed, and conservative post-classification filters here.

Understanding the outputs

In this example, the results of geNomad's analysis will be written to the genomad_output directory, which will look like this:

genomad_output
├── GCF_009025895.1_aggregated_classification
├── GCF_009025895.1_aggregated_classification.log
├── GCF_009025895.1_annotate
├── GCF_009025895.1_annotate.log
├── GCF_009025895.1_find_proviruses
├── GCF_009025895.1_find_proviruses.log
├── GCF_009025895.1_marker_classification
├── GCF_009025895.1_marker_classification.log
├── GCF_009025895.1_nn_classification
├── GCF_009025895.1_nn_classification.log
├── GCF_009025895.1_summary
╰── GCF_009025895.1_summary.log

As mentioned above, geNomad works by executing several modules sequentially. Each one of these will produce a log file (<prefix>_<module>.log) and a subdirectory (<prefix>_<module>).

For this example, we will only look at the files within GCF_009025895.1_summary. The <prefix>_summary directory contains files that summarize the results that were generated across the pipeline. If you just want a list of the plasmids and viruses identified in your input, this is what you are looking for.

genomad_output
╰── GCF_009025895.1_summary
    ├── GCF_009025895.1_plasmid.fna
    ├── GCF_009025895.1_plasmid_genes.tsv
    ├── GCF_009025895.1_plasmid_proteins.faa
    ├── GCF_009025895.1_plasmid_summary.tsv
    ├── GCF_009025895.1_summary.json
    ├── GCF_009025895.1_virus.fna
    ├── GCF_009025895.1_virus_genes.tsv
    ├── GCF_009025895.1_virus_proteins.faa
    ╰── GCF_009025895.1_virus_summary.tsv

First, let's look at GCF_009025895.1_virus_summary.tsv:

seq_name                                 length   topology              coordinates       n_genes   genetic_code   virus_score   fdr   n_hallmarks   marker_enrichment   taxonomy
--------------------------------------   ------   -------------------   ---------------   -------   ------------   -----------   ---   -----------   -----------------   -----------------------------------------------------------------
NZ_CP045015.1|provirus_2885510_2934610   49101    Provirus              2885510-2934610   69        11             0.9776        NA    14            76.0892             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
NZ_CP045015.1|provirus_3855947_3906705   50759    Provirus              3855947-3906705   79        11             0.9774        NA    16            75.1552             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
NZ_CP045018.1                            51887    No terminal repeats   NA                57        11             0.9774        NA    14            67.7749             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
…

This tabular file lists all the viruses that geNomad found in your input and gives you some convenient information about them. Here's what each column contains:

seq_name: The identifier of the sequence in the input FASTA file. Proviruses will have the following name scheme: <sequence_identifier>|provirus_<start_coordinate>_<end_coordinate>.
length: Length of the sequence (or the provirus, in the case of integrated viruses).
topology: Topology of the viral sequence. Possible values are: No terminal repeats, DTR (direct terminal repeats), ITR (inverted terminal repeats), or Provirus (viruses integrated in host genomes).
coordinates: 1-indexed coordinates of the provirus region within host sequences. Will be NA for viruses that were not predicted to be integrated.
n_genes: Number of genes encoded in the sequence.
genetic_code: Predicted genetic code. Possible values are: 11 (standard code for Bacteria and Archaea), 4 (recoded TGA stop codon), or 15 (recoded TAG stop codon).
virus_score: A measure of how confident geNomad is that the sequence is a virus. Sequences that have scores close to 1.0 are more likely to be viruses than the ones that have lower scores.
fdr: The estimated false discovery rate (FDR) of the classification (that is, the expected proportion of false positives among the sequences up to this row). To estimate FDRs geNomad requires score calibration, which is turned off by default. Therefore, this column will only contain NA values in this example.
n_hallmarks: Number of genes that matched a hallmark geNomad marker. Hallmarks are genes that were previously associated with viral function and their presence is a strong indicative that the sequence is indeed a virus.
marker_enrichment: A score that represents the total enrichment of viral markers in the sequence. The value goes as the number of virus markers in the sequence increases, so sequences with multiple markers will have higher score. Chromosome and plasmid markers will reduce the score.
taxonomy: Taxonomic assignment of the virus genome. Lineages follow the taxonomy contained in ICTV's VMR number 19. Viruses can be taxonomically assigned up to the family level, but not to specific genera or species within that family. The taxonomy is presented with a fixed number of fields (corresponding to taxonomic ranks) separated by semicolons, with empty fields left blank.

In our example, geNomad identified several proviruses integrated into the K. pneumoniae genome and one extrachromosomal phage. Since they all have high scores and marker enrichment, we can be confident that these are indeed viruses. They were all predicted to use the genetic code 11 and were assigned to the Caudoviricetes class, which contains all the tailed bacteriphages. In the taxonomy field for these viruses, after Caudoviricetes, there are two consecutive semicolons because geNomad could only assign them to the class level, leaving the order and family ranks empty.

Another important file is GCF_009025895.1_virus_genes.tsv. During its execution, geNomad annotates the genes encoded by the input sequences using a database of chromosome, plasmid, and virus-specific markers. The <prefix>_virus_genes.tsv file summarizes the annotation of the genes encoded by the identified viruses.

gene              start   end     length   strand   gc_content   genetic_code   rbs_motif     marker              evalue       bitscore   uscg   plasmid_hallmark   virus_hallmark   taxid   taxname          annotation_conjscan   annotation_amr   annotation_accessions              annotation_description
---------------   -----   -----   ------   ------   ----------   ------------   -----------   -----------------   ----------   --------   ----   ----------------   --------------   -----   --------------   -------------------   --------------   --------------------------------   --------------------------------------------------------------------------------------
NZ_CP045018.1_1   1       399     399      1        0.536        11             None          GENOMAD.108715.VP   2.536e-32    123        0      0                  1                2561    Caudoviricetes   NA                    NA               PF05100;COG4672;TIGR01600          Phage minor tail protein L
NZ_CP045018.1_2   401     1111    711      1        0.568        11             AGGAG         GENOMAD.168265.VP   9.279e-47    170        0      0                  0                2561    Caudoviricetes   NA                    NA               PF14464;COG1310;K21140;TIGR02256   Proteasome lid subunit RPN8/RPN11, contains Jab1/MPN domain metalloenzyme (JAMM) motif
NZ_CP045018.1_3   1143    1493    351      1        0.382        11             AGGAG         GENOMAD.147875.VV   1.495e-14    71         0      0                  0                2561    Caudoviricetes   NA                    NA               COG5633;TIGR03066                  NA
NZ_CP045018.1_4   1509    2120    612      1        0.477        11             GGA/GAG/AGG   GENOMAD.143103.VP   1.958e-50    179        0      0                  1                2561    Caudoviricetes   NA                    NA               PF06805;COG4723;TIGR01687          Phage-related protein, tail component
NZ_CP045018.1_5   2183    13516   11334    1        0.566        11             None          GENOMAD.159864.VP   1.225e-268   923        0      0                  0                2561    Caudoviricetes   NA                    NA               PF12421;PF09327                    Fibronectin type III protein
NZ_CP045018.1_6   13585   15084   1500     1        0.550        11             AGGAG         GENOMAD.195756.VP   2.017e-14    79         0      0                  0                2561    Caudoviricetes   NA                    NA               NA                                 NA
NZ_CP045018.1_7   15163   16128   966      -1       0.469        11             GGAGG         NA                  NA           NA         0      0                  0                1       NA               NA                    NA               NA                                 NA
…

The columns in this file are:

gene: Identifier of the gene (<sequence_name>_<gene_number>). Usually, gene numbers start with 1 (first gene in the sequence). However, genes encoded by prophages integrated in the middle of the host chromosome may start with a different number, depending on it's position within the chromosome.
start: 1-indexed start coordinate of the gene.
end: 1-indexed end coordinate of the gene.
length: Length of the gene locus (in base pairs).
strand: Strand that encodes the gene. Can be 1 (direct strand) or -1 (reverse strand).
gc_content: GC content of the gene locus.
genetic_code: Predicted genetic code (see details in the explanation of the summary file).
rbs_motif: Detected motif of the ribosome-binding site.
marker: Best matching geNomad marker. If this gene doesn't match any markers, the value will be NA.
evalue: E-value of the alignment between the protein encoded by the gene and the best matching geNomad marker.
bitscore: Bitscore of the alignment between the protein encoded by the gene and the best matching geNomad marker.
uscg: Whether the marker assigned to this gene corresponds to a universal single-copy gene (UCSG, as defined in BUSCO v5). These genes are expected to be found in chromosomes and are rare in plasmids and viruses. Can be 1 (gene is USCG) or 0 (gene is not USCG).
plasmid_hallmark: Whether the marker assigned to this gene represents a plasmid hallmark.
virus_hallmark: Whether the marker assigned to this gene represents a virus hallmark.
taxid: Taxonomic identifier of the marker assigned to this gene (you can ignore this as it is meant to be used internally by geNomad).
taxname: Name of the taxon associated with the assigned geNomad marker. In this example, we can see that the annotated proteins are all characteristic of Caudoviricetes (which is why the provirus was assigned to this class).
annotation_conjscan: If the marker that matched the gene is a conjugation-related gene (as defined in CONJscan) this field will show which CONJscan acession was assigned to the marker.
annotation_amr: If the marker that matched the gene was annotated with an antimicrobial resistance (AMR) function (as defined in NCBIfam-AMRFinder), this field will show which NCBIfam acession was assigned to the marker.
annotation_accessions: Some of the geNomad markers are functionally annotated. This column tells you which entries in Pfam, TIGRFAM, COG, and KEGG were assigned to the marker.
annotation_description: A text describing the function assigned to the marker.

In the example above we can see the information of the first seven genes encoded by NZ_CP045018.1. The last entry didn't match any geNomad marker. The first six were all assigned to protein families, some of which are typical of tailed bacteriphages (such as the minor tail protein), reassuring us that these are indeed Caudoviricetes.

One important detail here is that the primary purpose of geNomad's markers is classification. They were designed to be specific to chromosomes, plasmids, or viruses, enabling the distinction of sequences belonging to these classes. Therefore, you should not expect that every single viral gene will be annotated with a geNomad marker. If you want to annotate the genes within your sequences as throughly as possible, you should use databases such as Pfam or COG.

The other two virus-related files within the summary directory are GCF_009025895.1_virus.fna and GCF_009025895.1_virus_proteins.faa. These are FASTA files of the identified virus sequences and their proteins, respectively. Proviruses are automatically excised from the host sequence.

Moving on to plasmids, the data related to their identification can be found in the <prefix>_plasmid_summary.tsv, <prefix>_genes.tsv, <prefix>_plasmid.fna, and <prefix>_plasmid_proteins.faa files. These are mostly very similar to their virus counterparts. The differences in <prefix>_plasmid_summary.tsv (shown below) are the following:

Virus-specific columns that are in <prefix>_virus_summary.tsv (coordinates and taxonomy) are not present.
The conjugation_genes column lists genes that might be involved in conjugation. It's important to note that the presence of such genes is not sufficient to tell whether a given plasmid is conjugative or mobilizible. If you are interested in identifying conjugative plasmids, we recommend you to analyze the plasmids you identified using geNomad with CONJscan.
The amr_genes column lists genes annotated with antimicrobial resistance function. You can check the specific functions associated with each accession in AMRFinderPlus website.

seq_name        length   topology              n_genes   genetic_code   plasmid_score   fdr   n_hallmarks   marker_enrichment   conjugation_genes                                                                                       amr_genes
-------------   ------   -------------------   -------   ------------   -------------   ---   -----------   -----------------   -----------------------------------------------------------------------------------------------------   -----------------------------------
NZ_CP045020.1   28729    No terminal repeats   36        11             0.9955          NA    7             25.8098             F_traE                                                                                                  NA
NZ_CP045022.1   50635    No terminal repeats   61        11             0.9947          NA    9             46.4657             T_virB1;T_virB3;virb4;T_virB5;T_virB6;T_virB8;T_virB9                                                   NA
NZ_CP045019.1   44850    No terminal repeats   52        11             0.9945          NA    3             28.7110             F_traE                                                                                                  NA
NZ_CP045016.1   82240    No terminal repeats   110       11             0.9939          NA    11            33.4021             T_virB8;T_virB9;F_traF;F_traH;F_traG;T_virB1                                                            NF000225;NF000270;NF012171;NF000052
NZ_CP045017.1   61331    No terminal repeats   76        11             0.9934          NA    16            36.2817             I_trbB;I_trbA;MOBP1;I_traI;I_traK;I_traL;I_traN;I_traO;I_traP;I_traQ;I_traR;traU;I_traW;I_traY;F_traE   NA
NZ_CP045021.1   5251     No terminal repeats   7         11             0.9910          NA    1             1.4225              NA                                                                                                      NA

genomad's People

Contributors

Stargazers

Watchers

Forkers

nelsonruth11 oracle5th alienzj joacjo philpalmer pythseq linxingchen anyihu urineri gaworj ruixuan-zhang thearcheic zhuxiaoyu123 landerdc cerebro409 wook2014 aptejada

genomad's Issues

`OSError: Could not find/load shared object file: libllvmlite.so` from broken conda-forge package. Found fix with using numba channel.

I'm getting the following error when installing genomad in a fresh environment:

OSError: Could not find/load shared object file: libllvmlite.so
 Error was: libstdc++.so.6: failed to map segment from shared object

A couple of things I've tried but it was the same situation:

Reinstalling https://anaconda.org/conda-forge/llvmlite .
Downgrading to v1.3.2 and v1.3.1

What ended up working was swapping the llvmlite install from conda-forge to numba

Here's the log:

(base) [jespinoz@login02 jcl110]$ mamba create -n genomad_env -c bioconda genomad

                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (1.2.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['genomad']

bioconda/linux-64                                           Using cache
bioconda/noarch                                             Using cache
conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
pkgs/r/linux-64                                               No change
pkgs/main/linux-64                                            No change
pkgs/main/noarch                                              No change
pkgs/r/noarch                                                 No change
jolespin/linux-64                                             No change
qiime2/noarch                                                 No change
jolespin/noarch                                               No change
qiime2/linux-64                                               No change
Transaction

  Prefix: /expanse/projects/jcl110/anaconda3/envs/genomad_env

  Updating specs:

   - genomad


  Package                         Version  Build                    Channel                    Size
─────────────────────────────────────────────────────────────────────────────────────────────────────
  Install:
─────────────────────────────────────────────────────────────────────────────────────────────────────

  + _libgcc_mutex                     0.1  conda_forge              conda-forge/linux-64     Cached
  + _openmp_mutex                     4.5  2_gnu                    conda-forge/linux-64     Cached
  + _py-xgboost-mutex                 2.0  cpu_0                    conda-forge/linux-64        8kB
  + absl-py                         1.4.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + aiohttp                         3.8.3  py310h5764c6d_1          conda-forge/linux-64     Cached
  + aiosignal                       1.3.1  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + appdirs                         1.4.4  pyh9f0ad1d_0             conda-forge/noarch       Cached
  + aragorn                        1.2.41  hec16e2b_0               bioconda/linux-64        Cached
  + aria2                          1.23.0  0                        bioconda/linux-64          29MB
  + astunparse                      1.6.3  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + async-timeout                   4.0.2  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + attrs                          22.2.0  pyh71513ae_0             conda-forge/noarch       Cached
  + blinker                           1.5  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + brotlipy                        0.7.0  py310h5764c6d_1005       conda-forge/linux-64     Cached
  + bzip2                           1.0.8  h7f98852_4               conda-forge/linux-64     Cached
  + c-ares                         1.18.1  h7f98852_0               conda-forge/linux-64     Cached
  + ca-certificates             2022.12.7  ha878542_0               conda-forge/linux-64     Cached
  + cached-property                 1.5.2  hd8ed1ab_1               conda-forge/noarch          4kB
  + cached_property                 1.5.2  pyha770c72_1             conda-forge/noarch       Cached
  + cachetools                      5.3.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + certifi                     2022.12.7  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + cffi                           1.15.1  py310h255011f_3          conda-forge/linux-64     Cached
  + charset-normalizer              2.1.1  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + click                           8.1.3  unix_pyhd8ed1ab_2        conda-forge/noarch       Cached
  + cryptography                   39.0.0  py310h34c0648_0          conda-forge/linux-64     Cached
  + flatbuffers                  22.12.06  hcb278e6_2               conda-forge/linux-64     Cached
  + frozenlist                      1.3.3  py310h5764c6d_0          conda-forge/linux-64     Cached
  + gast                            0.4.0  pyh9f0ad1d_0             conda-forge/noarch       Cached
  + gawk                            5.1.0  h7f98852_0               conda-forge/linux-64     Cached
  + genomad                         1.3.3  pyhdfd78af_0             bioconda/noarch          Cached
  + gettext                        0.21.1  h27087fc_0               conda-forge/linux-64     Cached
  + giflib                          5.2.1  h36c2ea0_2               conda-forge/linux-64     Cached
  + google-auth                    2.16.0  pyh1a96a4e_1             conda-forge/noarch       Cached
  + google-auth-oauthlib            0.4.6  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + google-pasta                    0.2.0  pyh8c360ce_0             conda-forge/noarch       Cached
  + grpcio                         1.51.1  py310h4a5735c_1          conda-forge/linux-64     Cached
  + h5py                            3.8.0  nompi_py310h0311031_100  conda-forge/linux-64     Cached
  + hdf5                           1.12.2  nompi_h4df4325_101       conda-forge/linux-64     Cached
  + icu                              70.1  h27087fc_0               conda-forge/linux-64     Cached
  + idna                              3.4  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + importlib-metadata              6.0.0  pyha770c72_0             conda-forge/noarch       Cached
  + joblib                          1.2.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + jpeg                               9e  h166bdaf_2               conda-forge/linux-64     Cached
  + keras                          2.11.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + keras-preprocessing             1.1.2  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + keyutils                        1.6.1  h166bdaf_0               conda-forge/linux-64     Cached
  + krb5                           1.20.1  h81ceb04_0               conda-forge/linux-64     Cached
  + ld_impl_linux-64                 2.40  h41732ed_0               conda-forge/linux-64     Cached
  + libabseil                  20220623.0  cxx17_h05df665_6         conda-forge/linux-64     Cached
  + libaec                          1.0.6  hcb278e6_1               conda-forge/linux-64     Cached
  + libblas                         3.9.0  16_linux64_openblas      conda-forge/linux-64     Cached
  + libcblas                        3.9.0  16_linux64_openblas      conda-forge/linux-64     Cached
  + libcurl                        7.87.0  hdc1c0ab_0               conda-forge/linux-64     Cached
  + libedit                  3.1.20191231  he28a2e2_2               conda-forge/linux-64     Cached
  + libev                            4.33  h516909a_1               conda-forge/linux-64     Cached
  + libffi                          3.4.2  h7f98852_5               conda-forge/linux-64     Cached
  + libgcc                          7.2.0  h69d50b8_2               conda-forge/linux-64     Cached
  + libgcc-ng                      12.2.0  h65d4601_19              conda-forge/linux-64     Cached
  + libgfortran-ng                 12.2.0  h69a702a_19              conda-forge/linux-64     Cached
  + libgfortran5                   12.2.0  h337968e_19              conda-forge/linux-64     Cached
  + libgomp                        12.2.0  h65d4601_19              conda-forge/linux-64     Cached
  + libgrpc                        1.51.1  h4fad500_1               conda-forge/linux-64     Cached
  + libiconv                         1.17  h166bdaf_0               conda-forge/linux-64     Cached
  + libidn2                         2.3.4  h166bdaf_0               conda-forge/linux-64     Cached
  + liblapack                       3.9.0  16_linux64_openblas      conda-forge/linux-64     Cached
  + libllvm11                      11.1.0  he0ac6c6_5               conda-forge/linux-64     Cached
  + libnghttp2                     1.51.0  hff17c54_0               conda-forge/linux-64     Cached
  + libnsl                          2.0.0  h7f98852_0               conda-forge/linux-64     Cached
  + libopenblas                    0.3.21  pthreads_h78a6416_3      conda-forge/linux-64     Cached
  + libpng                         1.6.39  h753d276_0               conda-forge/linux-64     Cached
  + libprotobuf                   3.21.12  h3eb15da_0               conda-forge/linux-64     Cached
  + libsqlite                      3.40.0  h753d276_0               conda-forge/linux-64     Cached
  + libssh2                        1.10.0  hf14f497_3               conda-forge/linux-64     Cached
  + libstdcxx-ng                   12.2.0  h46fd767_19              conda-forge/linux-64     Cached
  + libunistring                   0.9.10  h7f98852_0               conda-forge/linux-64     Cached
  + libuuid                        2.32.1  h7f98852_1000            conda-forge/linux-64     Cached
  + libxgboost                      1.7.1  cpu_ha3b9936_0           conda-forge/linux-64     Cached
  + libxml2                        2.10.3  h7463322_0               conda-forge/linux-64     Cached
  + libzlib                        1.2.13  h166bdaf_4               conda-forge/linux-64     Cached
  + llvmlite                       0.39.1  py310h58363a5_1          conda-forge/linux-64     Cached
  + markdown                        3.4.1  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + markdown-it-py                  2.1.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + markupsafe                      2.1.2  py310h1fa729e_0          conda-forge/linux-64     Cached
  + mdurl                           0.1.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + mmseqs2                      14.7e284  pl5321hf1761c0_0         bioconda/linux-64        Cached
  + multidict                       6.0.4  py310h1fa729e_0          conda-forge/linux-64     Cached
  + ncurses                           6.3  h27087fc_1               conda-forge/linux-64     Cached
  + numba                          0.56.4  py310ha5257ce_0          conda-forge/linux-64     Cached
  + numpy                          1.23.5  py310h53a5b5f_0          conda-forge/linux-64     Cached
  + oauthlib                        3.2.2  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + openssl                         3.0.7  h0b41bf4_2               conda-forge/linux-64     Cached
  + opt_einsum                      3.3.0  pyhd8ed1ab_1             conda-forge/noarch       Cached
  + packaging                        23.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + perl                           5.32.1  2_h7f98852_perl5         conda-forge/linux-64     Cached
  + pip                              23.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + pooch                           1.6.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + prodigal-gv                    2.10.0  h7132678_0               bioconda/linux-64        Cached
  + protobuf                      4.21.12  py310heca2aa9_0          conda-forge/linux-64     Cached
  + py-xgboost                      1.7.1  cpu_py310hd1aba9c_0      conda-forge/linux-64     Cached
  + pyasn1                          0.4.8  py_0                     conda-forge/noarch       Cached
  + pyasn1-modules                  0.2.7  py_0                     conda-forge/noarch       Cached
  + pycparser                        2.21  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + pygments                       2.14.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + pyjwt                           2.6.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + pyopenssl                      23.0.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + pysocks                         1.7.1  pyha2e5f31_6             conda-forge/noarch       Cached
  + python                         3.10.8  h4a9ceb5_0_cpython       conda-forge/linux-64     Cached
  + python-crfsuite                 0.9.8  py310hbf28c38_1          conda-forge/linux-64     Cached
  + python-flatbuffers            23.1.21  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + python_abi                       3.10  3_cp310                  conda-forge/linux-64     Cached
  + pyu2f                           0.1.5  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + re2                        2023.02.01  hcb278e6_0               conda-forge/linux-64     Cached
  + readline                        8.1.2  h0f457ee_0               conda-forge/linux-64     Cached
  + requests                       2.28.2  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + requests-oauthlib               1.3.1  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + rich                           13.3.1  pyhd8ed1ab_1             conda-forge/noarch       Cached
  + rich-click                      1.6.1  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + rsa                               4.9  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + scikit-learn                    1.2.1  py310h209a8ca_0          conda-forge/linux-64     Cached
  + scipy                          1.10.0  py310h8deb116_0          conda-forge/linux-64     Cached
  + setuptools                     67.1.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + six                            1.16.0  pyh6c4a22f_0             conda-forge/noarch       Cached
  + snappy                          1.1.9  hbd366e4_2               conda-forge/linux-64     Cached
  + sqlite                         3.40.0  h4ff8645_0               conda-forge/linux-64     Cached
  + taxopy                         0.11.0  pyhdfd78af_0             bioconda/noarch          Cached
  + tensorboard                    2.11.2  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + tensorboard-data-server         0.6.1  py310h600f1e7_4          conda-forge/linux-64     Cached
  + tensorboard-plugin-wit          1.8.1  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + tensorflow                     2.11.0  cpu_py310hd1aba9c_0      conda-forge/linux-64     Cached
  + tensorflow-base                2.11.0  cpu_py310hc9b7e7f_0      conda-forge/linux-64     Cached
  + tensorflow-estimator           2.11.0  cpu_py310hfed9998_0      conda-forge/linux-64     Cached
  + termcolor                       2.2.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + threadpoolctl                   3.1.0  pyh8a188c0_0             conda-forge/noarch       Cached
  + tk                             8.6.12  h27826a3_0               conda-forge/linux-64     Cached
  + typing-extensions               4.4.0  hd8ed1ab_0               conda-forge/noarch       Cached
  + typing_extensions               4.4.0  pyha770c72_0             conda-forge/noarch       Cached
  + tzdata                          2022g  h191b570_0               conda-forge/noarch       Cached
  + urllib3                       1.26.14  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + werkzeug                        2.2.2  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + wget                           1.20.3  ha35d2d1_1               conda-forge/linux-64     Cached
  + wheel                          0.38.4  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + wrapt                          1.14.1  py310h5764c6d_1          conda-forge/linux-64     Cached
  + xgboost                         1.7.1  cpu_py310hd1aba9c_0      conda-forge/linux-64     Cached
  + xz                              5.2.6  h166bdaf_0               conda-forge/linux-64     Cached
  + yarl                            1.8.2  py310h5764c6d_0          conda-forge/linux-64     Cached
  + zipp                           3.12.0  pyhd8ed1ab_0             conda-forge/noarch       Cached
  + zlib                           1.2.13  h166bdaf_4               conda-forge/linux-64     Cached

  Summary:

  Install: 147 packages

  Total download: 29MB

─────────────────────────────────────────────────────────────────────────────────────────────────────


Confirm changes: [Y/n] y
_py-xgboost-mutex                                    7.9kB @  44.8kB/s  0.2s
cached-property                                      4.1kB @  21.0kB/s  0.2s
aria2                                               29.3MB @  38.9MB/s  0.8s

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

To activate this environment, use

     $ mamba activate genomad_env

To deactivate an active environment, use

     $ mamba deactivate

(base) [jespinoz@login02 jcl110]$ conda activate genomad_env
(genomad_env) [jespinoz@login02 jcl110]$ genomad -h
Traceback (most recent call last):
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/llvmlite/binding/ffi.py", line 160, in <module>
    lib = ctypes.CDLL(str(_lib_handle.__enter__()))
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libstdc++.so.6: failed to map segment from shared object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/bin/genomad", line 6, in <module>
    from genomad.cli import cli
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/genomad/__init__.py", line 5, in <module>
    from genomad.modules import (
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/genomad/modules/aggregated_classification.py", line 4, in <module>
    from genomad import sequence, utils
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/genomad/sequence.py", line 9, in <module>
    from numba import njit
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/numba/__init__.py", line 19, in <module>
    from numba.core import config
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/numba/core/config.py", line 16, in <module>
    import llvmlite.binding as ll
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/llvmlite/binding/__init__.py", line 4, in <module>
    from .dylib import *
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/llvmlite/binding/dylib.py", line 3, in <module>
    from llvmlite.binding import ffi
  File "/expanse/projects/jcl110/anaconda3/envs/genomad_env/lib/python3.10/site-packages/llvmlite/binding/ffi.py", line 167, in <module>
    raise OSError(msg)
OSError: Could not find/load shared object file: libllvmlite.so
 Error was: libstdc++.so.6: failed to map segment from shared object

I got it to work by changing the llvmlite install from conda-forge to numba channel:

(genomad_env) [jespinoz@login02 jcl110]$ conda install -c numba llvmlite
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /expanse/projects/jcl110/anaconda3/envs/genomad_env

  added / updated specs:
    - llvmlite


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    llvmlite-0.39.1            |  py310he1b5a44_0        28.2 MB  numba
    ------------------------------------------------------------
                                           Total:        28.2 MB

The following packages will be SUPERSEDED by a higher-priority channel:

  llvmlite           conda-forge::llvmlite-0.39.1-py310h58~ --> numba::llvmlite-0.39.1-py310he1b5a44_0


Proceed ([y]/n)? y


Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Does prediction with no coordinates indicate the whole sequences is a viral sequence?

I just found the answer:

coordinates: 1-indexed coordinates of the provirus region within host sequences. Will be NA for viruses that were not predicted to be integrated.

Normal prediction result.

seq_name               length   topology   coordinates     n_genes   genetic_code   virus_score   fdr   n_hallmarks   marker_enrichment   taxonomy            
--------------------   ------   --------   -------------   -------   ------------   -----------   ---   -----------   -----------------   --------------------
CP000388.1|provirus_   52034    Provirus   774617-826650   58        11             0.9401        NA    11            45.1836             Viruses;            
774617_826650                                                                                                                             Duplodnaviria;      
                                                                                                                                          Heunggongvirae;     
                                                                                                                                          Uroviricota;        
                                                                                                                                          Caudoviricetes

Prediction without coordinates (genome is GCA_000166735.2, a MAG).

seq_name         length   topology              coordinates   n_genes   genetic_code   virus_score   fdr   n_hallmarks   marker_enrichment   taxonomy            
--------------   ------   -------------------   -----------   -------   ------------   -----------   ---   -----------   -----------------   --------------------
AEMJ01000831.1   698      No terminal repeats   NA            1         11             0.9638        NA    0             1.7183              Viruses;            
                                                                                                                                             Duplodnaviria;      
                                                                                                                                             Heunggongvirae;     
                                                                                                                                             Uroviricota;        
                                                                                                                                             Caudoviricetes      
AEMJ01000737.1   1746     No terminal repeats   NA            2         11             0.9244        NA    0             1.7183              Viruses;            
                                                                                                                                             Duplodnaviria;      
                                                                                                                                             Heunggongvirae;     
                                                                                                                                             Uroviricota;        
                                                                                                                                             Caudoviricetes      
AEMJ01000706.1   826      No terminal repeats   NA            1         11             0.8908        NA    0             1.4495              Viruses;            
                                                                                                                                             Duplodnaviria;      
                                                                                                                                             Heunggongvirae;     
                                                                                                                                             Uroviricota;        
                                                                                                                                             Caudoviricetes      
AEMJ01000847.1   3369     No terminal repeats   NA            2         11             0.8785        NA    0             1.7183              Viruses;            
                                                                                                                                             Duplodnaviria;      
                                                                                                                                             Heunggongvirae;     
                                                                                                                                             Uroviricota;        
                                                                                                                                             Caudoviricetes      
AEMJ01000526.1   288      No terminal repeats   NA            2         11             0.8497        NA    0             0.0000              Unclassified        
AEMJ01000792.1   2672     No terminal repeats   NA            3         11             0.8414        NA    0             1.7183              Unclassified        
AEMJ01000320.1   1885     No terminal repeats   NA            2         11             0.8297        NA    0             1.7183              Unclassified        
AEMJ01000546.1   283      No terminal repeats   NA            2         11             0.8262        NA    0             0.0000              Unclassified        
AEMJ01000712.1   349      No terminal repeats   NA            2         11             0.8238        NA    0             0.0000              Unclassified

$ seqkit stats GCA_000166735.2.fna.gz 
file                    format  type  num_seqs    sum_len  min_len  avg_len  max_len
GCA_000166735.2.fna.gz  FASTA   DNA        893  2,298,088      101  2,573.4   82,336

$ seqkit seq -n GCA_000166735.2.fna.gz | head -n 3
AEMJ01000893.1 UNVERIFIED_ORG: Leuconostoc inhae KCTC 3774 contig00909, whole genome shotgun sequence
AEMJ01000892.1 UNVERIFIED_ORG: Leuconostoc inhae KCTC 3774 contig00908, whole genome shotgun sequence
AEMJ01000891.1 UNVERIFIED_ORG: Leuconostoc inhae KCTC 3774 contig00907, whole genome shotgun sequence

Non-detailed taxonomy classification. Possibility of IMG/VR database integration

First of all, thank you very much for developing this tool. Using this tool was a great experience.

I have four questions,

In 95% of cases, the taxonomy results only go as far as the "order" layer. Only in some cases "family" is also reported.
Can this situation be improved? Is it possible to reach other layers including genera or even species? Or at least all cases can include family?

The second question,
Most of the results (over 90%) are placed in the Unclassified group. Is there a way to classify them?

The third question,
In this address https://zenodo.org/record/7084650, you have also placed three other databases. What is their use? Is it possible to use this as an example: genomad_hmm_v1.1.tar.gz

The fourth question,
Is it possible to use the AMG/VR 4 database (as you mentioned in its paper) along with this pipeline (or as a database for this pipeline)? To achieve a more accurate taxonomy as well as fewer Unclassified.

Thank you very much in advance,
NP

mmseqs prefilter error: database has wrong type

Hi,
I am trying to annotate virus contigs ( 5kb and above) identified via virsorter2 and deepvirfinder. However the mmseqs prefilter throws the following error:

[14:07:34] Executing genomad annotate.
[14:07:34] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option to force the execution of all the steps again.
[14:07:34] final.vcontigs.fixed_proteins.faa was found. Skipping gene prediction with prodigal-gv.
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/genomad/mmseqs2.py", line 190, in run_mmseqs2
    subprocess.run(command, stdout=fout, stderr=fout, check=True)
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mmseqs', 'prefilter', PosixPath('0.6.viral_taxo/0.2.genomad/final.vcontigs.fixed_annotate/final.vcontigs.fixed_mmseqs2/query_db/query_db'), PosixPath('/home/user/database/genomad-1.5/genomad_db'), PosixPath('0.6.viral_taxo/0.2.genomad/final.vcontigs.fixed_annotate/final.vcontigs.fixed_mmseqs2/search_db/prefilter_db'), '--threads', '30', '-s', '4.2', '--split', '0', '--split-mode', '0', '--max-seqs', '10000000', '--min-ungapped-score', '25', '-k', '5']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/genomad/cli.py", line 441, in annotate
    genomad.annotate.main(
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/genomad/modules/annotate.py", line 203, in main
    mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, splits)
  File "/home/user/miniconda3/envs/genomad/lib/python3.8/site-packages/genomad/mmseqs2.py", line 193, in run_mmseqs2
    raise Exception(f"'{command_str}' failed.") from e
Exception: 'mmseqs prefilter 0.6.viral_taxo/0.2.genomad/final.vcontigs.fixed_annotate/final.vcontigs.fixed_mmseqs2/query_db/query_db /home/user/database/genomad-1.5/genomad_db 0.6.viral_taxo/0.2.genomad/final.vcontigs.fixed_annotate/final.vcontigs.fixed_mmseqs2/search_db/prefilter_db --threads 30 -s 4.2 --split 0 --split-mode 0 --max-seqs 10000000 --min-ungapped-score 25 -k 5' failed.

I checked the mmseqs2.log and it says Input database has the wrong type (Generic):

Time for merging to query_db: 0h 0m 0s 8ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 124ms
prefilter 0.6.viral_taxo/0.2.genomad/final.vcontigs.fixed_annotate/final.vcontigs.fixed_mmseqs2/query_db/query_db /home/user/database/genomad-1.5/genomad_db 0.6.viral_taxo/0.2.genomad/final.vcontigs.fixed_annotate/final.vcontigs.fixed_mmseqs2/search_db/prefilter_db --threads 30 -s 4.2 --split 0 --split-mode 0 --max-seqs 10000000 --min-ungapped-score 25 -k 5 

MMseqs Version:           	14.7e284
Substitution matrix       	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix  	aa:VTML80.out,nucl:nucleotide.out
Sensitivity               	4.2
k-mer length              	5
k-score                   	seq:2147483647,prof:2147483647
Alphabet size             	aa:21,nucl:5
Max sequence length       	65535
Max results per query     	10000000
Split database            	0
Split mode                	0
Split memory limit        	0
Coverage threshold        	0
Coverage mode             	0
Compositional bias        	1
Compositional bias        	1
Diagonal scoring          	true
Exact k-mer matching      	0
Mask residues             	1
Mask residues probability 	0.9
Mask lower case residues  	0
Minimum diagonal score    	25
Selected taxa             	
Include identical seq. id.	false
Spaced k-mers             	1
Preload mode              	0
Pseudo count a            	substitution:1.100,context:1.400
Pseudo count b            	substitution:4.100,context:5.800
Spaced k-mer pattern      	
Local temporary path      	
Threads                   	30
Compressed                	0
Verbosity                 	3

Input database "/home/user/database/genomad-1.5/genomad_db" has the wrong type (Generic).

Allowed input:
- Index
- Nucleotide
- Profile
- Aminoacid

I tried by re-downloading the database, and changing the output directory but had the same error.
The database files were manually downloaded and extracted to /home/user/database/genomad-1.5
Environment info

genomad --version
geNomad, version 1.7.0  (installed through conda)

 mmseqs version
14.7e284

database =1.5

ls /home/user/database/genomad-1.5
genomad_db
genomad_hmm_v1.5  
genomad_metadata_v1.5.tsv  
genomad_msa_v1.5  
mmseqs_vrefseq  
version.txt

Would there be a smaller database available to use for testing?

I was wondering if you would know of a smaller database in the size of MB that could be used to test this tool?

Thanks if possible!

Assistance Needed: Unexpected Error and Output Truncation During geNomad Tool Execution

I am encountering unexpected errors and incomplete output during the execution of the geNomad tool. The error messages indicate potential character encoding issues, and the output seems to be truncated or incomplete, making it difficult to interpret the results accurately. I have also tried setting the PYTHONIOENCODING environment variable to utf-8 to address potential character encoding problems.

Issue Details
Error Description:
When attempting to execute the geNomad tool, I receive an error related to character encoding ('charmap' codec) and difficulties in encoding specific characters. The error messages include:

UnicodeEncodeError: 'charmap' codec can't encode characters in position ...: character maps to
These errors suggest that the tool might face challenges in handling file paths with special characters or in environments with non-UTF-8 character encoding.
Output Truncation:
Additionally, even in successful executions, the output seems to be truncated or cut off, making it challenging to obtain comprehensive and complete information from the tool's output.

Steps to Reproduce
Install the geNomad tool.
Execute the tool with a command that involves file paths with special characters or in a system with a non-UTF-8 character encoding environment.
Expected Behavior
The tool should execute smoothly without encountering character encoding errors.
The output should be complete and display all relevant information without truncation.
Additional Information

I have attempted to resolve the character encoding issue by setting the PYTHONIOENCODING environment variable to utf-8, but the error persists.

Executing geNomad annotate (v1.7.0). This will perform gene calling in the input sequences and annotate the predicted │
│ proteins with geNomad's markers. │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ Outputs: │
│ genomad_output\E. coli K12 MG1655 (NC_000913)_annotate │
│ ├── E. coli K12 MG1655 (NC_000913)_annotate.json (execution parameters) │
│ ├── E. coli K12 MG1655 (NC_000913)_genes.tsv (gene annotation data) │
│ ├── E. coli K12 MG1655 (NC_000913)_taxonomy.tsv (taxonomic assignment) │
│ ├── E. coli K12 MG1655 (NC_000913)_mmseqs2.tsv (MMseqs2 output file) │
│ └── E. coli K12 MG1655 (NC_000913)_proteins.faa (protein FASTA file) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Scripts\genomad.exe_main.py", line 7, in
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\rich_click\rich_group.py", line 21, in main
rv = super().main(*args, standalone_mode=False, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\genomad\cli.py", line 1240, in end_to_end
ctx.invoke(
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\genomad\cli.py", line 441, in annotate
genomad.annotate.main(
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\genomad\modules\annotate.py", line 82, in main
utils.display_header(
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\genomad\utils.py", line 286, in display_header
console.print(
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\genomad\utils.py", line 96, in print
self.write_print(*args, **kwargs)
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\genomad\utils.py", line 80, in write_print
self.writer_console.print(*args, **kwargs)
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\rich\console.py", line 1673, in print
with self:
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\rich\console.py", line 865, in exit
self._exit_buffer()
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\rich\console.py", line 823, in _exit_buffer
self._check_buffer()
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\site-packages\rich\console.py", line 2039, in _check_buffer
write(text)
File "C:\Users\DavidIbarra\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 381-498: character maps to
*** You may need to add PYTHONIOENCODING=utf-8 to your environment ***

mmseqs2 prefilter requires too much memory

Dear Developer,
I am using genomad to annotate virus from metagenomic sequencing. I met a problem with mmseqs2.py prefilter. I have read the FAQ and used --splits 8, but it still showed memory was not enough.

Environment:

Linux x86_64
1000G memory
8 threads
genomad: 1.5.1
mmseq2 version: 14.7e284.

Input file: FASTA.fa (5G size)

My annotate code is:

$MY_PATH/genomad annotate --splits 8 --threads 8 --cleanup $MY_PATH/FASTA.fa $MY_PATH/demo $MY_PATH/genomad_db_v1.1

$MY_PATH means real work dir pathway.

Error shows as follow:

prefilter $MY_PATH/FASTA_annotate/FASTA_mmseqs2/query_db/query_db $MY_PATH/genomad_db_v1.1/genomad_db $MY_PATH/FASTA_annotate/FASTA_mmseqs2/tmp/11571856592932011841/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 4.2 -k 5 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 8 --split-mode 0 --split-memory-limit 0 -c 0.2 --cov-mode 1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 20 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 8 --compressed 0 -v 3  
Query database size: 33938637 type: Aminoacid
Target split mode. Searching through 8 splits
Estimated memory consumption: 577M
Target database size: 227897 type: Profile
Process prefiltering step 1 of 8
Index table k-mer threshold: 89 at k-mer size 5
Index table: counting k-mers
[=================================================================] 28.46K 85h 53m 21s 171ms
Index table: Masked residues: 0
Can not allocate entries memory in IndexTable::initMemory
Error: Prefilter died

Looking forward to your reply and you can and you can contact me with e-mail [email protected]
Thank you!

nn-classification module uses all available CPUs instead of the limit set by --threads

Hello, thank you for maintaining this useful tool. When running Genomad end-to-end, every module up to the nn-classification step uses up to 30 CPUs, which I allow using --threads 30. However, when the nn-classification step runs, all available CPUs on the machine are used. Could you please help me limit the CPUs used by this step? Skipping this module avoids this problem, but I'd really like to use the neural network classification. Thank you!

Finding chimeric sequences

Thank you very much for making this amazing tool. This is becoming very useful in my research.

I performed a metaSpades assembly on pairend 150bp Illumina reads and subsequently employed VirSorter2 to isolate viral sequences.
Currently, I am utilizing geNomad to further refine my dataset by excluding non-viral sequences, determining taxonomy, and identifying potential chimeric viral sequences.

I ran geNomad using the following command. My fasta file contains 143117 sequences.

genomad end-to-end --enable-score-calibration --cleanup --threads 10 --composition metagenome VirSorter_combined.fasta VirSorter genomad_db

Regarding the chimeric sequences, those that contain genes from two distinct groups, such as "S03-NODE_15397_length_3234_cov_0.690698||full," I am considering as chimeric.

File: VirSorter_combined_virus_summary.tsv

seq_name	length	topology	coordinates	n_genes	genetic_code	virus_score	fdr	n_hallmarks	marker_enrichment	taxonomy
S03-NODE_15397_length_3234_cov_0.690698||full	3234	No terminal repeats	NA	4	11	0.9507	0.005	1	3.4366	Viruses
S05-NODE_6353_length_8129_cov_0.841540||full	8129	No terminal repeats	NA	12	11	0.8916	0.0092	0	2.9931	Viruses;Bicaudaviridae

File: VirSorter_combined_genes.tsv

gene	start	end	length	strand	gc_content	genetic_code	rbs_motif	marker	evalue	bitscore	uscg	plasmid_hallmark	virus_hallmark	taxid	taxname	annotation_conjscan	annotation_amr	annotation_accessions	annotation_description
S03-NODE_15397_length_3234_cov_0.690698||full_1	3	89	87	-1	0.379	11	AGxAGG/AGGxGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S03-NODE_15397_length_3234_cov_0.690698||full_2	304	1230	927	-1	0.372	11	AGGAG	GENOMAD.044416.VV	3.03E-05	48	0	0	1	2561	Caudoviricetes	NA	NA	TIGR01537	NA
S03-NODE_15397_length_3234_cov_0.690698||full_3	1507	2265	759	1	0.461	11	AGGAG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S03-NODE_15397_length_3234_cov_0.690698||full_4	2622	3233	612	1	0.451	11	AGGAG	GENOMAD.096491.VV	5.21E-13	70	0	0	0	352	Marseilleviridae	NA	NA	NA	NA

S05-NODE_6353_length_8129_cov_0.841540||full_1	96	419	324	-1	0.358	11	GGA/GAG/AGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_2	601	1605	1005	1	0.506	11	AGxAGG/AGGxGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_3	1620	2165	546	1	0.579	11	AGGAG	GENOMAD.167727.VP	3.31E-05	46	0	0	0	2561	Caudoviricetes	NA	NA	PF18306;COG4474;TIGR00725	Uncharacterized SPBc2 prophage-derived protein YoqJ
S05-NODE_6353_length_8129_cov_0.841540||full_4	2237	2680	444	-1	0.563	11	AGGAG/GGAGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_5	2673	3134	462	-1	0.552	11	GGAGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_6	3197	4777	1581	-1	0.593	11	GGAGG	GENOMAD.184062.VV	1.80E-05	50	0	0	0	3654	Bicaudaviridae	NA	NA	COG4245;K16630	Uncharacterized conserved protein YegL, contains vWA domain of TerY type
S05-NODE_6353_length_8129_cov_0.841540||full_7	4901	5332	432	-1	0.528	11	AGGAG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_8	5450	5896	447	-1	0.597	11	GGAGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_9	5915	7267	1353	-1	0.582	11	AGxAGG/AGGxGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_10	7281	7631	351	-1	0.553	11	GGA/GAG/AGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_11	7631	7813	183	-1	0.563	11	GGAGG	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA
S05-NODE_6353_length_8129_cov_0.841540||full_12	7810	8127	318	-1	0.513	11	None	NA	NA	NA	0	0	0	1	NA	NA	NA	NA	NA

Now, I have a question about the "Viruses;Bicaudaviridae" taxonomy. Upon examining the "genes.tsv" file, I noticed the presence of two different genes from different families. Should I consider this sequence as chimeric as well?

Thanks and regards,

Bhim

provirus info appears in find_proviruses/provirus.tsv but lost in summary/summary.tsv

Hi, geNomad works well with most of my data overall. However, when I went in details checking if proviruses in find_provirus directory match exactly with all proviruses in summary directory, I got some mismatches.

I guess some of thoes mismatches might come from post-classification filtering. Nevertheless, I also found some proviruses exist in summary part without cutting off bacterial squence as shown in find_provirus part.

For example, in contig_find_proviruses/contig_provirus.tsv contig_1 was cut
seq_name source_seq start end length n_genes v_vs_c_score in_seq_edge integrases
contig_1|provirus_5838_32914 contig_1|full 5838 32914 27077 41 24.1463 True NA

On the contrary, in contig_summary/contig_summary.tsv, contig_1 was not cut
seq_name length topology coordinates n_genes genetic_code virus_score fdr n_hallmarks marker_enrichment taxonomy
contig_1 32914 Linear NA 48 11 0.9121 NA 1 14.5901 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes

This mismatch confused me a lot. Looking forward to the clarification. Thank you very much!

Best,
Menghao

Is it better to run geNomad on single genome assemblies?

I was wondering if there is a difference between

(1) Run geNomad 1000 times, one at a time per WGS assembly for 1000 genomes
or
(2) Run geNomad 1 time, by combine the 1000 WGS assemblies into a single fasta file

Is there a d ifference in how it works? Will the option (2) be treated like a 'metagenome' and hence run with different parameters? I have personally run (2) but am concerned if the accuracy is affected

Also, if a contig is identified as virus (either prophage or non-integrated), will geNomad only take the part it thinks is viral or will it just give the whole contig? I have noticed that sometimes the 'coordinates' is NA and it simply just gave me the entire bacteria contig unchanged.

mmseqs2: returned non-zero exit status 1.

Dear Developer,
I am trying genomad to find taxonomy for my viral contig. Prior to genomad, I performed spades to get assembled contigs and used this as an input for genomad. But I got the following error. This might be associated with to mmseqs2.py

genomad end-to-end --min-score 0.7 --cleanup --splits 8 spade_lim1_1_old/contigs.fasta lim1_1_genomad_output genomad_db

Traceback (most recent call last):
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 131, in run_mmseqs2
    subprocess.run(command, stdout=fout, stderr=fout, check=True)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mmseqs', 'search', PosixPath('lim1_1_genomad_output/contigs_annotate/contigs_mmseqs2/query_db/query_db'), PosixPath('/dss/dssfs02/lwp-dss-0001/u7b03/u7b03-dss-0000/ra78zut/genomad_db/genomad_db'), PosixPath('lim1_1_genomad_output/contigs_annotate/contigs_mmseqs2/search_db/search_db'), PosixPath('lim1_1_genomad_output/contigs_annotate/contigs_mmseqs2/tmp'), '--threads', '56', '-s', '4.2', '--cov-mode', '1', '-c', '0.2', '-e', '0.001', '--split', '8', '--split-mode', '0']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 1208, in end_to_end
    ctx.invoke(
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 425, in annotate
    genomad.annotate.main(
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/annotate.py", line 202, in main
    mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, splits)
  File "/dss/dsshome1/0F/ra78zut/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 134, in run_mmseqs2
    raise Exception(f"'{command_str}' failed.") from e
Exception: 'mmseqs search lim1_1_genomad_output/contigs_annotate/contigs_mmseqs2/query_db/query_db /dss/dssfs02/lwp-dss-0001/u7b03/u7b03-dss-0000/ra78zut/genomad_db/genomad_db lim1_1_genomad_output/contigs_annotate/contigs_mmseqs2/search_db/search_db lim1_1_genomad_output/contigs_annotate/contigs_mmseqs2/tmp --threads 56 -s 4.2 --cov-mode 1 -c 0.2 -e 0.001 --split 8 --split-mode 0' failed.

Any help would be appreciated.

Take circularity into account when applying post-classification filters

In response to Issue #23, geNomad's post-classification filters should be refined to take into account the circularity/completeness of input sequences. This could prevent the erroneous exclusion of legitimate plasmids due to the current stringent filtering criteria aimed at genomic island fragments.

Key Points:

Use circularity data (DTRs or user input) to adjust post-classification filters, making them more lenient for complete, circular sequences.

aragorn package does not exist

Hello! I am trying to install geNomad using mamba. However, it will not install because it says that aragorn does not exist
error: the following package could not be installed.
aragorn does not exist (perhaps a typo or a missing channel)

It is not an issue with my typing and from what I have seen it appears that my channels are fine. That being said I am new to bioinformatics and would appreciate any help. I would love be able to start using geNomad:)

Understanding Inputs and Outputs

Hello, I am doing some analysis on my dataset and have been reading through the documentation but I would like more clarification to make the best judgement.

For the input on the command line I understand that the --splits parameter splits the search into chunks but I am unsure exactly what the example means by "--splits 8" ? I have been putting in different numbers and I do not notice a difference in my results but does the number 8 in this case refer to the threads being used by the computer?
When looking at my summary output file (my virus summary file in my case) I notice that the output says there is predicted 7 viral genes, virus score = .9428, but the number of hallmarks = 0. Judging from the virus score it looks promising that there could be viral genes (because the score is closer to 1) but I am confused as to why the number of hallmarks would be zero if there are possibly 7 genes? My input settings are set to the default setting but my next step would be to put the --conservative parameter but I would like to hear your input thanks!

KeyError: 'provirus_names' when running genomad score_calibration

I encountered the following error when running genomad score_calibration:

Traceback (most recent call last):
  File "/home/remi/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 1074, in end_to_end
    ctx.invoke(
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 675, in score_calibration
    genomad.score_calibration.main(input, output, composition, force_auto, verbose)
  File "/home/remi/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/score_calibration.py", line 316, in main
    len(score_dict["contig_names"]) + len(score_dict["provirus_names"])
KeyError: 'provirus_names'
"

I was running the following command:

genomad end-to-end -t 30 --composition metagenome --enable-score-calibration ZSM005_contigs.filtered.sorted.fasta ZSM005_contigs_score_calibration genomad_db

I am using genomad version 1.3.2 installed using conda.

I can provide the fasta file ZSM005_contigs.filtered.sorted.fasta if it would be helpful in troubleshooting this error.

Too many threads used in numpy and tensorflow

Possibility to include NAs in taxonomy csv format for easier parsing?

Hi,

I'm trying to parse the taxonomy column of the _virus_summary.tsv file, but unclassified taxonomic levels are omitted which causes inconsistencies. In the table below for example, for the first entry, Straboviridae would end up in the Order column while it is a viral family.

seq_name	length	topology	coordinates	n_genes	genetic_code	virus_score	fdr	n_hallmarks	marker_enrichment	taxonomy
NODE_A75_length_5000_cov_17.505180_BlackFly28	5000	No_terminal_repeats	NA	4	11	0.9997	0.0003	1	6.8731	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Straboviridae
NODE_A20_length_7615_cov_6.791589_BlackFly6	7615	No_terminal_repeats	NA	11	11	0.9997	0.0003	6	15.3910	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NODE_A8_length_6639_cov_9.592350_BlackFly34	6639	No_terminal_repeats	NA	8	11	0.9997	0.0003	1	13.4767	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NODE_A4_length_12404_cov_730.807901_BlackFly44	12404	No_terminal_repeats	NA	6	11	0.9996	0.0004	2	3.4366	Viruses;Riboviria;Orthornavirae;Negarnaviricota;Monjiviricetes;Mononegavirales;Rhabdoviridae

Would it be possible to output the taxonomy column as:
Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;Straboviridae;; or
Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;unclassified;Straboviridae?

Thanks in advance!

nn classification stalls for large datasets

Hi again @apcamargo!

I've been able to successfully run genomad on several datasets (metagenomes). During the nn-classification I always receive an error in the log that says:

2022-10-16 22:38:39.339717: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

But the software keeps running and was able to finish successfully in some cases. However, for some very large datasets, the log shows this error and the software keeps running for >20h without any additional info on whether or not something is happening in the background. I cannot find a way to see what process is being executed. I left them running for now, still hoping that they will finish like the others, but it would be great if there was more info in the log on what is going on.

Thanks!

mmseqs2 prefilter failed: No k-mer could be extracted for the database genomad_db/genomad_db

Hello! I am currently trying to utilize the genomad annotate module to annotate a .fna file of Megahit assembled contigs. After downloading the database and attaching a unique identifier to each .fna headline (because they all started with k127), i ran the following command:

genomad annotate final_vOTUs_numbered.fna ./genomad_output ./genomad_db

I get this error directly from genomad:

Traceback (most recent call last):
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 137, in run_mmseqs2
subprocess.run(command, stdout=fout, stderr=fout, check=True)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mmseqs', 'search', PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db'), PosixPath('genomad_db/genomad_db'), PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/search_db/search_db'), PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp'), '--threads', '48', '-s', '4.2', '--cov-mode', '1', '-c', '0.2', '-e', '0.001', '--split', '0', '--split-mode', '0', '--max-seqs', '1000000', '--min-ungapped-score', '20', '--max-rejected', '225']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/hhallow1/.conda/envs/genomad/bin/genomad", line 8, in
sys.exit(cli())
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
rv = super().main(*args, standalone_mode=False, **kwargs)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 425, in annotate
genomad.annotate.main(
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/modules/annotate.py", line 202, in main
mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, splits)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 140, in run_mmseqs2
raise Exception(f"'{command_str}' failed.") from e
Exception: 'mmseqs search genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/search_db/search_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp --threads 48 -s 4.2 --cov-mode 1 -c 0.2 -e 0.001 --split 0 --split-mode 0 --max-seqs 1000000 --min-ungapped-score 20 --max-rejected 225' failed.

Here is the output from mmseqs2:

Converting sequences
[=====
Time for merging to query_db_h: 0h 0m 0s 79ms
Time for merging to query_db: 0h 0m 0s 31ms
Database type: Aminoacid
Time for processing: 0h 0m 1s 380ms
search genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_m
mseqs2/search_db/search_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp --threads 48 -s 4.2 --cov-mode 1 -c 0.2 -e 0.001 --split 0 --split-mode 0 --max-seq
s 1000000 --min-ungapped-score 20 --max-rejected 225

MMseqs Version: 13.45111
Substitution matrix nucl:nucleotide.out,aa:blosum62.out
Add backtrace false
Alignment mode 2
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0.2
Coverage mode 1
Max sequence length 65535
Compositional bias 1
Max reject 225
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a 1
Pseudo count b 1.5
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Gap open cost nucl:5,aa:11
Gap extension cost nucl:2,aa:1
Zdrop 40
Threads 48
Compressed 0
Verbosity 3
Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out
Sensitivity 4.2
k-mer length 5
k-score 2147483647
Alphabet size nucl:5,aa:21
Max results per query 1000000
Split database 0
Split mode 0
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask lower case residues 0
Minimum diagonal score 20
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.1
Global sequence weighting false
Allow deletions false
Filter MSA 1
Maximum seq. id. threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files false

prefilter genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp/4444936417411739143/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 4.2 -k 5 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1000000 --split 0 --split-mode 0 --split-memory-limit 0 -c 0.2 --cov-mode 1 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 20 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 48 --compressed 0 -v 3

Query database size: 56046 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 227897 type: Profile
Process prefiltering step 1 of 1

Index table k-mer threshold: 104 at k-mer size 5
Index table: counting k-mers
[=================================================================] 227.90K 10s 479ms
Index table: Masked residues: 0
No k-mer could be extracted for the database genomad_db/genomad_db.
Maybe the sequences length is less than 14 residues.
Error: Prefilter died

the .fna file, the genomad_output directory and the genomad_db directory are all in the same directory, and i am running the command from that directory as well. Any ideas how to fix this? Thanks!!

UnicodeEncodeError

Installation: mamba create -n genomad -c condo-forge -c bioconda genome
Relevant versions:
- genomad: 1.5.0
- click: 8.1.3
- rich: 13.3.2
- rich-click: 1.6.1
Sys info: Linux ubuntu 18 x86_64

There seems to be an encoding issue with the rich console output. I have encountered the following error when running the following commands:
- genomad end-to-end -h
- genomad download-database . after the tarball is downloaded
- PYTHONIOENCODING="utf-8" genomad download-database . error suggests I add an env var to specify encoding, but that did not work

Error stack:

Traceback (most recent call last):
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/download.py", line 93, in main
    console.log(f"Database extracted to [green]{database_path}[/green].")
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/utils.py", line 99, in log
    self.regular_console.log(*args, **kwargs)
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/console.py", line 1940, in log
    with self:
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/console.py", line 864, in __exit__
    self._exit_buffer()
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/console.py", line 822, in _exit_buffer
    self._check_buffer()
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/console.py", line 2060, in _check_buffer
    self.file.write(text)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u280f' in position 280: ordinal not in range(256)
*** You may need to add PYTHONIOENCODING=utf-8 to your environment ***

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/storage1/data14/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 335, in download_database
    genomad.download.main(destination, keep, verbose)
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/download.py", line 91, in main
    with console.status(f"Extracting the database to [green]{database_path}[/green]."):
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/status.py", line 106, in __exit__
    self.stop()
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/status.py", line 91, in stop
    self._live.stop()
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/live.py", line 147, in stop
    with self.console:
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/console.py", line 864, in __exit__
    self._exit_buffer()
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/console.py", line 822, in _exit_buffer
    self._check_buffer()
  File "/storage1/data14/miniconda3/envs/genomad/lib/python3.10/site-packages/rich/console.py", line 2060, in _check_buffer
    self.file.write(text)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u280f' in position 280: ordinal not in range(256)
*** You may need to add PYTHONIOENCODING=utf-8 to your environment ***

Long fasta headers with white spaces may interfere with downstream processing

I downloaded a multi-fasta file from Genbank and passed it as input to genomad. I only get expected results with genomad when I rename the fasta headers in the file.

Rename command: awk '/^>/{print ">Seq"++i; next}{print}' input.fasta > output.fasta

Original header:

>gi|29366675|ref|NC_000866.4| Enterobacteria phage T4, complete genome

New header:

>Seq1

If I don't rename fasta headers prior to running genomad, the *taxonomy.tsv file created by 'genomad end-to-end` is empty. Fasta headers longer than 30 characters or with white spaces can cause bugs in downstream processing because some software tools have limitations on the maximum length of header lines they can handle, or they may use whitespace as a delimiter to parse the header line and extract specific information. As a result, headers that exceed these limits may cause errors or unexpected behavior in downstream processing tools.

Support for GFF and GBK outputs

Include functionality to export GFF and GBK files, as per Issue #28. This can be achieved via parsing the genes tabular output or by leveraging Pyrodigal's write_gff and write_genbank methods. A script to convert the tabular output into a GFF can be found here.

Key Points:

Determine inclusion of sequences in GBK files for consistency with tools like Prokka.
Consider performance; enable feature via --write-gff and --write-gbk flags if there's a significant processing delay.

SystemError: initialization of _internal failed without raising an exception

I installed on a cori login node using conda which ran without errors:
conda create -n genomad-conda -c conda-forge -c bioconda genomad

After activating and testing it produces errors:

source activate genomad-conda
genomad

Traceback (most recent call last):
File "/global/homes/s/snayfach/.conda/envs/genomad-conda/bin/genomad", line 6, in
from genomad.cli import cli
File "/global/homes/s/snayfach/.conda/envs/genomad-conda/lib/python3.10/site-packages/genomad/init.py", line 5, in
from genomad.modules import (
File "/global/homes/s/snayfach/.conda/envs/genomad-conda/lib/python3.10/site-packages/genomad/modules/aggregated_classification.py", line 4, in
from genomad import sequence, utils
File "/global/homes/s/snayfach/.conda/envs/genomad-conda/lib/python3.10/site-packages/genomad/sequence.py", line 9, in
from numba import njit
File "/global/homes/s/snayfach/.conda/envs/genomad-conda/lib/python3.10/site-packages/numba/init.py", line 42, in
from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
File "/global/homes/s/snayfach/.conda/envs/genomad-conda/lib/python3.10/site-packages/numba/np/ufunc/init.py", line 3, in
from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
File "/global/homes/s/snayfach/.conda/envs/genomad-conda/lib/python3.10/site-packages/numba/np/ufunc/decorators.py", line 3, in
from numba.np.ufunc import _internal
SystemError: initialization of _internal failed without raising an exception

Here's a stack overflow thread on the issue: numba/numba#8615

From skimming that, the issue might be the latest version of numpy and installing a lower version of numpy (<1.24) may fix the problem.

Fasta contains multiple entries with the same identifier

Hello, I got a error as " con1.fa is either empty or contains multiple entries with the same identifier. Please check your input FASTA file and execute genomad annotate again."
I noticed that genomad could used for metagenomic data in the website, however when I try to put this (con1.fa) as the input file, it doesn't work. In my mind, the identifier of every read could be repeat. Should I change all the identifiers?
could you provide a advice to solve this problem? I will really really appreciate about this. Best wishes!

Difference between "Unclassified" and "Viruses" taxonomy?

Thanks for making such a great tool! It's super easy to run and exactly what I need for my project.

Can you explain what the difference is between "Unclassified" and just "Viruses" in the taxonomy output? Does it mean that Unclassified hits are unknown if they're viruses at all?

For example, I have 292 'Unclassified' hits and 7668 'Viruses' hits across 2000 genomes, does this mean the unclassified could have possibly been plasmid/chromosome?

Step skipping doesn't work properly in the classification modules when the output of `annotate` changes

When the output of annotate changes (due to a change in the sensitivity of the search, for instance), the marker-classification and nn-classification will skill skip some steps.

In marker-classification this will cause some features to be incompatible with the actual gene annotations (e.g., marker frequency remains the same, when it should have changed). In both marker-classification and nn-classification, the provirus outputs will remain intact, even if no provirus was detected in the second execution (leading to an error in summary, as exemplified below).

Reproducing the bug

Run the end-to-end module twice to classify LC735414.1, first with -s 4.2 and then with -s 1. A provirus will be detected when running with -s 4.2 but not with -s 1, causing a bug in the summary module.

Is it safe to run genomad directly on nanopore R10.4.1 long reads?

Hi,

Thanks for providing genomad, it is very useful.

I was wondering would it be OK to run genomad directly on raw nanopore reads (>1000 bp, median q19) instead of assembled contigs? Should the parameters be tuned in this case?

Interpretation of Results

I have enjoyed using geNomad and find it to be a very useful tool. When geNomad identifies a plasmid or virus on a particular contig, is it saying that entire contig likely makes up the plasmid? Because the annotated genes cover the length of the contig, so I wanted to make sure I am interpreting this correctly.

Is there additional documentation on the significance of assigning the three types of topology to plasmids in particular? I was told DTR plasmids are perhaps more likely to be closed than ITR, but it would be helpful to have some documentation or links to information about interpreting the topology.

Finally, is there a way to identify where geNomad found the direct or inverted terminal repeats in a contig?

Thank you!

Can genomad be applied to the metagenome bin?

Hello, I have some metagenomic data and got some bin after megahit and metabat2 processing. Can I use genomad on bin to obtain the virus?
Which contig assembled by megahit or bin processed by metabat2 can be used for genomad better?
I have a lot of memory and cpu, can I speed things up?
Do you have a better suggestion? Thanks!

Issues running genomad

genomad end-to-end --cleanup --splits 8 "/mnt/c/Users/DavidIbarra/OneDrive - Cemvita Factory Inc/Desktop/GCF001999325.1.fasta" genomad_output "/mnt/c/Users/DavidIbarra/genomad_db"
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Executing geNomad annotate (v1.7.0). This will perform gene calling in the input sequences and annotate the predicted │
│ proteins with geNomad's markers. │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ Outputs: │
│ genomad_output/GCF001999325.1_annotate │
│ ├── GCF001999325.1_annotate.json (execution parameters) │
│ ├── GCF001999325.1_genes.tsv (gene annotation data) │
│ ├── GCF001999325.1_taxonomy.tsv (taxonomic assignment) │
│ ├── GCF001999325.1_mmseqs2.tsv (MMseqs2 output file) │
│ └── GCF001999325.1_proteins.faa (protein FASTA file) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[13:50:35] Executing genomad annotate.
[13:50:35] Previous execution detected. Steps will be skipped unless their outputs are not found. Use the --restart option
to force the execution of all the steps again.
[13:50:35] GCF001999325.1_proteins.faa was found. Skipping gene prediction with prodigal-gv.
Traceback (most recent call last):
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 190, in run_mmseqs2
subprocess.run(command, stdout=fout, stderr=fout, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mmseqs', 'prefilter', PosixPath('genomad_output/GCF001999325.1_annotate/GCF001999325.1_mmseqs2/query_db/query_db'), PosixPath('/mnt/c/Users/DavidIbarra/genomad_db/genomad_db'), PosixPath('genomad_output/GCF001999325.1_annotate/GCF001999325.1_mmseqs2/search_db/prefilter_db'), '--threads', '12', '-s', '4.2', '--split', '8', '--split-mode', '0', '--max-seqs', '10000000', '--min-ungapped-score', '25', '-k', '5']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/david1ibarra/.local/bin/genomad", line 8, in
sys.exit(cli())
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/rich_click/rich_command.py", line 126, in main
rv = self.invoke(ctx)
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 1240, in end_to_end
ctx.invoke(
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 441, in annotate
genomad.annotate.main(
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/genomad/modules/annotate.py", line 203, in main
mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, splits)
File "/home/david1ibarra/.local/pipx/venvs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 193, in run_mmseqs2
raise Exception(f"'{command_str}' failed.") from e
Exception: 'mmseqs prefilter genomad_output/GCF001999325.1_annotate/GCF001999325.1_mmseqs2/query_db/query_db /mnt/c/Users/DavidIbarra/genomad_db/genomad_db genomad_output/GCF001999325.1_annotate/GCF001999325.1_mmseqs2/search_db/prefilter_db --threads 12 -s 4.2 --split 8 --split-mode 0 --max-seqs 10000000 --min-ungapped-score 25 -k 5' failed.

Not sure what went wrong or how to fix

enable-score-calibration option

Hello. I've been reading the documentation and the use of --enable-score-calibration is not clear to me. When I apply it in the end-to-end pipeline, it tells me that I have less than 1000 sequences and that another option will be used by default if I do not apply an automatic option.

I am working with prokaryotic genome assemblies (less than 1000 contigs) and I would like to know the estimated probabilities. What is the most recommended option? Should I do genomad end-to-end without any tags?

Thanks in advanced!

Tenserflow error for nn-classification

Hi
When running the end-to-end module, I got this error for the genomad (version 1.5.0) nn-classification.

genomad end-to-end --cleanup --threads 25 GFS_2469.fa GFS_2469_genomad ~/Desktop/Databases/Genomad/genomad_db/

I'm running this on a ubuntu machine with 250GB of RAM and it stops without really using any of the memory

[10:22:24] Executing genomad nn-classification.
[10:22:24] Creating the GFS_2469_genomad/GFS_2469_nn_classification directory.
[10:22:24] Creating the GFS_2469_genomad/GFS_2469_nn_classification/GFS_2469_encoded_sequences directory.
[10:22:26] Encoded sequence data written to GFS_2469_encoded_sequences.
Traceback (most recent call last):
  File "/home/river/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 1239, in end_to_end
    ctx.invoke(
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 694, in nn_classification
    genomad.nn_classification.main(
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/nn_classification.py", line 285, in main
    contig_predictions = nn_model.predict(
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error:

Detected at node 'model_1/model/conv1d/Pad' defined at (most recent call last):
    File "/home/river/miniconda3/envs/genomad/bin/genomad", line 10, in <module>
      sys.exit(cli())
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
      return self.main(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
      rv = super().main(*args, standalone_mode=False, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
      rv = self.invoke(ctx)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
      return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
      return __callback(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
      return f(get_current_context(), *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 1239, in end_to_end
      ctx.invoke(
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
      return __callback(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 694, in nn_classification
      genomad.nn_classification.main(
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/nn_classification.py", line 285, in main
      contig_predictions = nn_model.predict(
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2350, in predict
      tmp_batch_outputs = self.predict_function(iterator)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2137, in predict_function
      return step_function(self, iterator)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2123, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2111, in run_step
      outputs = model.predict_step(data)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 2079, in predict_step
      return self(x, training=False)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 561, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 511, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/training.py", line 561, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 511, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/functional.py", line 668, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1132, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/home/river/miniconda3/envs/genomad/lib/python3.10/site-packages/keras/layers/convolutional/base_conv.py", line 276, in call
      inputs = tf.pad(inputs, self._compute_causal_padding(inputs))
Node: 'model_1/model/conv1d/Pad'
OOM when allocating tensor with shape[128,6002,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node model_1/model/conv1d/Pad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_predict_function_942]

Error running Genomad 1.7.1

Version 1.6.1 worked fine for me.

After installing Genomad 1.7.1 (fresh install), I get the following error:

Traceback (most recent call last):
File "/gpfs/work5/0/gusr0570/conda/genomad_1.7.1/bin/genomad", line 6, in
from genomad.cli import cli
File "/gpfs/work5/0/gusr0570/conda/genomad_1.7.1/lib/python3.10/site-packages/genomad/init.py", line 5, in
from genomad.modules import (
File "/gpfs/work5/0/gusr0570/conda/genomad_1.7.1/lib/python3.10/site-packages/genomad/modules/annotate.py", line 4, in
from genomad import database, mmseqs2, prodigal, sequence, taxonomy, utils
File "/gpfs/work5/0/gusr0570/conda/genomad_1.7.1/lib/python3.10/site-packages/genomad/prodigal.py", line 5, in
import pyrodigal_gv
File "/gpfs/work5/0/gusr0570/conda/genomad_1.7.1/lib/python3.10/site-packages/pyrodigal_gv/init.py", line 11, in
from .meta import METAGENOMIC_BINS, METAGENOMIC_BINS_VIRAL
File "/gpfs/work5/0/gusr0570/conda/genomad_1.7.1/lib/python3.10/site-packages/pyrodigal_gv/meta.py", line 12, in
METAGENOMIC_BINS = pyrodigal.MetagenomicBins([
AttributeError: module 'pyrodigal' has no attribute 'MetagenomicBins'. Did you mean: 'MetagenomicBin'?

Is there a way to get the annotate outputs in a genbank format file?

Hello,

Some tools require a gff or genbank file as input for synteny analysis, such as clinker. Is there a If there a way to obtain these file formats from the outputs of $ genomad annotate? If not, do you recommend any script or program to convert the outputs to these formats?

Thank you very much!

download database manually

Is it possible to download database manually instead of genomad download-database? Thanks!

Could update to ICTV's VMR 21(MSL #37)?

Dear @apcamargo!

Thanks a lot for your excellent work.
I noticed the software used the ICTV's VMR number 19, however, the ICTV has updated the new taxonomy. Could you update the database based on the new taxonomy? or shared the procedures to make the database.

Thanks a lot.
 Warm Regards
 Jiandui Mi

Siphoviridae was not detected in metagenomic analysis

DearJamie Morton，

  Thanks a lot for your excellent software. However, we found  Siphoviridae was the majority one in the gut of previous studies. But we did not find it in our result with the software analysis. And we check the list ICTV VMR_19 and also did not found any more. Siphoviridae was removed? Could you please help me. Siphoviridae was combined to which order or family?
   Thanks a lot.
   Warm Regards
   Jiandui Mi

question about nn-classification

Hello, Thanks for the impressive tools for provirus identification. I have two questions when running the tools:

Can this tool be applied to Eukaryota organisms? Like some Axenic/Authentic algae or fungi? and any parameters need to be adjusted (like "The Genetic Codes" or any else)?
When I ran the nn-classification separately using the test dataset (Note: I ran this tool on a server without a GPU), I got some errors as follows, is there any suggestion to deal with it?

Commond:
genomad nn-classification --cleanup --threads 2 GCF_009025895.1.fa output

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Executing geNomad nn-classification (v1.7.0). This will classify the input sequences into chromosome, plasmid, or virus based on the nucleotide sequence. │
│ ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ Outputs: │
│ out_1/GCF_009025895.1_nn_classification │
│ ├── GCF_009025895.1_nn_classification.json (execution parameters) │
│ ├── GCF_009025895.1_encoded_sequences (directory containing encoded sequence data) │
│ ├── GCF_009025895.1_nn_classification.tsv (contig classification: tabular format) │
│ ├── GCF_009025895.1_nn_classification.npz (contig classification: binary format) │
│ ├── GCF_009025895.1_encoded_proviruses (directory containing encoded sequence data) │
│ ├── GCF_009025895.1_provirus_nn_classification.tsv (provirus classification: tabular format) │
│ └── GCF_009025895.1_provirus_nn_classification.npz (provirus classification: binary format) │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[16:24:41] Executing genomad nn-classification.
[16:24:42] Creating the out_1/GCF_009025895.1_nn_classification/GCF_009025895.1_encoded_sequences directory.
[16:24:45] Encoded sequence data written to GCF_009025895.1_encoded_sequences.
[16:24:45] Creating the out_1/GCF_009025895.1_nn_classification/GCF_009025895.1_encoded_proviruses directory.
[16:24:46] Encoded provirus data written to GCF_009025895.1_encoded_proviruses.
Traceback (most recent call last):
File "/path/to/python3.9.6/bin/genomad", line 8, in
sys.exit(cli())
File "/path/to/python3.9.6/lib/python3.9/site-packages/click/core.py", line 1137, in call
return self.main(*args, **kwargs)
File "/path/to/python3.9.6/lib/python3.9/site-packages/rich_click/rich_group.py", line 21, in main
rv = super().main(*args, standalone_mode=False, **kwargs)
File "/path/to/python3.9.6/lib/python3.9/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/path/to/python3.9.6/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/path/to/python3.9.6/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/path/to/python3.9.6/lib/python3.9/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/path/to/python3.9.6/lib/python3.9/site-packages/genomad/cli.py", line 719, in nn_classification
genomad.nn_classification.main(
File "/path/to/python3.9.6/lib/python3.9/site-packages/genomad/modules/nn_classification.py", line 304, in main
TimeRemainingColumn(elapsed_when_finished=True),
TypeError: init() got an unexpected keyword argument 'elapsed_when_finished'

Software version I used:

The python version: 3.9.6
TensorFlow version: I tried 2.8.0, 2.10.0 and 2.13.0
GeNomad database version: v1.5

Provirus detection changes when more input is given

I have a large number of metagenomic assemblies (virus-enriched in the wet lab). When I concatenate the assemblies into one file and run geNomad end-to-end (with score calibration), I see a few proviruses that are not detected when I run geNomad on the individual assemblies, with the same settings. The proviruses appear in the find-proviruses folder as well as in the summary.

This might be an expected behaviour but it's not really clear to me. An explanation would be highly appreciated. Also, if this is expected, would the prophage prediction of the concatenated input be more accurate than that of the individual assemblies?

In case it helps to clarify what I mean, this is the summary from the concatenated input:

$ grep "NODE_A77_length_14406_cov_85.369321_CH_17692_sum_ile_d" ALL.SAMPLES_1kb.contigs_virus_summary.tsv 
NODE_A77_length_14406_cov_85.369321_CH_17692_sum_ile_d|provirus_6327_14404	8078	Provirus	6327-14404	14	11	0.9952	0.0014	1	11.4948	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes

And this is from the individual assembly:

$ grep "NODE_A77_length_14406_cov_85.369321_CH_17692_sum_ile_d" CH_17692_sum_ile_d_1kb.contigs_virus_summary.tsv 
NODE_A77_length_14406_cov_85.369321_CH_17692_sum_ile_d	14406	No terminal repeats	NA	20	11	0.9429	0.0109	1	9.7680	Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes

Cheers!

decrease in the number of hits with the new version

Hello,

Thank you for this great tool. I both used the old version (I dont remember the exact version number but I used it in November) and the new version (1.5.0) of the tool for the same samples. The new version gives less hits (generally 50% decrease) compared to the older version. I kept the virus_score at 0.7 for both versions. I am just wondering why there is such a dramatic change in the number of hits between two versions? I am a bit confused.

Best,

Kadir

Error: prokka: Failed to download resource "aragorn"

Hi everyone!
I am new to this field and currently in the process of installing Prokka on my MacBook M2. Despite my attempts using both Homebrew and Conda package managers, the result is the same: Aragorn, a dependency of Prokka, is not available. I would greatly appreciate any assistance in resolving this matter !!

Error: prokka: Failed to download resource "aragorn"
Failure while executing; /usr/bin/env /opt/homebrew/Library/Homebrew/shims/shared/curl --disable --cookie /dev/null --globoff --show-error --user-agent Homebrew/4.1.14\ $Macintosh\;\ arm64\ Mac\ OS\ X\ 13.6$\ curl/8.1.2 --header Accept-Language:\ en --retry 3 --fail --location --silent --head http://mbio-serv2.mbioekol.lu.se/ARAGORN/Downloads/aragorn1.2.38.tgz exited with 56. Here's the output:
curl: (56) Recv failure: Connection reset by peer

genomad terminates during mmseqs

Hi,

I'm trying to run genomad for the first time! I'm using it on a compute cluster but with shared resources, so trying to control memory and threads.
I've tried several times, each time adjusting cores and memory resources, and also using the split option you indicated in the manual. But I always end up here:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jorap2/.conda/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 1015, in end_to_end
    ctx.invoke(
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 338, in annotate
    genomad.annotate.main(
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/genomad/modules/annotate.py", line 201, in main
    mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, splits)
  File "/home/jorap2/.conda/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 134, in run_mmseqs2
    raise Exception(f"'{command_str}' failed.") from e
Exception: 'mmseqs search all-samples_VIRUSES_out/all-samples_5kb_1.5kb-cir_annotate/all-samples_5kb_1.5kb-cir_mmseqs2/query_db/query_db genomad_db/genomad_db_v1.1/genomad_db all-samples_VIRUSES_out/all-samples_5kb_1.5kb-cir_annotate/all-samples_5kb_1.5kb-cir_mmseqs2/search_db/search_db all-samples_VIRUSES_out/all-samples_5kb_1.5kb-cir_annotate/all-samples_5kb_1.5kb-cir_mmseqs2/tmp --threads 128 -s 6.4 --cov-mode 1 -c 0.2 -e 0.001 --split 16 --split-mode 0' failed.
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=15098211.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

It seems like mmseqs is using 128 threads and I don't know how to contain it. Do you think this is the issue?

Thanks!

Input file does not exist error

Hello,

Thank you for making this. I am looking forward to using it. I have received the following error:

Invalid value for 'INPUT': Path 'sample_1.fa.gz' does not exist.

I have entered the correct path and also tried moving the file to my working directory with no avail. I am using singularity to run this in Docker, if that may pose any issues.

Thank you for your time

Processing many genomes

Thanks for the tool, its been really easy to install and fast to run!
I was just wondering, if I have ~100 SAGs/MAGs to classify, would you suggest concatenating them into a single fasta file for processing in order to FDR corrrect across all genomes rather than on a per-genome-basis?

Taxonomy changed when input contigs are different

Hi,

I used geNomad to identify viruses in addition to VirSorter2 and Cenote-taker2. Subsequently, I used geNomad's annotate module to assign taxonomy to all viral contigs, including those identified by other tools. I noticed that three of them received completely different taxonomic assignment when the input contigs were altered.

For instance, one of them was initially categprozed as Algavirales (Varidnaviria › Bamfordvirae › Nucleocytoviricota › Megaviricetes ) when the input contigs consisted of all assembled contigs (approximately 70,000), but it was then classified as Caudoviricetes (Duplodnaviria › Heunggongvirae › Uroviricota) when only viral contigs were used as the input (around 5000). Is this expected? I would think that taxonomy annotation should be more stable across different input contigs.

Many thanks.

How to understand the results?

Dear @apcamargo,
thank you for developing so great tool!

I used it to do viral genome taxonomic assignment:

genomad end-to-end --min-score 0.8 --cleanup --splits 16 \
results/09.dereplicate/genomes/virome/representative/vMAGs_hmq.megahit.rep.fa.gz \
genomad_output ~/databases/ecogenomics/geNomad/genomad_db \
>genomad.log 2>&1

Here is the summary of results:

➤ zcat results/09.dereplicate/genomes/virome/representative/vMAGs_hmq.megahit.rep.fa.gz | rg -c "^>"
8439

➤ wc -l genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_plasmid_summary.tsv
483 genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_plasmid_summary.tsv

➤ wc -l genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_virus_summary.tsv
4933 genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_virus_summary.tsv

Since all vMAGs were identified by Virsorter2 and phamb, and have complete, high or medium quality evaluated by CheckV,
below is what I don't understand currently:

Why genomad can identify plasmids from viral genomes (vMAGs)? There 482 plasmids were found.
The number of input genomes is 8439, why do only 4932 viral genomes have taxonomic assignments?

Thanks a lot!

AttributeError

Hi,
following my previous issue from the run with a pretty big input file, I have tried to run a much smaller assembly (the input fa.gz file is about 65 MB), but there is another error this time:

[14:10:19] Executing genomad annotate.
[14:10:19] Creating the ANT01_genomad_output/contigs_annotate directory.
Traceback (most recent call last):
  File "/projappl/project_2006548/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/genomad/cli.py", line 1208, in end_to_end
    ctx.invoke(
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/genomad/cli.py", line 425, in annotate
    genomad.annotate.main(
  File "/opt/conda/lib/python3.10/site-packages/genomad/modules/annotate.py", line 178, in main
    prodigal_obj.run_parallel_prodigal(threads)
  File "/opt/conda/lib/python3.10/site-packages/genomad/prodigal.py", line 92, in run_parallel_prodigal
    self._append_prodigal_fasta(current_file_path, protid_start)
  File "/opt/conda/lib/python3.10/site-packages/genomad/prodigal.py", line 42, in _append_prodigal_fasta
    match.group(1)
AttributeError: 'NoneType' object has no attribute 'group'

Could you please help?
Best,
Tatiana

FileNotFoundError

Hi,
Thanks for developing this tool. I use it to predict viral classification. However, I meet some problems:
my command:
genomad annotate ../checkv_contigs.fa genomad /public/zycheng/database/virus.db/genomad_db_v1.3
And the log:

  Executing geNomad annotate (v1.5.2). This will perform gene calling in the input sequences and annotate the predicted proteins with geNomad's  markers. 
[09:58:03] Executing genomad annotate.                                                                                                                    
[09:58:03] Creating the genomad/checkv_contigs_annotate directory.                                                                                        
Traceback (most recent call last):
  File "/public/home/zycheng/anaconda3/envs/genomad/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 425, in annotate
    genomad.annotate.main(
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/genomad/modules/annotate.py", line 167, in main
    database_obj = database.Database(database_path)
  File "/public/home/zycheng/anaconda3/envs/genomad/lib/python3.10/site-packages/genomad/database.py", line 10, in __init__
    with open(database_directory / "version.txt") as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/public/zycheng/database/virus.db/genomad_db/version.txt'

I downloaded the database from Zenodo and extract it manually, and there are only four files, without version.txt.

Best,
Zhongyi

How to organize the file structure of geNomad database

Hi Antônio,

I really love this tool. It has really nice docs with beautiful charts and is effortless to use!

I downloaded the database from Zenodo and extracted them manually.

./: 32.66 GB
  22.89 GB      genomad_hmm_v1.3
   3.70 GB      genomad_msa_v1.3
   3.27 GB      genomad_hmm_v1.3.tar.gz
   1.37 GB      genomad_db
 810.10 MB      genomad_db_v1.3.tar.gz
 653.91 MB      genomad_msa_v1.3.tar.gz
   6.49 MB      genomad_metadata_v1.3.tsv.g

Then I tested with a genome from GTDB (GCA_000010645.1), which seemed to work as expected, successfully identifying the four plasmids in the file (only one when not using --relaxed).

genomad end-to-end --relaxed --cleanup --threads 40 GCA_000010645.1.fna.gz genomad ~/ws/db/genomad/genomad_db

I have one little question.

Are other files except genomad_db needed? Files including genomad_hmm_v1.3 and genomad_msa_v1.3 are out of the genomad_db.

-- EDIT --

Hmm, I think the answer is no. It still works after moving other files to other paths.

apcamargo / genomad Goto Github PK

genomad's Introduction

geNomad

Features

Documentation

Web app

Citing geNomad

Quick start

Installation

Downloading the database

Executing geNomad

Understanding the outputs

genomad's People

Contributors

Stargazers

Watchers

Forkers

genomad's Issues

Reproducing the bug

Recommend Projects

Recommend Topics

Recommend Org