ebi-metagenomics / genomes-catalogue-pipeline Goto Github PK

MGnify genome analysis pipeline

License: Other

Shell 3.35% Python 63.45% R 0.31% Dockerfile 2.33% Perl 6.55% Nextflow 24.02%

genomes-catalogue-pipeline's Introduction

MGnify genomes catalogue pipeline

MGnify A pipeline to perform taxonomic and functional annotation and to generate a catalogue from a set of isolate and/or metagenome-assembled genomes (MAGs) using the workflow described in the following publication:

Gurbich TA, Almeida A, Beracochea M, Burdett T, Burgin J, Cochrane G, Raj S, Richardson L, Rogers AB, Sakharova E, Salazar GA and Finn RD. (2023) MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues. J Mol Biol. doi: https://doi.org/10.1016/j.jmb.2023.168016

Detailed information about existing MGnify catalogues: https://docs.mgnify.org/src/docs/genome-viewer.html

Tools used in the pipeline

Tool/Database	Version	Purpose
CheckM2	1.0.1	Determining genome quality
dRep	3.2.2	Genome clustering
Mash	2.3	Sketch for the catalogue; placement of genomes into clusters (update only); strain tree
GUNC	1.0.3	Quality control
GUNC DB	2.0.4	Database for GUNC
GTDB-Tk	2.3.0	Assigning taxonomy; generating alignments
GTDB	r214	Database for GTDB-Tk
Prokka	1.14.6	Protein annotation
IQ-TREE 2	2.2.0.3	Generating a phylogenetic tree
Kraken 2	2.1.2	Generating a kraken database
Bracken	2.6.2	Generating a bracken database
MMseqs2	13.45111	Generating a protein catalogue
eggNOG-mapper	2.1.11	Protein annotation (eggNOG, KEGG, COG, CAZy)
eggNOG DB	5.0.2	Database for eggNOG-mapper
Diamond	2.0.11	Protein annotation (eggNOG)
InterProScan	5.62-94.0	Protein annotation (InterPro, Pfam)
CRISPRCasFinder	4.3.2	Annotation of CRISPR arrays
AMRFinderPlus	3.11.4	Antimicrobial resistance gene annotation; virulence factors, biocide, heat, acid, and metal resistance gene annotation
AMRFinderPlus DB	3.11 2023-02-23.1	Database for AMRFinderPlus
SanntiS	0.9.3.2	Biosynthetic gene cluster annotation
Infernal	1.1.4	RNA predictions
tRNAscan-SE	2.0.9	tRNA predictions
Rfam	14.9	Identification of SSU/LSU rRNA and other ncRNAs
Panaroo	1.3.2	Pan-genome computation
Seqtk	1.3	Generating a gene catalogue
VIRify	2.0.1	Viral sequence annotation
Mobilome annotation pipeline	2.0.1	Mobilome annotation
samtools	1.15	FASTA indexing

Setup

Environment

The pipeline is implemented in Nextflow.

Requirements:

singulairty or docker

Reference databases

The pipeline needs the following reference databases and configuration files (roughtly ~150G):

ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/gunc_db_2.0.4.dmnd.gz
ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/eggnog_db_5.0.2.tgz
ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/rfam_14.9/
ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/kegg_classes.tsv
ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/continent_countries.csv
https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.0/auxillary_files/gtdbtk_r214_data.tar.gz
ftp://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/3.11/2023-02-23.1
https://zenodo.org/records/4626519/files/uniref100.KO.v1.dmnd.gz

Containers

This pipeline requires singularity or docker as the container engine to run pipeline.

The containers are hosted in biocontainers and quay.io/microbiome-informatics repository.

It's possible to build the containers from scratch using the following script:

cd containers && bash build.sh

Running the pipeline

Data preparation

You need to pre-download your data to directories and make sure that genomes are uncompressed. Scripts to fetch genomes from ENA (fetch_ena.py) and NCBI (fetch_ncbi.py) are provided and need to be executed separately from the pipeline. If you have downloaded genomes from both ENA and NCBI, put them into separate folders.
When genomes are fetched from ENA using the fetch_ena.py script, a CSV file with contamination and completeness statistics is also created in the same directory where genomes are saved to. If you are downloading genomes using a different approach, a CSV file needs to be created manually (each line should be genome accession, % completeness, % contamination). The ENA fetching script also pre-filters genomes to satisfy the QS50 cut-off (QS = % completeness - 5 * % contamination).
You will need the following information to run the pipeline:

catalogue name (for example, zebrafish-faecal)
catalogue version (for example, 1.0)
catalogue biome (for example, root:Host-associated:Human:Digestive system:Large intestine:Fecal)
min and max accession number to be assigned to the genomes (only MGnify specific). Max - Min = #total number of genomes (NCBI+ENA)

Execution

The pipeline is built in Nextflow, and utilized containers to run the software (we don't support conda ATM). In order to run the pipeline it's required that the user creates a profile that suits their needs, there is an ebi profile in nexflow.config that can be used as template.

After downloading the databases and adjusting the config file:

nextflow run EBI-Metagenomics/genomes-pipeline -c <custom.config> -profile <profile> \
--genome-prefix=MGYG \
--biome="root:Host-associated:Fish:Digestive system" \
--ena_genomes=<path to genomes> \
--ena_genomes_checkm=<path to genomes quality data> \
--mgyg_start=0 \
--mgyg_end=10 \
--preassigned_accessions=<path to file with preassigned accessions if using>
--catalogue_name=zebrafish-faecal \
--catalogue_version="1.0" \
--ftp_name="zebrafish-faecal" \
--ftp_version="v1.0" \
--outdir="<path-to-results>"

Development

Install development tools (including pre-commit hooks to run Black code formatting).

pip install -r requirements-dev.txt
pre-commit install

Code style

Use Black, this tool is configured if you install the pre-commit tools as above.

To manually run them: black .

Testing

This repo has 2 set of tests, python unit tests for some of the most critical python scripts and nf-test scripts for the nextflow code.

To run the python tests

pip install -r requirements-test.txt
pytest

To run the nextflow ones the databases have to downloaded manually, we are working to improve this.

nf-test test tests/*

genomes-catalogue-pipeline's People

Contributors

Stargazers

Watchers

genomes-catalogue-pipeline's Issues

Can't fetch ena data with biome list

https://github.com/EBI-Metagenomics/genomes-pipeline/blob/853487f6dda1420fd8b6b41dd4aff5c8540c7e37/bin/fetch_ena.py#L66

The method above returns nothing. I believe it's because metagenome_source returns empty results via the API:

https://www.ebi.ac.uk/ena/portal/api/search?result=wgs_set&query=assembly_type%3D%22metagenome-assembled%20genome%20%28mag%29%22&fields=study_accession%2Cmetagenome_source&limit=10&format=json&download=false

[
  {
    "study_accession": "PRJEB35770",
    "metagenome_source": "",
    "accession": "CAEMXZ010000000"
  },
  ....
  {
    "study_accession": "PRJEB35770",
    "metagenome_source": "",
    "accession": "CAESAJ010000000"
  }
]

Not sure what field would work here to get the biome, when hitting the api to get the search fields for wgs_set I get a 500: https://www.ebi.ac.uk/ena/portal/api/searchFields?dataPortal=metagenome&result=wgs_set&format=json - Is there an alternate field that contains the biome? Any workaround here? Thanks!

dead links

https://github.com/EBI-Metagenomics/genomes-pipeline/blob/ac2e8f1be5c8f6d31ea56abd266c528be7ccabbf/README.md?plain=1#L78

These links don't work:

(fetch_ena.py)

and

(fetch_ncbi.py)

Updating pre-existing catalog and functional profiling

Hi, I had a couple of questions regarding the pipeline and the generated catalogues:

I was wondering, if you could guide me through the process of updating a pre-existing catalogue. This is something I've seen you mention in the release, but I haven't seen it documented in the github.
I was wondering which of the generated catalogue files would be used to make a functional metagenome profile and which program could I use (Humann3, etc)?

Thank you very much in advance.
Best regards, Sam

Improve metadata script

Metadata script does a lot of ENA API calls to get information for each genome. If ENA API is unstable - whole script fails -> pipeline fails. -resume function runs everything from scratch.
Maybe it makes sense to improve this script:

create a backup file for already fetched metadata
fetch as one big file and parse it
fetch metadata in the start of pipeline (in parallel with fetch ...)

Fetch singularity images for offline use

Hi, thank you for sharing us this great pipeline! I wonder if there any solution for fetching all the images needed before executing the pipeline?

I've tried with nf-core download (which may not a proper use for this project), and it has ended up with errors below. Look forward for your response.

$ nf-core download --revision v2.2.0 --outdir genomes-pipeline_v2.2.0 -s singularity EBI-Metagenomics/genomes-pipeline
INFO    Process workflow revision v2.2.0, found 24 container images in total
Pulling singularity images
ERROR   Please try to rerun 
            "Singularity pull --name ..../quay.io-biocontainers-bracken-2.8--py...img quay.io-biocontainers-bracken-2.8--py...img" manually with a different registry.f

Software & system information:

System: Red Hat 8.5.0-10
Singularity-ce: version 3.10.2-1.el8
Nextflow: version 23.10.0 build 5889
nf-core: version 2.10

Completeness and contamination missing from fetch_ncbi.py

Are we suppose to compile this data ourselves? fetch_ena.py seems to generate this data, but fetch_ncbi.py does not.

Interproscan error

Seeing the following error pretty randomly:

ERROR ~ Error executing process > 'GAP:ANNOTATE:IPS (4)'

Caused by:
  Process `GAP:ANNOTATE:IPS (4)` terminated with an error exit status (231)

Command executed:

  interproscan.sh     -cpu 8     -dp     --goterms     -pa     -f TSV     --input protein_catalogue-90.4.faa     -o protein_catalogue-90.4.IPS.tsv

Command exit status:
  231

Command output:
  2023-12-21 04:36:10,501 [main] [uk.ac.ebi.interpro.scan.jms.master.StandaloneBlackBoxMaster:190] WARN - StepInstance 7 is being re-run following a failure.
  2023-12-21 04:36:34,277 [amqEmbeddedWorkerJmsContainer-6] [uk.ac.ebi.interpro.scan.management.model.implementations.RunBinaryStep:199] ERROR - Command line failed with exit code: 1
  Command: bin/hmmer/hmmer3/3.3/hmmsearch -Z 17929 --cut_ga --cpu 1 -o temp/defdc38590ee_20231221_041344507_5lx8//jobPfam/000000005001_000000010000.raw.out data/pfam/35.0/pfam_a.hmm temp/defdc38590ee_20231221_041344507_5lx8//jobPfam/000000005001_000000010000.fasta 
  Error output from binary:
  
  Error: File existence/permissions problem in trying to open HMM file data/pfam/35.0/pfam_a.hmm.
  HMM file data/pfam/35.0/pfam_a.hmm not found (nor an .h3m binary of it)
  
  
  2023-12-21 04:36:34,277 [amqEmbeddedWorkerJmsContainer-6] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:216] ERROR - Execution thrown when attempting to executeInTransaction the StepExecution.  All database activity rolled back.
  java.lang.IllegalStateException: Command line failed with exit code: 1
  Command: bin/hmmer/hmmer3/3.3/hmmsearch -Z 17929 --cut_ga --cpu 1 -o temp/defdc38590ee_20231221_041344507_5lx8//jobPfam/000000005001_000000010000.raw.out data/pfam/35.0/pfam_a.hmm temp/defdc38590ee_20231221_041344507_5lx8//jobPfam/000000005001_000000010000.fasta 
  Error output from binary:
  
  Error: File existence/permissions problem in trying to open HMM file data/pfam/35.0/pfam_a.hmm.
  HMM file data/pfam/35.0/pfam_a.hmm not found (nor an .h3m binary of it)
  
  
        at uk.ac.ebi.interpro.scan.management.model.implementations.RunBinaryStep.execute(RunBinaryStep.java:201) ~[interproscan-management-5.62-94.0.jar:?]
        at uk.ac.ebi.interpro.scan.jms.activemq.StepExecutionTransactionImpl.executeInTransaction(StepExecutionTransactionImpl.java:87) ~[interproscan-5.jar:?]
        at jdk.internal.reflect.GeneratedMethodAccessor101.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:367) ~[spring-tx-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:118) ~[spring-tx-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at com.sun.proxy.$Proxy141.executeInTransaction(Unknown Source) ~[?:?]
        at uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener.onMessage(LocalJobQueueListener.java:200) [interproscan-5.jar:?]
        at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:761) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:699) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:674) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:318) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1186) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1176) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1073) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at java.lang.Thread.run(Thread.java:829) [?:?]
  2023-12-21 04:36:34,278 [amqEmbeddedWorkerJmsContainer-6] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:218] ERROR - The exception is :
  2023-12-21 04:36:34,278 [amqEmbeddedWorkerJmsContainer-6] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:222] ERROR - StepExecution with errors - stepName: stepPfamRunHmmer3
  2023-12-21 04:36:34,322 [main] [uk.ac.ebi.interpro.scan.jms.activemq.NonZeroExitOnUnrecoverableError:25] FATAL - Analysis step 360 : Run HMMER 3 Binary for selected proteins for proteins 5001 to 10000 has failed irretrievably.  Available StackTraces follow.
  2023-12-21 04:36:34,323 [main] [uk.ac.ebi.interpro.scan.jms.activemq.NonZeroExitOnUnrecoverableError:42] FATAL - The JVM will now exit with a non-zero exit status.
  2023-12-21 04:36:34,323 [main] [uk.ac.ebi.interpro.scan.jms.master.StandaloneBlackBoxMaster:363] ERROR - Exception thrown by StandaloneBlackBoxMaster: 
  java.lang.IllegalStateException: InterProScan exiting with non-zero status, see logs for further information.
        at uk.ac.ebi.interpro.scan.jms.activemq.NonZeroExitOnUnrecoverableError.failed(NonZeroExitOnUnrecoverableError.java:43) ~[interproscan-5.jar:?]
        at uk.ac.ebi.interpro.scan.jms.master.StandaloneBlackBoxMaster.run(StandaloneBlackBoxMaster.java:169) [interproscan-5.jar:?]
        at uk.ac.ebi.interpro.scan.jms.main.Run.main(Run.java:413) [interproscan-5.jar:?]

Command error:
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1186)
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1176)
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1073)
        at java.base/java.lang.Thread.run(Thread.java:829)
  2023-12-21 04:36:34,277 [amqEmbeddedWorkerJmsContainer-6] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:216] ERROR - Execution thrown when attempting to executeInTransaction the StepExecution.  All database activity rolled back.
  java.lang.IllegalStateException: Command line failed with exit code: 1
  Command: bin/hmmer/hmmer3/3.3/hmmsearch -Z 17929 --cut_ga --cpu 1 -o temp/defdc38590ee_20231221_041344507_5lx8//jobPfam/000000005001_000000010000.raw.out data/pfam/35.0/pfam_a.hmm temp/defdc38590ee_20231221_041344507_5lx8//jobPfam/000000005001_000000010000.fasta 
  Error output from binary:
  
  Error: File existence/permissions problem in trying to open HMM file data/pfam/35.0/pfam_a.hmm.
  HMM file data/pfam/35.0/pfam_a.hmm not found (nor an .h3m binary of it)
  
  
        at uk.ac.ebi.interpro.scan.management.model.implementations.RunBinaryStep.execute(RunBinaryStep.java:201) ~[interproscan-management-5.62-94.0.jar:?]
        at uk.ac.ebi.interpro.scan.jms.activemq.StepExecutionTransactionImpl.executeInTransaction(StepExecutionTransactionImpl.java:87) ~[interproscan-5.jar:?]
        at jdk.internal.reflect.GeneratedMethodAccessor101.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:367) ~[spring-tx-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:118) ~[spring-tx-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212) ~[spring-aop-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at com.sun.proxy.$Proxy141.executeInTransaction(Unknown Source) ~[?:?]
        at uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener.onMessage(LocalJobQueueListener.java:200) [interproscan-5.jar:?]
        at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:761) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:699) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:674) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:318) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1186) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1176) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1073) [spring-jms-5.2.24.RELEASE.jar:5.2.24.RELEASE]
        at java.lang.Thread.run(Thread.java:829) [?:?]
  2023-12-21 04:36:34,278 [amqEmbeddedWorkerJmsContainer-6] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:218] ERROR - The exception is :
  2023-12-21 04:36:34,278 [amqEmbeddedWorkerJmsContainer-6] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:222] ERROR - StepExecution with errors - stepName: stepPfamRunHmmer3
  2023-12-21 04:36:34,322 [main] [uk.ac.ebi.interpro.scan.jms.activemq.NonZeroExitOnUnrecoverableError:25] FATAL - Analysis step 360 : Run HMMER 3 Binary for selected proteins for proteins 5001 to 10000 has failed irretrievably.  Available StackTraces follow.
  2023-12-21 04:36:34,323 [main] [uk.ac.ebi.interpro.scan.jms.activemq.NonZeroExitOnUnrecoverableError:42] FATAL - The JVM will now exit with a non-zero exit status.
  2023-12-21 04:36:34,323 [main] [uk.ac.ebi.interpro.scan.jms.master.StandaloneBlackBoxMaster:363] ERROR - Exception thrown by StandaloneBlackBoxMaster: 
  java.lang.IllegalStateException: InterProScan exiting with non-zero status, see logs for further information.
        at uk.ac.ebi.interpro.scan.jms.activemq.NonZeroExitOnUnrecoverableError.failed(NonZeroExitOnUnrecoverableError.java:43) ~[interproscan-5.jar:?]
        at uk.ac.ebi.interpro.scan.jms.master.StandaloneBlackBoxMaster.run(StandaloneBlackBoxMaster.java:169) [interproscan-5.jar:?]
        at uk.ac.ebi.interpro.scan.jms.main.Run.main(Run.java:413) [interproscan-5.jar:?]
  java.lang.IllegalStateException: InterProScan exiting with non-zero status, see logs for further information.
        at uk.ac.ebi.interpro.scan.jms.activemq.NonZeroExitOnUnrecoverableError.failed(NonZeroExitOnUnrecoverableError.java:43)
        at uk.ac.ebi.interpro.scan.jms.master.StandaloneBlackBoxMaster.run(StandaloneBlackBoxMaster.java:169)
        at uk.ac.ebi.interpro.scan.jms.main.Run.main(Run.java:413)
  InterProScan analysis failed. Exception thrown by StandaloneBlackBoxMaster. Check the log file for details

Contigs naming is not consistent between genomes

Hi,

In the the new UHGG release contig names contain a '_fa' suffix in some cases.

MGYG000005536.fa_69
MGYG000004909.fa_32
MGYG000006874.fa_4
MGYG000005065.fa_9

In most cases, this suffix is absent:

MGYG000015734_2
MGYG000276370_5
MGYG000003188_5
MGYG000002642_60

I think this is a bug (obviously not critical) that could be fixed on occasion.

Best,
Florian

Missing EBI profile

https://github.com/EBI-Metagenomics/genomes-pipeline/blob/853487f6dda1420fd8b6b41dd4aff5c8540c7e37/nextflow.config#L27

I can't seem to find the ebi profile in nextflow.config that's referenced in the README and above

No valid AMRFinder database is found

Hi, I had one question about the installation of this pipeline

I feel quite confused because, during the actual installation process, the README did not mention the need to install AMRFinder database. However, in the nextflow.config file, it requires me to specify the path to this database. Following your earlier suggestion, I downloaded the corresponding version of the database from the amrfinder official websiteand put it under this diractory: /media/chenwen/miniconda3/share/amrfinderplus/data/2023-02-23.1Yet, during the test of this pipleine , I encountered issues.

*** ERROR ***
No valid AMRFinder database is found.
This directory (or symbolic link to directory) is not found: /media/chenwen/miniconda3/share/amrfinderplus/data/2023-02-23.1
To download the latest version to the default directory run: amrfinder -u

HOSTNAME: 410be4c3662f
SHELL: ?
PWD: /mnt/chenwen/02.program/10.mgnify_test/00.script/work/f3/f342e386c93a2c661490879fd67962
PATH: /usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/media/pipeline/anaconda3/envs/nextflow/.nextflow/assets/EBI-Metagenomics/genomes-pipeline/bin
Progam name:  amrfinder
Command line: amrfinder --plus -n MGYG000000000.fna -p MGYG000000000.faa -g MGYG000000000.gff -d /media/chenwen/miniconda3/share/amrfinderplus/data/2023-02-23.1 -a prokka --output MGYG000000000_amrfinde
rplus.tsv --threads 1

Work dir:
/mnt/chenwen/02.program/10.mgnify_test/00.script/work/f3/f342e386c93a2c661490879fd67962

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

-- Check '.nextflow.log' file for details

Upon inspecting the modules/amrfinder_plus.nf file, I noticed the use of the amrfinder image. Hence, I suspect that my local database path might not be properly mapped into this image. However, I am uncertain if this is the issue and do not know how to resolve it.

Thank you very much in advance.
Best regards,
May

Container image miss command `ps`

Hi, I caught an error in Process: GAP:MASH_TO_NWK.

# .command.err
Command error:
   Command 'ps' required by nextflow to collect task metrics cannot be found

# .command.run
nxf_launch() {
    set +u; env - PATH="$PATH" ${TMP:+SINGULARITYENV_TMP="$TMP"} \
${TMPDIR:+SINGULARITYENV_TMPDIR="$TMPDIR"} \
${NXF_TASK_WORKDIR:+SINGULARITYENV_NXF_TASK_WORKDIR="$NXF_TASK_WORKDIR"} \
SINGULARITYENV_NXF_DEBUG="${NXF_DEBUG:=0}" \
singularity exec --no-home --pid -B /home/user /home/user/singularity-images/[quay.io](http://quay.io/)-microbiome-informatics-genomes-pipeline.mash2nwk-v1.img /bin/bash -c "cd $PWD; eval $(nxf_container_env); /bin/bash /home/user/temp/work/1b/7f283388b4cb40d93918bcf18e0b97/.command.run nxf_trace"
}

I think the container genomes-pipeline.mash2nwk might lack an important tool ps. A possible solution could be included ps into the Dockerfile of mash2nwk (as below). Since docker was not installed on the system I worked on, I had some trouble on transforming dockerfile to singularity image. I would be so grateful if you could publish an updated container on quay.io.

# containers/mash2nwk/Dockerfile
FROM r-base:4.1.0

LABEL software="mash2nwk"
LABEL software.version="1.0.0"
LABEL description="Generate Mash distance tree of conspecific genomes"
LABEL website="https://github.com/EBI-Metagenomics/genomes-pipeline"
LABEL license="GPLv3"

# Line added, to make sure ps is available
RUN apt-get update && apt install -y procps g++ && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* 

RUN install2.r \
        reshape2 \
        fastcluster \
        optparse \
        data.table \
        ape

RUN mkdir /tools
COPY mash2nwk1.R /tools
RUN chmod a+x /tools/*
ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools

# Workdir
RUN mkdir /data
WORKDIR /data

# Entrypoint
CMD ["Rscript", "/tools/mash2nwk1.R"]

where can I download taxcheck?

Thank you for sharing the code

where can I download taxcheck.sh ？
Does taxcheck a software?
I search all the github and google but can't find it

MGnify merges distinct species in the same cluster

Hi,

I have noticed that MGnify merges some distinct species in the same cluster.
Some examples are:

Phocaeicola dorei and Phocaeicola vulgatus
Adlercreutzia celatus_A and Adlercreutzia equolifaciens

In these cases, species delineation ANI cutoff is slighlty above 95% .

I would suggest to perform dereplication in two steps:

Group genomes with the same GTDB annotation at species level if available
Use dRep for genomes without annotation at species level

What do you think?

Question about lowercase/uppercase genomes

Hi, I noticed that some genome .fna files e.g. MGYG000002322 are all lowercase, while others are all uppercase MGYG000000001. Could you explain what the significance of the lower/upper case genomes is?

Best,
Francisco

SanntiS floods CPUs

Hi, developers

I got a warning on CPU usage on process:SANNTIS, which flooding CPUs regardless the process cpu (1) assigned in config.

I checked the scripts and found that execution command of sanntis lacking cpu params. I did some modification like this and reran this step.

However, The cpus still ran out of limits. I reported this issue to Sanntis too Finn-Lab/SanntiS#6, but not quite sure how to fix it. I wonder if there any options we could skip Sanntis temporarily?

Environment:

System: Red Hat 8.5.0-10
Singularity-ce: version 3.10.2-1.el8
Linux Cgroups v1  # thus cannot limit cpus when creating container
Nextflow: version 23.10.0 build 5889
Container: "quay.io/microbiome-informatics/sanntis:0.9.3.2"

Execution scripts and log files:
logfiles.zip

KRAKEN2_BUILD cp *.txt permission error

Hi, I might an error in Process: GAP:KRAKEN_SWF:KRAKEN2_BUILD (1).

ERROR~ Error executing process > 'GAP:KRAKEN_SWF:KRAKEN2_BUILD (1)'

Caused by:
  Process `GAP:KRAKEN_SWF:KRAKEN2_BUILD (1)` terminated with an error exit status (1)

Command executed:

  kraken2-build --build     --db kraken2_db_zebrafish-faecal_v1.0     --threads 1

Command exit status:
  1

Command output:
  (empty)

Command error:
  cp: cannot open '/mnt/02.program/10.mgnify_test/00.script_test/work/f1/41e40f8fb880a273d678abd448e214/kraken2_db_zebrafish-faecal_v1.0/library/added/prelim_map_Nyw1kzZg3u.txt' for reading: Permission denied
  cp: cannot open '/mnt/02.program/10.mgnify_test/00.script_test/work/f1/41e40f8fb880a273d678abd448e214/kraken2_db_zebrafish-faecal_v1.0/library/added/prelim_map_c3xTok5gdJ.txt' for reading: Permission denied
  cp: cannot open '/mnt/02.program/10.mgnify_test/00.script_test/work/f1/41e40f8fb880a273d678abd448e214/kraken2_db_zebrafish-faecal_v1.0/library/added/prelim_map_RIfYUEdjBD.txt' for reading: Permission denied

Work dir:
  /mnt/02.program/10.mgnify_test/00.script_test/work/a4/3db458ad6a66b51a0da32baa40ae28

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Then I changed the directory to the mentioned folder above.
I checked these *txt files and observed their permissions, but I'm uncertain about the specific step in which these files were generated.
Hope to receive some guidance!

drwxr-xr-x 2 root root    4096 Dec 26 15:11 ./
drwxr-xr-x 3 root root    4096 Dec 26 15:11 ../
-rw-r--r-- 1 root root 2317728 Dec 26 15:11 GNBKq8t8Va.fna
-rw-r--r-- 1 root root       0 Dec 26 15:11 GNBKq8t8Va.fna.masked
-rw-r--r-- 1 root root 2281926 Dec 26 15:11 M1kPl0XaQL.fna
-rw-r--r-- 1 root root       0 Dec 26 15:11 M1kPl0XaQL.fna.masked
-rw-r--r-- 1 root root 1769393 Dec 26 15:11 jMxWgZVgaq.fna
-rw-r--r-- 1 root root       0 Dec 26 15:11 jMxWgZVgaq.fna.masked
-rw------- 1 root root    1071 Dec 26 15:11 prelim_map_Nyw1kzZg3u.txt
-rw------- 1 root root    1041 Dec 26 15:11 prelim_map_RIfYUEdjBD.txt
-rw------- 1 root root    3603 Dec 26 15:11 prelim_map_c3xTok5gdJ.txt

Environment
Nextflow version: 23.04.1
Java version: 17
Operating system: Linux
Bash version: GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)

Taxonomic annotation is not consistent in metadata file

Hello,

I have noticed that taxonomic annotation is not consistent between genomes assigned to the same species representatives.

Below is my code:

library(tidyverse)
genomes_all_metadata=read_tsv('https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0/genomes-all_metadata.tsv')

genomes_all_metadata=genomes_all_metadata %>%
  select(Species_rep, Lineage) %>%
  group_by(Species_rep,Lineage) %>% summarise(num_genomes=n())

genomes_all_metadata = genomes_all_metadata %>% 
  group_by(Species_rep) %>%
  filter(n()>1)

For instance, genomes assigned to MGYG000002478 are sometimes classified as Phocaeicola dorei and sometimes as Bacteroides_B dorei

Could you fix this?

Florian

Prokka error

Getting an error on the following step, any help would be appreciated!

[66/af4c33] process > GAP:PROCESS_SINGLETON_GENOMES:PROKKA (MGYG000000010)  [ 42%] 3 of 7, failed: 1

Error:

ERROR ~ Error executing process > 'GAP:PROCESS_SINGLETON_GENOMES:PROKKA (MGYG000000030)'

Caused by:
  Process `GAP:PROCESS_SINGLETON_GENOMES:PROKKA (MGYG000000030)` terminated with an error exit status (2)

Command executed:

  cat MGYG000000030.fa | tr '-' ' ' > MGYG000000030_cleaned.fasta
  
  export JAVA_TOOL_OPTIONS="-XX:-UsePerfData"
  
  prokka MGYG000000030_cleaned.fasta     --cpus 8     --kingdom 'Bacteria'     --outdir MGYG000000030_prokka     --prefix MGYG000000030     --force     --locustag MGYG000000030

Command exit status:
  2

Command output:
  (empty)

Command error:
  [18:57:17] Modify product: Probable FKBP-type peptidyl-prolyl cis-trans isomerase FkpA => putative FKBP-type peptidyl-prolyl cis-trans isomerase FkpA
  [18:57:17] Modify product: Putative zinc metalloprotease aq_1964 => Putative zinc metalloprotease
  [18:57:17] Modify product: Uncharacterized ABC transporter ATP-binding protein TM_0288 => putative ABC transporter ATP-binding protein
  [18:57:17] Modify product: Uncharacterized ABC transporter ATP-binding protein Rv1273c => putative ABC transporter ATP-binding protein
  [18:57:17] Modify product: Probable ATP-dependent transporter SufC => putative ATP-dependent transporter SufC
  [18:57:17] Modify product: Probable dual-specificity RNA methyltransferase RlmN => putative dual-specificity RNA methyltransferase RlmN
  [18:57:17] Modify product: Probable butyrate:acetyl-CoA coenzyme A-transferase => putative butyrate:acetyl-CoA coenzyme A-transferase
  [18:57:17] Modify product: Probable FMN/FAD exporter YeeO => putative FMN/FAD exporter YeeO
  [18:57:17] Modify product: Putative multidrug export ATP-binding/permease protein SAV1866 => Putative multidrug export ATP-binding/permease protein
  [18:57:17] Modify product: Probable transcriptional regulatory protein HP_0162 => putative transcriptional regulatory protein
  [18:57:17] Modify product: Probable bifunctional oligoribonuclease and PAP phosphatase NrnA => putative bifunctional oligoribonuclease and PAP phosphatase NrnA
  [18:57:17] Modify product: UPF0758 protein YsxA => hypothetical protein
  [18:57:17] Modify product: 23S rRNA 5-hydroxycytidine C2501 synthase => 23S rRNA 5-hydroxycytidine synthase
  [18:57:17] Modify product: UPF0194 membrane protein YbhG => hypothetical protein
  [18:57:17] Modify product: Uncharacterized zinc protease Rv2782c => putative zinc protease
  [18:57:17] Modify product: Probable sulfoacetate transporter SauU => putative sulfoacetate transporter SauU
  [18:57:17] Modify product: Probable cysteine desulfurase => putative cysteine desulfurase
  [18:57:17] Modify product: Probable branched-chain-amino-acid aminotransferase => putative branched-chain-amino-acid aminotransferase
  [18:57:17] Modify product: Probable multidrug resistance ABC transporter ATP-binding/permease protein YheI => putative multidrug resistance ABC transporter ATP-binding/permease protein YheI
  [18:57:17] Modify product: UPF0353 protein Rv1481 => hypothetical protein
  [18:57:17] Modify product: UPF0353 protein Rv1481 => hypothetical protein
  [18:57:18] Modify product: Probable copper-transporting ATPase PacS => putative copper-transporting ATPase PacS
  [18:57:18] Modify product: Uncharacterized sugar kinase YdjH => putative sugar kinase YdjH
  [18:57:18] Modify product: UPF0371 protein DIP2346 => hypothetical protein
  [18:57:18] Modify product: UPF0053 protein HI_0107 => hypothetical protein
  [18:57:18] Modify product: Putative glutamine amidotransferase Rv2859c => Putative glutamine amidotransferase
  [18:57:18] Modify product: Tricorn protease homolog 1 => Tricorn protease 
  [18:57:18] Modify product: Probable L-galactonate transporter => putative L-galactonate transporter
  [18:57:18] Modify product: UPF0701 protein HI_0467 => hypothetical protein
  [18:57:18] Modify product: Uncharacterized protein YhaP => putative protein YhaP
  [18:57:18] Modify product: Uncharacterized protein YqeN => putative protein YqeN
  [18:57:18] Modify product: GTP cyclohydrolase 1 type 2 homolog => GTP cyclohydrolase 1 type 2 
  [18:57:18] Modify product: Uncharacterized ABC transporter ATP-binding protein YheS => putative ABC transporter ATP-binding protein YheS
  [18:57:18] Modify product: Bifunctional protein FolD => Bifunctional protein FolD protein
  [18:57:18] Modify product: Uncharacterized epimerase/dehydratase SA0511 => putative epimerase/dehydratase
  [18:57:18] Modify product: Putative 2-hydroxyacid dehydrogenase HI_1556 => Putative 2-hydroxyacid dehydrogenase
  [18:57:18] Modify product: Probable glycine dehydrogenase (decarboxylating) subunit 2 => putative glycine dehydrogenase (decarboxylating) subunit 2
  [18:57:18] Modify product: Probable chromate transport protein => putative chromate transport protein
  [18:57:18] Modify product: Probable manganese efflux pump MntP => putative manganese efflux pump MntP
  [18:57:18] Modify product: Probable GTP-binding protein EngB => putative GTP-binding protein EngB
  [18:57:18] Modify product: RutC family protein HI_0719 => RutC family protein
  [18:57:18] Modify product: Probable glycine dehydrogenase (decarboxylating) subunit 1 => putative glycine dehydrogenase (decarboxylating) subunit 1
  [18:57:18] Modify product: Dihydroorotate dehydrogenase B (NAD(+)), electron transfer subunit homolog => Dihydroorotate dehydrogenase B (NAD(+)), electron transfer subunit 
  [18:57:18] Cleaned 45 /product names
  [18:57:18] Deleting unwanted file: MGYG000000030_prokka/MGYG000000030.sprot.tmp.115.faa
  [18:57:18] Deleting unwanted file: MGYG000000030_prokka/MGYG000000030.sprot.tmp.115.blast
  [18:57:18] There are still 1050 unannotated CDS left (started with 1689)
  [18:57:18] Will use hmmer3 to search against /usr/local/db/hmm/HAMAP.hmm with 8 CPUs
  [18:57:18] Running: cat MGYG000000030_prokka\/MGYG000000030\.HAMAP\.hmm\.tmp\.115\.faa | parallel --gnu --plain -j 8 --block 23723 --recstart '>' --pipe hmmscan --noali --notextw --acc -E 1e-09 --cpu 1 /usr/local/db/hmm/HAMAP.hmm /dev/stdin > MGYG000000030_prokka\/MGYG000000030\.HAMAP\.hmm\.tmp\.115\.hmmer3 2> /dev/null
  [18:57:23] Could not run command: cat MGYG000000030_prokka\/MGYG000000030\.HAMAP\.hmm\.tmp\.115\.faa | parallel --gnu --plain -j 8 --block 23723 --recstart '>' --pipe hmmscan --noali --notextw --acc -E 1e-09 --cpu 1 /usr/local/db/hmm/HAMAP.hmm /dev/stdin > MGYG000000030_prokka\/MGYG000000030\.HAMAP\.hmm\.tmp\.115\.hmmer3 2> /dev/null

The file MGYG000000000.HAMAP.hmm.tmp.115.faa seems to be fine:

>1
MERSGFSTRLGFILVSAGCAIGLGNVWRFPYITGQYGGASFVLIYLLFLAILGLPIMVAEFAVGRASIRSAAMSFDVLEPKGTKWHWHKYTAIGGNMILMMFYTTVAGWMLYYLYKTATGAFDGLDAAGIGAVFGDLLQDPVTMGGYMAAIVLLCGGVCYLGVEAGVERITKWMMTCLLLLMVILGINSVLLPGAGEGLKYYLYPDFGRLMEHGLKEVIFAAMGQAFFTLSIGMGSLAVFGSYIGKSKRLTGEAVWIIILDTFVAIMAGLIIFPACFSYGVSPSSGPNLLFVTLPNVFNAMPLGRLWGTLFFLFMTFAAMSTVIAVFENLVVCFFDLLRIDRHKIICAGMPIVILLSLPCVLGFNEWSWIQPLGKGSGILDLEDFLVSNNILPLGSLVYMAFCTSRYGWGWDNFIKEADTGRGMEFPKWIRFYVTYILPLIVLVIFVNGYYALFFSR
>2
MAILLVITAILALIHLYHLSDKMKRLDYVYLMVLLIVMGLNTDNADYAAYERIYQKVSYATTWEEIMRAHTDKGYVFLNWLATVIGMDYKCFHLGLFTLLLGGIFIIAKRIGTPICALFLAYTMYPMFMDAIQIRNFIISAVLLFSIYCYAHANVRWYAIGVISLTIAVTIHPFILIFIPFIVFYKMYDTERFRPITYIPICLGLLSIAIKILIDTYWNEVTAMLTVLADWASRGHSYIGHQVLTSRQFKIYLVVVIFAWLLYKAKKYLSTSECVNDIQKKFVELSFVAFLYLICWMPLFALDINLATRMPRDLFLVAYMSLGIYLSKCTSQRIKIGLLFGIMFLAFFFGLVDLYISTDRFNVDVILSHNLLL
>3
MHIILVGPEYYNYSEGIISAFISLGCEVDFFPSVEFYENCSWYQRRLYKLGYKSLEKIWNNDWETRFREFIANRIRKDTIFLFCTGNMISVRLLKDLDAYVKVLYMWDSVKRYSDDFQSRLKLYDYLFAFEFGDIEFVRKKYGVSMQYMPLGYDSDFYYPDDNVVKDIDISFVGSCTKMRMNLLEKVAR
.....

Duplicated genomes in the UHGG v2

Hi,

I have noticed that some genomes are exactly the same in the UHGG v2.

Here is the list:

duplicated_genomes_metadata.txt

It would be great to fix this in the future versions.