metagenome-atlas / tutorial Goto Github PK

A tutorial for Metagenome-Atlas

License: GNU General Public License v3.0

HTML 10.62% Jupyter Notebook 17.52% MAXScript 71.60% Shell 0.01% Python 0.22% R 0.03%

tutorial's Introduction

Metagenome-Atlas Tutorial

This is a tutorial for Metagenome-Atlas. Metagenome-Atlas is an easy-to-use pipeline for analyzing metagenomic data. It handles all steps from QC, Assembly, Binning, to Annotation.

⁉️ If you have any question or errors write us.

Setup

Got to the setup page and follow the instructions.

Analyze the output of Atlas

Usually before starting to install a program I want to make sure that it gives the output I want. Therefore, we start analyzing the output of Metagenome-atlas.

We prepared an interactive Rmarkdown with the code for differential analysis.

✨ Follow this link to the interactive tutorial.

Here is an other Tutorial based on human samples with only the reports

Install and run atlas with three commands

In this part of the tutorial you will install metagenome-atlas either in GitHub codespaces or on your server and test it with a small dataset. As real metagenomic assembly can take more than 250GB ram and multiple processors, you would ideally do this directly on a high-performance system, e.g. the cluster of your university. You can install minconda in your home directory if it is not installed on your system.

Follow this link

See also the get started section in the documentation.

Use this code for your project

First, clone this git repository.

Copy atlas files to your local machine.

I made some handy scripts to copy the most important atlas output files from a server to your local machine. As the output files might change between different versions of atlas I use the file atlas_output_files.yaml to specify them. Check with atlas version is the closest to the atlas version you used.

You can run get_atlas_files.py or get_atlas_files.R to do this.

The Python script asks for the following information and stores them in .connection_details.yaml.

    "output_dir": 'atlas_data',
    "atlas_version": "v2.17",
    "username": "me",
    "server": "myserver.server.com",
    "base_path_server": '/home/user/my_atlas_run',
    "private_key_path": None # "C:/Users/User/.ssh/id_rsa"

For the R script you need to hard code them into the script.

⚠️ Some output atlas files might be very large, e.g. the gene catalog.

Use files specified in the `atlas_output_files.yaml`

This might be a complicated but generic way to access the atlas files. You can also simply copy the path specified in the atlas_output_files.yaml

In R you can use

data_dir <- "atlas_data" # path specified as output_dir in the get_atlas_files script
atlas_version <- "v2.17"
file_config_files <- "../atlas_output_files.yaml"

files <- yaml::yaml.load_file(file_config_files)[[atlas_version]]

for (key1 in names(files)) {
  value1 <- files[[key1]]
  if (is.character(value1)) {
    # It's a direct path
    files[[key1]] <- file.path(data_dir, value1)
  } else if (is.list(value1)) {
    # It's a nested list, go deeper
    for (key2 in names(value1)) {
      value2 <- value1[[key2]]
      files[[key1]][[key2]] <- file.path(data_dir, value2)
    }
  }
}


taxonomy_file <- files[["genomes"]][["taxonomy"]]
tree_file <- files[["genomes"]][["tree_bacteria"]]

tutorial's People

Contributors

Stargazers

Watchers

Forkers

mebigi aiqbal94 rintukutum animesh sewunet-abera felansky mdsoapbrain justan6 farhadm1990 leitemfa promec-ntnu

tutorial's Issues

Where to find contaminant_references ?

Question from @psychesha21

I’m following the online tutorial in the afternoon and I’ve all the way till the Contminate sequences part, and I don’t know where to find these two sequences:

contaminant_references:
  PhiX: /path/to/databases/phiX174_virus.fa
  host: /path/to/databases/host_genome.fasta

Unable to open 'Genecatalog/protein_catalog/db'

The past issue was solved by signing 500gb of memory and 50 jobs.
It went smoothly till I got this error.

rule get_rep_proteins:
input: Genecatalog/all_genes/predicted_genes, Genecatalog/clustering/mmseqs
output: Genecatalog/orf2gene_oldnames.tsv, Genecatalog/protein_catalog, Genecatalog/representatives_of_clusters.fasta
log: logs/Genecatalog/clustering/get_rep_proteins.log
jobid: 140
threads: 50
resources: tmpdir=/tmp, mem=500, mem_mb=500000, time=5

Activating conda environment: /home/nioo/sewuneta/sorghum_shotgun_metagenome_analysis/atlas.trial/atlas/databases/conda_envs/80a2955775f066104b58bf5b10a3ed68
[Tue Sep 14 07:47:46 2021]
Error in rule get_rep_proteins:
jobid: 140
output: Genecatalog/orf2gene_oldnames.tsv, Genecatalog/protein_catalog, Genecatalog/representatives_of_clusters.fasta
log: logs/Genecatalog/clustering/get_rep_proteins.log (check log file(s) for error message)
conda-env: /home/nioo/sewuneta/sorghum_shotgun_metagenome_analysis/atlas.trial/atlas/databases/conda_envs/80a2955775f066104b58bf5b10a3ed68
shell:

        mmseqs createtsv Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/clustering/mmseqs/clusterdb Genecatalog/orf2gene_oldnames.tsv  &> logs/Genecatalog/clustering/get_rep_proteins.log

        mkdir Genecatalog/protein_catalog 2>> logs/Genecatalog/clustering/get_rep_proteins.log

        mmseqs result2repseq Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/clustering/mmseqs/clusterdb Genecatalog/protein_catalog/db  &>> logs/Genecatalog/clustering/get_rep_proteins.log

        mmseqs result2flat Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/protein_catalog/db Genecatalog/representatives_of_clusters.fasta  &>> logs/Genecatalog/clustering/get_rep_proteins.log


    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job get_rep_proteins since they might be corrupted:
Genecatalog/protein_catalog
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Note the path to the log file for debugging.
Documentation is available at: https://metagenome-atlas.readthedocs.io
Issues can be raised at: https://github.com/metagenome-atlas/atlas/issues
Complete log: /home/nioo/sewuneta/sorghum_shotgun_metagenome_analysis/atlas.trial/atlas/.snakemake/log/2021-09-14T074744.372307.snakemake.log
[2021-09-14 07:47 CRITICAL] Command 'snakemake --snakefile /home/nioo/sewuneta/.conda/envs/atlasenv/lib/python3.8/site-packages/atlas/Snakefile --directory /home/nioo/sewuneta/sorghum_shotgun_metagenome_analysis/atlas.trial/atlas --jobs 80 --rerun-incomplete --configfile '/home/nioo/sewuneta/sorghum_shotgun_metagenome_analysis/atlas.trial/atlas/config.yaml' --nolock --use-conda --conda-prefix /home/nioo/sewuneta/sorghum_shotgun_metagenome_analysis/atlas.trial/atlas/databases/conda_envs --scheduler greedy all --keep-going ' returned non-zero exit status 1.
(atlasenv) sewuneta@nioo0003:~/sorghum_shotgun_metagenome_analysis/atlas.trial/atlas$

And the out put in the log file is
Program call:
Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/clustering/mmseqs/clusterdb Genecatalog/orf2gene_oldnames.tsv

MMseqs Version: 3.be8f6
first sequence as respresentative false

Query file is Genecatalog/all_genes/predicted_genes/inputdb
Data file is Genecatalog/clustering/mmseqs/clusterdb
Could not open data file Genecatalog/clustering/mmseqs/clusterdb!
Program call:
Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/clustering/mmseqs/clusterdb Genecatalog/protein_catalog/db

MMseqs Version: 3.be8f6
Threads 80
Verbosity 3

Could not open data file Genecatalog/clustering/mmseqs/clusterdb!
Program call:
Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/all_genes/predicted_genes/inputdb Genecatalog/protein_catalog/db Genecatalog/representatives_of_clusters.fasta

MMseqs Version: 3.be8f6
Use fasta header false
Verbosity 3

Query file is Genecatalog/all_genes/predicted_genes/inputdb
Target file is Genecatalog/all_genes/predicted_genes/inputdb
Data file is Genecatalog/protein_catalog/db
Could not open data file Genecatalog/protein_catalog/db!
logs/Genecatalog/clustering/get_rep_proteins.log (END)

Could pls help me resolve it?
(Note that I working my way in to get to know Atlas on the tutorial data and want to use it on my 300gb shotgun data)
Thanks.

Feature Request - Genecatalog analysis and updated tutorial

This is a bit of a vague feature request, but I was wondering if there are any plans to update or add any additional updates to this tutorial. It's been very helpful in parsing the large amounts of data generated from atlas.

Memory requirement and path for database to run metagenome-atlas

Hi,
I have 70 Gb shotgun illumina PE data from 4 samples with 3 replicates.
I need your help regarding Running of metagenome-atlas for analysis:

How much RAM do I need? I have a workstation Dell 7920 with 96 Gb RAM with 26 core (52 cpu) and processor Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz.
Is it enough to run the samples for downstream analysis?

I have installed the metagenome-atlas in a new environment (conda). Do i need to download the databases and link the path?
consider me beginner as suggest for setting up the tool and running it on the samples.

Thanks
rgds
RNS

How does this work on Paired-end(1 samle - 1.fq.gz and 2.fq.gz)

Hi ! Thank you very much for developing such a convenient tool.

run "atlas init --db-dir databases /PATH/", It told me "Found 24 samples under /PATH/"
actually I have 12 samples(24 fq.gz, because they are Paired-end, 12 x 2 = 24)

When I dry run, I see "B3-2_raw_se.fastq.gz" and "verifypaired=f"
So It seems that its default setting does not support Paired-end,

My current idea is : cat A-1.fq.gz A-2.fq.gz > A-fq.gz, so it may work properly on sample A, And I just need to do this for all 12 samples

Best
kzr

Broken link to view notebook

At the end of Tutorial part 1, it's stated

You can always have a look at the notebook to see what would be the output.

The link to view the notebook is broken and gives a 404.

Host genome reference format

Hi, I think I'm having trouble specifying a host genome. I've listed it in the contaminants (see below), but the results suggest that none of the reads were mapped to it, nor removed.

contaminant_references:
PhiX: /data/users/danross/databases/phiX174_virus.fa
Acer: /data/users/danross/refGens/acersaccharum/GCA_030490815.1/GCA_030490815.1_UCONN_Acsa_1.0_genomic.fna

The Acer (host) reference genome is comprised of 388 scaffolds, is this what's causing the problem?

Thanks,
David

Specifying memory and number of jobs

Hello,
I trying to figure out my way in to get to know atlas.
And currently I'm running the tutorial data set on a server. and run it generously for 4 data sets
atlas run all --resources mem=500 --jobs 50
But down stream I got this error,
Error in rule error_correction:
jobid: 43
output: sample2/assembly/reads/QC.errorcorr_R1.fastq.gz, sample2/assembly/reads/QC.errorcorr_R2.fastq.gz, sample2/assembly/reads/QC.errorcorr_se.fastq.gz
log: sample2/logs/assembly/pre_process/error_correction_QC.log (check log file(s) for error message)
conda-env: /home/nioo/sewuneta/sorghum_shotgun_metagenome_analysis/atlas.trial/atlas/path/to/fastq/conda_envs/ac12f856310e8bfb503b4ccb5cc5fb23
shell:

    tadpole.sh -Xmx51G             prealloc=1             in1=sample2/assembly/reads/QC_R1.fastq.gz,sample2/assembly/reads/QC_se.fastq.gz in2=sample2/assembly/reads/QC_R2.fastq.gz             out1=sample2/assembly/reads/QC.errorcorr_R1.fastq.gz,sample2/assembly/reads/QC.errorcorr_se.fastq.gz out2=sample2/assembly/reads/QC.errorcorr_R2.fastq.gz             mode=correct             threads=8             ecc=t ecco=t 2>> sample2/logs/assembly/pre_process/error_correction_QC.log

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Does this mean the specified figures are not enough or I should have done that in the configuration file?

Thanks and hope to hear from you sooner.