Coder Social home page Coder Social logo

magpy's Introduction

MAGpy

MAGpy is a Snakemake pipeline for downstream analysis of metagenome-assembled genomes (MAGs) (pronounced mag-pie)

Citation

Robert Stewart, Marc Auffret, Tim Snelling, Rainer Roehe, Mick Watson (2018) MAGpy: a reproducible pipeline for the downstream analysis of metagenome-assembled genomes (MAGs). Bioinformatics bty905, bty905

How to install

Follow the "10 minute install" here

Clean your MAGs

There are a few things you will need to do before you run MAGpy, and these are due to limitations imposed by the software MAGpy runs, rather than by MAGpy itself.

These are:

  • the names of contigs in your MAGs must be globally unique. Some assemblers, e.g. Megahit, output very generic contig names e.g. "scaffold_22" which, if you have assembled multiple samples, may be duplicated in your MAGs. This is not allowed. BioPython and/or BioPerl can help you rename your contigs
  • The MAG FASTA files must start with a letter
  • The MAG FASTA files should not have any "." characters in them, other than the final . before the file extension e.f. mag1.faa is fine, mag.1.faa is not

Reproducibility

For workflows to be reproducible, we recommend that whilst following the "10 minute install" that you name the downloaded databases with the database version and/or the database download date.

if you do this, then the name of the database will be propagated throughout all of the Snakemake outputs and will therefore provide an exact record of what was done, by what software and on which version of the database (software versions are controlled/recorded by the yaml files in the envs directory)

Doing this will ensure reproducibility accross different platforms and groups.

Run the tests!

It's always a good idea to run the tests and this has the added advantage of installing all of the conda environments.

Clone the repo:

git clone https://github.com/WatsonLab/MAGpy.git
cd MAGpy

Run the tests:

snakemake --use-conda -s MAGpy test

Test outputs will be in test/outputs and you should have an error_log file in your current working directory.

How to run

MAGpy is slightly different to other software tools that you may install. Rathen than installing MAGpy once, in e.g. a global software directory, it is better to clone the repo each time you wish to use it. Therefore for each project you wish to run MAGpy on, simply clone the repo into that project's folder, and run MAGpy from there. New project? Clone the repo again. The disk footprint of MAGpy is tiny so you lose nothing by doing this (and conda should handle the multiple environments fine).

Clone the repo:

git clone https://github.com/WatsonLab/MAGpy.git
cd MAGpy

Edit the config.json and point to the databases and directories created during the "10 minute install"

In this directory, put all of your genomes into the mags folder, one file per genome, with a .fa file extension

Then to run in basic (linear, non-cluster) mode:

snakemake --use-conda -s MAGpy

Outputs will be placed into the current working directory, so make sure you have write access.

To test which commands snakemake will run, you can try:

snakemake -np -s MAGpy

However, on any serious number of MAGs, this basic operation will take a very long time as each job will be run in serial (i.e. one after the other). However, snakemake has the ability to submit to most HPC clusters. There are some instructions here.

Here at Edinbugh, we run an SGE cluster and this is how we run MAGpy on the cluster:

snakemake --use-conda --cluster-config MAGpy.json --cluster "qsub -V -cwd -pe sharedmem {cluster.core} -l h_rt= {cluster.time} -l h_vmem={cluster.vmem} -P {cluster.proj}" --jobs 1000

This mode looks into the MAGpy.json file for cluster configurations relating to each type of job; the jobs are "rules" within the MAGpy snakefile.

The integration of PhyloPhlAn

OK, this is a bit complex. Essentially, PhyloPhlAn has a few foibles, which are:

  • input to PhyloPhlAn has to be placed in the input/ directory contains within the PhyloPhlAn install directory
  • output from PhyloPhlAn is written to the output/ directory within the PhyloPhlAn install directory
  • The PhyloPhlAn process has to be run from the root of the PhyloPhlAn install directory

Therefore, whatever user is running the MAGpy process, whether it be on a cluster or a single machine, must have read and write access to the input/ and output/ directories in the PhyloPhlAn install directory

Here is what MAGpy attempts to do:

  • It copies protein files into the input directory of PhyloPhlAn
  • It then attempts to cd into the PhyloPhlAn install directory
  • From there, it runs PhyloPhlAn
  • When finished, it attempts to mv the results folder back to the original directory (to folder tree)
  • MAGpy then changes back to the original working directory

Now obviously this is a bit, erm, hacky but as long as permissions are set on the PhyloPhlAn directory correctly, it should work.

Drawing the tree

The way we have snakemake set up, it is in a Python 3.5 env and GraPhlAn is in a Python 2.7 env. So we create the tree outside of Snakemake:

conda env create -f envs/basic2.yaml
source activate basic2

perl scripts/produce_tree.pl checkm_plus.txt tree/folder/path_to_newick.nwk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.