Coder Social home page Coder Social logo
 photo

micromeda Goto Github PK

repos: 6.0 gists: 0.0

Type: Organization

Bio: Micromeda is a platform for rapid annotation and functional comparison of denovo-assembled and metagenome-assembled genomes (MAGs).

Micromeda Platform

Overview

The Micromeda platform allows users to predict which genome properties are possessed by organisms and then compare the presence and absence of these properties across organisms. The platform has three core components:

  • Micromeda-Visualizer -- A web-based visualization tool that draws interactive heat maps of genome property and property step assignments. It has two components Micromeda-Client and Micromeda-Server and is available at micromeda.uwaterloo.ca.
  • Pygenprop -- A Jupyter Notebooks compatible Python library that allows for the programmatic comparison of property and step assignments across organisms. Pygenprop also allows users to explore the InterProScan annotations and protein sequences that support genome property assignments. Pygenprop also has a command-line interface (CLI) for generating Micromeda files.
  • Micromeda Files -- These files allow for the aggregation and transfer of genome property assignments, step assignments, and supporting InterProScan annotations and protein sequences from multiple organisms. They allow for the transfer of complete property analysis datasets between researchers. They are also used to transfer datasets to Micromeda-Visualizer.

Analysis Workflow

Analyzing datasets using Micromeda involves the following steps:

  • Acquiring an organism's protein sequences
  • Annotating an organism's proteins using InterProScan5
  • Creating a Micromeda file using Pygenprop's CLI
  • Uploading the Micromeda file to Micromeda-Visualizer

An Overview of creating Micromeda files.

Micromeda files take FASTA protein sequences and InterProScan5 out files as inputs. Protein sequences are predicted by via gene prediction software such as Prodigal. Overview of creating Micromeda files

Installing Software and Databases

The following pieces of software must be installed to generate Micromeda files:

The following pieces of software are used in the tutorial but are optional:

InterProScan5 takes an organism's predicted proteins sequences as input. Genes must first be predicted using a gene prediction application (e.g., https://en.wikipedia.org/wiki/List_of_gene_prediction_software) to get these proteins and then translated to proteins either using the same software or the second piece of software. Different types of software must be used on eukaryotic vs prokaryotic genomes. For this tutorial, it is assumed that prokaryotic organisms are being analyzed. Prodigal is used to predict these organism's proteins.

Install Docker and Conda (optional)

InterProScan5 and Prodigal are more easily installed when you have the following installed:

Install InterProScan

NOTE: Docker for Mac is not a native MacOS application but instead runs inside a Linux virtual machine. By default, this VM is only allocated half the total CPUs and 2 GB of RAM. InterProScan may require more resources, such as additional RAM or CPU. Users can follow the instructions in the Docker for Mac documentation to adjust RAM or CPU allocations, available at https://docs.docker.com/desktop/settings/mac/.

InterProScan5 can be installed manually or it can be easily installed using our Docker-based installation.

docker build https://raw.githubusercontent.com/Micromeda/InterProScan-Docker/master/Dockerfile -t micromeda/interproscan-docker

You should also install our wrapper script that allows you to run InterProScan on files found outside of its container.

wget https://raw.githubusercontent.com/Micromeda/InterProScan-Docker/master/run_docker_interproscan.sh
chmod +x run_docker_interproscan.sh

Install Pygenprop

pip install numpy # Numpy needs to be installed separately
pip install pygenprop

or alternatively

conda install -c conda-forge -c lbergstrand pygenprop

Install Prodigal (optional)

Prodigal can either be installed using the author's installation tutorial. Alternatively, Prodigal can be installed using Conda.

conda install -c bioconda Prodigal

Install GNU Parallel (optional)

GNU parallel can be used to parallelize some workflows such as gene prediction across multiple processes.

# Debian/Ubuntu
apt-get install parallel

# OSX
brew install parallel 

Download The Genome Properties Database

wget https://raw.githubusercontent.com/ebi-pf-team/genome-properties/master/flatfiles/genomeProperties.txt

Create a Micromeda file for One Organism

Below is a tutorial that overviews how to build a single Micromeda file for one prokaryotic organism. Generating Micromeda file for multiple organisms will be discussed later in the document.

1. Predict Proteins

Use Prodigal to predict the organism's proteins. The -a flag is used to write the predicted protein sequences to a file.

prodigal -i ./genome.fasta -a ./genome.faa 

2. Remove * Characters

Prodigal adds * characters, representing stop codons, to the end of its output protein sequences. The * character is not in the IUPAC protein alphabet. As a result, InterProScan5 throws an error when annotating Prodigal's output sequences. These * characters need to be removed to make the previously produced .faa file compatible with InterProScan5. Alternative gene prediction programs may produce .faa files without * characters. When using these program's output, the ```sed`` step below can be skipped.

Use sed to remove these * characters.

sed -i 's/\*$//' genome.faa

3. Annotate Domains

Use the InterProScan5 Docker container to domain annotate the previously sanitized .faa file. For convenience, one can use run_docker_interproscan.sh, which simplifies using the container.

./run_docker_interproscan.sh genome.faa

This step produces an InterProScan .tsv annotation file called genome.tsv.

4. Build a Micromeda File

When Pygenprop is installed, its CLI is also installed and can be used to build Micromeda files, among a few other functions. The CLI's build command takes the previous output .tsv file generated by run_docker_interproscan.sh as input.

pygenprop build -d ./genomeProperties.txt -i genome.tsv -o ./data.micro -p

The build command's -p flag is used to add protein sequences to the output Micromeda file. With this flag active, Pygenprop searches the FASTA files that were scanned by InterProScan for proteins that support genome property steps and adds them to the output Micromeda file. The FASTA files must be in the same directory as the InterProScan5 files and share the same basename (e.g., filename without file extension).

5. Overview

As discussed above, creating a Micromeda file involves converting an input genome files through a series of steps.

genome.fasta --> genome.faa --> genome.tsv --> data.micro

Create a Micromeda file for Multiple Organisms

The above steps can be applied to multiple genomes to create more massive analysis datasets. Below we use parallel to parallelize specific steps across multiple cores. However, the same task could also be accomplished by placing the above commands, except for step four, in a shell script that loops through a series of input FASTA file paths.

For the steps below, let us assume that we have the following directory structure:

data/
├── ecoli_one.fasta
├── ecoli_two.fasta

1. Predict Proteins

The find . -maxdepth 1 -name "*.fasta" command finds all the FASTA files in the current working directory. This list is piped to parallel, which runs a Prodigal process on each file in parallel.

find . -maxdepth 1 -name "*.fasta" | parallel prodigal -i {} -a {.}.faa

Resulting Folder Structure

data/
├── ecoli_one.fasta
├── ecoli_one.faa
├── ecoli_two.fasta
├── ecoli_two.faa

2. Remove * Characters

Find all the .faa files and run sed`` on them in parallel to remove *``` characters.

find . -maxdepth 1 -name "*.faa" | parallel sed -i 's/\*$//' {}

Resulting Folder Structure N/A

3. Annotate Domains

Find all the .faa files and run InterProScan on them. The -j 1 flag of parallel ensures that only one copy of InterProScan is run at a time (equivalent to xargs). Because InterProScan is already multi-threaded we don't need to run muliple processes.

find . -maxdepth 1 -name "*.faa" | parallel -j 1 ./run_docker_interproscan.sh {}

Resulting Folder Structure

data/
├── ecoli_one.fasta
├── ecoli_one.faa
├── ecoli_one.tsv
├── ecoli_two.fasta
├── ecoli_one.tsv
├── ecoli_two.faa

4. Build a Micromeda File

Build an output Micromeda file from multiple input InterProScan .tsv files.

pygenprop build -d ./genomeProperties.txt -i *.tsv -o ./data.micro -p

Resulting Folder Structure

data/
├── ecoli_one.fasta
├── ecoli_one.faa
├── ecoli_one.tsv
├── ecoli_two.fasta
├── ecoli_one.tsv
├── ecoli_two.faa
├── data.micro

Uploading a Micromeda File for Visualization

Micromeda-Visualizer is available at micromeda.uwaterloo.ca. Users can upload Micromeda files to Micromeda-Visualizer for visualization via a drag and drop interface. The steps for creating Micromeda heat map visualizations are overviewed below.

1. Upload Micromeda Files

Navigate to micromeda.uwaterloo.ca.

1a. Click on the Upload Button

Clicking the upload button will redirect the browser to the Micromeda file upload page. Click Upload Button

1b. Upload File

Upload a Micromeda file using the drag and drop zone. Alternatively, the drop zone can be clicked to bring up a file selection menu. Upload A Micromeda File

2. Wait for File Processing

It may take a few minutes to tens of minutes for the Micromeda file to be uploaded and processed. After upload and processing are complete, the browser window will automatically be redirected to the heat map display page. Do not navigate away from this page manually after file upload or all progress is lost. Note that the upload progress bar may show 100%, but the page will not be redirected until the Micromeda file completely processed on the server. Please have the patience for the page redirect when uploading large Micromeda files (>30 organisms). The time taken processing the file on the server grows exponentially with input file size.

Wait For File Upload To Complete

3. Use the Heat Map

After server-side processing, the heat map will be available for viewing. Only a single heat map can be viewed at one time. To create a new heat map or view a previous one, their corosponding Micromeda files must be reuploaded. Micromeda files and heat maps are only stored on the server for a two hour period. Afterward, the original Micromeda file must be reuploaded.

The Heat Map Can Be Viewed

Installing Micromeda-Server (Optional)

Before installing Micromeda-Server you should install Docker and Docker Compose.

Steps:

  1. Download the docker compose file.

    wget https://raw.githubusercontent.com/Micromeda/micromeda-server/master/docker-compose.yml
    
  2. Edit the Docker Compose file:

    For the line:

    - BACKEND_URL=http://0.0.0.0:5000/
    

    Replace 0.0.0.0 with the server's recognized URL (e.g., micromeda.uwaterloo.ca). If you are running Micromeda-Server on your personal computer, then leave 0.0.0.0 in place.

    - BACKEND_URL=http://micromeda.uwaterloo.ca:5000/
    
  3. Build the front-end and back-end.

    docker-compose build
    
  4. Run the front-end and back-end.

    docker-compose up
    

    Note: You may want to run docker-compose up as a background process.

    docker-compose up -d
    

    -d is for detached (i.e., background) mode.

micromeda's Projects

micromeda-client icon micromeda-client

Interactive and comparative data visualization of the genome properties of novel genomes, metagenome assembled genomes (MAGs) and metagenomes. Try Me! -->

micromeda-workflow icon micromeda-workflow

An overview of how to generate Micromeda files and upload then to Micromeda-Server.

pygenprop icon pygenprop

A python library for programmatic usage of EBI InterPro Genome Properties.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.