Coder Social home page Coder Social logo

miller-alexander / motrpac-rna-seq-pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from motrpac/motrpac-rna-seq-pipeline

0.0 0.0 0.0 517 KB

WDL implementation of harmonized RNA-seq pipeline for MoTrPAC project

Shell 4.93% Python 47.65% Awk 0.52% Dockerfile 10.92% WDL 35.99%

motrpac-rna-seq-pipeline's Introduction

MoTrPAC RNA-SEQ Pipeline

DOI

Overview

This repo contains the rna-seq data processing pipeline implemented in Workflow Description Language (WDL) based on harmonized RNA-SEQ MOP. This pipeline uses caper, a wrapper python package for the workflow management system Cromwell. All the data was processed on the Google Cloud Platform (GCP). The pipeline uses STAR aligner, RSEM and featureCounts read quantification tools. The pipeline also generates a qc metrics file, useful for outlier detection and covariate adjustment during differential analysis.

Details

GCP set-up

The WDL/Cromwell framework is optimized to run pipelines in high-performance computing environments. The MoTrPAC Bioinformatics Center runs pipelines on Google Cloud Platform (GCP). We used a number of fantastic tools developed by our colleagues from the ENCODE project to run pipelines on GCP (and other HPC platforms).

A brief summary of the steps to set-up a VM to run the Motrpac pipelines on GCP (for details, please, check the caper repo):

  • Create a GCP account.
  • Enable cloud APIs.
  • Install the Google Cloud SDK (Software Development Kit) on your local machine.
  • Create a service account and download the key file to your local computer (e.g โ€œservice-account-191919.jsonโ€)
  • Create a bucket for pipeline inputs and outputs (e.g. gs://pipelines/). Note: a GCP bucket is similar to a folder on your computer or a storage unit, but it is stored on Google's servers in the cloud instead of on your local computer.
  • Set up a VM on GCP: create a Virtual Machine (VM) instance from where the pipelines will be run. We recommend the script available in the caper repo. For that, clone the repo on your local machine and run the following command:
$ bash create_instance.sh [INSTANCE_NAME] [PROJECT_ID] [GCP_SERVICE_ACCOUNT_KEY_JSON_FILE] [GCP_OUT_DIR]

# Example for the pipeline:
./create_instance.sh pipeline-instance your-gcp-project-name service-account-191919.json gs://pipelines/results/
  • Finally, clone the repo

git clone https://github.com/MoTrPAC/motrpac-rna-seq-pipeline

Software / Dockerfiles

Several tools are required to run the rna-seq pipeline. All of them are pre-installed in docker containers, which are publicly available in the Artifact Registry. To find out more about the specific versions of tools used to run the pipeline, check the dockerfiles/*.Dockerfile

Configuration Files

An input configuration file (in JSON format) is required to process the data through the pipeline. This configuration file contains several key-value pairs that specify the inputs and outputs of the workflow, the location of the input files, default pipeline paramenters, docker containers, the execution environment, and other parameters needed for execution.

The optimal way to generate the configuration files is to run the make_json_rnaseq.py script. Check this help guide to find out more.

Run the pipeline

Connect to the VM and submit the job using the below command

caper submit rnaseq_pipeline_scatter.wdl -i input_json/set1_rnaseq.json

Check the status of workflows and make sure they have succeeded by typing caper list on the VM instance that's running the job and look for Succeeded.

motrpac-rna-seq-pipeline's People

Contributors

archanaraja avatar mihirsamdarshi avatar shrutimarwaha avatar akre96 avatar biodavidjm avatar hershman avatar nicolerg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.