Coder Social home page Coder Social logo

fredyr / vireflow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from niemasd/vireflow

0.0 0.0 0.0 2.89 MB

An elastically-scaling automated AWS pipeline for viral consensus sequence generation

License: GNU General Public License v3.0

Python 100.00%

vireflow's Introduction

ViReflow

ViReflow is a tool for constructing elastically-scaling parallelized automated AWS pipelines for viral consensus sequence generation. Given sequence data from a viral sample as well as information about the reference genome and primers, ViReflow generates a Reflow file that contains all steps of the workflow, including AWS instance specifications. Because ViReflow is intended to be used with Reflow, the workflows that are developed by ViReflow automatically distribute independent tasks to be run in parallel as well as elastically scale AWS instances based on each individual step of the workflow. ViReflow makes use of compact minimal Docker images for each step of the viral analysis workflow, details about which can be found in the Niema-Docker GitHub organization.

Workflow Summary

Read Trimmers Read Mappers Variant Callers Optional Analyses
fastp Bowtie2 FreeBayes coronaSPAdes
iVar Trim BWA-MEM iVar Variants MEGAHIT
PRINSEQ Minimap2 LoFreq metaviralSPAdes
pTrimmer minia
Pangolin (COVID-19)
rnaviralSPAdes
VirStrain
ฯ€ Diversity Metric

Installation

ViReflow is written in Python 3. You can simply download ViReflow.py to your machine and make it executable:

wget "https://raw.githubusercontent.com/niemasd/ViReflow/master/ViReflow.py"
chmod a+x ViReflow.py
sudo mv ViReflow.py /usr/local/bin/ViReflow.py # optional step to install globally

While ViReflow itself only depends on Python 3, the pipelines it produces are Reflow files that run on AWS. Thus, in order to run the pipelines ViReflow produces, one must first install Reflow.

Usage

ViReflow can be used as follows:

usage: ViReflow.py [-o OUTPUT_RF] -d DESTINATION -rf REFERENCE_FASTA -rg REFERENCE_GFF -p PRIMER_BED [OPTIONAL ARGS] FASTQ1 [FASTQ2 ...]

For extensive details about each command line argument, see the Command Line Argument Descriptions section of the ViReflow wiki.

Example Usage

We have provided demo files, and ViReflow can be executed as follows:

ViReflow.py -o demo.rf                                                                          `# output Reflow run file` \
            -d s3://my-s3-folder/vireflow-demo                                                  `# output S3 folder` \
            -rf https://github.com/niemasd/ViReflow/raw/main/demo/NC_045512.2.fas               `# reference genome (FASTA)` \
            -rg https://github.com/niemasd/ViReflow/raw/main/demo/NC_045512.2.gff3              `# reference genome annotation (GFF3)` \
            -p https://github.com/niemasd/ViReflow/raw/main/demo/sarscov2_v2_primers_swift.bed  `# primer coordinates file (BED)` \
            https://github.com/niemasd/ViReflow/raw/main/demo/test_R1.fastq                     `# FASTQ 1` \
            https://github.com/niemasd/ViReflow/raw/main/demo/test_R2.fastq                     `# FASTQ 2`

This will result in the creation of a file called demo.rf, which is the Reflow workflow file. Assuming Reflow is properly installed and configured, the workflow can now be run as follows:

reflow run demo.rf

Batch Parallel Execution

In a given sequencing experiment, if you have multiple samples you want to run (e.g. sample1, sample2, ..., sampleN), you can use ViReflow to process all of them in parallel (assuming your AWS account has access to spin up sufficient EC2 instances). First, you need to use ViReflow to produce a Reflow run file (.rf) for each sample:

for s in sample1 sample2 [REST_OF_SAMPLES] sampleN ; do ViReflow.py -id $s -o $s.rf [REST_OF_VIREFLOW_ARGS] ; done

Alternatively, you can create a CSV file in the following format that, in which the first column contains the run ID, and all remaining columns denote the FASTQ files. You can then run ViReflow as follows to generate the Reflow files for all runs:

sample1 sample1_R1.fastq s3://my_samples/sample1_R2.fastq
sample2 sample2_R1.fastq s3://my_samples/sample2_R2.fastq
... ... ...
ViReflow.py [VIREFLOW_ARGS] my_samples.csv

Then, you can use the rf_batch.py script to create a batch Reflow run file that will execute all of the individual sample Reflow run files:

rf_batch.py -o batch_samples.rf sample1.rf sample2.rf [REST_OF_SAMPLES].rf sampleN.rf

Now, you can simply run Reflow on the newly-created batch_samples.rf, and it will automatically execute of the individual sample Reflow run files:

reflow run batch_samples.rf

Citing ViReflow

If you use ViReflow in your work, please cite:

Moshiri N, Fisch KM, Birmingham A, DeHoff P, Yeo GW, Jepsen K, Laurent LC, Knight R (2022). "The ViReflow pipeline enables user friendly large scale viral consensus genome reconstruction." Scientific Reports. 12:5077. doi:10.1038/s41598-022-09035-w

Please also cite the mapper, trimmer, variant caller, and optional analysis tool(s) you used in your ViReflow run(s).

vireflow's People

Contributors

niemasd avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.