Pandora

Identification and Discovery of Tumor Associated Microbes via RNAseq

Introduction

Pandora is a multi-step pipeline to find pathogen sequences in RNAseq data. It includes modules for host separation, assembly, blasting contigs, and orf discovery. As input, Pandora takes paired fastq files; as output, it produces a report.

Dependencies

The following programs must be in your PATH:

python 2.7.x
Samtools
STAR
Bowtie2
Trinity
BLAST 2.3.x
featureCounts (Subread)

Pandora depends on the following Python modules:

Biopython

Workflow

To accomplish diverse tasks, Pandora has various subcommands (like, say, the program git). The primary subcommand is scan, which is a pipeline comprising the following steps:

Subtraction of reads mapping to host genome
De-Novo assembly of remaining reads
BLAST of assembled contigs
ORF search in contigs of unknown origin
Filter and parse blast results into tidy human-readable report

The aggregate subcommand [...].

Additional Files

Pandora requires various references and annotation files.

For scan step 1, please provide:

a host genome indexed for STAR
a host genome indexed for bowtie2
(optional) a gtf describing the genes of the host

For scan step 3, please provide:

the BLAST nucleotide collection nt database

For scan step 4, you can optionally provide:

the BLAST protein collection nr database

For scan step 5, you can optionally provide:

a text file of "blacklist" non-pathogen taxids for filtering. If you do not provide one, the script will use resources/blacklist.txt by default. This list contains any taxid children of the nodes chordata (Taxonomy ID: 7711) or "other sequences" (Taxonomy ID: 28384)

Because there are a considerable number of files involved, you can specify their paths with a configuration file instead of command line flags. See pandora.config.txt for example formatting. Note that options specified as flags take precedence over options specified via the configuration file.

Usage Examples

pandora.py scan -id patient1 -r1 mate_1.fastq.gz -r2 mate_2.fastq.gz --gzip --refstar /path/ref/STAR --refbowtie /path/ref/bowtie/hg19 -db /path/ref/blastdb/nt

Here is an example command using a configuration file:

pandora.py scan -id patient1 -r1 mate_1.fastq.gz -r2 mate_2.fastq.gz --gzip --verbose -c pandora.config.txt

Notes

Currently, Pandora makes use of the Oracle Grid Engine by default. The reason for this is that blast is computationally intensive, embarrassingly parallelizable, and lends itself very nicely to cluster computing. You can turn this off with the --noSGE flag, but blast will be very slow.

Note that RNA-seq enriched for poly-A transcripts will miss prokaryotic pathogens.

Status: Active Development

ioanfilip2 / pandora Goto Github PK

pandora's Introduction

Pandora

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent