Coder Social home page Coder Social logo

ncbi-hackathons / viruspy Goto Github PK

View Code? Open in Web Editor NEW
25.0 16.0 4.0 8.88 MB

A pipeline for viral identification from metagenomic samples

License: MIT License

Perl 15.58% Shell 60.87% Python 23.54%
virus viral metagenomics viral-contigs sra genome-assembly virus-discovery virus-integration

viruspy's Introduction

ViruSpy: a pipeline for viral identification from metagenomic samples

Table of Contents

What is ViruSpy?
Why is this important?
ViruSpy Workflow
Quickstart
Installing ViruSpy
ViruSpy Usage
ViruSpy Testing and Validation
Additional Functionality

What is ViruSpy?

ViruSpy is a pipeline designed for virus discovery from metagenomic sequencing data available in NCBI’s SRA database. The first step identifies viral reads in the metagenomic sample with Magic-BLAST, which allows this step without needing to download the (often quite large) metagenomic dataset. The extracted raw reads are assembled into contigs using MEGAHIT and annotated for genes by Glimmer and for conserved domains by RPS-tBLASTn. Following annotation, the Building Up Domains (BUD) algorithm allows us to tell whether the viral genomes are non-native (i.e. integrated) to a host genome.

Why is this important?

Viruses compose a large amount of the genomic biodiversity on the planet, but only a small fraction of the viruses that exist are known. To help fill this gap in knowledge we created a pipeline that can identify putative viral sequences from large scale metagenomic datasets that already exist in the SRA database.

Viruses across multiple virus families are found integrated in host genomes. By including the BUD algorith in the pipeline, we are able to identify these and distinguish them from exogenous viruses.

ViruSpy Workflow

The ViruSpy pipeline requires the user to provide the SRA ID of the metagenomic sample to be searched through and a reference viral genome database. The reference viral genome database can be either supplied by the user in the form of a FASTA file or BLAST database. If neither is provided, ViruSpy will default to the RefSeq viral genome database and attempt to download those sequences in FASTA format.

In the first step Magic-BLAST returns all of the virus-like sequences from the SRA sample, which are assembled into contigs using the MEGAHIT assembler.

The contigs are verified as viral sequences through two methods: prediction of open reading frames within the contigs using Glimmer3, and prediction of conserved protein domains using RPS-tBLASTn. The viral conserved domains (CD) are determined using the NCBI CDD database. Output files from both of these methods are then combined to identify a set of high confidence viral contigs.

Using the identified viral reads, the determination of endogenous reads within a host relies upon the Building Up Domains (BUD) algorithm. BUD takes as input an identified peprocessed viral contig from a metagenomics dataset and feeds the contig ends from both sides to Magic-BLAST, which searches for overlapping reads in the SRA dataset. The reads are then used to extend the contig in both directions. This process continues until non-viral domains are identified on either side of the original viral contig, implying that the original contig was endogenous in the host, or until a specified number of iterations has been reached (default iteration value was set to 10). This process is depicted below:

Useful References

Magic-BLAST

BLAST Command Line Manual
Magic-BLAST GitHub repo
Magic-BLAST NCBI Insights

MEGAHIT

MEGAHIT GitHub repo
MEGAHIT Paper

Protein Domain Identification

BLAST Command Line Manual
NCBI Conserved Domain and Protein Classification

Glimmer3

Glimmer3 Page at JHU
Glimmer3 Paper
Glimmer3 Manual

Installing ViruSpy

Required software

The ViruSpy /scripts/ directory should be added to the user's $PATH.

ViruSpy Usage

Example usage

viruspy.sh [-d] [-f viral_genomes.fasta/-b viral_db] -s SRR1553459 -o output_folder

Required arguments:

Option Description
-s SRR acession number from SRA database
-o Folder to be used for pipeline output

Optional arguments:

Option Description
-f FASTA file containing viral sequences to be used in construction of a BLAST database. If neither this argument nor -b are used, ViruSpy will default to using the Refseq viral genome database.
-b BLAST database with viral sequences to be used with Magic-BLAST. If neither this argument nor -f are used, ViruSpy will default to using the Refseq viral genome database.
-d Determines signature of viruses that are integrated into a host genome (runs the BUD algorithm)

ViruSpy Testing and Validation

Additional Functionality

viruspy's People

Contributors

dcgenomics avatar glickmac avatar jkwaldman avatar karinazile avatar kkupkova avatar mae92 avatar pcantalupo avatar vineet1992 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

viruspy's Issues

BUD Algorithm Contig Carry over

I have not implemented a method to remove empty fa files (generated by the contig_extend script) from the next iteration of the Bud Algorithm. This should reduce the time for the subsequent iterations.

Brainstorming ideas to do this:

  1. a bash script to retain .fa files > than 0 bytes then filter contigs in combined file by names of files left

  2. A script that creates a list of empty file names and filters contig keys in a dictionary format.

can't run viruspy with data not in the SRA database

This issue is for those who want to run viruspy against sequences that are not in the SRA database. A clear case for this is somebody that did their own sequencing and has a set of FASTQ files that they want to analyze.

A potential reason for short contigs

We may need to do an after-magicBLAST cleanup, where we extract the reads and pairs back out of the original SRA. Duplications and unpaired reads may be throwing off the aligners. Unpaireds may be interesting, but we should be able to pull them back out with BUD.

(vdb-dump) works well for this.

To wit, with the results from ViruSpy:

grep SRR1161435.123159 SRR1161435_putative_viral_reads.fastq
@SRR1161435.123159.1/1
@SRR1161435.123159.2/2
@SRR1161435.123159.2/2
@SRR1161435.123159.2/2
@SRR1161435.123159.2/2

grep SRR1553459.9969 SRR1553459_putative_viral_reads.fastq
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2

grep SRR5675673.32209 SRR5675673_putative_viral_reads.fastq
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2

BUD Algorithm Bug

There is a memory error with the implementation of the bud algorithm. The process of checking if a fasta file is extended is not completed due to too many files being created for individual contigs to check against the contigs_extend perl script. Either moving the fasta data into a dictionary or reading each contig by line is necessary to avoid creating the large amount of files for both the input and output.

Pragmatically ealing with repeats and evolutionarily diverged regions

Eventually, possibly in another hackathon, BUD should have a state dependent magicBLAST parameter modifier, where:

the parameters are tightened in regions with > average # of reads (repeats are presumably a main driver of this)

the parameters are loosened in regions with < average # of reads (divergence is presumably a main driver of this)

Not using full paths within scripts

Script paths need straightening out - while trying to run viruspy.sh and magicblast_w_opts.sh I got repeated "file not found" errors.

./viruspy.sh -s SRR5675673 -o /zfs1/ncbi-workshop/virus-discovery/JW/Pipeline_Rerun/

srr: SRR5675673
outdir: /zfs1/ncbi-workshop/virus-discovery/JW/Pipeline_Rerun/
blastDB:
paired:
./viruspy.sh: line 73: magicblast_w_opts.sh: command not found

This error is because of line 73 in the script:

$magic_blast -s $srr -o $srr.fastq

without ./$magic_blast or bash $magic_blast, my shell won't be able to interpret the command.

Once that's fixed, I get:

bash: magicblast_w_opts.sh: No such file or directory

That's because of line 67:

cd $magic_dir

Which takes us out of the current directory into out_dir/data_magicblast, and the full path for magicblast_w_opts is not given, so it can't find the file. Since this is a larger issue (most of our paths in scripts are not full paths) I'm putting it up as an issue rather than just fixing it in the script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.