ncbi-hackathons / viruspy Goto Github PK

A pipeline for viral identification from metagenomic samples

License: MIT License

Perl 15.58% Shell 60.87% Python 23.54%

virus viral metagenomics viral-contigs sra genome-assembly virus-discovery virus-integration

viruspy's Introduction

ViruSpy: a pipeline for viral identification from metagenomic samples

What is ViruSpy?
Why is this important?
ViruSpy Workflow
Quickstart
Installing ViruSpy
ViruSpy Usage
ViruSpy Testing and Validation
Additional Functionality

What is ViruSpy?

ViruSpy is a pipeline designed for virus discovery from metagenomic sequencing data available in NCBI’s SRA database. The first step identifies viral reads in the metagenomic sample with Magic-BLAST, which allows this step without needing to download the (often quite large) metagenomic dataset. The extracted raw reads are assembled into contigs using MEGAHIT and annotated for genes by Glimmer and for conserved domains by RPS-tBLASTn. Following annotation, the Building Up Domains (BUD) algorithm allows us to tell whether the viral genomes are non-native (i.e. integrated) to a host genome.

Why is this important?

Viruses compose a large amount of the genomic biodiversity on the planet, but only a small fraction of the viruses that exist are known. To help fill this gap in knowledge we created a pipeline that can identify putative viral sequences from large scale metagenomic datasets that already exist in the SRA database.

Viruses across multiple virus families are found integrated in host genomes. By including the BUD algorith in the pipeline, we are able to identify these and distinguish them from exogenous viruses.

ViruSpy Workflow

The ViruSpy pipeline requires the user to provide the SRA ID of the metagenomic sample to be searched through and a reference viral genome database. The reference viral genome database can be either supplied by the user in the form of a FASTA file or BLAST database. If neither is provided, ViruSpy will default to the RefSeq viral genome database and attempt to download those sequences in FASTA format.

In the first step Magic-BLAST returns all of the virus-like sequences from the SRA sample, which are assembled into contigs using the MEGAHIT assembler.

The contigs are verified as viral sequences through two methods: prediction of open reading frames within the contigs using Glimmer3, and prediction of conserved protein domains using RPS-tBLASTn. The viral conserved domains (CD) are determined using the NCBI CDD database. Output files from both of these methods are then combined to identify a set of high confidence viral contigs.

Using the identified viral reads, the determination of endogenous reads within a host relies upon the Building Up Domains (BUD) algorithm. BUD takes as input an identified peprocessed viral contig from a metagenomics dataset and feeds the contig ends from both sides to Magic-BLAST, which searches for overlapping reads in the SRA dataset. The reads are then used to extend the contig in both directions. This process continues until non-viral domains are identified on either side of the original viral contig, implying that the original contig was endogenous in the host, or until a specified number of iterations has been reached (default iteration value was set to 10). This process is depicted below:

Useful References

Installing ViruSpy

Required software

Magic-BLAST (>= v1.3): download documentation
BLAST+
Samtools (>= version 1.5)
Prinseq
MEGAHIT
Glimmer3

The ViruSpy /scripts/ directory should be added to the user's $PATH.

ViruSpy Usage

Example usage

viruspy.sh [-d] [-f viral_genomes.fasta/-b viral_db] -s SRR1553459 -o output_folder

Required arguments:

Option	Description
-s	SRR acession number from SRA database
-o	Folder to be used for pipeline output

Optional arguments:

Option	Description
-f	FASTA file containing viral sequences to be used in construction of a BLAST database. If neither this argument nor -b are used, ViruSpy will default to using the Refseq viral genome database.
-b	BLAST database with viral sequences to be used with Magic-BLAST. If neither this argument nor -f are used, ViruSpy will default to using the Refseq viral genome database.
-d	Determines signature of viruses that are integrated into a host genome (runs the BUD algorithm)

ViruSpy Testing and Validation

Additional Functionality

viruspy's People

Contributors

Stargazers

Watchers

Forkers

b-rich peiwenliu18 5l1v3r1 pcantalupo

viruspy's Issues

BUD Algorithm Contig Carry over

I have not implemented a method to remove empty fa files (generated by the contig_extend script) from the next iteration of the Bud Algorithm. This should reduce the time for the subsequent iterations.

Brainstorming ideas to do this:

a bash script to retain .fa files > than 0 bytes then filter contigs in combined file by names of files left
A script that creates a list of empty file names and filters contig keys in a dictionary format.

can't run viruspy with data not in the SRA database

This issue is for those who want to run viruspy against sequences that are not in the SRA database. A clear case for this is somebody that did their own sequencing and has a set of FASTQ files that they want to analyze.

A potential reason for short contigs

We may need to do an after-magicBLAST cleanup, where we extract the reads and pairs back out of the original SRA. Duplications and unpaired reads may be throwing off the aligners. Unpaireds may be interesting, but we should be able to pull them back out with BUD.

(vdb-dump) works well for this.

To wit, with the results from ViruSpy:

grep SRR1161435.123159 SRR1161435_putative_viral_reads.fastq
@SRR1161435.123159.1/1
@SRR1161435.123159.2/2
@SRR1161435.123159.2/2
@SRR1161435.123159.2/2
@SRR1161435.123159.2/2

grep SRR1553459.9969 SRR1553459_putative_viral_reads.fastq
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2
@SRR1553459.9969.1/1
@SRR1553459.9969.2/2

grep SRR5675673.32209 SRR5675673_putative_viral_reads.fastq
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2
@SRR5675673.32209.1/1
@SRR5675673.32209.2/2

BUD Algorithm Bug

There is a memory error with the implementation of the bud algorithm. The process of checking if a fasta file is extended is not completed due to too many files being created for individual contigs to check against the contigs_extend perl script. Either moving the fasta data into a dictionary or reading each contig by line is necessary to avoid creating the large amount of files for both the input and output.

Pragmatically ealing with repeats and evolutionarily diverged regions

Eventually, possibly in another hackathon, BUD should have a state dependent magicBLAST parameter modifier, where:

the parameters are tightened in regions with > average # of reads (repeats are presumably a main driver of this)

the parameters are loosened in regions with < average # of reads (divergence is presumably a main driver of this)

magicblast versions and unaligned reads

Not really an issue but something to remember. MagicBlast version 1.2 by default does not return unaligned reads. HOWEVER, version 1.3 does return unaligned reads by default. To turn this off, use the '-no_unaligned' option. See https://gist.github.com/pcantalupo/fe755e79d5b195573f80ec9ef179e562 for comparsion of v1.2 vs v1.3 options.

Not using full paths within scripts

Script paths need straightening out - while trying to run viruspy.sh and magicblast_w_opts.sh I got repeated "file not found" errors.

./viruspy.sh -s SRR5675673 -o /zfs1/ncbi-workshop/virus-discovery/JW/Pipeline_Rerun/

srr: SRR5675673
outdir: /zfs1/ncbi-workshop/virus-discovery/JW/Pipeline_Rerun/
blastDB:
paired:
./viruspy.sh: line 73: magicblast_w_opts.sh: command not found

This error is because of line 73 in the script:

$magic_blast -s $srr -o $srr.fastq

without ./$magic_blast or bash $magic_blast, my shell won't be able to interpret the command.

Once that's fixed, I get:

bash: magicblast_w_opts.sh: No such file or directory

That's because of line 67:

cd $magic_dir

Which takes us out of the current directory into out_dir/data_magicblast, and the full path for magicblast_w_opts is not given, so it can't find the file. Since this is a larger issue (most of our paths in scripts are not full paths) I'm putting it up as an issue rather than just fixing it in the script.