broadinstitute / viral-ngs Goto Github PK

View Code? Open in Web Editor NEW

187.0 29.0 66.0 66.06 MB

Viral genomics analysis pipelines

License: Other

Python 90.93% Shell 4.56% Roff 0.04% WDL 4.05% Perl 0.27% Dockerfile 0.15%

viral-ngs viral genomics genome genome-assembly genome-sequencing bam fastq variant-calling variant-annotations

viral-ngs's Introduction

viral-ngs

A set of scripts and tools for the analysis of viral NGS data.

More detailed documentation can be found at http://viral-ngs.readthedocs.org/ This includes installation instructions, usage instructions for the command line tools, and usage of the pipeline infrastructure.

viral-ngs's People

Contributors

Stargazers

Watchers

Forkers

mlin dnanexus-archive acegid-senegal harper357 a113n yesimon toarms gzluo mamanambiya tianyabeef kesimmons biocyberman flyfl1992 longbow0 dkoppstein notestaff jb013b jianzuoyi pythseq tom-dyar koesgroup bioshell floatgreen nvy6 pvanheus shaoyanz lelouchzhu xuwei684 jiarong luoyuan1984 biomathematica jonessarae aswinssoman waeidrc snunzi godotgildor dgueren vikash84 sudarshangc jxshi ahdilhmd chazgoo novapyth bioinformaticsdeeplearning crossz simonlevine animesh skchronicles minghao2016 n0v1c3 sdy2813 wook2014 richarddeveloper doh-hnh0303 nailouzhang 09aaronl pengfeiintuebingen hqi87 dailypartita

viral-ngs's Issues

implement Novoalign tool (aligner)

CoordMapper: take pre-built alignments

Update the interhost.CoordMapper class to:

Take pre-built alignments from aligned FASTA files (one per chromosome/segment) instead of always building from scratch
Accept multiple alignments (not just pairwise) and expose an interface to map between arbitrary samples, or alternatively (as an initial implementation), just specify which two samples to load a pairwise mapping for in the constructor.

Subclass CoordMapper into two classes that have different data loaders / constructors, but load into the same data structure. Leave most of the coordinate mapping machinery in the base class.

Implement p-value for more than two libraries in vphaser_one_sample

Faster Travis tests

Current builds are starting to approach 20mins. Over 90% of this time is spent downloading and compiling dependencies.

Explore some of the container-based options for Travis in order to skip past some of the tool download-compile-install and pip install steps, for normal builds. We can stick to the full test suite for pull requests. We can encrypt the container for Travis the same way we're doing the Novoalign/GATK bundle currently.

We should be able to get a normal CI build for a standard commit down to about 2-mins.

Stop redistributing prinseq. Instead, write downloader.

post_download parameter for return code

Have an optional parameter in DownloadPackage.post_download set to zero by default that is the return code that we assert for equality. If the caller passes a None value, don't check the return code.

Tool.install tries remaining methods after success

If Tool.install successfully installs using one method, it continues to try the remaining methods, thus potentially installing the tool multiple times. It should not.

RTD command line usage docs are missing descriptions

The initial description of each subcommand for all of the command line python scripts are missing in http://viral-ngs.readthedocs.org/. They all show up as "Undocumented". The actual documentation of each argument is fine, we're just missing the preamble paragraph that describes the purpose of each command. These are supposed to be automatically read out of the docstring of the main method for each command.

I think sphinx-argparse is looking in the wrong place (the subcommand's "help" field instead of its "description" field).

Toolify V-Phaser2

The V-Phaser2 iSNV caller (written by Xiao, Broad Viral Genomics) belongs to us, so we can just incorporate the relevant C code into this project, make sure the Makefile works more generally, and wrap it all up in a new Tool class.

This program takes a BAM file in (representing aligned reads) and spits out a custom-formatted text file that describes all the intra-host variation in this sample. It is one of the few such tools that calls indels as well.

The resulting output is pretty permissive--subsequent steps will need to filter the calls based on various criteria, and the coordinates will also need to be remapped in most cases.

implement bwa tool (required by gatk best practices)

Stop redistributing maf-convert.py

Stop redistributing maf-convert.py. Instead, during install run 2to3 on it and edit the result to make it run on 2.x and 3.x (the output of 2to3 runs on neither 2.x nor 3.x without editing it).

implement Picard tool (BAM handling)

vphaser p-value computation is unstable

When V-Phaser 2 is tested on the sample file with which it is distributed, 4528.454.indelRealigned.bam, some of the p-values in the resulting .txt files vary from run to run. Only the .txt files vary; the other files are the same. After running it 10 times, 5 of them produced the same p-values, 3 of the remaining ones produced the same p-values as each other, and the p-values in the remaining 2 runs were different from each other and from the other runs.

Here's a sample diff:

diff output1/V4528_assembly.var.raw.txt output2
32c32

< 2872 A G 0.2215 snp 0.7186 A:1:5 G:436:393

2872 A G 0.3361 snp 0.7186 A:1:5 G:436:393
41c41

< 3260 C A 0.4104 snp 0.7018 A:506:343 C:1:5

3260 C A 0.5249 snp 0.7018 A:506:343 C:1:5
57c57

< 4236 G A 0.6867 snp 0.5566 A:534:538 G:2:4

4236 G A 0.6939 snp 0.5566 A:534:538 G:2:4
134c134

< 8191 A T 0.6692 snp 1.354 A:1:5 T:296:141

8191 A T 0.7388 snp 1.354 A:1:5 T:296:141

For now, TestVPhaser works around the instability by not comparing the .txt files. That should be fixed when the instability is fixed.

Reduce disk-space footprint

Our fully-installed setup incurs a pretty significant amount of disk space and might be causing random Travis failures. Places where we can trim some fat:

NCBI BLAST tool installation: we download a tarball of statically compiled binaries totaling > 500MB. Most of these can be deleted, as we only use blastn and makeblastdb currently.
Trinity tool installation: this is > 400MB, but most of this is in sample_data and trinity-plugins, which we may be able to do without.
Travis tests: the bmtagger test creates a huge index file (>100MB), this should be deleted on test tearDown.

finish snpeff tool

update location of dynamically-linked library specified in variant_caller?

This may be a quirk of my setup (OSX 10.10.4 with Xcode 6.4 installed and gcc 4.9 via Homebrew), but I had to change the variant_caller binary to have the correct location for the libgomp.1.dylib library via the install_name_tool command, as described here:

sudo install_name_tool -change /usr/local/opt/gcc49/lib/gcc/x86_64-apple-darwin13.4.0/4.9.1/libgomp.1.dylib /usr/local/opt/gcc/lib/gcc/4.9/libgomp.1.dylib /Users/tomkinsc/git-repositories/viral-ngs/tools/binaries/V-Phaser-2.0/MacOSX/variant_caller

Without this fix, attempts to run commands that would call variant_caller (including unit tests) would result in the following error:

tools.vphaser2: DEBUG: OMP_NUM_THREADS=4 /Users/tomkinsc/git-repositories/viral-ngs/tools/binaries/V-Phaser-2.0/MacOSX/variant_caller -i /Users/tomkinsc/git-repositories/viral-ngs/test/input/TestPerSample/in.indels.bam -o /var/folders/bx/90px0g1n1v122slzjnyvrsk5gn42vm/T/tmp-TestPerSample-Ykv2o6/tmpvsT9FGvphaser2
tools.vphaser2: ERROR: dyld: Library not loaded: /usr/local/opt/gcc49/lib/gcc/x86_64-apple-darwin13.4.0/4.9.1/libgomp.1.dylib
Referenced from: /Users/tomkinsc/git-repositories/viral-ngs/tools/binaries/V-Phaser-2.0/MacOSX/variant_caller
Reason: image not found

implement automated Hs depletion pipeline

In particular, the wrapping framework around the component steps that would glue together all the processes needed for human & library contaminant read depletion.

Added multi_align_mafft snakemake rule

Add variable number_of_chromosomes to config.json file to represent number of chromosomes/segments in the organism. To the interhost.rules file, a "multi_align_mafft" rule should be added to call ./interhost.py multichr_mafft, passing in k assembly files where k is number_of_chromosomes specified in the config.json file.

iSNV all-sample merge step (step 2)

Each set of iSNV calls starts off separate for each sample. They are in their own coordinate spaces (having been called against their own consensus sequences) and have their own relative definition of "reference" and "alternate" alleles. We'd like to merge this all to a single master table, using consistent coordinates and alleles, and a standard file format (e.g. VCF). Make a new command in intrahost.py (perhaps called something like "merge_to_vcf").

input:
- a bunch of marked up text files per sample with iSNV calls, the output of the command described in #88.
- each individual sample's consensus assembly (fasta) that describes their reference alleles and coordinate space.
- a master reference assembly (fasta) that describes our target coordinate space and reference alleles for the output file.
output: a single VCF file

Steps:

For each sample:

a. Align individual assembly to master reference with MUSCLE. Create custom code that reads that alignment and is able to generally map arbitrary coordinates (chr name, position) between the two coordinate spaces at will.

b. Map all iSNV calls to target coordinate space. Change the position of deletions off-by-one to match the VCF conventions of how to talk about deletions.

c. Flip alleles (ref & alt) if necessary to match the target reference genome.
Across all samples:

a. Sort and group all data per genomic position. The VCF format describes one row per position in the genome. This requires us to merge together data from all files into a single output row since our input describes each sample separately. But more than that, V-Phaser often describes a single position multiple times for a single sample, if multiple types of variants (SNPs, insertions, deletions) are seen at the same place. In this step, we must merge all variants together across all samples, and pad out their allele lengths with extra genome sequence if necessary.

b. If V-Phaser had no output at a given position for a sample, reach back into that sample's consensus assembly and record that consensus allele at 100% (V-Phaser only reports variant positions). If the consensus assembly is an N (insufficient read support at this position), don't do this. If the consensus assembly is gapped at this position, figure out the proper indel to put in this place (at 100%).

The existing "vphaser_to_vcf" command is a good starting point to refer to for the code, but it doesn't do all of the above, and it also does a few steps that belong in #88 instead. In particular, it doesn't do the MUSCLE coordinate remapping and doesn't handle any consensus-level indels, because none of these existed in our Summer 2014 data set, but they definitely exist in our current EBOV data. (they existed in previous Lassa data sets, but we weren't using V-Phaser back then)

Clearly, this step will need some decent unit testing...

VCF annotation with snpEff

Annotate a VCF file (output of #90) with gene/protein coding consequences using snpEff. Create a new top-level python script called annotate.py and add two new commands to it: "snpEff_makedb" and "snpEff_vcf". The first will take a genbank-formatted annotation file (or perhaps even a genbank accession number and query Entrez directly?) and load it into snpEff's internal database with a certain name. The only difference in the VCF file should be the INFO column. The second will use that database to annotate a VCF file. @dpark01 has some python code for this already elsewhere.

iSNV filtering based on tuneable criteria (step 3)

Given a VCF file of iSNV calls (output of #89), perform filtration on iSNV calls based on their frequencies, library support, etc (mostly looking at # of libraries and chi-sq bias test). The exact cutoffs used here will be refined with experimental wet lab work that Shirlee is doing, so we'll set some defaults that we used last summer, and otherwise expose them as optional parameters. Make a new command in intrahost.py (maybe something like "filter_vcf").

Input: VCF
Output: VCF

Steps:

For each sample and each call:

a. filter on number of libraries (any call below a certain allele frequency cutoff will require more independent libraries to believe it)

b. filter on library concordance (anything with a discordant chi-sq above some cutoff will be tossed)
Filter out alleles that have no more samples supporting them
Filter out rows that have no more samples with iSNVs

undefined variable in assembly.py vcf_to_seqs

sample is undefined here: https://github.com/broadinstitute/viral-ngs/blob/b7d97ca/assembly.py#L725

We hit this error after getting into this code path inadvertently (not actually trying to process multiple samples)

possible bug in VPhaser2 variant percentage calculation

I was using the viral-ngs on the command line to use viral-ngs/intrahost.py vphaser_one_sample when I noticed what I think might be a bug in the calculation of the variant percentage.

When V-Phaser2 says there is two SNPs at the same location, the Var_percent column seems to the sum of the percents of the variants. Was this intended? Below is an example of what I am talking about :

1275 A G 0.02322 snp 0.400866 G:5009:7414:1 A:6:24:1 T:13:7:1

0.4% is not the percent of the variant called (in this case, A), but is instead the sum of both the A and T percents.

I tried to email the authors of VPhaser2, but both Xiao Yang's and M Zody's Broad Inst. emails seem to be no longer active and the google group appears to be dead.

Move build, scripts, and binaries under tools

Toolify MAFFT

Create new Tool subclass for the MAFFT multi-aligner, available here: http://mafft.cbrc.jp/alignment/software/

MAFFT has precompiled MacOSX and Linux binaries, as well as source. Try the Linux "portable package" version (not the RPM or DEB version) and the Mac "all-in-one" version (basically, don't ever assume you have root privs). If the Linux binary succeeds on Travis as well as Broad CentOS (copper, tin, nickel) and Broad RHEL (gold, silver), then we're all set and feel free to skip the source-compile installer. Otherwise, you might need to implement a compile-from-source step.

The MUSCLE tool has an example of a two-installer approach depending on the OS. If the OS is Linux or Darwin x86, two install methods are appended: the first downloads the precompiled binary, the second downloads and unpacks the source and calls make. If the OS is not one of those two known ones, then only the source installer is appended.

nosetests -v test.unit.test_tools.TestToolsInstallation will tell you if your installer succeeded. On your local Mac checkout, it may fail on some other tools (like m-vicuna and v-phaser2) unless you have a special build environment set up, but you can ignore those errors.

Novoalign mangles read groups in BAMs

Apparently, running Novoalign (at least the 3.02.02 version we use) on BAM input and creating BAM output will appear to run fine, but the output BAM header will only contain one of the read groups (if the input BAM had many) and all of the reads in the output BAM will get reassigned to this one read group!

Solution: see if a newer version of Novoalign fixes this. Alternatively, rework our tool wrapper for Novoalign to break multi-RG BAMs into pieces (one per RG), run Novoalign separately on each one, and lump back together.

Testing: write a unit test with an input BAM with two read pairs, each from a different RG, and assert that output BAMs have properly assigned RGs (same as they were originally assigned) and that the output header describes both RGs.

VCF files to text tables

VCF files are the standard representation for such data, but tabular text is easier for many people to manipulate in R, scripts, spreadsheets, etc. intrahost.py iSNV_table contains the code used last summer to accomplish this. We should just verify that this continues to work properly in our soon-to-be implemented iSNV pipeline. There might not be much to do here.

finish implementing samtools tool

Create unit test with insertions and deletions with multiple libraries in vphaser_one_sample

implement coordinate mapper

Some generic mapping utility (a class called interhost.CoordMapper) that allows us to map coordinates between related genomes.

Example input genomes:

Genome A:

>chr1
ATGCATGC
>chr2
GGTTTTC

Genome B:

>first_chrom
GCATTTGC
>second_chr
GGTTTCCAC

Expected outputs:

mapAtoB('chr1', 5) should return ('first_chrom', 3)
mapBtoA('first_chrom', 7) should return ('chr1', 7)
mapAtoB('chr1',1) should ... maybe raise an IndexError or .. maybe return a negative number or something (we can figure that out). Actually, I think IndexError might be better.
mapBtoA('first_chrom', 4) or 5 or 6 should all return the same answer
mapAtoB('chr1', 6) should return a deterministic answer (perhaps 4)

I've checked a basic framework into interhost.py and test/unit/test_interhost.py, but it's full of NotImplementedErrors (and @unittest.skip decorators).

update ncbi annotation transfer to use pre-built alignments

ncbi.py tbl_transfer currently takes a pair of fasta files and pairwise aligns them.

Create a new command in ncbi.py: tbl_transfer_prealigned, that does the same thing as tbl_transfer, but instead of taking two fasta files as input, it takes pre-made alignments as input and also takes as input the name of the reference sequence and the name of the alternate sequence in the alignment. The input alignment may contain more than these two sequences, but the other sequences will be ignored.

Also update the Snakemake rule (rule annot_transfer in interhost.rules) accordingly.

This depends on #127 .

Automate recipes in taxon_filter

possible coordinate obo bug in iSNV positions in latter half of genome

Hi @tomkinsc I will let Shirlee Wohl discuss this with you next week.

implement GATK tool (SNP calling)

Make MAFFT alignment tool work with multiple samples containing multiple chromosomes

In the case where MAFFT is being used to align sequences stored in sample-specific FASTA files, MAFFT should be able to do an alignment among the respective chromosomes from the FASTA files, such that the output is a series of chromosome-specific alignments in FASTA files.

In the case of two samples from organisms with two chromosomes, the input and output would be as follows:

Input: 
    sample1.fasta would contain chr1, chr2
    sample2.fasta would also contain chr1, chr2
Output:
    chr1.fasta would contain chr1_sample1, chr1_sample2
    chr2.fasta would contain chr2_sample1, chr2_sample2

Before MAFFT, an intermediate step would be needed to transpose the chromosomes from their respective FASTA files to chromosome-specific FASTA files prior to alignment. This can be done via itertools.izip(Python2)/zip(Python3) and Bio.SeqIO.parse(), which returns an iterable.

Assure no tests have executable bit

Any tests that have the executable bit set will be silently excluded by nose when automatic tests are run in Travis. To detect this, put a check in test/init.py that asserts that no test have the executable bit set. Also remove scripts/init.py since python files in scripts are not import-safe.

Create simple MAFFT command line entry point

In interhost.py, create a new command called "align_mafft". It should take one or more input fasta files and produce one output fasta alignment file. You may expose some of MAFFT's optional parameters via argparse. Some of the ones that we will likely use include: --localpair --maxiterate --reorder --ep --preservecase and --thread.

Depends on #118

Replace and generalize align_and_orient steps.

The current align_and_orient steps rely on VFAT and some other custom scripts. They should be rewritten to remove dependency on Mosiak and to work properly on multi-chromosome genomes. A good starting point would be to steal Bellini python code for merge_contigs and order_and_orient.

test_tools_bwa is adding files to test/input

test_tools_bwa adds the files ebola.fasta.{amb,ann,bwt,pac,sa} to test/input. Then, git lists these files as untracked.

finish implementing blast tool

implement Trinity tool (assembler)

Make pysam work with Travis

pysam mostly works with Travis, except a few of the functions that embed samtools functionality (like pysam.view) which do some weird redirection of stdin/stdout, but Travis does some weird redirection of stdin/stdout as well (and nosetests does too, but we can turn that off). pysam.view and such seem to work fine in real life, but the tests on Travis fail when they use them.

If we can figure out how to make pysam.view work, then get rid of the Samtools tool entirely, which is one less big thing to download and compile (plus it should be cleaner than invoking subprocess.call every time).

Toolify VFAT steps

Those perl scripts have a bunch of hard coded paths and such.

Don't fail if path contains spaces

Currently, most of our commands will fail if any of the paths contain spaces or other characters interpreted by the shell. The reason is that they are getting interpreted by the shell twice, once when the user invokes our command and then again when we call os.system. This can easily be fixed by putting single quotes around all the paths in our call to os.system. Alternatively, we could use subprocess.call instead of os.system, since that is now the recommended way to do this, but that would be more work and I'm not sure what the benefit is.

Automate call to lastl

resolve irritating whitespace/tab issue

There is an irritating tab/whitespace issue due to one of our pre-commit hooks. Specifically, it sees leading whitespaces as errors and changes them to tabs. This is mostly an issue for when spaces are used to align arguments on multiple lines.

implement MUSCLE tool (aligner)

iSNV per-sample processing pipeline (step 1)

After V-Phaser2 produces iSNV calls, a few filtration and markup steps should be performed. Create a new command in intrahost.py (maybe something like "vphaser_one_sample") that looks like the following:

Input: a single BAM file, representing reads from one sample, mapped to its own consensus assembly. It may contain multiple read groups and libraries.
Output: a single tabular text file, similar to V-Phaser's output format, but with some additional columns added (# libraries, chi-sq for library discordance).

Steps:

Call V-Phaser2: BAM -> text file with iSNV calls
Hard filtering of iSNVs based on read counts and strand bias on a per-library basis. Use pysam to reach into the original BAM file to query the read pileup at each iSNV position. At each position, count up how many reads are on a given strand and library. Toss out the counts for any library that fails the strand bias / minimum read count criteria. Toss out iSNVs that have no libraries left at this point. Pass any remaining iSNVs downstream along with their per-library read counts.
For each iSNV, add columns that denote the number of libraries that survived the above filters and a chi-sq or fisher's exact test score for the agreement of allele frequencies across libraries (for iSNVs that have > 1 library).

update iSNV pipeline to use pre-built alignments

intrahost.py merge_to_vcf currently has an --assemblies required input parameter that takes in a ton of fasta files (one per sample). It then pairwise aligns everything before doing anything useful, and this often takes an hour or so, and is very unoptimal.

Instead, it should take in alignments in an --alignments required input parameter that takes in one aligned fasta file per chromosome/segment. This must contain all samples as well as the reference genome.

Also update the Snakemake rule (rule isnvs_vcf in intrahost.rules) accordingly.

This depends on #127 .

Automated gene annotation transfers to new assemblies

A good starting point would be Bellini's code based on Genewise.

broadinstitute / viral-ngs Goto Github PK

viral-ngs's Introduction

viral-ngs

viral-ngs's People

Contributors

Stargazers

Watchers

Forkers

viral-ngs's Issues

< 2872 A G 0.2215 snp 0.7186 A:1:5 G:436:393

< 3260 C A 0.4104 snp 0.7018 A:506:343 C:1:5

< 4236 G A 0.6867 snp 0.5566 A:534:538 G:2:4

< 8191 A T 0.6692 snp 1.354 A:1:5 T:296:141

Recommend Projects

Recommend Topics

Recommend Org