hall-lab / svtyper Goto Github PK

View Code? Open in Web Editor NEW

124.0 10.0 55.0 2.26 MB

Bayesian genotyper for structural variants

License: MIT License

Python 98.84% Shell 0.26% R 0.68% Makefile 0.22%

genotype vcf bioinformatics genomics

svtyper's Introduction

SVTyper

Bayesian genotyper for structural variants

Overview

SVTyper performs breakpoint genotyping of structural variants (SVs) using whole genome sequencing data. Users must supply a VCF file of sites to genotype (which may be generated by LUMPY) as well as a BAM/CRAM file of Illumina paired-end reads aligned with BWA-MEM. SVTyper assesses discordant and concordant reads from paired-end and split-read alignments to infer genotypes at each site. Algorithm details and benchmarking are described in Chiang et al., 2015.

Installation

Requirements:

Python 2.7.x

Install via `pip`

pip install git+https://github.com/hall-lab/svtyper.git

svtyper depends on pysam (version 0.15.0 or newer), numpy, and scipy; svtyper-sso additionally depends on cytoolz. If the dependencies aren't already available on your system, pip will attempt to download and install them.

`svtyper` vs `svtyper-sso`

svtyper is the original implementation of the genotyping algorithm, and works with multiple samples. svtyper-sso is an alternative implementation of svtyper that is optimized for genotyping a single sample. svtyper-sso is a parallelized implementation of svtyper that takes advantage of multiple CPU cores via the multiprocessing module. svtyper-sso can offer a 2x or more speedup (depending on how many CPU cores used) in genotyping a single sample. NOTE: svtyper-sso is not yet stable. There are minor logging differences between the two and svtyper-sso may exit with an error prematurely when processing CRAM files.

Example Usage

`svtyper`

As a Command Line Python Script

svtyper \
    -i sv.vcf \
    -B sample.bam \
    -l sample.bam.json \
    > sv.gt.vcf

As a Python Library

import svtyper.classic as svt

input_vcf = "/path/to/input.vcf"
input_bam = "/path/to/input.bam"
library_info = "/path/to/library_info.json"
output_vcf = "/path/to/output.vcf"

with open(input_vcf, "r") as inf, open(output_vcf, "w") as outf:
    svt.sv_genotype(bam_string=input_bam,
                    vcf_in=inf,
                    vcf_out=outf,
                    min_aligned=20,
                    split_weight=1,
                    disc_weight=1,
                    num_samp=1000000,
                    lib_info_path=library_info,
                    debug=False,
                    alignment_outpath=None,
                    ref_fasta=None,
                    sum_quals=False,
                    max_reads=None)

# Results will be inside the /path/to/output.vcf file

`svtyper-sso`

As a Command Line Python Script

svtyper-sso \
    --core 2 # number of cpu cores to use \
    --batch_size 1000 # number of SVs to process in a single batch (default: 1000) \
    --max_reads 1000 # skip genotyping if SV contains valid reads greater than this threshold (default: 1000) \
    -i sv.vcf \
    -B sample.bam \
    -l sample.bam.json \
    > sv.gt.vcf

As a Python Library

import svtyper.singlesample as sso

input_vcf = "/path/to/input.vcf"
input_bam = "/path/to/input.bam"
library_info = "/path/to/library_info.json"
output_vcf = "/path/to/output.vcf"

with open(input_vcf, "r") as inf, open(output_vcf, "w") as outf:
    sso.sso_genotype(bam_string=input_bam,
                     vcf_in=inf,
                     vcf_out=outf,
                     min_aligned=20,
                     split_weight=1,
                     disc_weight=1,
                     num_samp=1000000,
                     lib_info_path=library_info,
                     debug=False,
                     alignment_outpath=None,
                     ref_fasta=None,
                     sum_quals=False,
                     max_reads=1000,
                     cores=2,
                     batch_size=1000)

# Results will be inside the /path/to/output.vcf file

Development

Requirements:

Python 2.7 or newer
GNU Make
virtualenv (or conda for anaconda or miniconda users)

Setting Up a Development Environment

Using `virtualenv`

git clone https://github.com/hall-lab/svtyper.git
cd svtyper
virtualenv myvenv
source myvenv/bin/activate
pip install -e .
<add, edit, or delete code>
make test

# when you're finished with development
git push <remote-name> <branch>
deactivate
cd .. && rm -rf svtyper

Using `conda`

git clone https://github.com/hall-lab/svtyper.git
cd svtyper
conda create --channel bioconda --name mycenv pysam numpy scipy cytoolz # type 'y' when prompted with "proceed ([y]/n)?"
source activate mycenv
pip install -e .
<add, edit, or delete code>
make test


# when you're finished with development
git push <remote-name> <branch>
source deactivate
cd .. && rm -rf svtyper
conda remove --name mycenv --all

Troubleshooting

Many common issues are related to abnormal insert size distributions in the BAM file. SVTyper provides methods to assess and visualize the characteristics of sequencing libraries.

Running SVTyper with the -l flag creates a JSON file with essential metrics on a BAM file. SVTyper will sample the first N reads for the file (1 million by default) to parse the libraries, read groups, and insert size histograms. This can be done in the absence of a VCF file.

svtyper \
    -B my.bam \
    -l my.bam.json

The lib_stats.R script produces insert size histograms from the JSON file

scripts/lib_stats.R my.bam.json my.bam.json.pdf

Citation

C Chiang, R M Layer, G G Faust, M R Lindberg, D B Rose, E P Garrison, G T Marth, A R Quinlan, and I M Hall. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Meth 12, 966–968 (2015). doi:10.1038/nmeth.3505.

http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html

svtyper's People

Contributors

Stargazers

Watchers

Forkers

ryanlayer jaquol holtgrewe chapmanb brentp mlinderm mkiwala cc2qe ernfrid snashraf tomsasani pradyumnasagar ttriche avakel indraniel jiadong324 microtsiu florealcab yechao94 btoe dvanderleest silenwang scalavision polojacky christophelegendre apregier bioming jiaozexin jjfarrell mchowdh200 color aakrosh shulp2211 ottov tianfuzeng tubiejun xuelei-dai devangthakkar alexander-veit dbmi-bgm janeyang123 quanrd wook2014 radygenomics lessonoff panguangze tzhang-nmdp davidstreid aubeldutcha

svtyper's Issues

reads dropped in non deletion events

This notation is not allowed by python

https://github.com/hall-lab/svtyper/blob/master/svtyper#L457

svtyper stucks

It looks like the program gets stuck at certain point. The program stops outputting results at certain chromsome. Here is the command I used:

svtyper -B ${bamfile} -S ${splitters} -i in.vcf -M

Did I miss any necessary input parameters?

Thanks.

Downsample high depth regions

SVTyper may become slow or unresponsive in very deep regions of the genome, especially centromeres and other repetitive areas. Add a parameter to downsample regions of excessive depth (perhaps with resevoir sampling?) to N reads (probably by default max: 200)

How to genotype many samples

I have about 100 samples, and i I runned Lumpyexpress with -p and got the sample1.vcf, sample2.vcf, sample3.vcf, et al.
lumpyexpress
-B sample1.bam
-S sample1.splitters.bam
-D sample1.discordants.bam
-P
-o sample1.vcf

Then I used 1_sort.py and 1_merge.py to merge all the vcf files named samples.sorted.merge.vcf
python l_sort.py sample1.vcf, sample2.vcf, sample3.vcf > samples.sorted.vcf
python l_merge.py -i samples.sorted.vcf > samples.sorted.merge.vcf

I tried to run SVTyper to genotype each sample.
svtyper
-B sample1.bam
-S sample1.splitters.bam
-i samples.sorted.merge.vcf
-M

sample1.gt.vcf

The errors ocurred. Should I extact each sample.vcf from the samples.sorted.merge.vcf again, then to run the SVTyper for each sample?Thanks for your help in advance!

Traceback (most recent call last):
File "/WORK/app/bcbio//bin/svtyper", line 1413, in
sys.exit(main())
File "/WORK/app/bcbio//bin/svtyper", line 1400, in main
args.debug)
File "/WORK/app/bcbio//bin/svtyper", line 1234, in sv_genotype
out_bam)
File "/WORK/app/bcbio//bin/svtyper", line 574, in count_pairedend
mate_mapq = get_mate_mapq(sample.bam, read) # move this for speed
File "/WORK/app/bcbio//bin/svtyper", line 367, in get_mate_mapq
mq = bam.mate(read).mapq
File "pysam/calignmentfile.pyx", line 1007, in pysam.calignmentfile.AlignmentFile.mate (pysam/calignmentfile.c:12008)
ValueError: mate not found

SVTyper discarding VCF entries

I recently used SVTyper to genotype a big dataset. To do so, I splitted the dataset in several files with 10000 VCF entries each. I did a simple wc -l in the resulting files and their line count is around 2000-3000 lines. Which lines are being dropped, and why? I couldn't find it in the docs. I want to think that it does it depending on the QUAL, but there is still many entries with QUAL= 0.0

Thank you

speed up options

Hello,
I am using svtyper on the results of a lumpy somatic vcf and am therefore running it with 2 bwa-mem bams and 2 split reads bam files. I find that the progess is very slow at this point : ~200 variants out of 9000 have been genotyped after 20hrs. Just wondering if there were any flags or ways to speed up the process other than splitting the vcf into smaller chunks.

thanks
Arun

VCF header preservation

Pervasive bug in svtyper and several scripts that removes unrecognized header info (FILTER, weird ##GATKCommandLine= stuff, etc).

Need to be more permissive in header fields, while preventing duplication of header info IDs.

recommendations for pysam version

I'm having a difficult time getting a version of pysam that works with this for various reasons (installing pysam via conda). I have the PM field populated in the RG tag which appears that pysam does not recognize as a valid RG tag field (I used the most current version of pysam as well as an older one, but with the same errors, https://groups.google.com/forum/#!topic/pysam-user-group/9Cbe1M2Y7gQ). the newer version of pysam causes problems with their samtools distro of 1.5.2 b/c of how they distribute bzip. reverting back to samtools 1.4.1 with conda fixes the samtools issue, but now I have a different issue with pysam (which gets downgraded to 0.11.2.2 when installing samtools 1.4.1). this is the new error with pysam...which looks to bzip shared library problem again...os is rhel 6.7

khetric1@sunrhel2> svtyper -M --dump $TMPDIR/temp.sv -B 15546-0183702140.bam -S 15546-0183702140.lumpy.discordant.split.reads.sort.bam -i lumpy.filtered.bam.vcf -o lumpy.filtered.bam.gt.vcf
Traceback (most recent call last):
File "/isilon/sequencing/Kurt/Programs/SVTyper/temp/svtyper/svtyper", line 3, in
import pysam
File "/isilon/sequencing/peng/bin/anaconda/lib/python2.7/site-packages/pysam/init.py", line 5, in
from pysam.libchtslib import *
ImportError: libbz2.so.1.0: cannot open shared object file: No such file or directory

ignore proper pair flag when defining distribution

this is biasing the expectation of the insert size distribution

add GT description to header

error/warning on singleton BND

if both ends of a BND are not present in the VCF sent for genotyping, svtyper will silently ignore the one that is present. It would be nice if this at least printed a warning message.

AttributeError: 'csamtools.AlignedRead' object has no attribute 'tlen'

Hi,
I am using speedseq and an error occured:

Traceback (most recent call last):
File "/risapps/src6/speedseq//bin/svtyper", line 1044, in
sys.exit(main())
File "/risapps/src6/speedseq//bin/svtyper", line 1036, in main
args.debug)
File "/risapps/src6/speedseq//bin/svtyper", line 735, in sv_genotype
ins_hist_list[lib] = insert_hist(bam, lib_to_rg[lib])
File "/risapps/src6/speedseq//bin/svtyper", line 658, in insert_hist
or read.tlen <= 0
AttributeError: 'csamtools.AlignedRead' object has no attribute 'tlen'

Do you have an idea what's wrong?
Thanks!
Ming

Update test for v0.1.0

The test script and output hasn't yet been updated in the v0.1.0 repository. This came up for: hall-lab/svtools#167

stricter joining of vcfs

check the ids when joining vcfs.

GQ should be Type=Integer

according to the VCF spec.

Running SVTyper on any SV tool

Hi,

I'm hoping to be able to run SVTyper in a generic mode on any SV tool.

I have tried running SVTyper on Manta and SvAbA vcfs, and am running into errors because SVTyper is looking for information that is not listed in these files. For example, some entries in Manta do not have CIPOS and SvAbA does not list CIPOS for any of its entries. SVTyper stops and throws an error because it cannot find CIPOS. There have also been other format-specific errors I've run into, such as SVTyper throwing an error when an additional '>' is found in the FORMAT header section.

Will there be a version of SVTyper that is more lenient with the vcf format, so that it can be run on other SV callers besides Lumpy?

Interpreting Genotype Likelihoods of Complex Structural Variants

We have performed WGS at 40x of a number of families and processed structural variants through the speedseq v0.0.3a pipeline.

I was wondering how to interpret genotype-likelihoods (GL) / genotype-quality (GQ) scores, in particular for more complex structural variants than simple deletions, duplications, and inversions?

We identified a few de novo complex structural variants with very poor GL / GQ scores (e.g. 0 GQ), however we went ahead and successfully validated them through PCR.

At the other end of the scale, we have breakpoints that are called in almost everyone and have 0 genotype quality scores, however a few individuals may have GQ scores > 50. These are probably erroneous.

Also, we have complex events with for example two pairs of breakpoints, where one pair has a 0 GQ score, and the other has a reasonable GQ score (>50, n.b. I grouped breakpoints into single events based on their proximity and overlap, e.g. two pairs of breakpoints overlap and one end of each pair are adjacent to each other in opposing orientations).

I suspect it is very challenging to accurately estimate GLs in complex events that have multiple signals at the same locus.

My temptation is to keep anything denovo regardless of GQ score if it involves >1 pair of breakpoints, since the numbers are manageable. I would also keep anything else with >1 pair of breakpoints where at least one pair of breakpoints has a good median GQ score (>50 or >100?) across individuals that have this pair of breakpoints in the cohort. Otherwise filter any events with consistently low or zero GQ scores. Please advise!

AttributeError: 'csamtools.AlignedRead' object has no attribute 'has_tag'

I am facing this problem when running HiCExplorer software:

command:
hicBuildMatrix -s mapping/SRR1956527_1.bam mapping/SRR1956527_2.bam -rs dpnII_positions_GRCm38.bed -seq GATC -b hiCmatrix/SRR1956527_ref.bam -o hiCmatrix/SRR1956527.matrix

reading mapping/SRR1956527_1.bam and mapping/SRR1956527_2.bam to build hic_matrix
Minimum distance considered between restriction sites is 300
Max distance: 800
Matrix size: 2666241
dangling sequences to check are {'pat_forw': 'ATC', 'pat_rev': 'GAT'}
Traceback (most recent call last):
File "/usr/local/bin/hicBuildMatrix", line 7, in
main()
File "/usr/local/lib/python2.7/dist-packages/hicexplorer/hicBuildMatrix.py", line 644, in main
mate1_supplementary_list = get_supplementary_alignment(mate1, str1)
File "/usr/local/lib/python2.7/dist-packages/hicexplorer/hicBuildMatrix.py", line 410, in get_supplementary_alignment
if read.has_tag('SA'):
AttributeError: 'csamtools.AlignedRead' object has no attribute 'has_tag'

pysam version:
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

        import pkg_resources
        pkg_resources.get_distribution("pysam").version
        '0.9.1.4'

Can you please help me to solve it out.
Thanks in advance

problem in running svtyper

Hi,
Recently, I have used lumpy to call the SVs from citrus genome successfully. Then, I wanted to do the genotyping by svtyper. Here, it came some problems.
The svtyper can genotype successfully before it came to chromosome 7 at position around 4378666. The output file stopped at this position. There were no error messages and the procedure was still running.
Do you know what is the problem?

Cheers,

Something funny about reading header

I am getting the following error:

Traceback (most recent call last):
  File "/uufs/chpc.utah.edu/common/home/u6000294/src/svtyper/svtyper", line 1129, in <module>
    sys.exit(main())
  File "/uufs/chpc.utah.edu/common/home/u6000294/src/svtyper/svtyper", line 1121, in main
    args.debug)
  File "/uufs/chpc.utah.edu/common/home/u6000294/src/svtyper/svtyper", line 818, in sv_genotype
    vcf.add_header(header)
  File "/uufs/chpc.utah.edu/common/home/u6000294/src/svtyper/svtyper", line 67, in add_header
    self.add_info(*[b.split('=')[1] for b in r.findall(a)])
IndexError: list index out of range

The offending line appears to be

##INFO=<ID=GQT_0,Number=1,Type=Float,Description="GQT maf result from phenotype:'' genotype:'maf()>0'">

Numerical result out of range error

Hi,

I'm getting an erro running svtyper:
File "/usr/local/packages/seq/lumpy-sv/tools/svtyper/svtyper", line 964, in
gt_sum = sum(10**x for x in gt_lplist)
OverflowError: (34, 'Numerical result out of range')

The entry in the VCF file that causes this error is:
chr1 17684780 42 a ~~0.00 . TOOL=LUMPY;SVTYPE=DEL;SVLEN=-317;END=17685097;STR=+-:20;CIPOS=0,0;CIEND=0,0;EVENT=42;SUP=20;PESUP=6;SRSUP=14;EVTYPE=PE,SR;PRIN GT:SUP:PE:SR 0/1:20:6:14~~

However, when I run the svtyper version that comes with speedseq I don't get the error for this entry, but for another one:
chr1 188539456 333 a ~~0.00 . TOOL=LUMPY;SVTYPE=DEL;SVLEN=-773;END=188540229;STR=+-:155;CIPOS=0,0;CIEND=0,0;EVENT=333;SUP=155;PESUP=117;SRSUP=38;EVTYPE=PE,SR;PRIN GT:SUP:PE:SR 0/1:155:117:38~~

I also noticed that the GQ is 0.00 or -0.00 for all SVs that worked.
The input VCF file was generated using the speedseq sv module. The original BAM file was aligned using bwa mem. Discordant BAM files were generated as shown on the lumpy-sv page. I used the current version of speedseq.

Best Regards,
Thomas

'module' object has no attribute 'AlignmentFile'

Hello,

I'm doing a little testing with version 0.1.0 and I've run into an issue.

my command based on the svtyper usage statement:

svtyper -B 101263-101263.bam -i 101263-101263.bam_temp.vcf -o result.test.vcf

And the result:

Traceback (most recent call last):
  File "/uufs/chpc.utah.edu/common/home/ucgdstor/common/apps/ember.arches/svtyper/0.1.0/svtyper", line 1519, in <module>
    sys.exit(main())
  File "/uufs/chpc.utah.edu/common/home/ucgdstor/common/apps/ember.arches/svtyper/0.1.0/svtyper", line 1514, in main
    args.dump)
  File "/uufs/chpc.utah.edu/common/home/ucgdstor/common/apps/ember.arches/svtyper/0.1.0/svtyper", line 1152, in sv_genotype
    bam_list = [pysam.AlignmentFile(b, 'rb') for b in bam_string.split(',')]
AttributeError: 'module' object has no attribute 'AlignmentFile'

Which made me think that pysam was not installed but:

$ python
Python 2.7.3 (default, Mar 28 2013, 15:20:19)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysam
>>>

And of course the usage statement works so it's finding pysam:

usage: svtyper [-h] -B BAM [-i VCF] [-o VCF] [-l JSON] [-m INT] [-n INT] [--split_weight FLOAT] [--disc_weight FLOAT] [--debug] [--dump STR]
svtyper: error: argument -B/--bam is required

Any ideas?

--Shawn

python3 svtyper

Dear svtyper team,

Do you intend to provide a python3 version of svtyper.
I am using svtyper within a pipeline written in python3 and incorporating a python3 version would greatly facilitate the installation of the pipeline on other platforms.

Thank you in advance for your answer,
Thomas Faraut

region option

Would it be difficult to add a region option?

SVtyper with long reads

So - I'm trying to generate SV frequencies from long read data (ONT) aligned with a split-read aligner (BWA MEM) using SVtyper, but I keep getting an error:

Error:
Traceback (most recent call last):
File "/home/timp/Code/svtyper/svtyper", line 1078, in
sys.exit(main())
File "/home/timp/Code/svtyper/svtyper", line 1070, in main
args.debug)
File "/home/timp/Code/svtyper/svtyper", line 767, in sv_genotype
sample = Sample(bam_list[i], spl_bam_list[i], num_samp)
File "/home/timp/Code/svtyper/svtyper", line 709, in init
self.name = bam.header['RG'][0]['SM']
KeyError: 'RG'

Command:

##BWAMEM
~/Code/bwa/bwa mem -x ont2d /mithril/Data/NGS/Reference/human/hg19.fa ${fastq}_BC${bc}.fastq.gz >${outdir}/${prefix}.bwa.sam

samtools view -b -S ${outdir}/${prefix}.bwa.sam | samtools sort - ${outdir}/${prefix}.bwa
samtools index ${outdir}/${prefix}.bwa.bam

##From LUMPY readme                                                                                                                                                                                                                                                       

samtools view -b -F 1294 -S ${outdir}/${prefix}.bwa.sam >${outdir}/${prefix}.bwa.discordants.unsorted.bam

# Extract the split-read alignments                                                                                                                                                                                                                                       
samtools view -h -S ${outdir}/${prefix}.bwa.sam \
    | ~/Code/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin \
    | samtools view -Sb - \
    > ${outdir}/${prefix}.bwa.splitters.unsorted.bam

# Sort both alignments                                                                                                                                                                                                                                                    
samtools sort ${outdir}/${prefix}.bwa.discordants.unsorted.bam ${outdir}/${prefix}.bwa.discordants
samtools sort ${outdir}/${prefix}.bwa.splitters.unsorted.bam ${outdir}/${prefix}.bwa.splitters

samtools index ${outdir}/${prefix}.bwa.splitters.bam


~/Code/lumpy-sv/bin/lumpy \
    -mw 4 \
    -tt 0 \
    -sr id:${prefix}.bwa,bam_file:${outdir}/${prefix}.bwa.splitters.bam,back_distance:10,weight:1,min_mapping_threshold:20 \
    > ${outdir}/${prefix}.bwa.vcf


cat ${outdir}/${prefix}.bwa.vcf | \
    ~/Code/svtyper/svtyper \
    -B ${outdir}/${prefix}.bwa.bam \
    -S ${outdir}/${prefix}.bwa.splitters.bam \
    > ${outdir}/${prefix}.bwa.gt.vcf

svtyper on tumor/normal pair, which BAM to supply

Hi, I am using svtyper on a VCF file produced using lumpyexpress given both a tumor and a matched normal sample. I thus have:

One "somatic" lumpy VCF file
Two BAM files for the tumor and normal sample

Do I:

Run svtyper pointing -B to the tumor BAM
Run svtyper pointing -B to the normal BAM
Both 1 and 2

IndexError: list index out of range

Hi, I am getting the following error "IndexError: list index out of range"
with the following Traceback:
Traceback (most recent call last):
File "path_to_svtyper", line 1078, in
sys.exit(main())
File "path_to_svtyper", line 1070, in main
args.debug)
File "path_to_svtyper", line 767, in sv_genotype
sample = Sample(bam_list[i], spl_bam_list[i], num_samp)
File "path_to_svtyper", line 734, in init
self.lib_dict[name].calc_insert_hist()
File "path_to_svtyper", line 689, in calc_insert_hist
med = median(valueCounts)
File "path_to_svtyper", line 567, in median
v = valueList[i]

I used lumpyexpress to generate the vcf file, from set of bam files originally generated with bwa-mem. Samblaster was used to extract splitters.

Thanks for any help, Chris

AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

I was originally getting an "IndexError: list index out of range" error when using the latest release. Now that I've updated to the latest master version, I get a "AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'" error.

I have successfully run svtyper on 10 multisample VCFs before this one. The BAMs were all aligned and processed the same way. There are two readgroups per BAM, one more represented than the other, but a similar ratio to previous BAMs that have not had an issue.

Based on some of the info previously requested from similar issues, I have appended some things that may be helpful:

Python 2.7.5 (default, Jul 24 2013, 15:39:37)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pkg_resources
pkg_resources.get_distribution("pysam").version
'0.7.5'

~/apps/svtyper-master/svtyper -B BAMs/sample1.processed.bam ^C BAMs/sample1.splitreads.bam -i sample1.lumpy.vcf -n 2 -M -o sample1.lumpy.gt.vcf --debug

Traceback (most recent call last):
File "/isilon/bgcusers/bjk003/apps/svtyper-master/svtyper", line 1418, in
sys.exit(main())
File "/isilon/bgcusers/bjk003/apps/svtyper-master/svtyper", line 1405, in main
args.debug)
File "/isilon/bgcusers/bjk003/apps/svtyper-master/svtyper", line 1040, in sv_genotype
sample = Sample.from_bam(bam_list[i], spl_bam_list[i], num_samp, min_lib_prevalence)
File "/isilon/bgcusers/bjk003/apps/svtyper-master/svtyper", line 953, in from_bam
new_lib = Library.from_bam(lib_name, bam, num_samp)
File "/isilon/bgcusers/bjk003/apps/svtyper-master/svtyper", line 784, in from_bam
num_samp)
File "/isilon/bgcusers/bjk003/apps/svtyper-master/svtyper", line 727, in init
self.calc_read_length()
File "/isilon/bgcusers/bjk003/apps/svtyper-master/svtyper", line 809, in calc_read_length
if read.inferred_length > max_rl:
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

KeyError: 'MATEID'

Hi,

I'm running svtyper on the output of Delly2. Then I got these error messages:

Traceback (most recent call last):
  File "/home/work01/tools/SVTyper/svtyper/svtyper", line 1808, in <module>
    sys.exit(main())
  File "/home/work01/tools/SVTyper/svtyper/svtyper", line 1803, in main
    args.max_reads)
  File "/home/work01/tools/SVTyper/svtyper/svtyper", line 1480, in sv_genotype
    if var.info['MATEID'] in breakend_dict:
KeyError: 'MATEID'

Thanks!

memory error

Hi, I've been experiencing a memory error in a couple samples

File "/frazer01/software/speedseq-20170419/bin/svtyper", line 1806, in
sys.exit(main())
File "/frazer01/software/speedseq-20170419/bin/svtyper", line 1801, in main
args.max_reads)
File "/frazer01/software/speedseq-20170419/bin/svtyper", line 1527, in sv_genotype
read_batch, many = gather_reads(sample, chromA, posA, ciA, z, read_batch, max_reads
)
File "/frazer01/software/speedseq-20170419/bin/svtyper", line 1331, in gather_reads
fragment_dict[read.query_name] = SamFragment(read, lib)
File "/frazer01/software/speedseq-20170419/bin/svtyper", line 769, in init
self.query_name = read.query_name
MemoryError

I have tried using max_reads argument (used 2000) to limit reads sampled in high read depth regions- but I have noticed that this resulted in many sites not getting genotyped (./.). Any ideas? One of the samples has a totally normal insert size distribution, the other is slightly skewed low (attached). How much memory should these jobs be consuming?

Thanks!

Dropping reference support in certain cases

I noticed that svtyper will force reference evidence, e.g. RP, to 0 if there is no corresponding alternate evidence for that kind but alternate evidence for the other.

As an example, I was annotating an variant with the following evidence:

sr_a (ref, alt) 25 4
pe_a (ref, alt) 39 7
sr_b (ref, alt) 15 4
pe_b (ref, alt) 31 7
sr_a_scaled (ref, alt) 24.571404 3.999996
pe_a_scaled (ref, alt) 37.691501499 0.184318874161
sr_b_scaled (ref, alt) 14.999985 3.999996
pe_b_scaled (ref, alt) 30.6593866175 0.184318874161

Because pe_*_scaled (alt) rounds to 0 but there is split read alt evidence, the ref PE evidence was not reported.

Is there a reason not to report reference evidence if the alternate evidence is zero?

Reference genotypes with no support

I'm seeing samples genotyped as follows:

GT:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:ASC:RP:AP:AB 0/0:1:4.77:0,0,0:0:0:0:0:0:0:0:0:0:0:.

It seems like this should be reported as a null genotype instead...

Use empirical distribution histogram

Paired end evidence would be more reliable if we used an empirical insert size distribution, rather than a normal curve.

Sort oder of input vcf for SVtyper

Hi Guys,
After producing lumpy output I filter the vcf in several ways before using SVtyper. Should the input vcf to SVtyper be sorted in any particular way?

Genotyping Output Size Doesn't Match Input Size

I have a quick question. I have generated a lumpy vcf, using an exclude.bed to exclude areas of high read depth (as described on lumpy git), and then removing calls in low complexity regions, seg dupes, centromeric/telomeric regions and calls in non-autosomal contigs before using svtyper to genotype. My input vcf is around 6K regions, and the output from svtyper ends up at around 4800, is this expected? What happens to these regions that do not appear in the output?

Wondering if I'm doing something wrong or if additional things might need to be filtered before using svtyper.

Question

I as wondering if svtyper does "joint genotyping" of multiple samples, analogous to what GATK does for SNPs?

Does svtyper do break-point resolution? For example, if I have a population samples with structural variants in the same region, can svtyper use that information to try to figure out the boundaries of the structural variant?

Thank you,

Luz

Error while processing Lumpy results

This is the command I used:
svtyper -i sample.vcf -B K1_aln.unique.sorted.bam -l K1.bam.json > K1.gt.vcf

And this is the output/error I am getting.
Calculating library metrics from K1_aln.unique.sorted.bam... done
Writing library metrics to K1.bam.json... done
Traceback (most recent call last):
File "/home/huws/bandy016/softwares/svtyper/svtyper", line 1808, in
sys.exit(main())
File "/home/huws/bandy016/softwares/svtyper/svtyper", line 1803, in main
args.max_reads)
File "/home/huws/bandy016/softwares/svtyper/svtyper", line 1528, in sv_genotype
read_batch, many = gather_reads(sample, chromA, posA, ciA, z, read_batch, max_reads)
File "/home/huws/bandy016/softwares/svtyper/svtyper", line 1330, in gather_reads
fragment_dict[read.query_name].add_read(read)
File "/home/huws/bandy016/softwares/svtyper/svtyper", line 793, in add_read
if split_candidate.is_valid():
File "/home/huws/bandy016/softwares/svtyper/svtyper", line 964, in is_valid
a = self.SplitPiece(self.read.reference_name,
AttributeError: 'pysam.calignmentfile.AlignedSegment' object has no attribute 'reference_name'

I got the same error while running the test file for svtyper?

SVTYPE=INS

Hi @cc2qe,

I've split a VCF into many pieces, some are finishing just fine, others contain some kinda of call that is throwing the following error.

It looks INS is genotyping in most cases? Is this accidental? It runs fine without INS.

File "/net/eichler/vol8/home/zevk/tools/svtyper/svtyper", line 1519, in
sys.exit(main())
File "/net/eichler/vol8/home/zevk/tools/svtyper/svtyper", line 1514, in main
args.dump)
File "/net/eichler/vol8/home/zevk/tools/svtyper/svtyper", line 1273, in sv_genotype
if o1_is_reverse: posA += 1
UnboundLocalError: local variable 'o1_is_reverse' referenced before assignment
Error in job genotype while creating output file split_calls/split_bh.genotyped.vcf.
RuleException:
CalledProcessError in line 12 of /net/eichler/vol24/projects/structural_variation/nobackups/zevk/wham/genotype/Snakefile:
Command '

CRAM support

to be added in v0.1.0

ValueError: mate not found

Hi, sounds like a interesting tool. Is there a detailed description what it actually does? Does it compare the coverage within and outside of the event? How many flanking bases are considered?

This is the error message I get:
File "pysam/calignmentfile.pyx", line 836, in pysam.calignmentfile.AlignmentFile.mate (pysam/calignmentfile.c:10945)
ValueError: mate not found

I used BWA MEM for the alignment, called multiple samples with lumpy-sv. From looking at the svtyper script, it seems that it does support multiple samples, if bams are provided in a comma seperated list. If I split the multivcf per sample and run svtyper in the single sample mode or if svtyper in the multisample mode the error message stays the same.

Do you have any idea what went wrong?

Some lines actually got processed. The last output line is:
1 823747 12 N 733.59 . SVTYPE=DEL;SVLEN=-206;END=823953;STRANDS=+-:2;IMPRECISE;CIPOS=-7,8;CIEND=-1,9;CIPOS95=-2,4;CIEND95=0,9;SU=2;PE=0;SR=2 GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO 0/1:0:0:0:200:733.59:-83,-9,-106:333:252:81 ./.:0:0:0:.:.:.:.:.:. ./.:0:0:0:.:.:.:.:.:. ./.:0:0:0:.:.:.:.:.:. ./.:2:0:2:.:.:.:.:.:.

Strange that only sample got genotyped. The corresponding lines in the input file:
1 823747 12 N . . SVTYPE=DEL;STRANDS=+-:2;SVLEN=-206;END=823953;CIPOS=-7,8;CIEND=-1,9;CIPOS95=-2,4;CIEND95=0,9;IMPRECISE;SU=2;PE=0;SR=2 GT:SU:PE:SR ./.:0:0:0 ./.:0:0:0 ./.:0:0:0 ./.:0:0:0 ./.:2:0:2
1 829170 13 N . . SVTYPE=DEL;STRANDS=+-:9;SVLEN=-35;END=829205;CIPOS=-9,8;CIEND=-5,3;CIPOS95=0,0;CIEND95=0,0;SU=9;PE=1;SR=8 GT:SU:PE:SR ./.:4:0:4 ./.:2:0:2 ./.:1:1:0 ./.:2:0:2 ./.:0:0:0

Thanks

Priors for split vs paired discordant ratios

We can calculate an estimate for PE/SR ratio based on the read length and insert size. SVs that deviate from this ratio (e.g. few or none of an evidence type) will have lower confidence. SVs in the correct ratio (both type of evidence) have higher confidence.

In addition, for joint SV genotyping, we can adjust these ratios based on the observation in the proband sample (sample the SV was discovered in).

SVtyper: Filtering Svs based on qual Column

Hi,

I wonder if qual column represents Phred-scaled probabilities? If so, Is it logical to filter out any quality score smaller than 10 to get at least 90% accuracy?

Best,
Mehmet

Stopped SVtyper

Hi,

Thanks for the nice tool. I have been running SVtyper for 1000 Genome WGS samples (~30X) after Lumpy-SV run.

The tool worked okay when I ran with chr22. Then, I used it for entire chromosomes, and I found strange behaviour that SVTyper stopped after writing few lines (~100-200 lines including header) of VCF output.

Then, I did bit more experiments that I did split the VCF file into few chunks (200 lines), given that the run of chr22 was okay and I assumed it could be line number issue, and ran again them but wasn't working.

I am wondering whether you or other users noticed this issue before.

Thanks
Joon

Using svtyper for other sv callers

Hi,

I wonder if anyone used svtyper to genotype pindel and delly outputs?

Do I need to modify their outputs to make svtyper work or should I give their outputs directly to svtyper?

Can I use svtyper to genotype Hydra output?

svtyper on lumpy vcf

hi
I ran lumpy and got the vcf output file.
I would like to run svtyper but i keep getting this error message:
Traceback (most recent call last):
File "./svtyper", line 1078, in
sys.exit(main())
File "./svtyper", line 1070, in main
args.debug)
File "./svtyper", line 767, in sv_genotype
sample = Sample(bam_list[i], spl_bam_list[i], num_samp)
File "./svtyper", line 709, in init
self.name = bam.header['RG'][0]['SM']
KeyError: 'RG'

Is it related to the -M issue?
Thanks

argument list too long

Hi!

I am trying to genotype the lumpy results with around 10000 samples. I divided all the samples to 4 batches. Then I ran lumpy on each batch, and then use SVTOOLS lsort and lmerge to get the merged call set. However, for the last step, to use SVTYPER to genotype those
calls across my 10000 samples, it returns with the error that the argument list too long. Any idea how to solve this issue?

The script looks like:
svtyper
-i lmerge.vcf
-B sample1.bam,.....,sample10000.bam
-S sample1.splitters.bam,....,sample10000.splitters.bam > svtyper.vcf

Error:
svtyper: Argument list too long

I am using this SVTYPER (https://github.com/cc2qe/svtyper). With this package, I did run it successfully to genotype the calls for each batch's lumpy result.

Best,
Gang

Missing 'STR' field

Hey,
I am trying to convert site specific vcfs to genotyped vcfs using svtyper. The issue I am running into is that some of them get a key error corresponding to a 'STR' key. Is this a start value? What variant caller(s) do you use that can create the STR value in the .vcf file? Thanks.

P.S. you can close this and email me if you want, my email is on my profile.

unknown error <- the worst bug report.

tabix -h ../ALL_Illumina_Integrate_20170206.vcf.gz chr4:140000000-150000000 | python ~/tools/svtyper/svtyper -B /net/eichler/vol24/projects/structural_variation/nobackups/bams/final/HG00512.bam,/net/eichler/vol24/projects/structural_variation/nobackups/bams/final/HG00513.bam,/net/eichler/vol24/projects/structural_variation/nobackups/bams/final/HG00514.bam,/net/eichler/vol24/projects/structural_variation/nobackups/bams/final/HG00731.bam,/net/eichler/vol24/projects/structural_variation/nobackups/bams/final/HG00732.bam,/net/eichler/vol24/projects/structural_variation/nobackups/bams/final/HG00733.bam,/net/eichler/vol24/projects/structural_variation/nobackups/bams/final/NA19238.bam,/net/eichler/vol24/projects/structural_variation/nobackups/bams/final/NA19239.bam,/net/eichler/vol24/projects/structural_variation/nobackups/bams/final/NA19240.bam -l lib.json 2> error

error

Reading library metrics from lib.json...Reading library metrics from lib.json...Reading library metrics from lib.json...Reading library metrics from lib.json...Reading library metrics from lib.json...Reading library metrics from lib.json...Reading library metrics from lib.json...Reading library metrics from lib.json...Reading library metrics from lib.json... done
Traceback (most recent call last):
File "/net/eichler/vol8/home/zevk/tools/svtyper/svtyper", line 1738, in
sys.exit(main())
File "/net/eichler/vol8/home/zevk/tools/svtyper/svtyper", line 1733, in main
args.alignment_outpath)
File "/net/eichler/vol8/home/zevk/tools/svtyper/svtyper", line 1380, in sv_genotype
vcf.add_header(header)
File "/net/eichler/vol8/home/zevk/tools/svtyper/svtyper", line 70, in add_header
self.add_info(*[b.split('=')[1] for b in r.findall(a)])
TypeError: add_info() takes exactly 5 arguments (4 given)

output AD for depth

svtyper uses AO and RO, but the spec has now settled on AD for ref, alt depths.
this would simplify getting alt depths from a VCF without caring if it's GATK or SV output.

hall-lab / svtyper Goto Github PK

svtyper's Introduction

SVTyper

Overview

Installation

Install via pip

svtyper vs svtyper-sso

Example Usage

svtyper

As a Command Line Python Script

As a Python Library

svtyper-sso

As a Command Line Python Script

As a Python Library

Development

Setting Up a Development Environment

Using virtualenv

Using conda

Troubleshooting

Citation

svtyper's People

Contributors

Stargazers

Watchers

Forkers

svtyper's Issues

Recommend Projects

Recommend Topics

Recommend Org

Install via `pip`

`svtyper` vs `svtyper-sso`

`svtyper`

`svtyper-sso`

Using `virtualenv`

Using `conda`