astrazeneca-ngs / vardictjava Goto Github PK

View Code? Open in Web Editor NEW

127.0 40.0 55.0 95.26 MB

VarDict Java port

License: MIT License

Java 98.50% Shell 1.50%

vardictjava's Introduction

This is the Final Version of VarDict. No longer maintained.

VarDictJava

Introduction

VarDictJava is a variant discovery program written in Java and Perl. It is a Java port of VarDict variant caller.

The original Perl VarDict is a sensitive variant caller for both single and paired sample variant calling from BAM files. VarDict implements several novel features such as amplicon bias aware variant calling from targeted sequencing experiments, rescue of long indels by realigning bwa soft clipped reads and better scalability than many other Java based variant callers. The Java port is around 10x faster than the original Perl implementation.

Please cite VarDict:

Lai Z, Markovets A, Ahdesmaki M, Chapman B, Hofmann O, McEwen R, Johnson J, Dougherty B, Barrett JC, and Dry JR. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 2016, pii: gkw227.

The link to is article can be accessed through: https://academic.oup.com/nar/article/44/11/e108/2468301?searchresult=1

Original coded by Zhongwu Lai 2014.

VarDictJava can run in single sample (see Single sample mode section), paired sample (see Paired variant calling section), or amplicon bias aware modes. As input, VarDictJava takes reference genomes in FASTA format, aligned reads in BAM format, and target regions in BED format.

Requirements

JDK 1.8 or later
R language (uses /usr/bin/env R)
Perl (uses /usr/bin/env perl)
Internet connection to download dependencies using gradle.

To see the help page for the program, run

 <path_to_vardict_folder>/build/install/VarDict/bin/VarDict -H.

Getting started

Getting source code

The VarDictJava source code is located at https://github.com/AstraZeneca-NGS/VarDictJava.

To load the project, execute the following command:

git clone --recursive https://github.com/AstraZeneca-NGS/VarDictJava.git

Note that the original VarDict project is placed in this repository as a submodule and its contents can be found in the sub-directory VarDict in VarDictJava working folder. So when you use teststrandbias.R and var2vcf_valid.pl. (see details and examples below), you have to add prefix VarDict: VarDict/teststrandbias.R and VarDict/var2vcf_valid.pl.

Compiling

The project uses Gradle and already includes a gradlew script.

To build the project, in the root folder of the project, run the following command:

./gradlew clean installDist

Clean will remove all old files from build folder.

To generate Javadoc, in the build/docs/javadoc folder, run the following command. If you want to save content of build folder as it is (for example after building the project), run it without clean option:

./gradlew clean javadoc

To generate release version in the build/distributions folder as tar or zip archive, run the following command:

./gradlew distZip

./gradlew distTar

Distribution Package Structure

When the build command completes successfully, the build/install/VarDict folder contains the distribution package.

The distribution package has the following structure:

bin/ - contains the launch scripts
lib/ - has the jar file that contains the compiled project code and the jar files of the third-party libraries that the project uses.

You can move the distribution package (the content of the build/install/VarDict folder) to any convenient location.

Generated zip and tar releases will also contain scripts from VarDict Perl repository in bin/ directory (teststrandbias.R, testsomatic.R, var2vcf_valid.pl, var2vcf_paired.pl).

You can add VarDictJava on PATH by adding this line to .bashrc:

export PATH=/path/to/VarDict/bin:$PATH

After that you can run VarDict by Vardict command instead of full path to <path_to_vardict_folder>/build/install/VarDict/bin/VarDict.

Third-Party Libraries

Currently, the project uses the following third-party libraries:

JRegex (http://jregex.sourceforge.net, BSD license) is a regular expressions library that is used instead of the standard Java library because its performance is much higher than that of the standard library.
Commons CLI (http://commons.apache.org/proper/commons-cli, Apache License) – a library for parsing the command line.
HTSJDK (http://samtools.github.io/htsjdk/) is an implementation of a unified Java library for accessing common file formats, such as SAM and VCF.
Mockito and TestNG are the testing frameworks (not included in distribution, used only in tests).

Single sample mode

To run VarDictJava in single sample mode, use a BAM file specified without the | symbol and perform Steps 3 and 4 (see the Program workflow section) using teststrandbias.R and var2vcf_valid.pl. The following is an example command to run in single sample mode with BED file.
You have to set options -c, -S, -E, -g using number of columns in your BED file for chromosome, start, end and gene of region respectively:

AF_THR="0.01" # minimum allele frequency
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N sample_name -b /path/to/my.bam -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f $AF_THR > vars.vcf

VarDictJava can also be invoked without a BED file if the region is specified in the command line with -R option. The following is an example command to run VarDictJava for a region (chromosome 7, position from 55270300 to 55270348, EGFR gene) with -R option:

<path_to_vardict_folder>/build/install/VarDict/bin/VarDict  -G /path/to/hg19.fa -f 0.001 -N sample_name -b /path/to/sample.bam -R  chr7:55270300-55270348:EGFR | VarDict/teststrandbias.R | VarDict/var2vcf_valid.pl -N sample_name -E -f 0.001 > vars.vcf

In single sample mode, output columns contain a description and statistical info for variants in the single sample. See section Output Columns for list of columns in the output.

Paired variant calling

To run paired variant calling, use BAM files specified as BAM1|BAM2 and perform Steps 3 and 4 (see the Program Workflow section) using testsomatic.R and var2vcf_paired.pl.

In this mode, the number of statistics columns in the output is doubled: one set of columns is for the first sample, the other - for second sample.

The following is an example command to run in paired mode.
You have to set options -c, -S, -E, -g using number of columns in your bed file for chromosome, start, end and gene of region respectively:

AF_THR="0.01" # minimum allele frequency
<path_to_vardict_folder>/build/install/VarDict/bin/VarDict -G /path/to/hg19.fa -f $AF_THR -N tumor_sample_name -b "/path/to/tumor.bam|/path/to/normal.bam" -c 1 -S 2 -E 3 -g 4 /path/to/my.bed | VarDict/testsomatic.R | VarDict/var2vcf_paired.pl -N "tumor_sample_name|normal_sample_name" -f $AF_THR

Amplicon based calling

This mode is active if the BED file uses 8-column format and the -R option is not specified.

In this mode, only the first list of BAM files is used even if the files are specified as BAM1|BAM2 - like for paired variant calling.

For each segment, the BED file specifies the list of positions as start and end positions (columns 7 and 8 of the BED file). The Amplicon based calling mode outputs a record for every position between start and end that has any variant other than the reference one (all positions with the -p option). For any of these positions, VarDict in amplicon based calling mode outputs the following:

Same columns as in the single sample mode for the most frequent variant
Good variants for this position with the prefixes GOOD1, GOOD2, etc.
Bad variants for this position with the prefixes BAD1, BAD2, etc.

For this running mode, the -a option (default: 10:0.95) specifies the criteria of discarding reads that are too far away from segments. A read is skipped if its start and end are more than 10 positions away from the segment ends and the overlap fraction between the read and the segment is less than 0.95.

Running Tests

Integration testing

The list of integration test cases is stored in files in testdata/intergationtestcases directory. To run all integration tests, the command is:

./gradlew test --tests com.astrazeneca.vardict.integrationtests.IntegrationTest

The results of the tests can be viewed in the build/reports/tests/index.html file.

User extension of testcases

Each file in testdata/intergationtestcases directory represents a test case with input data and expected output

1. Create a txt file in testdata/intergationtestcases folder.

The file contains testcase input (of format described in Test cases file format) in the first line and expected output in the remaining file part.

2. Extend or create thin-FASTA in testdata/fastas folder.

3. Run tests.

Test cases file format

Each input file represents one test case input description. In the input file the first line consists of the following fields separated by , symbol:

Required fields:

test case name (Amplicon/Somatic/Simple mode)
reference name
bam file name
chromosome name
start of region
end of region

Optional fields:

start of region with amplicon case
end of region with amplicon case

Parameters field:

the last field can be any other command line parameters string

Example of first line of input file:

Amplicon,hg19.fa,Colo829-18_S3-sort.bam,chr1,933866,934466,933866,934466,-a 10:0.95 -D
Somatic,hg19.fa,Colo829-18_S3-sort.bam|Colo829-19_S4-sort.bam,chr1,755917,756517
Simple,hg19.fa,Colo829-18_S3-sort.bam,chr1,9922,10122,-p

Test coverage Report

To build test coverage report run the following command:

./gradlew test jacocoTestReport

Then HTML report could be found in build/reports/jacoco/test/html/index.html

Thin-FASTA Format

Thin fasta is needed to store only needed for tests regions of real references to decrease disk usage. Each thin-FASTA file is .csv file, each line of which represent part of reference data with information of:

chromosome name
start position of contig
end position of contig
and nucleotide sequence that corresponds to region

thin-FASTA example:

chr1,1,15,ATGCCCCCCCCCAAA
chr1,200,205,GCCGA
chr2,10,12,AC

Note: VarDict expands given regions by 1200bp to left and right (plus given value by -x option).

Program Workflow

The main workflow

The VarDictJava program follows the workflow:

Get regions of interest from a BED file or the command line.
For each segment:
1. Find all variants for this segment in mapped reads:
  1. Optionally skip duplicated reads, low mapping-quality reads, and reads having a large number of mismatches.
  2. Skip unmapped reads.
  3. Skip a read if it does not overlap with the segment.
  4. Preprocess the CIGAR string for each read (section CIGAR Preprocessing).
  5. For each position, create a variant. If a variant is already present, adjust its count using the adjCnt function.
  6. Combine SNVs into MNV or SNV with indel to complex if variants are located closer than -X option (default <=2 bases) and have good base qualities.
2. Find structural variants (optionally can be disabled by option -U).
3. Realign some of the variants using realignment of insertions, deletions, large insertions, and large deletions using unaligned parts of reads (soft-clipped ends). This step is optional and can be disabled using the -k 0 switch.
4. Apply variant filtering rules (hard filters) defined in Variant filtering.
5. Assign a type to each variant.
6. Output variants in an intermediate internal format (tabular). Columns of the table are described in the Output Columns section.
Note: To perform Steps 1 and 2, use Java VarDict.
Perform a statistical test for strand bias using an R script.
Note: Use R scripts teststrandbias.R or testsomatic.R for this step.
Transform the intermediate tabular format to VCF. Output the variants with filtering and statistical data.
Note: Use the Perl scripts var2vcf_valid.pl or var2vcf_paired.pl for this step. Be aware that var2vcf_valid.pl or var2vcf_paired.pl by default will output the variant with the highest AF on position: if few variants start at one position, only the highest will be added in VCF. To output all variants use -A option with these perl scripts.

CIGAR Preprocessing (Initial Realignment)

Read alignment is specified in a BAM file as a CIGAR string. VarDict modifies this string (and alignment) in the following special cases:

Soft clipping next to insertion/deletion is replaced with longer soft-clipping. The same takes place if insertion/deletion is separated from soft clipping by no more than 10 matched bases.
Short matched sequence and insertion/deletion at the beginning/end are replaced by soft-clipping.
Two close deletions and insertions are combined into one deletion and one insertion
Three close deletions are combined in one
Three close insertions/deletions are combined in one deletion or in insertion+deletion
Two close deletions are combined into one
Two close insertions/deletions are combined into one
Mis-clipping at the start/end are changed to matched sequences

Variants

Simple variants (SNV, simple insertions, and deletions) are constructed in the following way:

Single-nucleotide variation (SNV). VarDict inserts an SNV into the variants structure for every matched or mismatched base in the reads. If an SNV is already present in variants, VarDict adjusts its counts and statistics.
Simple insertion variant. If read alignment shows an insertion at the position, VarDict inserts +BASES string into the variants structure. If the variant is already present, VarDict adjusts its count and statistics.
Simple Deletion variant. If read alignment shows a deletion at the position, VarDict inserts -NUMBER into the variants structure. If the variant is already present, VarDict adjusts its count and statistics.
Complex variants: VarDict also handles complex variants (for example, an insertion that is close to SNV or to deletion) using specialized ad-hoc methods.

Structural Variants are looked for after simple variants. VarDict supported DUP, INV and DEL structural variants.

Variant Description String

The description string encodes a variant for VarDict internal use.

The following table describes Variant description string encoding:

String	Description
[ATGC]	for SNPs
+[ATGC]+	for insertions
-[0-9]+	for deletions
...#[ATGC]+	for insertion/deletion variants followed by a short matched sequence
...^[ATGC]+	something followed by an insertion
...^[0-9]+	something followed by a deletion
...&[ATGC]+	for insertion/deletion variants followed by a matched sequence

Variant Filtering

A variant appears in the output if it satisfies the following criteria (in this order). If variant doesn't fit criteria on the step, it will be filtered out and the next steps won't be checked (except for the step 8, read the explanation below):

Frequency of the variant exceeds the threshold set by the -f option (default = 1%).
The minimum number of high-quality reads supporting variant is larger than the threshold set by the -r option (default = 2).
The mean position of the variant in reads is larger than the value set by the -P option (default = 5).
The mean base quality (phred score) for the variant is larger than the threshold set by the -q option (default = 22.5).
Variant frequency is more than 25% or reference allele does not have much better mapping quality than the variant.
Deletion variants are not located in the regions where the reference genome is missing.
The ratio of high-quality reads to low-quality reads is larger than the threshold specified by -o option (default=1.5).
Variant frequency exceeds 30%. If so, next steps won't be checked and variant considered as "good". Otherwise, other steps will be also checked.
Mean mapping quality exceeds the threshold set by the -O option (default: no filtering)
In the case of an MSI region, the variant size is less than 12 nucleotides for the non-monomer MSI or 15 for the monomer MSI. Variant frequency is more than 10% for the non-monomer MSI (or set by --nmfreq option) and 25% for the monomer MSI (or set by --mfreq option).
Variant has not "2;1" bias or variant frequency more than 20%. If both conditions aren't met, then variant mustn't be SNV and any of variants refallele or varallele lengths must be more than 3 nucleotides.

Bias flag explanation

Bias flag can take values [0-2];[0-2] (i.e. "0;2", "2;1" and separator can be another in paired and single VCF). The first value refers to reads that support the reference allele, and the second to reads that support the variant allele.

0 - small total count of reads (less than 12 for the sum of forward and reverse reads) 1 - strand bias 2 - no strand bias

Variant classification in paired(somatic) analysis

In paired analysis, VarDict will classify each variant into the following types that are propagated into STATUS info tag after var2vcf_paired.pl script.

When both samples have coverage for the variant:

Germline: detected in germline sample (pass all quality parameters)
StrongSomatic: detected in tumor sample only
LikelySomatic: the variant has at least one read support OR allele frequency < 5% (defined by –V option with default 0.05)
StrongLOH: detected in germline sample only, opposite of StrongSomaitc
LikelyLOH: detected in germline but either lost in tumor OR 20-80% in germline, but increased to 1-opt_V (95%).
AFDiff: detected in tumor (pass quality parameters) and present in germline but didn’t pass quality parameters.

When only one sample has coverage for the variant:

SampleSpecific: detected in tumor sample, but no coverage in germline sample (it’s more technical than biological, as it’s unlikely a tumor sample can gain a piece of sequence in reference that germline sample lacks).
Deletion: detected in germline sample, but no coverage in tumor sample

These are only rough classification. You need to examine the p-value (after testsomatic.R script) to determine whether or not it's significant.

Program Options

VarDictJava options

-H|-?
Print help page
-h|--header
Print a header row describing columns
-i|--splice Output splicing read counts
-p
Do pileup regardless of the frequency
-C
Indicate the chromosome names are just numbers, such as 1, 2, not chr1, chr2 (deprecated)
-D|--debug
Debug mode. Will print some error messages and append full genotype at the end.
-y|--verbose
Verbose mode. Will output variant calling process.
-t|--dedup
Indicate to remove duplicated reads. Only one pair with identical start positions will be kept
-3
Indicate to move indels to 3-prime if alternative alignment can be achieved.
-K Include Ns in the total depth calculation.
-F bit
The hexical to filter reads. Default: 0x504 (filter unmapped reads, 2nd alignments and duplicates). Use -F 0 to turn it off.
-z 0/1
Indicate whether the BED file contains zero-based coordinates, the same way as the Genome browser IGV does. -z 1 indicates that coordinates in a BED file start from 0. -z 0 indicates that the coordinates start from 1. Default: 1 for a BED file or amplicon BED file (0-based). Use 0 to turn it off. When using -R option, it is set to 0
-a|--amplicon int:float
Indicate it is amplicon based calling. Reads that do not map to the amplicon will be skipped. A read pair is considered to belong to the amplicon if the edges are less than int bp to the amplicon, and overlap fraction is at least float. Default: 10:0.95
-k 0/1
Indicate whether to perform local realignment. Default: 1 or yes. Set to 0 to disable it.
-G Genome fasta
The reference fasta. Should be indexed (.fai). Defaults to: /ngs/reference_data/genomes/Hsapiens/hg19/seq/hg19.fa
Also short commands can be used to set path to:
hg19 - /ngs/reference_data/genomes/Hsapiens/hg19/seq/hg19.fa
hg38 - /ngs/reference_data/genomes/Hsapiens/hg38/seq/hg38.fa
mm10 - /ngs/reference_data/genomes/Mmusculus/mm10/seq/mm10.fa
-R Region
The region of interest. In the format of chr:start-end. If chr is not start-end but start (end is omitted), then it is a single position. No BED is needed.
-d delimiter
The delimiter for splitting region_info, defaults to tab "\t"
-n regular_expression
The regular expression to extract sample names from bam filenames. Defaults to: /([^\/\._]+?)_[^\/]*.bam/
-N string
The sample name to be used directly. Will overwrite -n option
-b string
The indexed BAM file. Multiple BAM files can be specified with the “:” delimiter.
-c INT
The column for chromosome
-S INT
The column for the region start, e.g. gene start
-E INT
The column for the region end, e.g. gene end
-s INT
The column for a segment starts in the region, e.g. exon starts
-e INT
The column for a segment ends in the region, e.g. exon ends
-g INT
The column for a gene name, or segment annotation
-x INT
The number of nucleotides to extend for each segment, default: 0
-f double
The threshold for allele frequency, default: 0.01 or 1%
-r minimum reads
The minimum # of variance reads, default: 2
-B INT
The minimum # of reads to determine strand bias, default: 2
-Q INT
If set, reads with mapping quality less than INT will be filtered and ignored
-q double
The phred score for a base to be considered a good call. Default: 22.5 (for Illumina). For PGM, set it to ~15, as PGM tends to underestimate base quality.
-m INT
If set, reads with mismatches more than INT will be filtered and ignored. Gaps are not counted as mismatches. Valid only for bowtie2/TopHat or BWA aln followed by sampe. BWA mem is calculated as NM - Indels. For STAR you have to increase default if nM tag (for "paired" alignment) is presented in reads.Default: 8, or reads with more than 8 mismatches will not be used.
-T|--trim INT
Trim bases after [INT] bases in the reads
-X INT
Extension of bp to look for mismatches after insersion or deletion. Default to 2 bp, or only calls when they are within 2 bp.
-P number
The read position filter. If the mean variants position is less that specified, it is considered false positive. Default: 5
-Z|--downsample double
For downsampling fraction, e.g. 0.7 means roughly 70% downsampling. Default: No downsampling. Use with caution. The downsampling will be random and non-reproducible.
-o Qratio
The Qratio of (good_quality_reads)/(bad_quality_reads+0.5). The quality is defined by -q option. Default: 1.5
-O MapQ
The variant should has at least mean MapQ to be considered a valid variant. Default: no filtering
-V freq
The lowest frequency in a normal sample allowed for a putative somatic mutations. Used only in paired mode. Defaults to 0.05
-I INT
The indel size. Default: 50bp. Be cautious with -I option, especially in the amplicon mode, as amplicon sequencing is not a way to find large indels. Increasing the search size might be slow and false positives may appear in low complexity regions. Increasing it to 200-300 bp is only recommend for hybrid capture sequencing.
-M INT
The minimum matches for a read to be considered. If, after soft-clipping, the matched bp is less than INT, then the read is discarded. It's meant for PCR based targeted sequencing where there's no insert and the matching is only the primers. Default: 0, or no filtering
-th [threads]
If this parameter is missing, then the mode is one-thread. If you add the -th parameter, the number of threads equals to the number of processor cores. The parameter -th threads sets the number of threads explicitly.
-VS STRICT | LENIENT | SILENT
How strict to be when reading a SAM or BAM. STRICT - throw an exception if something looks wrong. LENIENT - Emit warnings but keep going if possible. SILENT - Like LENIENT, only don't emit warning messages. Default: LENIENT
-u
Indicate unique mode, which when mate pairs overlap, the overlapping part will be counted only once using forward read only. Default: unique mode disabled, all reads are counted.
-UN
Indicate unique mode, which when mate pairs overlap, the overlapping part will be counted only once using first read only. Default: unique mode disabled, all reads are counted.
--chimeric
Indicate to turn off chimeric reads filtering. Chimeric reads are artifacts from library construction, where a read can be split into two segments, each will be aligned within 1-2 read length distance, but in opposite direction. Default: filtering enabled
-U|--nosv
Turn off structural variant calling
-L INT
The minimum structural variant length to be presented using <DEL> <DUP> <INV> <INS>, etc. Default: 1000. Any indel, complex variants less than this will be spelled out with exact nucleotides
-w|--insert-size INT INSERT_SIZE
The insert size. Used for SV calling. Default: 300
-W|--insert-std INT INSERT_STD
The insert size STD. Used for SV calling. Default: 100
-A INT INSERT_STD_AMT
The number of STD. A pair will be considered for DEL if INSERT > INSERT_SIZE + INSERT_STD_AMT * INSERT_STD. Default: 4
-Y|--ref-extension INT
Extension of bp of reference to build lookup table. Default to 1200 bp. Increasing the number will slow down the program. The main purpose is to call large indels with 1000 bp that can be missed by discordant mate pairs.
--deldupvar
Turn on deleting of duplicate variants in output that can appear due to VarDict linear work on regions. Variants in this mode are considered and outputted only if start position of variant is inside the region interest.
-DP|--default-printer
The printer type used for different outputs. Default: OUT (i.e. System.out).
--adaptor
Filter adaptor sequences so that they are not used in realignment. Multiple adaptors can be supplied by setting them with comma, like:
--adaptor ACGTTGCTC,ACGGGGTCTC,ACGCGGCTAG
-J|--crispr CRISPR_cutting_site
The genomic position that CRISPR/Cas9 suppose to cut, typically 3bp from the PAM NGG site and within the guide. For CRISPR mode only. It will adjust the variants (mostly In-Del) start and end sites to as close to this location as possible, if there are alternatives. The option should only be used for CRISPR mode.
-j CRISPR_filtering_bp
In CRISPR mode, the minimum amount in bp that a read needs to overlap with cutting site. If a read does not meet the criteria, it will not be used for variant calling, since it is likely just a partially amplified PCR. Default: not set, or no filtering
--nmfreq
The variant frequency threshold to determine variant as good in case of non-monomer MSI. Default: 0.1
--mfreq
The variant frequency threshold to determine variant as good in case of monomer MSI. Default: 0.25
--fisher
EXPERIMENTAL FEATURE: to exclude R script from the VarDict pipeline we added this option to calculate pvalue and oddratio from Fisher Test. It will decrease time processing on big samples because R script uses slow textConnection function. If you use this, do NOT run teststrandbias.R or testsomatic.R after Vardict, but use var2vcf_valid.pl or var2vcf_paired.pl after VarDictJava as usual.

Important var2vcf_valid.pl options

The full list of options in VarDictPerl var2vcf_valid.pl -h

-A
Indicate to output all variants at the same position. By default, only the variant with the highest allele frequency is converted to VCF.
-S
If set, variants that didn't pass filters will not be present in VCF file.
-N string
The sample name to be used directly.
-f float
The minimum allele frequency. Default to 0.02

Important var2vcf_paired.pl options

The full list of options in VarDictPerl var2vcf_paired.pl -h

-M
If set, will increase stringency for candidate somatic: flag P0.01Likely and InDelLikely, and add filter P0.05
-A
Indicate to output all variants at the same position. By default, only the variant with the highest allele frequency is converted to VCF.
-S
If set, variants that didn't pass filters will not be present in VCF file.
-N string
The sample name(s). If only one name is given, the matched will be simply names as "name-match". Two names are given separated by "|", such as "tumor|blood".
-f float
The minimum allele frequency. Default to 0.02

Output columns

Simple mode:

Sample - sample name
Gene - gene name from a BED file
Chr - chromosome name
Start - start position of the variation
End - end position of the variation
Ref - reference sequence
Alt - variant sequence
Depth (DP) - total coverage
AltDepth (VD) - variant coverage
RefFwdReads (REFBIAS) - reference forward strand coverage
RefRevReads (REFBIAS) - reference reverse strand coverage
AltFwdReads (VARBIAS) - variant forward strand coverage
AltRevReads (VARBIAS) - variant reverse strand coverage
Genotype - genotype description string
AF - allele frequency
Bias - strand bias flag
PMean - mean position in read
PStd - flag for read position standard deviation (1 if the variant is covered by at least 2 read segments with different positions, otherwise 0).
QMean - mean base quality
QStd - flag for base quality standard deviation
MAPQ - mapping quality
QRATIO (SN) - ratio of high quality reads to low-quality reads
HIFREQ (HIAF) - variant frequency for high-quality reads
EXTRAFR (ADJAF) - Adjusted AF for indels due to local realignment
SHIFT3 - No. of bases to be shifted to 3 prime for deletions due to alternative alignment
MSI - MicroSatellite. > 1 indicates MSI
MSILEN - MicroSatellite unit length in bp
NM - average number of mismatches for reads containing the variant
HICNT - number of high-quality reads with the variant
HICOV - position coverage by high quality reads
5pFlankSeq (LSEQ) - neighboring reference sequence to 5' end
3pFlankSeq (RSEQ) - neighboring reference sequence to 3' end
SEGMENT:CHR_START_END - position description
VARTYPE - variant type
DUPRATE - duplication rate in fraction
SV splits-pairs-clusters: Splits (SPLITREAD) - No. of split reads supporting SV, Pairs (SPANPAIR) - No. of pairs supporting SV, Clusters - No. of clusters supporting SV
CRISPR - only in crispr mode - how close to a CRISPR site is the variant

Amplicon mode

In amplicon mode columns from #35 are changed to:
(35) GoodVarCount (GDAMP) - number of good variants on amplicon
(36) TotalVarCount (TLAMP) - number of good and bad variants on amplicon
(37) Nocov (NCAMP) - number of variants on amplicon that has depth less than 1/50 of the max depth (they will be considered not working and thus not used).
(38) Ampflag - if there are different good variants on different amplicons, it will be 1.

Somatic mode

In somatic mode we have information from both samples:

Sample - sample name
Gene - gene name from a BED file
Chr - chromosome name
Start - start position of the variation
End - end position of the variation
Ref - reference sequence
Alt - variant sequence
Fields from first sample:
Depth (DP) - total coverage
AltDepth (VD) - variant coverage
RefFwdReads (REFBIAS) - reference forward strand coverage
RefRevReads (REFBIAS) - reference reverse strand coverage
AltFwdReads (VARBIAS) - variant forward strand coverage
AltRevReads (VARBIAS) - variant reverse strand coverage
Genotype - genotype description string
AF - allele frequency
Bias - strand bias flag
PMean - mean position in read
PStd - flag for read position standard deviation
QMean - mean base quality
QStd - flag for base quality standard deviation
MAPQ - mapping quality
QRATIO (SN) - ratio of high quality reads to low-quality reads
HIFREQ (HIAF) - variant frequency for high-quality reads
EXTRAFR (ADJAF) - Adjusted AF for indels due to local realignment
NM - average number of mismatches for reads containing the variant
Fields from second sample:
Depth - total coverage
AltDepth - variant coverage
RefFwdReads (REFBIAS) - reference forward strand coverage
RefRevReads (REFBIAS) - reference reverse strand coverage
AltFwdReads (VARBIAS) - variant forward strand coverage
AltRevReads (VARBIAS) - variant reverse strand coverage
Genotype - genotype description string
AF - allele frequency
Bias - strand bias flag
PMean - mean position in read
PStd - flag for read position standard deviation
QMean - mean base quality
QStd - flag for base quality standard deviation
MAPQ - mapping quality
QRATIO (SN) - ratio of high quality reads to low-quality reads
HIFREQ (HIAF) - variant frequency for high-quality reads
EXTRAFR (ADJAF) - Adjusted AF for indels due to local realignment
NM - average number of mismatches for reads containing the variant
Common fields:
SHIFT3 - No. of bases to be shifted to 3 prime for deletions due to alternative alignment
MSI - MicroSatellite. > 1 indicates MSI
MSILEN - MicroSatellite unit length in bp
5pFlankSeq (LSEQ) - neighboring reference sequence to 5' end
3pFlankSeq (RSEQ) - neighboring reference sequence to 3' end
SEGMENT:CHR_START_END - position description
VarLabel - variant label due to type: StrongLOH, StrongSomatic...
VARTYPE - variant type
DUPRATE1 - duplication rate in fraction from first sample
SV_info1 - Splits - No. of split reads supporting SV, Pairs - No. of pairs supporting SV, Clusters - No. of clusters supporting SV from first sample
DUPRATE2 - duplication rate in fraction from second sample
SV_info2: Splits - No. of split reads supporting SV, Pairs - No. of pairs supporting SV, Clusters - No. of clusters supporting SV from second sample

Input Files

BED File – Regions

VarDict uses 2 types of BED files for specifying regions of interest: 8-column and all others. The 8-column file format is used for targeted DNA deep sequencing analysis (amplicon based calling), amplicon analysis will try to start if BED with 8 columns was provided. Otherwise you can start single and paired sample analysis by providing options -c, -S, -E, -g with number of columns for chromosome, start, end, gene of the region respectively.

All lines starting with #, browser, and track in a BED file are skipped. The column delimiter can be specified as the -d option (the default value is a tab “\t“).

The 8-column amplicon BED file format involves the following data:

Chromosome name
Region start position
Region end position
Gene name
Score - not used by VarDict
Strand - not used by VarDict
Start position – VarDict starts outputting variants from this position
End position – VarDict ends outputting variants from this position

For example 4-column BED file format involves the following data and VarDict must be start with -c 1 -S 2 -E 3 -g 4 to recognize it:

Chromosome name
Region start position
Region end position
Gene name

FASTA File - Reference Genome

The reference genome in FASTA format is read using HTSJDK library. For every invocation of the VarDict pipeline (usually 1 for a region in a BED file) and for every BAM file, a part of the reference genome is extracted from the FASTA file. In some cases of Structural Variants finding the reference can be reread in other regions.

Region of FASTA extends and this extension can be regulated via the REFEXT variable (option -Y INT, default 1200 bp).

Errors and warnings

Information about some of the errors and their causes is located in wiki

License

The code is freely available under the MIT license.

Contributors

Java port of VarDict implemented based on the original Perl version (Zhongwu Lai) by:

vardictjava's People

Contributors

Stargazers

Watchers

Forkers

dh10 pkrusche antonstamov tfenne hrk2109 alanhoyle davidroberson limbus-medtec microtsiu biocyberman cbrueffer chapmanb nh13 polinabevad wangdi2014 libhet luisff andurill sarahjamieson nkarulin clintval zhuzhoujie thkitapci baiyuanxiang gerde leornardzhou annahyan roh329 mihin vkirst songtr quantori lbergelson zhujiaqi2014 newalbert123456 kennedy-lab-uw bkohrn crutching liuzhen51032 bgruening graeme-smith dujidan humaasif mahajrod schaudge melisandre18 bulksecuritygeneratorprojectv2 suyanan hejunyi6721 belay-diagnostics emlec tysonp612 roysomak4

vardictjava's Issues

StringIndexOutOfBoundsException: String index out of range: 1

I get this error while running VarDict:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at java.lang.String.charAt(String.java:658)
at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:2221)
at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2552)
at com.astrazeneca.vardict.VarDict.vardictNotParallel(VarDict.java:301)
at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:254)
at com.astrazeneca.vardict.VarDict.start(VarDict.java:65)
at com.astrazeneca.vardict.Main.run(Main.java:134)
at com.astrazeneca.vardict.Main.main(Main.java:25)
Argument "0,0909" isn't numeric in numeric lt (<) at /home/kua/Downloads/Installer/CNVtools/VarDict-1.4.6/VarDict/var2vcf_valid.pl line 124, <> line 71.
Argument "124,5" isn't numeric in numeric lt (<) at /home/kua/Downloads/Installer/CNVtools/VarDict-1.4.6/VarDict/var2vcf_valid.pl line 125, <> line 71.
Argument "93,0" isn't numeric in numeric lt (<) at /home/kua/Downloads/Installer/CNVtools/VarDict-1.4.6/VarDict/var2vcf_valid.pl line 127, <> line 71.
Argument "0,0" isn't numeric in numeric lt (<) at /home/kua/Downloads/Installer/CNVtools/VarDict-1.4.6/VarDict/var2vcf_valid.pl line 128, <> line 71.
....

Could you please help me??

Increasing minimum evidence required in normal bam to be classified as "germline"

Hello @mjafin ,

We are testing vardict using our proxy bam file for somatic indel detection and are noticing several cases where very low evidence in the proxy bam (3-4 reads at an allele frequency of 0.007 - 0.009) are causing the variant to be classified as germline instead of likely or strongly somatic. The tumor variant AFs in these cases are between 0.38 and 0.51.

Here is a generalized example of the command we ran for this analysis : VarDict -th 5 -G hs37d5.fa -f 0.003 -N HTB-44 -b "chromosome_21.recal.bam|proxy_chromosome_21.recal.bam" -C -c 1 -S 2 -E 3 21.bed | testsomatic.R | var2vcf_paired.pl -N "samp|proxy" -f 0.003 1> vardict_chrom_21_paired.vcf 2> vardict_chrom_21_paired.err

From reading your exposed parameter options that there are no filters which we could change to simply allow these somatic variants with low frequency proxy evidence to no longer be detected as germline.

If necessary we could send you bam segments as well.

Thanks,
Sean

Amplicon mode output too many columns

Hello

I am trying to use VarDictJava in amplicon mode but when I pipe the results into TestStrandBias.R I get an error that the input is incorrect. It looks like teststrandbias expects a 38 column file from VarDict but I am getting a 40 column table. Any thoughts?

Here is my command
VarDict -D -G /fdb/GATK_resource_bundle/b37/human_g1k_v37.fasta -f $AF_THR -a 10:0.95 -N "NY152" -b mapping/NY152.merged.bam -C -c 1 -S 2 -E 3 -g 4 test.interval.bed

Bcbio Vardict 1.4.5 Exception in thread "main" java.lang.StringIndexOutOfBoundsException:

Hi,

I am trying to call somatic mutations using vardict with bcbio pipeline system. I came across with this error during the process. Even though it has called mutations for chr 1-15, i dont know what might have caused this error. I also checked the closed answers but i couldnt find and answer. It was concluded that new version was released.

Thank you for your help,

Best,

Tunc.

My vardict version is;

vardict,2016.02.19
vardict-java,1.4.5

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.charAt(String.java:658)
    at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1896)
    at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2552)
    at com.astrazeneca.vardict.VarDict.somaticNotParallel(VarDict.java:354)
    at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:249)
    at com.astrazeneca.vardict.VarDict.start(VarDict.java:65)
    at com.astrazeneca.vardict.Main.run(Main.java:134)
    at com.astrazeneca.vardict.Main.main(Main.java:25)
' returned non-zero exit status 1

Issues with DNP, MNP and Complex events

Hello VarDict Team,

In our testing of VarDict we noticed some issues when an event is called as a DNP, MNP or a complex event that are germline

For mock example in DNP:
if there is germline:
chr1:10 = A
chr1:11 = T
The variant called is
AT > CG
VarDict properly assign DP,VD to DNP for both case and control as well as tagging it as germline.

But then VarDict also call's
chr1:11 = T
The variant called is
T>G
VarDict assigns this variant with very little read support in VD in case and no support in control. It is also tagged as somatic. Any reason why this event is still called separately.

This happens in multiple cases of DNP, MNP and complex events that are germline.

Let me know if I am not clear.

Attached is the real example for DNP:

Best,
Ronak

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0

Hi there,
I'm calling variants in hg38 and am getting this error:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0
        at java.lang.String.charAt(String.java:658)
        at com.astrazeneca.vardict.VarDict.varType(VarDict.java:5905)
        at com.astrazeneca.vardict.VarDict.vardict(VarDict.java:966)
        at com.astrazeneca.vardict.VarDict.vardictNotParallel(VarDict.java:305)
        at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:257)
        at com.astrazeneca.vardict.VarDict.start(VarDict.java:68)
        at com.astrazeneca.vardict.Main.run(Main.java:134)
        at com.astrazeneca.vardict.Main.main(Main.java:25)
' returned non-zero exit status 1

This is in the chrUn_GL000216v2 contig that is fairly short (0-176608). I can provide you the data via Box again when you have time to pick this up.

Thanks,
Miika

VardictJava Reproducibly Breaking At Specific RNA Recal/Realigned Bam Regions

Hello,

I have been trying to apply the java version of your verdict program on a set of tumor/normal samples that have been modified according to GATK best practices for RNA (Recal, Realign, dedup, etc.).

The samples run correctly for the perl version of your pipeline, however, they are breaking at very specific regions in the recal realigned bam file. For example, it breaks in all samples at the KLF6 gene. There does not appear to be anything strange about this region when inspecting the bam (~2,000x coverage and no easily observed variants in IGV - Similar to other regions that pass). However, it always gets hung up here in the java version.

Have you observed this in the past? Do you have any suggestions on what I could do to avoid this?

Here is an example of my command:
VarDict -th 4 \ -G hs37d5.fa \ -N {Samp}_spiked_UC3_RNA \ -b "StarAligned_RDSQ_Recal_Final.bam|Core_DNA_Normal_Merged.bam" \ -C -c 1 -S 2 -E 3 \ target_regions/10a.bed

hs37d5 is effectively b37 + decoys. The region in question is on chr 10.

And here is the resulting error position:

1 30.9 1 60.0 26.000 0.3171 0 1.5 0 1.000 1 GAACTGCACGCTAGGGAAGG GAATGACCAGAACGCAAAAG 10:3208380-3208592 Deletion SNV HCC1187_spiked_UC3_RNA 10 10 3212392 3212392 T A 169 2 49 118 1 1 T/A 0.0118 2;2 7.0 0 34.0 1 60.0 4.000 0.0127 0 1.0 260 0 127 133 0 0 T/A 0 2;0 48.0 0 16.0 0 60.0 0.000 0 0 2.0 0 2.000 2 CATCGCCACGCTCTGTGGTG GCATGTCTAGATTAAAAGTC 10:3212295-3212472 StrongSomatic SNV HCC1187_spiked_UC3_RNA 10 10 3214369 3214369 A G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 35 43 12 28 7 A/G 0.3889 2;2 27.5 1 37.1 1 60.0 34.000 0.3953 0 1.1 0 2.000 1 TGGGCCGGCAGCGCTGAACT TTACCTTATATTAAACACGG 10:3214342-3214531 Deletion SNV HCC1187_spiked_UC3_RNA 10 10 3214498 3214498 C T 3 0 1 2 0 0 C/C 0 2;0 22.3 1 34.3 1 60.0 6.000 1.0000 0 0.3 168 2 66 100 1 1 C/T 0.0119 2;2 53.0 1 32.5 1 60.0 4.000 0.0120 0 2.0 0 2.000 2 GGAGGCGTCAGTTCTCTGAA ACAGCGAGACTTTAAGTATG 10:3214342-3214531 StrongLOH SNV HCC1187_spiked_UC3_RNA 10 10 3214516 3214516 A G 1 1 0 0 0 1 G/G 1.0000 0;0 15.0 0 36.0 0 60.0 2.000 1.0000 0 1.0 155 68 30 57 23 45 A/G 0.4387 2;2 36.3 1 35.7 1 60.0 21.667 0.4333 0 1.1 0 1.000 1 AACACAGCGAGACTTTAAGT TGGGCTGTGGGCGCCTCGGG 10:3214342-3214531 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3214995 3214996 AC A 5 3 1 1 1 2 C/-1 0.6000 2;2 45.7 1 35.3 1 60.0 6.000 0.6000 0 0.0 268 125 75 68 79 46 C/-1 0.4664 2;2 34.5 1 35.4 1 60.0 24.000 0.4688 0.0112 0.4 4 5.000 1 GAGCACCTGGCTGGCGAGGA CCCCTCTTAACCCGACGCAG 10:3214887-3215100 Germline Deletion java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at com.astrazeneca.vardict.VarDict.somaticParallel(VarDict.java:341) at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:254) at com.astrazeneca.vardict.VarDict.start(VarDict.java:68) at com.astrazeneca.vardict.Main.run(Main.java:134) at com.astrazeneca.vardict.Main.main(Main.java:25) Caused by: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5342) at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5325) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.charAt(String.java:695) at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1881) at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2537) at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5320) at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5294)

To assist in determining the issue, I have isolated a bam region which has this problem. I have also run this snippet of bam on my end to demonstrate that that issue is there. There is an additional line mentioning "Ignoring SAM validation error". This is due to my extracting the region and was not in the original issue.

` /hpc/env/dev/apps/vardict_java/1.4.3/build/install/VarDict/bin/VarDict -th 4 \

-G /nfs/science/data/genomes/reference/hs37d5/raw/hs37d5.fa
-N HCC1187_spiked_UC3_RNA
-b "/nfs/projects/home/sboyle/temp/10a_BrokenVardictRegion_Tumor_ToSend.bam|/nfs/projects/home/sboyle/temp/10a_BrokenVardictRegion_Normal_ToSend.bam"
-C -c 1 -S 2 -E 3
/nfs/projects/home/sboyle/temp/10a_BrokenVardictRegion_ToSend.bed

HCC1187_spiked_UC3_RNA 10 10 3190265 3190265 A G 8 4 2 2 1 3 A/G 0.5000 2;2 24.5 1 36.0 1 60.0 8.000 0.5000 0 2.0 166 85 43 38 40 45 A/G 0.5120 2;2 33.2 1 36.4 1 60.0 170.000 0.5152 0 1.9 0 1.000 1 AAAAGCACACCCTGAACCAA GAAAACACAGAAGAAAGGAA 10:3190173-3190483 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3190298 3190298 C T 7 5 1 1 2 3 C/T 0.7143 2;2 35.6 1 37.0 0 60.0 10.000 0.7143 0 1.6 172 91 40 41 45 46 C/T 0.5291 2;2 26.9 1 33.8 1 60.0 44.500 0.5298 0 1.9 2 1.000 1 GAAAGGAATTAGGGCCGAAT GAACAGTGACAGACACTGGA 10:3190173-3190483 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3200249 3200249 C T 142 104 14 24 46 58 C/T 0.7324 2;2 15.0 1 35.9 1 60.0 33.667 0.7372 0 1.6 101 39 38 24 23 16 C/T 0.3861 2;2 32.0 1 33.7 1 60.0 18.500 0.3737 0 1.9 0 1.000 1 CACTCAACTACTTCATCAAT GTTCTGTCTATGAGGCTTCT 10:3200186-3200426 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3200292 3200292 G A 150 108 23 19 54 54 G/A 0.7200 2;2 29.7 1 35.1 1 60.0 20.600 0.7203 0 1.6 116 47 37 32 26 21 G/A 0.4052 2;2 31.6 1 32.9 1 60.0 94.000 0.4052 0 1.7 1 1.000 1 CGGTCTCAATGTCTTTCTCC CAATCCCTTGGAGGCCGACA 10:3200186-3200426 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3201097 3201097 T C 2 1 1 0 1 0 T/C 0.5000 0;0 42.0 0 37.0 0 60.0 2.000 0.5000 0 1.0 169 87 60 22 56 31 T/C 0.5148 2;2 28.3 1 35.5 1 60.0 20.750 0.5123 0 1.4 0 3.000 1 CTACTGATCTAATCTAATCA AAACTCACCCAACATCAGGA 10:3201094-3201280 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3202065 3202065 T C 418 256 93 68 145 111 T/C 0.6124 2;2 22.9 1 35.6 1 60.0 35.571 0.6103 0 1.2 247 111 98 37 81 30 T/C 0.4494 2;2 31.0 1 33.1 1 60.0 17.500 0.4412 0 1.6 0 2.000 1 TAAGAGGAAGCTAACGCTGA GGTTGTTTGTTTAGAGGGAT 10:3202031-3202237 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3202140 3202140 C T 14 6 3 5 2 4 C/T 0.4286 2;2 46.5 1 35.7 1 60.0 12.000 0.4286 0 1.5 341 153 95 93 72 81 C/T 0.4487 2;2 31.8 1 34.0 1 60.0 75.500 0.4507 0 1.5 0 2.000 5 GAATTCCCTCTGTAAAATGA GTGACGTTGTGAGTAGGCAC 10:3202031-3202237 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3206027 3206027 A G 212 121 43 47 61 60 A/G 0.5708 2;2 26.5 1 35.4 1 60.0 29.250 0.5707 0 1.2 58 33 6 19 9 24 A/G 0.5690 2;2 33.5 1 34.4 1 60.0 66.000 0.5690 0 1.2 0 2.000 1 ACCACTGAGTACGTGTGGTC GGAAGAAGTCTGTTCTGAAG 10:3205874-3206107 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3208557 3208557 A G 245 170 38 37 82 88 A/G 0.6939 2;2 14.2 1 35.5 1 59.7 33.000 0.6904 0 1.1 68 31 13 24 14 17 A/G 0.4559 2;2 23.6 1 33.3 1 60.6 6.750 0.4286 0 1.6 0 1.000 1 CCAGTACTGTCCATGGGAGT GTACGGAACTGCACGCTAGG 10:3208380-3208592 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3208567 3208592 T TGCACGCTAGGGAAGAGAGAGGAATG 274 0 144 129 0 0 T/T 0 2;0 5.8 1 35.9 1 59.6 33.125 1.0000 0 0.7 69 27 16 26 9 18 T/+25 0.3913 2;2 28.1 1 34.1 1 60.4 26.000 0.3939 0.3188 1.3 14 2.000 1 CCATGGGAGTAGTACGGAAC GCACGCTAGGGAAGGAGAAT 10:3208380-3208592 StrongLOH Insertion
HCC1187_spiked_UC3_RNA 10 10 3208583 3208583 A C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 43 13 9 20 0 13 A/C 0.3023 2;1 25.5 1 30.9 1 60.0 26.000 0.3171 0 1.5 0 1.000 1 GAACTGCACGCTAGGGAAGG GAATGACCAGAACGCAAAAG 10:3208380-3208592 Deletion SNV
HCC1187_spiked_UC3_RNA 10 10 3214369 3214369 A G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 35 43 12 28 7 A/G 0.3889 2;2 27.5 1 37.1 1 60.0 34.000 0.3953 0 1.1 0 2.000 1 TGGGCCGGCAGCGCTGAACT TTACCTTATATTAAACACGG 10:3214342-3214531 Deletion SNV
HCC1187_spiked_UC3_RNA 10 10 3214516 3214516 A G 1 1 0 0 0 1 G/G 1.0000 0;0 15.0 0 37.0 0 60.0 2.000 1.0000 0 1.0 155 68 30 57 23 45 A/G 0.4387 2;2 36.3 1 35.7 1 60.0 21.667 0.4333 0 1.1 0 1.000 1 AACACAGCGAGACTTTAAGT TGGGCTGTGGGCGCCTCGGG 10:3214342-3214531 Germline SNV
HCC1187_spiked_UC3_RNA 10 10 3214995 3214996 AC A 5 3 1 1 1 2 C/-1 0.6000 2;2 45.7 1 37.0 0 60.0 6.000 0.6000 0 0.0 268 125 75 68 79 46 C/-1 0.4664 2;2 34.5 1 35.4 1 60.0 24.000 0.4688 0.0112 0.4 4 5.000 1 GAGCACCTGGCTGGCGAGGA CCCCTCTTAACCCGACGCAG 10:3214887-3215100 Germline Deletion
Ignoring SAM validation error: ERROR: Read name HWI-D00528:212:C81CAANXX:1:1314:13296:35671, No real operator (M|I|D|N) in CIGAR
java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
at java.util.concurrent.FutureTask.get(FutureTask.java:111)
at com.astrazeneca.vardict.VarDict.somaticParallel(VarDict.java:341)
at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:254)
at com.astrazeneca.vardict.VarDict.start(VarDict.java:68)
at com.astrazeneca.vardict.Main.run(Main.java:134)
at com.astrazeneca.vardict.Main.main(Main.java:25)
Caused by: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
at java.util.concurrent.FutureTask.get(FutureTask.java:111)
at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5342)
at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5325)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.charAt(String.java:695)
at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1881)
at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2537)
at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5320)
at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5294)`

Vardict_Specific_Issue_Region.zip

Thank you for your assistance.

Cheers,
Sean

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 8137, Read name 0801tcm7, MAPQ should be 0 for unmapped read.

Hi there,
I'm getting an error from samtools:

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 8137, Read name 0801tcm7, MAPQ should be 0 for unmapped read.

It seems to be related to poor alignments by bwa (aln), see section here http://broadinstitute.github.io/picard/faq.html

Q: Why am I getting errors from Picard like "MAPQ should be 0 for unmapped read" or "CIGAR should have zero elements for unmapped read?"
...

I can't seem to be able to copy the answer, but in Picard it's possible to relax the validation of reads. I wonder if there's a flag in samtools-java to do the same?

Java crashes when VarDict file modified to do profiling with hprof

I am trying to profile VarDictJava by modifying the DEFAULT_JVM_OPTS in the file build/install/VarDict/bin/VarDict to '"-Xrunhprof:cpu=times,depth=10" "-Xmx768m"'. However, when I try running the software, Java itself crashes, and I get the error message

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fb87dcdfc0c, pid=41165, tid=140429174187776

JRE version: Java(TM) SE Runtime Environment (7.0_45-b18) (build 1.7.0_45-b18)

Java VM: Java HotSpot(TM) 64-Bit Server VM (24.45-b08 mixed mode linux-amd64 compressed oops)

Problematic frame:

V [libjvm.so+0x80ec0c] os::PlatformEvent::unpark()+0x1c

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/gpfs01/home/kpbh388/tests/VarDictJava/build/install/VarDict/bin/hs_err_pid41165.log

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

Aborted (core dumped)

The log file is attached. Thanks for your help.

hs_err_pid41165.txt

Error when local realignment is turned off:

This produces an error (used -k 0 to turn off local realignment):
VarDict -G human_g1k_v37_decoy.fasta -f 0.01 -h -b 'tumor.k.bam|normal.k.bam' -k 0 -Q 1 -c 1 -S 2 -E 3 -g 4 1.k.bed

The error message is:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 101
at java.lang.String.charAt(String.java:658)
at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:2194)
at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2536)
at com.astrazeneca.vardict.VarDict.somaticNotParallel(VarDict.java:356)
at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:251)
at com.astrazeneca.vardict.VarDict.start(VarDict.java:67)
at com.astrazeneca.vardict.Main.run(Main.java:145)
at com.astrazeneca.vardict.Main.main(Main.java:25)

This runs without a problem (default, so local realignment is on):
VarDict -G human_g1k_v37_decoy.fasta -f 0.01 -h -b 'tumor.k.bam|normal.k.bam' -Q 1 -c 1 -S 2 -E 3 -g 4 1.k.bed

The data are stored here:
https://drive.google.com/folderview?id=0B9pfRlnkG-Z7fnAwUzFvYllnOGpOODNIdUdCaVI5Z1hRcl9zYi1vVmFTN0o4bURrelJHZlk&usp=sharing

Post processing Germline projects

Hi Vlad,

I would like to run all but the 'Variants' step in post processing for this project, /ngs/oncology/analysis/dev/Dev_0192_HiSeq4000_NormPlasma_WGS/bcbio/final, which has normal-only samples. I commented out the 'Variants' step in the config/run_info.yaml and it ran Seq2C and TargQC but not the full project report. Is there a way to do that? The script is not finding final/sample/*vardict.vcf files (there won't be any) and that I think is where it is interrupting.

Bcbio generated variants are in the datestamp/var/raw/*batch-vardict.vcf files and I am running vcf2txt standlone to generate variants.txt applying minimal filters, to be delivered in txt format.

Thanks,
Sakina

VardictJava having extensive runtime on very highly expressed RNA regions

We are running into cases where VardictJava is taking days to finish running on regions with very high coverage in RNA (highly expressed genes, coverage >700,000x). Have you worked through similar problems in the past? Do you have suggestions on how to improve performance on these very deep regions? Mutect, for example, has an option (dcov) where you can limit the max depth considered by the caller, greatly speeding up variant calling on deep regions.

Exception in thread "main" java.lang.NullPointerException

Hi,
I'm testing VarDict Java on bcbio-nextgen's integration test data and it's dying out on me likely due to edge cases:

Exception in thread "main" java.lang.NullPointerException
        at com.astrazeneca.vardict.VarDict.toRegions(VarDict.java:101)
        at com.astrazeneca.vardict.VarDict.start(VarDict.java:64)
        at com.astrazeneca.vardict.Main.run(Main.java:135)
        at com.astrazeneca.vardict.Main.main(Main.java:24)

When running VarDict (the perl version) it often throws missing variable warnings (but doesn't die out). I suspect this would cause the Java version to crash though if it's trying to access variables that don't exist. I'm just guessing here since the error message isn't very verbose.

3 Column Bed file

Java VarDict gives an immediate java.lang.ArrayIndexOutOfBoundsException when I try a 3-column bed file. The bed file looks like (tab delimited)
chr1 65509 65625
chr1 65831 65973
chr1 69481 69600
chr1 721381 721519
chr1 721530 721806
...

Information from an individual line works with the -R option.
An 8-column bed file does work with columns 4,5 and 6 empty like
chr1 65509 65625 65509 65625
...

Does this suggest that "amplicon mode" is always on? I have tried the -z option and that hasn't changed this behavior. Where have I gone wrong?

-O flag in Java version vs. -Q flag in Perl version

I noticed that in the Perl version, -Q enforces a minimum mapping quality, but in Java version it's -O.
I'm just wondering if it's intentional, or the tail in Q was accidentally lost to become an O.

Improving memory usage for longer regions

We've been testing VarDictJava on whole genome analyses chunked into analyzable sections. We typically break up the genome at natural breakpoints without coverage to avoid accidentally splitting an analysis across a larger indel or multi-nucleotide change.

This works for 99% of cases but we end up with some regions where we have slightly larger regions (2-5Mb) that cause out of memory errors with VarDictJava. VarDict also struggles with these larger chunks. Thanks are due to @lpantano for diagnosing the underlying issue.

Would it be possible to adjust memory usage so we can work with these larger regions without needing to drastically increase memory? @zhongwulai and @jabbarish might have ideas about places where we can reset some shared state and keep memory usage more stable. Our target has been to do regions up to 5Mb in 3Gb of memory.

Here is a reproducible test case with a small shell script that shows the issue:

https://s3.amazonaws.com/chapmanb/az/vardict_memory.tar.gz

The original region that causes memory issues is:

22  16050035    19050035

Chunking it into 1Mb blocks lets it finish without any issues on the same memory:

22  16050035    17050035
22  17050035    18050035
22  18050035    19050035

Here is the memory issue we see -- just a straight up out of memory error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.regex.Pattern.compile(Pattern.java:1655)
        at java.util.regex.Pattern.<init>(Pattern.java:1337)
        at java.util.regex.Pattern.compile(Pattern.java:1022)
        at com.astrazeneca.vardict.VarDict.findMSI(VarDict.java:3126)
        at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2899)
        at com.astrazeneca.vardict.VarDict.somaticNotParallel(VarDict.java:355)
        at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:251)
        at com.astrazeneca.vardict.VarDict.start(VarDict.java:67)
        at com.astrazeneca.vardict.Main.run(Main.java:137)
        at com.astrazeneca.vardict.Main.main(Main.java:24)

Thanks for any thoughts or suggestions.

VarDictJava instillation problem Execution failed for task ':javadoc'.

Hi,
I am getting an error in the second step of the instillation of the vardictjava. Could you help me to solve this problem ?

My java version is 1.8.0_71

Thanks

* What went wrong:
Execution failed for task ':javadoc'.
> Javadoc generation failed. Generated Javadoc options file (useful for troubleshooting): '/Users/morova/VarDictJava/build/tmp/javadoc/javadoc.options'

* Try:
Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':javadoc'.
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:69)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.execute(ExecuteActionsTaskExecuter.java:46)
    at org.gradle.api.internal.tasks.execution.PostExecutionAnalysisTaskExecuter.execute(PostExecutionAnalysisTaskExecuter.java:35)
    at org.gradle.api.internal.tasks.execution.SkipUpToDateTaskExecuter.execute(SkipUpToDateTaskExecuter.java:64)
    at org.gradle.api.internal.tasks.execution.ValidatingTaskExecuter.execute(ValidatingTaskExecuter.java:58)
    at org.gradle.api.internal.tasks.execution.SkipEmptySourceFilesTaskExecuter.execute(SkipEmptySourceFilesTaskExecuter.java:52)
    at org.gradle.api.internal.tasks.execution.SkipTaskWithNoActionsExecuter.execute(SkipTaskWithNoActionsExecuter.java:52)
    at org.gradle.api.internal.tasks.execution.SkipOnlyIfTaskExecuter.execute(SkipOnlyIfTaskExecuter.java:53)
    at org.gradle.api.internal.tasks.execution.ExecuteAtMostOnceTaskExecuter.execute(ExecuteAtMostOnceTaskExecuter.java:43)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter$EventFiringTaskWorker.execute(DefaultTaskGraphExecuter.java:203)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter$EventFiringTaskWorker.execute(DefaultTaskGraphExecuter.java:185)
    at org.gradle.execution.taskgraph.AbstractTaskPlanExecutor$TaskExecutorWorker.processTask(AbstractTaskPlanExecutor.java:62)
    at org.gradle.execution.taskgraph.AbstractTaskPlanExecutor$TaskExecutorWorker.run(AbstractTaskPlanExecutor.java:50)
    at org.gradle.execution.taskgraph.DefaultTaskPlanExecutor.process(DefaultTaskPlanExecutor.java:25)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter.execute(DefaultTaskGraphExecuter.java:110)
    at org.gradle.execution.SelectedTaskExecutionAction.execute(SelectedTaskExecutionAction.java:37)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:37)
    at org.gradle.execution.DefaultBuildExecuter.access$000(DefaultBuildExecuter.java:23)
    at org.gradle.execution.DefaultBuildExecuter$1.proceed(DefaultBuildExecuter.java:43)
    at org.gradle.execution.DryRunBuildExecutionAction.execute(DryRunBuildExecutionAction.java:32)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:37)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:30)
    at org.gradle.initialization.DefaultGradleLauncher$4.run(DefaultGradleLauncher.java:154)
    at org.gradle.internal.Factories$1.create(Factories.java:22)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:90)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:52)
    at org.gradle.initialization.DefaultGradleLauncher.doBuildStages(DefaultGradleLauncher.java:151)
    at org.gradle.initialization.DefaultGradleLauncher.access$200(DefaultGradleLauncher.java:32)
    at org.gradle.initialization.DefaultGradleLauncher$1.create(DefaultGradleLauncher.java:99)
    at org.gradle.initialization.DefaultGradleLauncher$1.create(DefaultGradleLauncher.java:93)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:90)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:62)
    at org.gradle.initialization.DefaultGradleLauncher.doBuild(DefaultGradleLauncher.java:93)
    at org.gradle.initialization.DefaultGradleLauncher.run(DefaultGradleLauncher.java:82)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter$DefaultBuildController.run(InProcessBuildActionExecuter.java:94)
    at org.gradle.tooling.internal.provider.ExecuteBuildActionRunner.run(ExecuteBuildActionRunner.java:28)
    at org.gradle.launcher.exec.ChainingBuildActionRunner.run(ChainingBuildActionRunner.java:35)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:43)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:28)
    at org.gradle.launcher.exec.ContinuousBuildActionExecuter.execute(ContinuousBuildActionExecuter.java:77)
    at org.gradle.launcher.exec.ContinuousBuildActionExecuter.execute(ContinuousBuildActionExecuter.java:47)
    at org.gradle.launcher.exec.DaemonUsageSuggestingBuildActionExecuter.execute(DaemonUsageSuggestingBuildActionExecuter.java:51)
    at org.gradle.launcher.exec.DaemonUsageSuggestingBuildActionExecuter.execute(DaemonUsageSuggestingBuildActionExecuter.java:28)
    at org.gradle.launcher.cli.RunBuildAction.run(RunBuildAction.java:43)
    at org.gradle.internal.Actions$RunnableActionAdapter.execute(Actions.java:170)
    at org.gradle.launcher.cli.CommandLineActionFactory$ParseAndBuildAction.execute(CommandLineActionFactory.java:237)
    at org.gradle.launcher.cli.CommandLineActionFactory$ParseAndBuildAction.execute(CommandLineActionFactory.java:210)
    at org.gradle.launcher.cli.JavaRuntimeValidationAction.execute(JavaRuntimeValidationAction.java:35)
    at org.gradle.launcher.cli.JavaRuntimeValidationAction.execute(JavaRuntimeValidationAction.java:24)
    at org.gradle.launcher.cli.CommandLineActionFactory$WithLogging.execute(CommandLineActionFactory.java:206)
    at org.gradle.launcher.cli.CommandLineActionFactory$WithLogging.execute(CommandLineActionFactory.java:169)
    at org.gradle.launcher.cli.ExceptionReportingAction.execute(ExceptionReportingAction.java:33)
    at org.gradle.launcher.cli.ExceptionReportingAction.execute(ExceptionReportingAction.java:22)
    at org.gradle.launcher.Main.doAction(Main.java:33)
    at org.gradle.launcher.bootstrap.EntryPoint.run(EntryPoint.java:45)
    at org.gradle.launcher.bootstrap.ProcessBootstrap.runNoExit(ProcessBootstrap.java:54)
    at org.gradle.launcher.bootstrap.ProcessBootstrap.run(ProcessBootstrap.java:35)
    at org.gradle.launcher.GradleMain.main(GradleMain.java:23)
    at org.gradle.wrapper.BootstrapMainStarter.start(BootstrapMainStarter.java:30)
    at org.gradle.wrapper.WrapperExecutor.execute(WrapperExecutor.java:129)
    at org.gradle.wrapper.GradleWrapperMain.main(GradleWrapperMain.java:61)
Caused by: org.gradle.api.GradleException: Javadoc generation failed. Generated Javadoc options file (useful for troubleshooting): '/Users/morova/VarDictJava/build/tmp/javadoc/javadoc.options'
    at org.gradle.api.tasks.javadoc.internal.JavadocGenerator.execute(JavadocGenerator.java:57)
    at org.gradle.api.tasks.javadoc.internal.JavadocGenerator.execute(JavadocGenerator.java:31)
    at org.gradle.api.tasks.javadoc.Javadoc.executeExternalJavadoc(Javadoc.java:143)
    at org.gradle.api.tasks.javadoc.Javadoc.generate(Javadoc.java:131)
    at org.gradle.internal.reflect.JavaMethod.invoke(JavaMethod.java:75)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.doExecute(AnnotationProcessingTaskFactory.java:227)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.execute(AnnotationProcessingTaskFactory.java:220)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.execute(AnnotationProcessingTaskFactory.java:209)
    at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:585)
    at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:568)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeAction(ExecuteActionsTaskExecuter.java:80)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:61)
    ... 60 more
Caused by: org.gradle.process.internal.ExecException: Process 'command '/Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/bin/javadoc'' finished with non-zero exit value 1
    at org.gradle.process.internal.DefaultExecHandle$ExecResultImpl.assertNormalExitValue(DefaultExecHandle.java:367)
    at org.gradle.process.internal.DefaultExecAction.execute(DefaultExecAction.java:31)
    at org.gradle.api.tasks.javadoc.internal.JavadocGenerator.execute(JavadocGenerator.java:52)
    ... 71 more


BUILD FAILED

Total time: 7.8 secs
Stopped 0 compiler daemon(s).

-k 0 to turn off local realignment and speed up

I am working with chicken genomes and there are many contigs cannot placed on the chromosomes. When I used vardict to call SNVs and indels in those contigs, vardict need more time (at least 5 times more than other chromosomes). I have followed GATK best practice to map reads to genome and done indel realignment. My questions are since I already did indel realignment, can I turn off local realignment of vardict (and still have reliable indel calles)? And can turning off local realignment speed up vardict ?

Best,
Hongen

Ignoring SAM validation error

Hello,
I am getting this error when running verdict-java on alignments from RNA-seq. The alignments were generated by the following command line

@PG ID:GSNAP.1  PN:gsnap    CL:gsnap -D /home/shared/ref_genome/Poplar/ -d poplarv3_gsnap --nthreads=16 -s gsnap_db/stringtie_cuffmerge/gtf_splicesite_file.iit -V gsnap_db/hisat_freebayes/snp/local_vcf_db_snp.iit --use-sarray=0 -B 5 -a paired -N 1 -m 4 -M 1 -i 2 -w 500000 -E 4 -n 100 --pairmax-rna=200000 --gmap-mode=pairsearch -A sam -O --read-group-id=SRR073112 --read-group-name=BELA --gunzip fastq//SRR073112_1.fastq.gz fastq//SRR073112_2.fastq.gz    VN:2015-12-31

The error message is

Ignoring SAM validation error: ERROR: Record 6874, Read name SRR073112.8947314, Mate Alignment start (25847531) must be <= reference sequence length (25263035) on reference Chr02
java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.util.concurrent.FutureTask.report(Unknown Source)
    at java.util.concurrent.FutureTask.get(Unknown Source)
    at com.astrazeneca.vardict.VarDict.vardictParallel(VarDict.java:289)
    at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:259)
    at com.astrazeneca.vardict.VarDict.start(VarDict.java:68)
    at com.astrazeneca.vardict.Main.run(Main.java:134)
    at com.astrazeneca.vardict.Main.main(Main.java:25)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.charAt(Unknown Source)
    at com.astrazeneca.vardict.VarDict.isHasAndNotEquals(VarDict.java:4068)
    at com.astrazeneca.vardict.VarDict.modifyCigar(VarDict.java:6179)
    at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1534)
    at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2538)
    at com.astrazeneca.vardict.VarDict$VardictWorker.call(VarDict.java:5374)
    at com.astrazeneca.vardict.VarDict$VardictWorker.call(VarDict.java:5352)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

And the information on the bam file for that read is the following

SRR073112.8947314   435 Chr02   75272   0   50M =   25847531    965 ACGTGAACGTCTTTATGAGAAGATGGGAATTGGCAGCAGTTATCTTTTTA  II@4:I8IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  RG:Z:SRR073112  MD:Z:7A1A40 NH:i:2  HI:i:2  NM:i:2  SM:i:0  XQ:i:0  X2:i:0  XO:Z:PM XS:A:-
SRR073112.8947314   371 Chr02   76187   0   21M29H  =   75272   965 TTGCAACTGTGGTTCTAGGAA   ;;FIIIIHIIIIIIIHIIIII   RG:Z:SRR073112  XH:Z:GTGCCGGGGATCATTGAATTAGGAATTAT  XI:Z:IIIIIIIIIIIIIIIIIIIIIIIIIIIII  MD:Z:21 NH:i:2  HI:i:2  NM:i:0  SM:i:0  XQ:i:0  X2:i:0  XO:Z:PM XS:A:+  XT:Z:GT-AG,2.00,2.00,[email protected]@25847531
SRR073112.8947314   67  Chr05   25847531    96  29M85N21M   =   25848531    965 ATAATTCCTAATTCAATGATCCCCGGCACTTCCTAGAACCACAGTTGCAA  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIHIIIIF;;  RG:Z:SRR073112  MD:Z:50 NH:i:2  HI:i:1  NM:i:0  SM:i:96 XQ:i:40 X2:i:0  XO:Z:PM XS:A:-
SRR073112.8947314   355 Chr05   25847531    0   29M21H  Chr02   75272   965 ATAATTCCTAATTCAATGATCCCCGGCAC   IIIIIIIIIIIIIIIIIIIIIIIIIIIII   RG:Z:SRR073112  XH:Z:TTCCTAGAACCACAGTTGCAA  XI:Z:IIIIIHIIIIIIIHIIIIF;;  MD:Z:29 NH:i:2  HI:i:2  NM:i:0  SM:i:0  XQ:i:0  X2:i:0  XO:Z:PM XS:A:-  XT:Z:GT-AG,2.00,2.00,[email protected]@25847531
SRR073112.8947314   131 Chr05   25848531    96  50M =   25847531    -965    TAAAAAGATAACTGCTGCCAATTCCCATCTTCTCATAAAGACGTTCACGT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII8I:4@II  RG:Z:SRR073112  MD:Z:50 NH:i:2  HI:i:1  NM:i:0  SM:i:96 XQ:i:40 X2:i:0  XO:Z:PM XS:A:+

and chromosomes sizes are

Chr02   25263035        51126598        80      81
Chr03   21816808        76705428        80      81
Chr04   24267051        98794954        80      81
Chr05   25890704        123365351       80      81

is this the same problem reported in #7?

Thanks in advance

VarDict producing illegal CIGAR strings in modifyCigar method and crashing

Hello,

When processing the attached bam (contains a single well-formatted read), using hg38 as a reference, VarDict produces a malformed cigar string with a negative number, causing samtools to throw an illegal argument exception.

java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Malformed CIGAR string: 3S38M6D-6I115M
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.astrazeneca.vardict.VarDict.vardictParallel(VarDict.java:286)
at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:256)
at com.astrazeneca.vardict.VarDict.start(VarDict.java:65)
at com.astrazeneca.vardict.Main.run(Main.java:134)
at com.astrazeneca.vardict.Main.main(Main.java:25)

Command: VarDict -G ./hg38.fa -b "problem.bam" -N "problem" -c 1 -S 2 -E 3 -g 4 problem.bed > out.txt

Downloads.zip

java.lang.StringIndexOutOfBoundsException

I have seen the following exit message on error:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.charAt(String.java:658)
at com.astrazeneca.vardict.VarDict.modifyCigar(VarDict.java:6065)
at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1529)
at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2515)
at com.astrazeneca.vardict.VarDict.somaticNotParallel(VarDict.java:356)
at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:251)
at com.astrazeneca.vardict.VarDict.start(VarDict.java:67)
at com.astrazeneca.vardict.Main.run(Main.java:137)
at com.astrazeneca.vardict.Main.main(Main.java:24)

I have two small BAM files of 1.5 MB each in size that can reproduce this error.

Command used were:
VarDictJava/build/install/VarDict/bin/VarDict -G human_g1k_v37_decoy.fasta -f 0.01 -h -b 'tumor.errored.bam|normal.errored.bam' -C -c 1 -S 2 -E 3 -g 4 errored.5kbps.bed 2> errored.msg > errored.var

Is there a way I can get the BAM files to you guys?

Thanks.

Differences in output for perl and java versions:

Hi there, can you take a look at the java and perl outputs in this gist:
https://gist.github.com/mjafin/2ef6660ec45bfa31ddbc

I'm going to refer to Excel notation for row and column numbering. If you look at column AB, the numerical precision in Perl is 1 decimal, whereas the Java output seems much higher. For consistency it would be great if you used the same precision.

I don't know if it's a consequence of numerical precision, but the following cells are different between the implementations:
22-W (1 vs. 0.976)
39-W (0.98 vs. 0.961)
39-AD (50 vs. 51)

Other than that it looks fantastic! Happy to continue testing this on bigger data sets. Let me know when you have time to look into fixing these.

build folder not showing up after running ./gradlew installApp clean

After running the "./gradlew installApp clean" command, I do not see the build folder anywhere. Gradle reports a successful build however. I suspected the problem could be due to permissions, so I ran chmod 760 * in the directory the vardict folder is in to no avail. This problem occurs on both my Windows PC and in a server running Centos 6.5. Do you know what the problem could be? Thanks.

CIGAR string parsing issues with alignments from bwa mem 0.7.12

Viktor and Zaal;
While running additional whole genome testing with VarDictJava, we ran into two issues parsing
CIGAR strings. These are both with alignment output from the latest bwa (bwa mem 0.7.12).

I put together a reproducible test case here with the two failures (and also the VarDict perl comparisons). You should only need to adjust the path to the GRCh37 reference genome to run the shell script:

https://s3.amazonaws.com/chapmanb/az/cigar_string_issues.tar.gz

The first issue looks related to #11. VarDict finished cleanly on this small
region (with no calls) while VarDictJava errors out with:

Exception in thread "main" java.lang.IllegalArgumentException: Malformed CIGAR string: 60S-1M1I25M15S
  at htsjdk.samtools.TextCigarCodec.decode(TextCigarCodec.java:69)
  at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1533)
  at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2515)
  at com.astrazeneca.vardict.VarDict.vardictNotParallel(VarDict.java:303)
  at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:256)
  at com.astrazeneca.vardict.VarDict.start(VarDict.java:52)
  at com.astrazeneca.vardict.Main.run(Main.java:137)
  at com.astrazeneca.vardict.Main.main(Main.java:24)

For the second issue VarDict raises a lot of Perl errors but finished (again
without any output), while VarDictJava errors out with:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
  at java.lang.String.charAt(String.java:658)
  at com.astrazeneca.vardict.VarDict.modifyCigar(VarDict.java:6065)
  at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1529)
  at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2515)
  at com.astrazeneca.vardict.VarDict.vardictNotParallel(VarDict.java:303)
  at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:256)
  at com.astrazeneca.vardict.VarDict.start(VarDict.java:52)
  at com.astrazeneca.vardict.Main.run(Main.java:137)
  at com.astrazeneca.vardict.Main.main(Main.java:24)

Both of these may be problematic CIGAR strings but it would be great if VarDictJava could cleanly handle them and continue.

Thanks much for all the work, please let me know if any other information would be helpful.

java.lang.StringIndexOutOfBoundsException

I have seen the following exit message on error:

I have two small BAM files of 1.5 MB each in size that can reproduce this error.

Is there a way I can get the BAM files to you guys?

Thanks.

Internal sentinel '^' character included with alternative allele in output

Viktor and Zaal;
Thanks again for all the fixes in the latest release. I've been testing on a larger whole genome dataset and ran into an additional problem where alternative alleles remain prefixed with a ^. Here is a minimal example with a shell script and subset BAM files:

https://s3.amazonaws.com/chapmanb/az/vardict_altallele_debug.tar.gz

The call at 103577362 has the extra carrot when called with VarDictJava but not in VarDict:

test-vardict-java.tsv:
syn3-tumor        1       1       103577362       103577364       GAT     ^NTTATAGATAG    14      22      10      0       2       G/-3iNTTATAGATAG        0.143   2;0     16.5    1       29.6    1       54.5    4.0000.154      0.071   0.5     8       0       3       5       0       0       G/G     0       2;0     20.0    1       34.1    158.0   16.000  1.000   0       0.5     0       5.000   1       TGTTTCATCCCCCTATTTTT    ATAGATAGATAGATAGATAG    1:103571062-103580719   StrongSomatic   Complex
test-vardict.tsv:
syn3-tumor     1       1       103577362       103577364       GAT     NTTATAGATAG     14      2       210     0       2       G/-3iNTTATAGATAG        0.143   2;0     16.5    1       29.6    1       54.5    4.000   0.1540.071      0.5     8       0       3       5       0       0       G/G     0       2;0     20.0    1       34.1    1       58.0    16.000  1.000   0       0.5     0       5.000   1       TGTTTCATCCCCCTATTTTT    ATAGATAGATAGATAGATAG    1:103571062-103580719   StrongSomatic   Complex

This will cause downstream tools to fail on the output since it's not a valid character. I dug into the code a bit and it looks like it's used as a sentinel internally but does not get removed in this case.

Please let me know if any other information would be helpful to debug this.

using more CPU than given by -th 4

Hi,
with the following command line

vardict-java \
-G /home/shared/ref_genome/Poplar/Ptrichocarpa_210_v3.0.fa -N DENA \
-b mapping/samples/hisat/DENA.sorted.bam \
-f 0.0 -c 1 -S 2 -E 3 -g 4 **-th 4** \
regions.bed |   teststrandbias.R

I am getting vardict-java using more than 4 CPUs.

I have come across similar issue with GATK before and I seem to remember the solution was to set the JVM option -XX:ParallelGCThreads=, http://gatkforums.broadinstitute.org/discussion/1975/how-can-i-use-parallelism-to-make-gatk-tools-run-faster

Is there any chance verdict-java would set the number threads used on this way? I tried it by first getting the command that was being executed, adding the -XX:ParallelGCThreads= option and it definitively restricts the number of threads used.

Thanks in advance

from http://broadinstitute.github.io/picard/faq.html

Q: Why does a Picard program use so many threads?
A: This can be caused by the GC method of Java when used on 64 bit Java. By default the JVM switches to 'server' settings when on 64 bit, this automatically implements parallel GC and will use as many cores as it can get it's hands on. The approach we decided on to get round this was to define the number of threads we would allow Java for GC.

-XX:ParallelGCThreads=
An alternative approach is to turn off Parallel Gc (boolean option so note the '-' to indicate it is turned off):

-XX:+UseSerialGC
. We found this to be sub-optimal as the process has to stop completely when GC occurs and takes much longer as (from what I can tell) a full GC sweep is the only type performed which in many cases is not required (parallel GC employs ~7 different types of GC). See here for further details of the tuneable parameters.

Feature Request

Thanks again for being so responsive, It will be great if you guys could add the ability to genotype other samples from the vcf output of VarDict, especially for the merged variants.

Filtering insertions in homopolymer regions

Hi,

utilizing VarDictJava v1.3 for calling variants from RNA-seq of tumor cells gives me in an unreasonable amount of called insertions (> 5500 on the whole genome). I had a closer look on some of the suspicious inserts and they seem to be called at positions where a homopolymer of 6 or more equal nucleotides start. At these positions, most reads align perfectly with the reference sequence. However, a few have an additional nucleotide (e.g., AF=7%).

Currently, I am using quite straight forward filter settings (looking at DP, HICNT, etc). Do you have any experience on how to filter these variants for a high-quality prediction? Does VarDict somehow try to fix this during realignment or report some value which allows to filter this (except removing all insertions in homopolymer regions).

Thanks in advance
Manuel

Shortened line count with repeated start/end/ref/alt coordinates inserted

Thanks for the work on the VarDict java port. The speed improvements are very
welcome, and I've been working on validating the latest version (1.2.1) against
the DREAM ICGC-TCGA challenge data used previously to validate VarDict Perl
(http://bcb.io/2015/03/05/cancerval/).

I ran into some issues with truncated lines where the VarDictJava output will
have 37 lines instead of 51. A small reproducible test case is here with the BAM
files and a small shell script you'll have to adjust to point to a local copy of GRCh37:

https://s3.amazonaws.com/chapmanb/az/vardict_line_count_problem.tar.gz

with the output of vardict and vardict-java included. Specifically it differs at
the call at 237754437, where VarDictJava appears to repeat the start/end/ref/alt
in the middle of the line, but writes out the start and end values correctly:

['syn3-tumor', '1', '1', '237754437', '237754437', 'C', 'T', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '4', '2',
'1', '1', '0', '2', 'C/T', '0.500', '2;0', '13.5', '1', '32.0', '1', '60.0',
'4.000', '0.667', '0', '3.5', '0', '4.000', '3', 'CCCCTTCTCCTCCTCCCCCT',
'CTCCTCCTCCCCCTCCTCCT', '1:237753695-237754577', 'Deletion', 'SNV\n']

['syn3-tumor', '1', '1', '237754437', '237754437', 'C', 'T', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
'237754437', '237754437', 'C', 'T', '0', '4.000', '3', 'CCCCTTCTCCTCCTCCCCCT',
'CTCCTCCTCCCCCTCCTCCT', '1:237753695-237754577', 'Deletion', 'SNV\n']

Let me know if any other information would be helpful to debug this.

@mjafin @zhongwulai While tracking it down I ran into this potential issues with
the R code:

https://github.com/AstraZeneca-NGS/VarDict/blob/master/testsomatic.R#L5

If anything goes wrong it'll swallow up all the output, silently producing an
empty VCF file. Do we every expect it to be okay that R data frame import will
fail? If not, it would be good to raise an error here and remove the
try/catch. teststrandbias.R has the same issue. What do you think?

Thanks for looking at this.

Avoid duplicated `-Xmx` options for shell wrapper

When running with the VarDict shell script and passing in options via VAR_DICT_OPTS the Java memory specification ends up on the commandline twice since it is initially set in DEFAULT_JVM_OPTS. Apparently the behavior of this is that the later specification will take precedence, although it's not explicitly defined so is the potential source of hard to debug issues later:

http://stackoverflow.com/questions/2740725/duplicated-java-runtime-options-what-is-the-order-of-preference

I couldn't find the shell script in the distribution to submit a proper patch but a quick fix is to skip the default arguments if VAR_DICT_OPTS gets passed:

if [ -n "$VAR_DICT_OPTS" ] ; then
    DEFAULT_JVM_OPTS=''
else
    DEFAULT_JVM_OPTS='"-Xmx512m"'
fi

Would it be possible to include this in the shell script? Thanks much.

Ignoring SAM validation error + memory error

Hi,

I ran VarDictJava with the following code. Since I aimed to call mutations across the genome, I used the hg19_15k_150bpOL_.seg.txt file which I believe contained a regions across the genome with 5k bp intervals. Program had been running almost more than a day( I know that this depends on the hardware) I got an error says about I got out of memory.

When I got checked the last couple lines, I have seen two error such as MAPQ and that traceback error.

My questions;

Was my command entry right ?
How can I tackle this memory problem?
Is this about storage or ram ? because I keep all of the information in harddisk and even though I checked its storage it confuses me?
My output vcf was only comprised of a vcf header and nothing more.? Is this something expected?

Please help me.

Best,

Tunc

Ignoring SAM validation error: ERROR: Record 622, Read name HWI-ST1120:209:D1KP6ACXX:1:2307:10614:18487, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 623, Read name HWI-ST1120:209:D1KP6ACXX:1:1101:13014:47142, MAPQ should be 0 for unmapped read.
WARNING: BAM index file /media/morova/blackhole/normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam.bai is older than BAM /media/morova/blackhole/normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam
Ignoring SAM validation error: ERROR: Record 2, Read name HWI-ST1120:209:D1KP6ACXX:2:1209:16879:99921, MAPQ should be 0 for unmapped read.
WARNING: BAM index file /media/morova/blackhole/tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam.bai is older than BAM /media/morova/blackhole/tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam
Ignoring SAM validation error: ERROR: Record 622, Read name HWI-ST1120:209:D1KP6ACXX:1:2307:10614:18487, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 623, Read name HWI-ST1120:209:D1KP6ACXX:1:1101:13014:47142, MAPQ should be 0 for unmapped read.
WARNING: BAM index file /media/morova/blackhole/normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam.bai is older than BAM /media/morova/blackhole/normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam
Ignoring SAM validation error: ERROR: Record 2, Read name HWI-ST1120:209:D1KP6ACXX:2:1209:16879:99921, MAPQ should be 0 for unmapped read.
WARNING: BAM index file /media/morova/blackhole/tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam.bai is older than BAM /media/morova/blackhole/tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam
Ignoring SAM validation error: ERROR: Record 622, Read name HWI-ST1120:209:D1KP6ACXX:1:2307:10614:18487, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 623, Read name HWI-ST1120:209:D1KP6ACXX:1:1101:13014:47142, MAPQ should be 0 for unmapped read.
WARNING: BAM index file /media/morova/blackhole/normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam.bai is older than BAM /media/morova/blackhole/normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam
Ignoring SAM validation error: ERROR: Record 2, Read name HWI-ST1120:209:D1KP6ACXX:2:1209:16879:99921, MAPQ should be 0 for unmapped read.
WARNING: BAM index file /media/morova/blackhole/tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam.bai is older than BAM /media/morova/blackhole/tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam
Ignoring SAM validation error: ERROR: Record 622, Read name HWI-ST1120:209:D1KP6ACXX:1:2307:10614:18487, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 623, Read name HWI-ST1120:209:D1KP6ACXX:1:1101:13014:47142, MAPQ should be 0 for unmapped read.
WARNING: BAM index file /media/morova/blackhole/normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam.bai is older than BAM /media/morova/blackhole/normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam
Ignoring SAM validation error: ERROR: Record 2, Read name HWI-ST1120:209:D1KP6ACXX:2:1209:16879:99921, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 3, Read name HWI-ST1120:209:D1KP6ACXX:2:1104:8070:91549, MAPQ should be 0 for unmapped read.

 *** caught illegal operation ***
address 0x7fed45e162fa, cause 'illegal operand'

Traceback:
 1: fisher.test(matrix(c(tref, d[i, 9], rref, d[i, 27]), nrow = 2))
aborting ...
Use of uninitialized value $sample in concatenation (.) or string at /home/morova/VarDict-1.4.4/VarDict/var2vcf_paired.pl line 35.

My command line entry.

sudo ~/VarDict-1.4.4/bin/VarDict -G reference/Homo_sapiens_assembly19.fasta -f $AF_THR -N tumor_sample -b "tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam|normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam" -z 1 -c 1 -S 2 -E 3 -g 4 /home/morova/VarDict-1.4.4/VarDict/human_hg19_5k_150bpOL_seg.txt | /home/morova/VarDict-1.4.4/VarDict/testsomatic.R | /home/morova/VarDict-1.4.4/VarDict/var2vcf_paired.pl -N "tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam|normal/TCGA-CH-5741-10A-01D-1572_/home/morova/VarDict-1.4.4/VarDict130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam" > vardict_asil.vcf.txt

Java version with bed fails where Perl or Java with -R option succeeds

Hello,

I have been running the Perl version of varDict before I realized I can switch to the Java version. Running nearly identical commands:

./build/install/VarDict/bin/VarDict -G ucsc.hg19.fasta -f 0.01 -N sampleName -b "../SL107646_rg_ordered.bam|../SL107661_rg_ordered.bam" -z -c 1 -S 2 -E 3 -g 4 /home/ubuntu/VarDict/test.bed

However, this gives me the following error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at htsjdk.samtools.Cigar.getCigarElement(Cigar.java:57) at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1494) at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2552) at com.astrazeneca.vardict.VarDict.ampVardictNotParallel(VarDict.java:5514) at com.astrazeneca.vardict.VarDict.start(VarDict.java:60) at com.astrazeneca.vardict.Main.run(Main.java:134) at com.astrazeneca.vardict.Main.main(Main.java:25)
When I replace the test.bed file with a -R and the region from that same bed file, everything works fine. This leads me to believe that there is no sam/bam processing issue (I already added read groups and ordered and indexed the bams). Could there be another underlying issue?

thanks,
sara

VarDict Amplicon Mode throws IndexOutOfBoundsException

I previously ran successfully using VarDict with the 6-column bed file without the amplicon mode. I then added the 7th and 8th column to the bed file and re-run the same sample using the amplicon mode, but failed. Here it is the error:

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at htsjdk.samtools.Cigar.getCigarElement(Cigar.java:57)
at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1498)
at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2556)
at com.astrazeneca.vardict.VarDict.ampVardictNotParallel(VarDict.java:5518)
at com.astrazeneca.vardict.VarDict.start(VarDict.java:60)
at com.astrazeneca.vardict.Main.run(Main.java:134)
at com.astrazeneca.vardict.Main.main(Main.java:25)

Here is how my 8-column bed file looks like:

chr1 4367240 4367362 rs1490413 1 + 4367262 4367342
chr1 45973896 45974012 r2275276 1 + 45973918 45973991
chr1 67861470 67861609 rs2229546 1 + 67861490 67861589
chr1 89388893 89389015 rs7532151 1 + 89388922 89388987
chr1 107984045 107984164 rs776284 1 + 107984068 107984143
chr1 167849380 167849521 rs203849 1 + 167849401 167849500
chr1 179520426 179520568 rs1410592 1 + 179520447 179520548
chr1 209811837 209811940 rs2076356 1 + 209811857 209811920

RNA-seq Variant Indel/SNV calling

Hello we would like to use your tool on a large set of RNA WTx samples (deep coverage) that have been aligned via STAR. Are there any notes on best practices for parameters to tweak for RNA-Seq data? So far we are thinking of "turning off" strand bias filter and reducing allele freq to 1%.

Thank you

Dave

Discrepancy between -q threshold and reported QMean value

Hi,
VarDict v1.3 does not call an SNV for which there seems to be a pileup due to low base quality, which is fine in general. However, there is a discrepancy between the filtering (-q 24 or -q 25) and the reported QMean value (27.3) which I believe to be the corresponding value in the output:

$ VarDict -q 25 -t -G $REF -b $BAM -R chr2:202508105-202508105
$ VarDict -q 24 -h -t -G $REF -b $BAM -R chr2:202508105-202508105
Sample  Gene    Chr     Start   End     Ref     Alt     Depth   AltDepth        RefFwdReads     RefRevReads     AltFwdReads     AltRevReads     Genotype        AF      Bias    PMean   PStd    QMean ...
data    chr2    chr2    202508105   202508105   C   A   62  12  25  25  6   6   C/A 0.1935  2;2 30.1    1   27.3    1   ...

I did not calculate the mean quality by my self but you can probably have a look at this nonetheless.

By the way, some cells are skipped in the header (using -h) but that's just cosmetics ;-)

Have a nice weekend!

Missing heterogenous deletion

@zhongwulai and all;
@ohofmann ran into a variant calling case where VarDict appears to be missing a real indel. Running FreeBayes on the same region results in a deletion call with high depth and a clear separation between tumor (first sample) and normal (second):

17      29562656        .       CTGTTT  CT      2102.99 .       AB=0.337979;ABP=68.4496;AC=1;AF=0.25;AN=4;AO=97;CIGAR=1M4D1M;DP=322;DPB=275.833;DPRA=8.2;EPP=6.79359;EPPR=3.49176;GTI=0;LEN=4;MEANALT=4;MQM=60;MQMR=60;NS=2;NUMALT=1;ODDS=48.4
244;PAIRED=1;PAIREDR=0.99095;PAO=9;PQA=224.5;PQR=1014.5;PRO=32;QA=3480;QR=8273;RO=221;RPL=49;RPP=3.03269;RPPR=3.80618;RPR=48;RUN=1;SAF=43;SAP=5.71904;SAR=54;SRF=119;SRP=5.84992;SRR=102;TYPE=del;technology.illumina=1   GT:DP:RO:QR:AO:QA:GL    0/1:287:187:7007:97:3480:-218.303,0,-600.114    0/0:35:34:1266:0:0:0,-10.8371,-120.597

The read support looks reasonable in terms of strand and positioning, so I couldn't identify an obvious reason why VarDict skips it. The VarDict output has no variants listed at all so I'm a big unsure how best to debug. Here is a minimal example and shell script that demonstrate the issue:

https://s3.amazonaws.com/chapmanb/vardict_missing_indel.tar.gz

I'd welcome any thoughts on digging into this to improve sensitivity for this case.

'nu' inserted into sequence in VarDict Java

Hi there,
Could I ask you to look (as a priority item where possible) into fixing a bug that enters nu into the sequence:

< SXFS_2147     chr12   chr12   125451750       125451783       A       ACACCAGGATGGATAACAGAGGGGCCCAnuCTTT      174
     2       102     70      1       1       A/+33   0.0115  2;2     26.0    1       33.2    1       60.0    4.000   0.0124  0.0115  10      3.000   1       0.0     2       161     CTCAGCCTGCCCCGTCAGGA    CACCAGGATGCCACCTGTGG    chr12:125451156-125451923       Insertion

---
> SXFS_2147     chr12   chr12   125451750       125451782       A       ACACCAGATGGATAACAGAGGGGCCCATTCTTT       174
     2       102     70      1       1       A/+32   0.0115  2;2     26.0    1       33.2    1       60.0    4.000   0.0124  0.0115  6       3.000   1       0.0     2       161     CTCAGCCTGCCCCGTCAGGA    CACCAGGATGCCACCTGTGG    chr12:125451156-125451923       Insertion

The first variant is from Java and the second the same one in Perl.

I can provide you a test case if you let me know where to put it (box, email, it's about 70MB)

"start point 90355650 lies after end point 90354753"

Actually it's a wrong region that has gone beyond the reference..... maybe this is the intended behavior?
The program exits when it runs into it, although the Perl version runs through it.

Exception in thread "main" htsjdk.samtools.SAMException: Malformed query; start point 90355650 lies after end point 90354753
at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:176)
at com.astrazeneca.vardict.VarDict.retriveSubSeq(VarDict.java:1022)
at com.astrazeneca.vardict.VarDict.getREF(VarDict.java:3136)
at com.astrazeneca.vardict.VarDict.somaticNotParallel(VarDict.java:355)
at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:252)
at com.astrazeneca.vardict.VarDict.start(VarDict.java:68)
at com.astrazeneca.vardict.Main.run(Main.java:134)
at com.astrazeneca.vardict.Main.main(Main.java:25)

Vardict detecting no somatic mutations in DNA panel using proxy normal

Hello, I am intersted in applying Vardict to identify somatic indels using a broad panel with a proxy normal sample. We know this sample has true somatic indels, however, when we run the suggested pipeline porting in a proxy normal sample no indels are listed as "strongly somatic" or "likely somatic". Before we look into this deeper on our end, I wanted to ask whether this is expected behavior when using a proxy normal with your pipeline. There may be issues with having a higher density of idels detected when using a proxy normal as private mutations may come up as somatic and I was wondering whether you expect something like this to cause true somatic indels to loose statistical significance. We are currently using your tool for exome paired tumor-normal analysis and wanted to apply it to tumor-only panel applications as well. Thanks in advance, Sean

VarDict should clearly document that reads without `NM` tags are skipped

VarDictJava (and presumably VarDict) completely ignore reads that are missing the optional NM tag, but this isn't clearly stated in any documentation, nor are any warnings issued while running. It would be nice if the usage documentation mentioned this, and also if it either logged a warning the first time this happens, or logged the number of missing NM tags at the end of a run.

It would be even nicer if it just recomputed the NM tag on the fly since it has the read and the reference available. There are a number of methods in htsjdk for calculating NM or just the number of mismatches (in SequenceUtil), and since VarDictJava already relies on htsjdk that would be a relatively simple route.

about bed file input + running VarDictJava

Hi again,

Thank you for your previous fast response. It helped me a lot. I believe there is a problem with my mutation calling process.

I started running with this command.

~/VarDict-1.4.4/bin/VarDict -G reference/Homo_sapiens_assembly19.fasta -f $AF_THR -N tumor_sample -b "tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam|normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam" -z -c 1 -S 2 -E 3 -g 4 are_high_normal_tumor_sample.bed.txt | ~/VarDict-1.4.4/VarDict/testsomatic.R | ~/VarDict-1.4.4/VarDict/var2vcf_somatic.pl -N "tumor/TCGA-CH-5741-01A-11D-1572_130104_SN1120_0209_BD1KP6ACXX_s_1_rg.sorted.bam|normal/TCGA-CH-5741-10A-01D-1572_130104_SN1120_0209_BD1KP6ACXX_s_2_rg.sorted.bam" -f $AF_THR > vardict.vcf

My bed input is a typical bed format. However, I am calling my mutations based on a chip-seq bed file therefore I have peak names instead of gene names.

chr6 51217257 51217593 tumor_high_1802 523.998413
chr7 66969316 66969642 tumor_high_1803 422.902191
chr6 51567727 51568019 tumor_high_1804 618.637512
chr6 51602453 51602721 tumor_high_1805 717.315308
chr6 51877947 51878195 tumor_high_1806 656.460632
chr7 68195932 68196279 tumor_high_1807 247.741989
chr11 117046945 117047229 tumor_high_1808 340.127594
chr6 52590531 52590912 tumor_high_1809 275.999542
chr6 52610638 52610983 tumor_high_1810 290.982056
chr10 29565502 29565809 tumor_high_1812 284.837311
chr7 69061173 69061526 tumor_high_1813 757.335510

My first question is, is it have to be gene name all the time ?
During the process, I am getting an error about bam index file is older than the bam file. I dont think this is an important warning but does it change anything?
During this process, my verdict.vcf output size does not change? This is the actual reason that I am writing this post. I think file size should increase?

Thank you very much for your patience and help.

Best,

Tunc.

uninitialized sample

Hello
I am using vardict for a paired comparison and here is the bash script I am using. This is basically the paired example provided for the use case.

#!/bin/bash

module load samtools R vardictjava
cd results/vardict


AF_THR=0.01


VarDict -C -th 32 -G /hpf/tools/centos6/mugqic-pipelines/source/resource/genomes/species/Homo_sapiens.GRCh37/genome/Homo_sapiens.GRCh37.fa -f $AF_THR -N B3 -b "/hpf/largeprojects/ccmbio/arun/DIPG/Heterogeneity/results/alignment/B3/B3.sorted.dup.recal.bam|/hpf/largeprojects/ccmbio/arun/DIPG/Heterogeneity/results/alignment/B1/B1.sorted.dup.recal.bam" -z -F 1 -c 1 -S 2 -E 3 -g 4 /hpf/largeprojects/ccmbio/arun/DIPG/Heterogeneity/annots/ens.coding.exons.Sorted.S04380110_overlap.uniq.bed | testsomatic.R | var2vcf_paired.pl -N "B1|B3" -f $AF_THR > vardict.java.B3.bed.out

and get the following error:

Use of uninitialized value $sample in concatenation (.) or string at /hpf/tools/centos6/vardictjava/1.4.6/VarDict/var2vcf_paired.pl line 35.

Thanks
Arun

Oncofuse error

I have an error running bcbio 0.9.5 for RNA-Seq, when including Oncofuse in the pipeline. The error is
_Could not find java jar Oncofuse in /apps/bcbio-nextgen/0.9.5/rhel6-x64/share/java/oncofuse _
It seems the Oncofuse file is in /apps/bcbio-nextgen/0.9.5/rhel6-x64/bin/oncofuse and not in java share.
Please can you investigate?
Thanks.

Version 1.4.5.beta - fix for issues #7 and #26

Applied changes from 'ccb858dbcc4d279b6cdfa5983eba4f135fc63d04' revision of perl version (fixed issues 7, 26)

Due to discovered diffs in results of perl and java version on some data files, we are placing here binaries for tests.
VarDict-1.4.5.beta.zip

Exception in thread "main" java.lang.IllegalArgumentException: Malformed CIGAR string: 66S-3M38S

This error occurs on HCC1143 data downloaded from CGHUB.
The analysis_id or uuid is:
c727c60e-8ddd-1ddc-e040-ad451e414a77 for tumor
c727c60e-95b8-1455-e040-ad451e414a79 for normal

I stored the truncated BAM files that can reproduce the error in Google Drive:
https://drive.google.com/folderview?id=0B9pfRlnkG-Z7fngxOWtXV05FWG52SU5oMEUzcklMdmFtQVk3V3FDWm1ubk05TDJvX3F6N2s

VarDict -G genome.GRCh37.fa -f 0.01 -h -b 'tumor.error.bam|normal.error.bam' -c 1 -S 2 -E 3 -g 4 error.bed > error.var 2> error.warning.msg

-th option

Hello,
I just downloaded VarDictJava but I do not see the -th option as mentionned in the documentation.

Thanks,

VarDict throws stringIndexOutOfBoundsException

Hello,

Sometimes when encountering severely clipped reads VarDict will throw an index out of bounds exception and exit:Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.charAt(String.java:646)
at com.astrazeneca.vardict.VarDict.isHasAndNotEquals(VarDict.java:4084)
at com.astrazeneca.vardict.VarDict.modifyCigar(VarDict.java:6238)
at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1548)
at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2554)
at com.astrazeneca.vardict.VarDict.vardictNotParallel(VarDict.java:301)
at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:254)
at com.astrazeneca.vardict.VarDict.start(VarDict.java:65)
at com.astrazeneca.vardict.Main.run(Main.java:134)
at com.astrazeneca.vardict.Main.main(Main.java:25)

I've attached an example of a problematic alignment (aligned to hg38). I realize this read likely should not be aligning here, but we do occassionally get output like this from our aligner. Since it is a valid bam entry it would be nice if VarDict could handle this so we don't have to filter these out manually since this can be expensive on large bam files. Thanks!

Command: VarDict -G ./hg38.fa -b indexOutOfBoundsError.bam -N "problem" -c 1 -S 2 -E 3 -g 4 myVeryShortBed.bed
indexOutOfBoundsError.zip

Incorrect columns for variants at first position of a chromosome

Victor and Zaal;
We ran into an edge case issue with variants at the first position of a chromosome. In these cases, VarDictJava produces a truncated output file with 33 columns instead of 34, resulting in failures from teststrandbias.R when filtering and converting to VCF. The perl VarDict complains with a number of errors but produces reasonable output and continues.

I put together a reproducible test case here that shows the problem:

https://s3.amazonaws.com/chapmanb/az/first-pos-variant-testcase.tar.gz

If you edit the run.sh script to point to a local copy of hg19 it should work for VarDict and fail on VarDictJava. Please let me know if I can provide any other information that would help.

Configuring amplicon mode

Hi,
I'm struggling to get my head around the combination of settings required for running in amplicon mode. I've worked out that using -a int:float, a regions file needs to be a BED8, and must be tumour-only.

Based on a small custom BAM file around PTEN gene, i've tried a number of regions file, including whole chrom, whole gene, known exons, actual amplicon locations (this is from a HaloPlex design). Each time I need to use a large int value in the -a to see any variants.
If I run the whole chrom or whole gene, and -a 1000000:0.95, then i find 44 variants.
If I make a new BED file that spans 40bp around each of these known variants, and use -a 10:0.95, or -a 30:0.95, I find none of them.

So at this stage the only way i can see variants is to use an absurdly large int value for -a. From the documentation "...if the edges are less than int bp to the amplicon": does this mean the variant must be within 10bp of the end of the amplicon by default?

This image shows the exons, the haloplex amplicons, the variants, and the bed file surrounding these variants.

any help would be greatly appreciated

cheers,
Mark

astrazeneca-ngs / vardictjava Goto Github PK

vardictjava's Introduction

VarDictJava

Introduction

Requirements

Getting started

Getting source code

Compiling

Distribution Package Structure

Third-Party Libraries

Single sample mode

Paired variant calling

Amplicon based calling

Running Tests

Integration testing

User extension of testcases

Test cases file format

Test coverage Report

Thin-FASTA Format

Program Workflow

The main workflow

CIGAR Preprocessing (Initial Realignment)

Variants

Variant Description String

Variant Filtering

Bias flag explanation

Variant classification in paired(somatic) analysis

Program Options

VarDictJava options

Important var2vcf_valid.pl options

Important var2vcf_paired.pl options

Output columns

Simple mode:

Amplicon mode

Somatic mode

Input Files

BED File – Regions

FASTA File - Reference Genome

Errors and warnings

License

Contributors

vardictjava's People

Contributors

Stargazers

Watchers

Forkers

vardictjava's Issues

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fb87dcdfc0c, pid=41165, tid=140429174187776

JRE version: Java(TM) SE Runtime Environment (7.0_45-b18) (build 1.7.0_45-b18)

Java VM: Java HotSpot(TM) 64-Bit Server VM (24.45-b08 mixed mode linux-amd64 compressed oops)

Problematic frame:

V [libjvm.so+0x80ec0c] os::PlatformEvent::unpark()+0x1c

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/gpfs01/home/kpbh388/tests/VarDictJava/build/install/VarDict/bin/hs_err_pid41165.log

If you would like to submit a bug report, please visit:

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

Recommend Projects

Recommend Topics

Recommend Org