samtools / bcftools Goto Github PK

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

Home Page: http://samtools.github.io/bcftools/

License: Other

Shell 1.17% C 87.24% Makefile 0.52% Perl 9.54% Python 0.73% M4 0.68% Batchfile 0.12%

bcftools's Introduction

samtools

This is the official development repository for samtools.

The original samtools package has been split into three separate but tightly coordinated projects:

htslib: C-library for handling high-throughput sequencing data
samtools: mpileup and other tools for handling SAM, BAM, CRAM
bcftools: calling and other tools for handling VCF, BCF

Building Samtools

See INSTALL for complete details. Release tarballs contain generated files that have not been committed to this repository, so building the code from a Git repository requires extra steps:

autoheader            # Build config.h.in (this may generate a warning about
                      # AC_CONFIG_SUBDIRS - please ignore it).
autoconf -Wno-syntax  # Generate the configure script
./configure           # Needed for choosing optional functionality
make
make install

By default, this will build against an HTSlib source tree in ../htslib. You can alter this to a source tree elsewhere or to a previously-installed HTSlib by configuring with --with-htslib=DIR.

Citing

Please cite this paper when using SAMtools for your publications.

Twelve years of SAMtools and BCFtools
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li
GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

@article{10.1093/gigascience/giab008,
    author = {Danecek, Petr and Bonfield, James K and Liddle, Jennifer and Marshall, John and Ohan, Valeriu and Pollard, Martin O and Whitwham, Andrew and Keane, Thomas and McCarthy, Shane A and Davies, Robert M and Li, Heng},
    title = "{Twelve years of SAMtools and BCFtools}",
    journal = {GigaScience},
    volume = {10},
    number = {2},
    year = {2021},
    month = {02},
    abstract = "{SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods.The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines.Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed \\&gt;1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.}",
    issn = {2047-217X},
    doi = {10.1093/gigascience/giab008},
    url = {https://doi.org/10.1093/gigascience/giab008},
    note = {giab008},
    eprint = {https://academic.oup.com/gigascience/article-pdf/10/2/giab008/36332246/giab008.pdf},
}

bcftools's People

Contributors

Stargazers

Watchers

Forkers

mcshane thomasleyer hjanime debian wtsi-hgi modi2020 samstudio8 singerma stschiff tjjjohnson kgururaj monkollek joelmartin winni2k ntthong theboocock litswu apastore tommycarstensen jherrero kontakter yeung678 curvaceous pd3 idiv-biodiversity avilella aertslab andrewjesaitis tbshirey mp15 johanneskoester mcol nordhuang genomicsplc kmholton theoryno3 neocruiser humanlongevity cyenyxe adamspargo dlaehnemann picsaver dkj ces mmesbahu kdm9 dvklopfenstein pauruihu jkbonfield muhuaw ramabit xtmgah alenzhao mh11 mrduck2320 davidroberson koelling jenniferliddle daviesrob jaredo yunzhengluoyi zzygyx9119 phuwanat brentp krooijers jessehoff kamilsjaron arielyuan arun-sub nsoranzo nieder eugenebragin hou garrettjstevens martinjvickers eyherabh tw7649116 wordless-chuanjun skerker shubhamsaini mwchristiansen graphanalytics bmpvieira-forks genomicsiter alexdavv liu-genomics olest jmarshall sunbymoon lesyamazurevich apozomt zhouhufeng bioinfotools cnaibon yifangt caot xingsiyuan clufox stekaz dayedepps

bcftools's Issues

annotate segfault

I'm unclear how to use bcftools annotate correctly. Specifically, how to construct the --columns argument to get the desired output. Potentially a few examples in the docs :)

My basic use cases.

Add IDs
Add specific INFO tag
Add all INFO tags
The above but while ignoring REF/ALT matching

A few attempts.

bcftools annotate -a dbsnp_138.hg19.bcf.gz -c INFO/CLINSIG  tmp_target.bcf.gz > tmp_anno.vcf
Segmentation fault (core dumped)

bcftools annotate -a dbsnp_138.hg19.bcf.gz -c CLINSIG  tmp_target.bcf.gz > tmp_anno.vcf
Segmentation fault (core dumped)

bcftools annotate -a dbsnp_138.hg19.bcf.gz -c CHROM,POS,-,-,-,-,-,-,INFO/CLINSIG  tmp_target.bcf.gz > tmp_anno.vcf
Segmentation fault (core dumped)

The dbsnp is from the GATK bundle.

https://drive.google.com/file/d/0B5pgT9nmbwfSRXJWNTlFVDJwbWs/edit?usp=sharing

HOB,ICB annotation for non-ref HET and multiallelic sites

These calculations do not take into account non-ref HETs and multiallelic sites.

bcftools tabix segfaults with unsupported -p option argument

$ bgzip norm.vcf
$ bcftools tabix -p gzvcf norm.vcf.gz 
Segmentation fault

$ bgzip norm.vcf
$ bcftools tabix -f -p randomcrap norm.vcf.gz 
Segmentation fault

vcf -> bcf : incomplete output

Converting a large vcf (dbsnp from gatk bundle) to bcf is running into a problem. The output only goes as far as chr2 with no error and exit status 0.

# convert a vcf to bcf
# the vcf.gz is 1.1GB and the bcf.gz is 176MB
bcftools view -O b dbsnp_138.hg19.vcf.gz > dbsnp_138.hg19.bcf.gz

# check the tail of the vcf via view
bcftools view  dbsnp_138.hg19.vcf.gz | tail -n 2 | cut -f -3
chr2    66823380        rs114241628
chr2    66823410        rs369988271

# more conventionally
zcat  dbsnp_138.hg19.vcf.gz | tail -n 2 | cut -f -3  
chrY    59363407        rs376709892
chrY    59363442        rs371324725

# check the tail of the bcf via view
bcftools view  dbsnp_138.hg19.vcf.gz | tail -n 2 | cut -f -3
chr2    66823380        rs114241628
chr2    66823410        rs369988271

Sorry about spamming the issues!

bcftools subset corrupts some annotations and is slow

Using awk to extract column for a single sample:
AC=5;AN=1724;DP=227802;DP4=109696,111215,3204,1373;HWE=1.000000e+00;MDV=190;MQ=41;NEGATIVE_TRAIN_SITE;PV4=5.4e-168,1,0,1;QBD=4.430391e-03;RPB=-2.692433e+01;VDB=4.838643e-01;VQSLOD=-9.116e+00;culprit=ReadPosRankSum GT:PL:DP:DV:SP:GQ 0/1:255,0,255:370:190:8:99

Vs using bcftools subset -I -S:
AC=5;AN=1724;DP=227802;DP4=109696,111215,3204,1373;HWE=1;MDV=190;MQ=41;NEGATIVE_TRAIN_SITE;PV4=0,1,0,1;QBD=0.00443039;RPB=-26.9243;VDB=0.483864;VQSLOD=-9.116;culprit=ReadPosRankSum GT:PL:DP:DV:SP:GQ 0/1:255,0,255:370:190:8:99

I can appreciate that bcftools subset normalises some values (e.g. going from HWE=1.000000e+00 to HWE=1 seems sensible enough), but changing a P-value annotation from a low P-value to 0 is somewhat less acceptable (e.g. PV4=5.4e-168 becoming PV4=0). I'm also slightly dubious that the subset operation drops some precision during the reformatting (e.g. for RPB=-2.692433e+01 becoming RPB=-26.9243), although if that were well documented it might be ok (it is just not something I expected of a simple subset operation).

Also, bcftools subset seems to be an order of magnitude slower than awk at this operation.

License of bcftools.

Hello,

thanks to your help on the htslib, I could prepare a draft Debian package for bcftools, that builds on a system-wide-installed htslib. I am currently applying the following patch.

diff --git a/Makefile b/Makefile
index 5139d1a..61c165b 100644
--- a/Makefile
+++ b/Makefile
@@ -4,7 +4,6 @@ all: $(PROG)

 # Adjust $(HTSDIR) to point to your top-level htslib directory
 HTSDIR = ../htslib
-include $(HTSDIR)/htslib.mk
 HTSLIB = $(HTSDIR)/libhts.a

 CC=                    gcc
@@ -51,8 +50,8 @@ force:
 test: $(PROG)
                ./test/test.pl

-main.o: version.h $(HTSDIR)/version.h bcftools.h
-vcfcall.o: vcfcall.c call.h mcall.c prob1.h $(HTSDIR)/htslib/kfunc.h $(HTSDIR)/htslib/vcf.h
+main.o: version.h bcftools.h
+vcfcall.o: vcfcall.c call.h mcall.c prob1.h $(HTSDIR)/kfunc.h $(HTSDIR)/vcf.h
 mcall.o ccall.o: call.h vcmp.h
 vcffilter.o: filter.h
 vcfsubset.o: filter.h
@@ -73,4 +72,4 @@ install: $(PROG)
 cleanlocal:
                rm -fr gmon.out *.o a.out *.dSYM *~ $(PROG) version.h

-clean:cleanlocal clean-htslib
+clean:cleanlocal

On amd64, the packaging scripts then pass HTSDIR="/usr/include/htslib" and HTSLIB=/usr/lib/x86_64-linux-gnu/libhts.so to build on the dynamic library.

Back to my original question, I can not upload the package to Debian without being sure of its license, and currently only kmin.c and kmin.h have some information. Can you clarify the general license of bcftools ?

Cheers,

Charles

bcftools isec [E::bcf_hdr_read] invalid BCF2 magic string

I am getting this error when running bcftools isec (develop branch of htslib and latest version of bcftools):

bcftools isec -f PASS -r chr20 -p bwa_gatk_2x100.isec f1.vcf.gz f2.vcf.gz
[E::bcf_hdr_read] invalid BCF2 magic string
Segmentation fault

Any ideas?

AF2 tag not documented in vcf output

When using the bcftools view -1 flag...
The AF2 tag (in the INFO field when doing contrast/association between groups) is not documented anywhere. It should have meta-information line (##INFO=<ID=AF2,...) in the vcf output. Adding it to the manpage would be nice too.

Going from the code:
ccall.c line 124-125 says that AF2 = 1-em[4], 1-em[6]
em.c line 139 says em[5..6] is "group1 freq, group2 freq"

If I understand correctly, those freqs are for the reference allele, but I'm not 100% on that. Anyway, it just needs to be properly documented.

bcftools tabix segfaults on vcf.gz input that is actually not gzipped

bcftools tabix segfaults where tabix used to give a helpful warning.

Demonstrative test case, using bcftools and htslib tag 0.2.0-rc3:

$ cp test/norm.vcf norm.vcf.gz
$ bcftools tabix norm.vcf.gz
Segmentation fault
$ tabix norm.vcf.gz
[tabix] was bgzip used to compress this file? norm.vcf.gz

I've run this on Linux x86_64.

bcftools query usage reports itself as "vcfquery"

Minor issue: on "Usage:" line and in "Examples:" the "bcftools query" command reports itself as "vcfquery":

$ bcftools query
About:   Extracts fields from VCF/BCF file and prints them in user-defined format
Usage:   vcfquery [options] <file.vcf.gz>
Options:
    -a, --annots <list>               alias for -f '%CHROM\t%POS\t%MASK\t%REF\t%ALT\t%TYPE\t' + tab-separated <list> of tags
    -c, --collapse <string>           collapse lines with duplicate positions for <snps|indels|both|any|some>
    -f, --format <string>             learn by example, see below
    -H, --print-header                print header
    -l, --list-columns                list columns
    -p, --positions <file>            list positions in tab-delimited tabix indexed file <chr,pos> or <chr,from,to>, 1-based, inclusive
    -r, --region <reg|file>           output from the given regions only
    -s, --samples <list|file>         samples to include: comma-separated list or one name per line in a file
    -v, --vcf-list <file>             process multiple VCFs listed in the file
Expressions:
        %CHROM          The CHROM column (similarly also other columns, such as POS, ID, QUAL, etc.)
        %INFO/TAG       Any tag in the INFO column
        %TYPE           Variant type (REF, SNP, MNP, INDEL, OTHER)
        %MASK           Indicates presence of the site in other files (with multiple files)
        %TAG{INT}       Curly brackets to subscript vectors (0-based)
        []              The brackets loop over all samples
        %GT             Genotype (e.g. 0/1)
        %TGT            Translated genotype (e.g. C/A)
        %LINE           Prints the whole line
        %SAMPLE         Sample name
Examples:
        vcfquery -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz

Make fails, not ready yet?

A make command lead to a failure

gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib main.c -o main.o
gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib vcfview.c -o vcfview.o
gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib bcfidx.c -o bcfidx.o
gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib tabix.c -o tabix.o
gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib vcfcheck.c -o vcfcheck.o
vcfcheck.c: In function 'do_sample_stats':
vcfcheck.c:594: error: 'bcf_int8_vector_end' undeclared (first use in this function)
vcfcheck.c:594: error: (Each undeclared identifier is reported only once
vcfcheck.c:594: error: for each function it appears in.)
vcfcheck.c:594: error: 'bcf_int8_missing' undeclared (first use in this function)
vcfcheck.c:595: error: 'bcf_int16_vector_end' undeclared (first use in this function)
vcfcheck.c:595: error: 'bcf_int16_missing' undeclared (first use in this function)
vcfcheck.c:596: error: 'bcf_int32_vector_end' undeclared (first use in this function)
vcfcheck.c:596: error: 'bcf_int32_missing' undeclared (first use in this function)
vcfcheck.c: In function 'check_vcf':
vcfcheck.c:683: warning: implicit declaration of function 'bcf_sr_has_line'
vcfcheck.c: In function 'main_vcfcheck':
vcfcheck.c:981: warning: assignment makes integer from pointer without a cast
make: *** [vcfcheck.o] Error 1

merge silently changes duplicate sample names

bcftools merge will silently prepend the file number to a sample name if it has already been seen. Should fail on duplicate sample names unless a --force flag specified.

norm segfaults when `len(REF)>1` and `ALT=.`

Here is a line that could be added to test/norm.vcf to see the problem:

20 59 . AG . 999 PASS . GT:PL:DP 0/0:0,23,255:4

bcftools concat --allow-overlaps is not recognized

When I run this command:
bcftools concat --allow-overlaps
I get this error message:
concat: unrecognized option '--allow-overlaps'
However, using the -a option instead results in no error.

Here is my version of bcftools:
$ bcftools -v
bcftools 0.2.0-rc7-48-g94877a1
Using htslib 0.2.0-rc7-23-gbb4721c

segfault on view

I'm parsing ESP data by MAF and got a segfault. Initially MAF was a String so I converted it to a Float but no luck. The field is an array.

##INFO=<ID=MAF,Number=3,Type=Float,Description="Minor Allele Frequency in percent in the order of EA,AA,All">  
bcftools view -i "MAF[0]<0.01" tmp.vcf > tmp2.vcf

A few hundred rows.
https://drive.google.com/file/d/0B5pgT9nmbwfSV2s0dkxxd1pNa2s/edit?usp=sharing

Can't recalculate AC without AN

I was trying to re-calculate the INFO field for a VCF after dropping a sample. There seems to be a problem with the INFO field has an AC value but not an AN value. I get the "bcftools: vcfutils.c:43: bcf_calc_ac: Assertion `an>=nac' failed". I would expect the utility to either calculate an AN value, or not check for the assertion when AN values aren't present.

Here is a sample VCF:

---- test.vcf ----

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GD,Number=1,Type=Integer,Description="Genotype Depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLEA SAMPLEB
20  60081   .   G   A   100 PASS    DP=19;NS=2;AC=1;AF=0.25  GT:GD:GQ    0/0:5:44    0/1:14:71
20  60177   .   A   C   100 PASS    DP=19;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:5:44    0/0:9:56
20  60215   .   G   A   100 PASS    DP=19;NS=2;AC=2;AF=0.50  GT:GD:GQ    1/1:9:56    0/0:10:59
20  60292   .   C   T   100 PASS    DP=18;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:11:62   0/0:6:47
20  60309   .   G   T   100 PASS    DP=18;NS=2;AC=3;AF=0.75  GT:GD:GQ    1/1:12:65   0/1:9:56
20  60324   .   C   T   100 PASS    DP=18;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:11:62   0/0:9:56
20  60427   .   G   A   100 PASS    DP=18;NS=2;AC=4;AF=1.00  GT:GD:GQ    1/1:9:56    1/1:5:44
20  60552   .   G   A   100 PASS    DP=19;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:5:44    0/0:9:56
20  60578   .   C   T   100 PASS    DP=19;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:4:41    0/0:4:41

and here is the command I tried to run:

bcftools view -G -Ov -otest.subset.sites.vcf -sSAMPLEB test.vcf

bcftools top-level usage output

The bcftools usage output (i.e. when running "bcftools" by itself) has a few issues:

the "call" command says '(former "view"; this version is broken)' but this is somewhat unclear as it might suggest that the call command is broken. If what is intended is to convey that the "bcftools view" command can no longer be used for calling, perhaps this should be documented under "view" and not under "call"
there is what appears to be an unnecessary separation between tabix, index commands and the rest of the commands which come under the heading "-- VCF/BCF tools:". All these commands now apply to vcf/bcf
the "som" command is currently listed as '(broken)'
there is inconsistency between 'former "view"' on the call line and 'former vcfcheck' on the stats line (one uses quotes and the other does not)

This is what I currently see for this usage:

$ bcftools 

Version: 0.2.0-rc3:0.2.0-rc3
Usage:   bcftools <command> <argument>
Commands:
        tabix           tabix for BGZF'd BED, GFF, SAM, VCF and more
        index           index BCF

 -- VCF/BCF tools:
        call            SNP/indel calling (former "view"; this version is broken)
        filter          filter VCF files using fixed thresholds
        gtcheck         tool for detecting swaps and contaminations
        isec            intersections of VCF files
        merge           merge VCF files
        norm            normalize indels
        query           transform VCF into user-defined formats
        som             filter using Self-Organized Maps (broken)
        stats           produce VCF stats (former vcfcheck)
        subset          subset and filter vcf and bcf
        view            VCF<->BCF conversion

norm not normalising SNPs when `len(REF)==len(ALT)>1`

Here is a line that could be added to test/norm.vcf to see the issue:

20 80 . CACAG CACAT 999 PASS . GT:PL:DP 0/1:255,0,255:13

Would expect it to normalise to:

20 84 . G T 999 PASS . GT:PL:DP 0/1:255,0,255:13

vcfutils.pl vcf2fq

Is the vcfutils.pl supposed to work with the new format specification of samtools/bcftools? I tried using the vcfutils.pl script included in the develop branch of bcftools to extract a fastq sequence from my bcf file but it fails giving the error:

Use of uninitialized value $q in numeric lt (<) at /usr/local/bin/vcfutils.pl line 508, <> line 103.
Use of uninitialized value $q in addition (+) at /usr/local/bin/vcfutils.pl line 518, <> line 103.

Using a bcf file generated using samtools 0.2 / bcftools 0.2 I tried to call this using:

bcftools view my.raw.snps.bcf | vcfutils.pl vcf2fq - > ./my.raw.fastq

The original bcf file was generated using the command

samtools mpileup -6EBSDugp -f ./my_ref.fasta my.sorted.bam | bcftools call -vm -O b - > ./my.raw.snps.bcf

Am I doing something wrong or is there a bug in vcfutils.pl?

isec stdout

Hi again!

Is isec meant to be generating results to stdout? I'm getting unexpected binary output when using -w

Also, using -w 1 I see 0000.vcf, 0001.vcf and 0002.vcf but they only contain the headers.

# works as expected
bcftools isec -O z -p rare_snps tmp_target.vcf dbsnp_common.vcf.gz

# tmux explodes
bcftools isec -O z -w 1 -p rare_snps tmp_target.vcf dbsnp_common.vcf.gz

# capture stdout but results are wrong
bcftools isec -O z -w 1 -p rare_snps tmp_target.vcf dbsnp_common.vcf.gz > whatsthis.txt

samtools fixmate loses a read

My 2 read file dropped to a 1 read file after running through fixmate. I haven't tested, but perhaps it's simply missing the last read in a file?

Trivial example:

jkb@seq3a[work/samtools...] cat overlap50.sam
@HD     VN:1.0  SO:coordinate
@SQ     SN:17   LN:81195210     M5:351f64d4f4f9ddd45b35336ad97aa6de     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37       SP:Human
@RG     ID:ERR162872    LB:3815246      SM:HG00100      PI:349  CN:SC   PL:ILLUMINA     DS:SRP001294
foo2    99      17      478     60      100M    =       569     190     TAGGAGAGAGAAATGAAGACATATGTCCACACAAAAACCTGTTCATTGCAGCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGG   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872
foo2    147     17      528     60      100M    =       569     190     GCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGGGGAACTCATCAACAACAAACTTGTGGTTTACCCACACAATGGAAGACCAC   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872
jkb@seq3a[work/samtools...] ./samtools view -S -b overlap50.sam|./samtools sort -n -o - - |./samtools fixmate - - | ./samtools view -h -
@HD     VN:1.0  SO:queryname
@SQ     SN:17   LN:81195210     M5:351f64d4f4f9ddd45b35336ad97aa6de     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37       SP:Human
@RG     ID:ERR162872    LB:3815246      SM:HG00100      PI:349  CN:SC   PL:ILLUMINA     DS:SRP001294
foo2    99      17      478     60      100M    =       528     150     TAGGAGAGAGAAATGAAGACATATGTCCACACAAAAACCTGTTCATTGCAGCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGG   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872  MQ:i:60 ct:Z:1F100M-50T2R100M
jkb@seq3a[work/samtools...] ./samtools view -S -b overlap50.sam|./samtools sort -n -o - - |./samtools-0.1.19 fixmate - - | ./samtools view -h -
[bam_header_read] EOF marker is absent. The input is probably truncated.
@HD     VN:1.0  SO:queryname
@SQ     SN:17   LN:81195210     M5:351f64d4f4f9ddd45b35336ad97aa6de     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37       SP:Human
@RG     ID:ERR162872    LB:3815246      SM:HG00100      PI:349  CN:SC   PL:ILLUMINA     DS:SRP001294
foo2    99      17      478     60      100M    =       528     150     TAGGAGAGAGAAATGAAGACATATGTCCACACAAAAACCTGTTCATTGCAGCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGG   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872  CT:Z:1F100M-50T2R100M
foo2    147     17      528     60      100M    =       478     -150    GCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGGGGAACTCATCAACAACAAACTTGTGGTTTACCCACACAATGGAAGACCAC   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872

bcftools subset -r on an empty index file runs for a long time (forever?) and consumes a lot (unlimited?) memory

test case:

$ touch empty_file empty_file.tbi
$ bcftools subset -r empty_file norm.vcf

Expected behaviour: error message about a bad index file
Actual behaviour: bcftools allocates a very large amount of memory (several GB at least before I killed it)

bcftools subset docs don't make -r and -t option functionality clear

From the current usage docs:

   -r, --regions <reg|file>       same as -t but index-jumps rather than streams to a region (requires indexed VCF/BCF)
   -t, --targets <reg|file>       restrict to positions in tab-delimited tabix indexed file <chr,pos> or <chr,from,to>, 1-based, inclusive

It makes it seem like the -r functionality is the same as -t, except that -r requres an indexed VCF/BCF and will be more efficient in seeking to those regions.

In actuality, -r appears to include all variants overlapping with the -r regions, while -t includes only variants that start in the given interval.

Also, the docs make it seem like -t and -r might be mutually exclusive, but in actuality it probably makes a lot of sense to use -t with -r if you have an indexed file but want the -t functionality.

Finally, the method for specifying <reg> as opposed to <file> is not specified and is different from the file format so I think it is important to list it. I think the definition of <reg> is something like:

<chr>:<from>-<to>[,<chr>:<from>-<to>]*

bcftools query doesn't give a sensible error if multiple file arguments are given

"bcftools query" shows in its usage that it takes a single VCF/BCF file as an argument following the options.

However, it does not give a usage error if you pass it multiple files (rather, it seems to try to open and read them, but not as additional VCF/BCF files).

$ cp norm.vcf norm2.vcf
$ bcftools query -f '%CHROM\t%POS\n' norm.vcf norm2.vcf
Index required, expected .vcf.gz or .bcf file: norm.vcf
Failed to open or the file not indexed: norm.vcf

Expected: some kind of error indicating an invalid argument and a usage message
Actual: an error regarding an index being required for an uncompressed VCF file

$ bgzip norm.vcf
$ bgzip norm2.vcf
$ tabix norm.vcf.gz
$ tabix norm2.vcf.gz
$ bcftools query -f '%CHROM\t%POS\n' norm.vcf.gz norm2.vcf.gz && echo success
success

Expected: some kind of error indicating an invalid argument and a usage message
Actual: successful exit, but no output

$ bcftools query -f '%CHROM\t%POS\n' norm.vcf.gz nonexistant.vcf.gz 
[E::hts_open] fail to open file 'nonexistant.vcf.gz'
Failed to open or the file not indexed: nonexistant.vcf.gz

Expected: some kind of error indicating an invalid argument and a usage message
Actual: error opening a file that shouldn't have been opened in the first place

How variants are filtered by vcfutils.pl varFilter

I wonder how variants are filtered by vcfutils.pl varFilter. Using -p option will print the filtered VCF entry with a single letter tag at the beginning of the line.

$ vcfutils.pl varFilter
Usage:   vcfutils.pl varFilter [options] <in.vcf>
         -p        print filtered variants

As explained: "The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option."
(http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ)

From the vcfutils.pl script, I think the letter is one of UQdDaGgPMS. Could anyone explain/document what those UQdDaGgPMS tags mean?

sub varFilter_aux {
  my ($first, $is_print) = @_;
  if ($first->[1] == 0) {
    print join("\t", @$first[3 .. @$first-1]), "\n";
  } elsif ($is_print) {
    print STDERR join("\t", substr("UQdDaGgPMS", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
  }
}

Consensus Calling - vcfutils.pl

This used to be part of vcfutils.pl’s vcf2fq and was documented on the website unfortunately it disappeared when we separated out the projects. We should probably reinstate this.

[norm] returns wrong result

20:69503:GTGGAC,GTGGACACAC,GTGGACAC,GTGGACACACAC,GTGG,GTGGACACACACAC,ATGGACACACAC

is normalized to

20:69506:GAC,GACACAC,GACAC,GACACACAC,G,GACACACACAC,GACACACAC

This is wrong, the last allele is ATGGACACACAC and should not be truncated to GACACACAC

bcftools view sample_name issues

From: http://sourceforge.net/mailarchive/message.php?msg_id=31835966

A couple of minor bugs which have cropped up recently for me.

When supplying a list of sample names with -s to view a subset, if it encounters a name not found in the header it silently passes over it. (In fact it does this even there is only one entry in the list, and outputs a vcf with no samples.) I think at the very least it should raise a warning (e.g. at htslib/vcf.c:1838, if I read the code correctly), and preferably an error. Otherwise a typo or other error in the sample name will have downstream consequences which may be difficult to debug.
Sometimes a vcf sample name may correspond to a genuine filename (i.e. such that stat returns 0 at bcftools/vcfview.c:53). In that case, if presented with -s foo, vcftools will erroneously try and read a list of sample names from the file foo. Particularly problematic is that this behaviour will depend on where (i.e. what directory) the program is run.

To avoid this, I think it's probably necessary to distinguish with separate option flags whether samples are specified as a list or in a file. Alternatively, default to treating the argument as a sample name if it is found in the vcf header.

(Both are useful options, so I wouldn't drop either. This is/was a problem for vcf-subset also.)

Regards,
Aylwyn

-t works in plot-vcfstats, but not --title

This is minor, but the --title flag appears to have not been added.

main::error('Unknown parameter or non-existent file "--title". Run -h for ...') called at /usr/local/bin/plot-vcfstats line 213
main::parse_params() called at /usr/local/bin/plot-vcfstats line 22

I'm getting an error when trying to plot regarding log-scaling

plot-vcfstats aln_mem.txt -p plots/
Parsing vcfstats output: aln_mem.txt
Plotting graphs: python plots/plot.py
Traceback (most recent call last):
File "plots/plot.py", line 121, in
ax1.set_xscale('log')
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 2493, in set_xscale
self.autoscale_view(scaley=False)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 1860, in autoscale_view
x0, x1 = xlocator.view_limits(x0, x1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/ticker.py", line 1345, in view_limits
"Data has no positive values, and therefore can not be log-scaled.")
ValueError: Data has no positive values, and therefore can not be log-scaled.
The command exited with non-zero status 256:
python plots/plot.py
at /usr/local/bin/plot-vcfstats line 57.
main::error('The command exited with non-zero status 256:\x{a}\x{9}python plots/pl...') called at /usr/local/bin/plot-vcfstats line 245
main::plot('HASH(0x7f98eb001e28)') called at /usr/local/bin/plot-vcfstats line 46

Any ideas what might be happening?

--use-header option for view (Feature Request)

It would be helpful to be able to easily add a new header via something like bcftools view --use-header.

My use case is constant fixing of improper INFO specifications on very large vcfs. Currently, I print the header and vcf body to separate files, manually clean up the header, and then cat them back together.

bcftools fails to index VCF with extra trailing new line

Hi,
I tried indexing the bgzipped example VCF from the bioperl project found at http://www.well.ox.ac.uk/~winni/outgoing/example.vcf.gz and I get the following error:

$ bcftools index example.vcf.gz
[E::get_intv] failed to parse TBX_VCF, was wrong -p [type] used?

[E::hts_idx_push] unsorted positions

When I remove the trailing new line the error disapears. Is this a bug?

feature request: option to print basic stats in human readable form

Maybe there is already an easy way to do this, but I am having a hard time figuring out how to get basic statistics about my bcf files. In particular it would be nice if there was an option for bcftools stats to just print out the number of samples and the number of sites in a bcf to stdout.

Filter on FORMAT fields

Hi all,
I am trying to use BCFTOOLS (version: 0.2.0-rc8-10-g7a1041c) to filter a bcf file but I am running into a couple of problems and I am not sure if I am mis-specifying my filter expressions or if there's a couple of bugs to be ironed out.

This was posted as a question on Seqanswers here

Filtering on string based FORMAT fields

e.g. trying to filter based on the genotype filed (FORMAT/GT) I use:

bcftools filter -i 'FORMAT/GT=="1/1"' ./mysnps.bcf | less

I return the text of the vcf header (i.e. it does not throw an error which to me indicates that bcftools was able to parse the filter expression) but zero variants. I can't get the string comparison to work.

Filtering on array subscripted FORMAT fields:

When I try to use a subscripted FORMAT field (e.g. PL) like this:

bcftools filter -i 'FORMAT/PL[0]==1' ./mysnps.bcf | less

I get the error:

Error: Arrays must be subscripted, e.g. PL[0]

Are array subscripted filters available for FORMAT fields?

Other filters work just fine, e.g.

bcftools filter -i "%QUAL>30" ./mysnps.bcf | less # this filter works fine
bcftools_0.2 filter -i "FORMAT/SP==1" JEL423.raw.snps.3.bcf | less # using a FORMAT field here works

FYI here are the relevant parts of my bcf header showing the INFO and FORMAT fields and their types which are included in the file I am filtering:

##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version=3>
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=0.2.0-rc8-10-g7a1041c+htslib-0.2.0-rc8-6-gd49dfa6
##bcftools_callCommand=call -vm -O b -
##bcftools_viewVersion=0.2.0-rc8-10-g7a1041c+htslib-0.2.0-rc8-6-gd49dfa6

segfault on --private

On devel I'm seeing a segfault on bcftools view --private

$ bcftools view -x private_segfault.vcf > private_segfault2.vcf                                                                                                                                               
bcftools: vcf.c:1027: bcf_write: Assertion `(h)->n[2]==v->n_sample' failed.
Aborted (core dumped)

https://drive.google.com/file/d/0B5pgT9nmbwfSdlgxLWNLV0l6Skk/edit?usp=sharing

Unclear which branch of htslib to use (should htslib be added as a git submodule?)

The header of bcftools on the github page says:

To compile, the bcftools+calling branch of htslib is needed: git clone --branch=bcftools+calling git://github.com/samtools/htslib.git htslib

But when I tried to do compile against that htslib branch, there were undefined references. I did find that http://samtools.github.io/bcftools/ suggests the develop branch instead, (git clone --branch=develop git://github.com/samtools/htslib.git)-- using this branch of htslib allows compilation to proceed. So at the very least, I think the text in the github short description needs to be changed.

It also might be a good idea to add htslib as a git submodule in bcftools so that a branch/version of htslib can be pinned by those doing bcftools development.

bcftools concat of files with single contig header lines

With separate chromosome BCFs that each only have a single ##contig line for their chromosome (e.g. added by default by some bcftools commands), then concat will segfault when combining them.

bcftools concat 1.bcf 2.bcf

where 1.bcf has ##contig=<ID=1,length=2147483647> and 2.bcf has ##contig=<ID=2,length=2147483647>.

The segfault comes after the last record of 1.bcf and before the first record of 2.bcf.

Region vs Target Filtering Documentation

A small documentation suggestion would be to clarify performance / use case for using -R vs -T.

I naively used -R with a sorted/tabixed (but not merged) bed file and the resulting order of the vcf was compromised. There were no issues with -T however it was substantially slower on a large vcf.

# -R (index)
real    4m54.352s
user    4m44.714s
sys     0m6.064s

# -T (streaming)
real    17m17.298s
user    17m3.484s
sys     0m9.653s

bcftools concat --allow-overlaps does not work as expected

When I call bcftools concat with two equivalent options on the VCF file found at
http://www.well.ox.ac.uk/~winni/outgoing/example_noTrailingNL.vcf.gz
I do not get the same result.

bcftools concat --allow-overlaps example_noTrailingNL.vcf.gz example_noTrailingNL.vcf.gz
bcftools concat -a example_noTrailingNL.vcf.gz example_noTrailingNL.vcf.gz

bcftools merge wont finish

I am testing "bcftools merge" for the first time on bgzipped vcf.gz files. I fire this command and bcftools is stuck not producing anything out, with the output file being of size 0. Am I doing something wrong?

find /my/dir/ -maxdepth 1 -name 'chr.vcf.gz' | sort | xargs bcftools merge -o z > all.genome.vcf.gz

Thanks

merge silently ignores a sample with bad tabix index

bcftools merge believes if tabix says the source vcf file does not contain sites. This can be seen as 'feature' but checking it would avoid many problems downstream.

bcftools filter for genotype fields

Giulio Genovese requested:

bcftools filter is capable to filter variant sites, but it would be nice if it was able to filter genotype calls. GATK VariantFiltration/-G_filter/-G_filterName (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_filters_VariantFiltration.html) is sort of capable to do so, but it can only tag genotype calls that satisfy a certain JEXL expression. It would be great if bcftools filter was able to do something similar, and in particular it would be great if bcftools filter was able to change to missing those genotype calls that satisfy a given filter expression.

-R / -T complement (Feature Request)

It would be nice to have the option to filter out regions instead of keeping them (a la "grep -v").

I gather this can be done via bcftools isec however it's a bit cumbersome for pipelines due to the output being written to a subdirectory and not simply stdout.

Is there a way to only remove all FORMAT tags except for GT?

record truncation

In an earlier post I had a problem with bcftools view --private and it was tracked down to incorrect number of columns for some rows.

Digging a bit deeper it looks like the line truncation was introduced during bcftools norm, specifically the conversion to bcf and not vcf.

# normalization to bcf and vcf
bcftools norm -O b -f ucsc.hg19.upper.fa  tmp_target.vcf > tmp_norm.bcf.gz
bcftools norm -O z -f ucsc.hg19.upper.fa  tmp_target.vcf > tmp_norm.vcf.gz

# viewing only has a problem with the bcf
bcftools view tmp_norm.bcf.gz > tmp_norm2.vcf 
Segmentation fault (core dumped)
bcftools view mp_norm.vcf.gz > tmp_norm2.vcf

https://drive.google.com/file/d/0B5pgT9nmbwfSRXJWNTlFVDJwbWs/edit?usp=sharing

How to get started using htslib

I would like to use htslib to subset out haplotypes from an indexed bcf file from within my own c++ code. I am at a bit of a loss on where to start to get myself familiar with the htslib API. Could I get some pointers on which files in the htslib directory I might use as templates for my own code? Thanks.

bcftools call gives empty results

I have posted an issue on biostars.org in which bcftools call after samtools mpileup gives just a VCF header with no variant calls. Would really appreciate it if the developers can follow it up and find what the problem might be. Thanks! Here is the link http://www.biostars.org/p/96425/#96520

Multi-allelic SNP calling

I have been trying to figure out if I understand this version of bcftools correctly. Does it now call multi-allelic SNPs (using the -m option)? The samtools manual page still notes that that is not the case with the samtools pipeline and I have heard the same from other researchers as well. Is this a new update (and welcomed) update to the version of bcftools?

Thanks,
Nate

bcftools query is very slow with large numbers of samples.

I have a VCF file containing 95k samples. Even when all variant lines are removed from the VCF, a simple query takes over 3 minutes:

$ time bcftools query --list-samples test.vcf.gz | wc -l
95708

real    3m16.188s
user    3m8.571s
sys     0m6.156s

The linux "perf top" command indicates all time is being spent in bcf_hdr_sync

A sample VCF can be found at https://www.dropbox.com/s/0heoazk6n8zciqo/test.vcf.gz

I am using bcftools/htslib 0.2.0-rc8.

An older version of htscmd (from May 2013) I have runs in less than a second:

$ time htscmd vcfquery --list-columns test.vcf.gz | wc -l
95708

real    0m0.101s
user    0m0.084s
sys     0m0.020s

Is there a fast way to split multiallelics into multiple rows?

I would like to split multiallelic SNPs into multiple rows so that each alternate allele gets its own row and "looks like" a biallelic SNP. Is there a fast way to do this with bcftools?