Coder Social home page Coder Social logo

samtools / bcftools Goto Github PK

View Code? Open in Web Editor NEW
622.0 622.0 240.0 19.86 MB

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html

Home Page: http://samtools.github.io/bcftools/

License: Other

Shell 1.17% C 87.24% Makefile 0.52% Perl 9.54% Python 0.73% M4 0.68% Batchfile 0.12%

bcftools's Introduction

samtools

Build Status Build status Github All Releases

This is the official development repository for samtools.

The original samtools package has been split into three separate but tightly coordinated projects:

  • htslib: C-library for handling high-throughput sequencing data
  • samtools: mpileup and other tools for handling SAM, BAM, CRAM
  • bcftools: calling and other tools for handling VCF, BCF

See also http://github.com/samtools/

Building Samtools

See INSTALL for complete details. Release tarballs contain generated files that have not been committed to this repository, so building the code from a Git repository requires extra steps:

autoheader            # Build config.h.in (this may generate a warning about
                      # AC_CONFIG_SUBDIRS - please ignore it).
autoconf -Wno-syntax  # Generate the configure script
./configure           # Needed for choosing optional functionality
make
make install

By default, this will build against an HTSlib source tree in ../htslib. You can alter this to a source tree elsewhere or to a previously-installed HTSlib by configuring with --with-htslib=DIR.

Citing

Please cite this paper when using SAMtools for your publications.

Twelve years of SAMtools and BCFtools
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li
GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

@article{10.1093/gigascience/giab008,
    author = {Danecek, Petr and Bonfield, James K and Liddle, Jennifer and Marshall, John and Ohan, Valeriu and Pollard, Martin O and Whitwham, Andrew and Keane, Thomas and McCarthy, Shane A and Davies, Robert M and Li, Heng},
    title = "{Twelve years of SAMtools and BCFtools}",
    journal = {GigaScience},
    volume = {10},
    number = {2},
    year = {2021},
    month = {02},
    abstract = "{SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods.The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines.Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed \\>1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.}",
    issn = {2047-217X},
    doi = {10.1093/gigascience/giab008},
    url = {https://doi.org/10.1093/gigascience/giab008},
    note = {giab008},
    eprint = {https://academic.oup.com/gigascience/article-pdf/10/2/giab008/36332246/giab008.pdf},
}

bcftools's People

Contributors

athos avatar daviesrob avatar dlaehnemann avatar emollier avatar eyherabh avatar freeseek avatar garrettjstevens avatar ginggs avatar jenniferliddle avatar jherrero avatar jkbonfield avatar jmarshall avatar jrandall avatar junaruga avatar kdm9 avatar lh3 avatar lindenb avatar mcshane avatar mp15 avatar nicolaasuni avatar nsoranzo avatar pd3 avatar pjagiello avatar pvanheus avatar spikyclip avatar tobiasrausch avatar travc avatar valeriuo avatar whitwham avatar winni2k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bcftools's Issues

annotate segfault

I'm unclear how to use bcftools annotate correctly. Specifically, how to construct the --columns argument to get the desired output. Potentially a few examples in the docs :)

My basic use cases.

  1. Add IDs
  2. Add specific INFO tag
  3. Add all INFO tags
  4. The above but while ignoring REF/ALT matching

A few attempts.

bcftools annotate -a dbsnp_138.hg19.bcf.gz -c INFO/CLINSIG  tmp_target.bcf.gz > tmp_anno.vcf
Segmentation fault (core dumped)

bcftools annotate -a dbsnp_138.hg19.bcf.gz -c CLINSIG  tmp_target.bcf.gz > tmp_anno.vcf
Segmentation fault (core dumped)

bcftools annotate -a dbsnp_138.hg19.bcf.gz -c CHROM,POS,-,-,-,-,-,-,INFO/CLINSIG  tmp_target.bcf.gz > tmp_anno.vcf
Segmentation fault (core dumped)

The dbsnp is from the GATK bundle.

https://drive.google.com/file/d/0B5pgT9nmbwfSRXJWNTlFVDJwbWs/edit?usp=sharing

vcf -> bcf : incomplete output

Converting a large vcf (dbsnp from gatk bundle) to bcf is running into a problem. The output only goes as far as chr2 with no error and exit status 0.

# convert a vcf to bcf
# the vcf.gz is 1.1GB and the bcf.gz is 176MB
bcftools view -O b dbsnp_138.hg19.vcf.gz > dbsnp_138.hg19.bcf.gz

# check the tail of the vcf via view
bcftools view  dbsnp_138.hg19.vcf.gz | tail -n 2 | cut -f -3
chr2    66823380        rs114241628
chr2    66823410        rs369988271

# more conventionally
zcat  dbsnp_138.hg19.vcf.gz | tail -n 2 | cut -f -3  
chrY    59363407        rs376709892
chrY    59363442        rs371324725

# check the tail of the bcf via view
bcftools view  dbsnp_138.hg19.vcf.gz | tail -n 2 | cut -f -3
chr2    66823380        rs114241628
chr2    66823410        rs369988271

Sorry about spamming the issues!

bcftools subset corrupts some annotations and is slow

Using awk to extract column for a single sample:
AC=5;AN=1724;DP=227802;DP4=109696,111215,3204,1373;HWE=1.000000e+00;MDV=190;MQ=41;NEGATIVE_TRAIN_SITE;PV4=5.4e-168,1,0,1;QBD=4.430391e-03;RPB=-2.692433e+01;VDB=4.838643e-01;VQSLOD=-9.116e+00;culprit=ReadPosRankSum GT:PL:DP:DV:SP:GQ 0/1:255,0,255:370:190:8:99

Vs using bcftools subset -I -S:
AC=5;AN=1724;DP=227802;DP4=109696,111215,3204,1373;HWE=1;MDV=190;MQ=41;NEGATIVE_TRAIN_SITE;PV4=0,1,0,1;QBD=0.00443039;RPB=-26.9243;VDB=0.483864;VQSLOD=-9.116;culprit=ReadPosRankSum GT:PL:DP:DV:SP:GQ 0/1:255,0,255:370:190:8:99

I can appreciate that bcftools subset normalises some values (e.g. going from HWE=1.000000e+00 to HWE=1 seems sensible enough), but changing a P-value annotation from a low P-value to 0 is somewhat less acceptable (e.g. PV4=5.4e-168 becoming PV4=0). I'm also slightly dubious that the subset operation drops some precision during the reformatting (e.g. for RPB=-2.692433e+01 becoming RPB=-26.9243), although if that were well documented it might be ok (it is just not something I expected of a simple subset operation).

Also, bcftools subset seems to be an order of magnitude slower than awk at this operation.

License of bcftools.

Hello,

thanks to your help on the htslib, I could prepare a draft Debian package for bcftools, that builds on a system-wide-installed htslib. I am currently applying the following patch.

diff --git a/Makefile b/Makefile
index 5139d1a..61c165b 100644
--- a/Makefile
+++ b/Makefile
@@ -4,7 +4,6 @@ all: $(PROG)

 # Adjust $(HTSDIR) to point to your top-level htslib directory
 HTSDIR = ../htslib
-include $(HTSDIR)/htslib.mk
 HTSLIB = $(HTSDIR)/libhts.a

 CC=                    gcc
@@ -51,8 +50,8 @@ force:
 test: $(PROG)
                ./test/test.pl

-main.o: version.h $(HTSDIR)/version.h bcftools.h
-vcfcall.o: vcfcall.c call.h mcall.c prob1.h $(HTSDIR)/htslib/kfunc.h $(HTSDIR)/htslib/vcf.h
+main.o: version.h bcftools.h
+vcfcall.o: vcfcall.c call.h mcall.c prob1.h $(HTSDIR)/kfunc.h $(HTSDIR)/vcf.h
 mcall.o ccall.o: call.h vcmp.h
 vcffilter.o: filter.h
 vcfsubset.o: filter.h
@@ -73,4 +72,4 @@ install: $(PROG)
 cleanlocal:
                rm -fr gmon.out *.o a.out *.dSYM *~ $(PROG) version.h

-clean:cleanlocal clean-htslib
+clean:cleanlocal

On amd64, the packaging scripts then pass HTSDIR="/usr/include/htslib" and HTSLIB=/usr/lib/x86_64-linux-gnu/libhts.so to build on the dynamic library.

Back to my original question, I can not upload the package to Debian without being sure of its license, and currently only kmin.c and kmin.h have some information. Can you clarify the general license of bcftools ?

Cheers,

Charles

bcftools isec [E::bcf_hdr_read] invalid BCF2 magic string

I am getting this error when running bcftools isec (develop branch of htslib and latest version of bcftools):

bcftools isec -f PASS -r chr20 -p bwa_gatk_2x100.isec f1.vcf.gz f2.vcf.gz
[E::bcf_hdr_read] invalid BCF2 magic string
Segmentation fault

Any ideas?

AF2 tag not documented in vcf output

When using the bcftools view -1 flag...
The AF2 tag (in the INFO field when doing contrast/association between groups) is not documented anywhere. It should have meta-information line (##INFO=<ID=AF2,...) in the vcf output. Adding it to the manpage would be nice too.

Going from the code:
ccall.c line 124-125 says that AF2 = 1-em[4], 1-em[6]
em.c line 139 says em[5..6] is "group1 freq, group2 freq"

If I understand correctly, those freqs are for the reference allele, but I'm not 100% on that. Anyway, it just needs to be properly documented.

bcftools tabix segfaults on vcf.gz input that is actually not gzipped

bcftools tabix segfaults where tabix used to give a helpful warning.

Demonstrative test case, using bcftools and htslib tag 0.2.0-rc3:

$ cp test/norm.vcf norm.vcf.gz
$ bcftools tabix norm.vcf.gz
Segmentation fault
$ tabix norm.vcf.gz
[tabix] was bgzip used to compress this file? norm.vcf.gz

I've run this on Linux x86_64.

bcftools query usage reports itself as "vcfquery"

Minor issue: on "Usage:" line and in "Examples:" the "bcftools query" command reports itself as "vcfquery":

$ bcftools query
About:   Extracts fields from VCF/BCF file and prints them in user-defined format
Usage:   vcfquery [options] <file.vcf.gz>
Options:
    -a, --annots <list>               alias for -f '%CHROM\t%POS\t%MASK\t%REF\t%ALT\t%TYPE\t' + tab-separated <list> of tags
    -c, --collapse <string>           collapse lines with duplicate positions for <snps|indels|both|any|some>
    -f, --format <string>             learn by example, see below
    -H, --print-header                print header
    -l, --list-columns                list columns
    -p, --positions <file>            list positions in tab-delimited tabix indexed file <chr,pos> or <chr,from,to>, 1-based, inclusive
    -r, --region <reg|file>           output from the given regions only
    -s, --samples <list|file>         samples to include: comma-separated list or one name per line in a file
    -v, --vcf-list <file>             process multiple VCFs listed in the file
Expressions:
        %CHROM          The CHROM column (similarly also other columns, such as POS, ID, QUAL, etc.)
        %INFO/TAG       Any tag in the INFO column
        %TYPE           Variant type (REF, SNP, MNP, INDEL, OTHER)
        %MASK           Indicates presence of the site in other files (with multiple files)
        %TAG{INT}       Curly brackets to subscript vectors (0-based)
        []              The brackets loop over all samples
        %GT             Genotype (e.g. 0/1)
        %TGT            Translated genotype (e.g. C/A)
        %LINE           Prints the whole line
        %SAMPLE         Sample name
Examples:
        vcfquery -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz

Make fails, not ready yet?

A make command lead to a failure

gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib main.c -o main.o
gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib vcfview.c -o vcfview.o
gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib bcfidx.c -o bcfidx.o
gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib tabix.c -o tabix.o
gcc -c -g -Wall -Wc++-compat -O2 -I. -I../htslib vcfcheck.c -o vcfcheck.o
vcfcheck.c: In function 'do_sample_stats':
vcfcheck.c:594: error: 'bcf_int8_vector_end' undeclared (first use in this function)
vcfcheck.c:594: error: (Each undeclared identifier is reported only once
vcfcheck.c:594: error: for each function it appears in.)
vcfcheck.c:594: error: 'bcf_int8_missing' undeclared (first use in this function)
vcfcheck.c:595: error: 'bcf_int16_vector_end' undeclared (first use in this function)
vcfcheck.c:595: error: 'bcf_int16_missing' undeclared (first use in this function)
vcfcheck.c:596: error: 'bcf_int32_vector_end' undeclared (first use in this function)
vcfcheck.c:596: error: 'bcf_int32_missing' undeclared (first use in this function)
vcfcheck.c: In function 'check_vcf':
vcfcheck.c:683: warning: implicit declaration of function 'bcf_sr_has_line'
vcfcheck.c: In function 'main_vcfcheck':
vcfcheck.c:981: warning: assignment makes integer from pointer without a cast
make: *** [vcfcheck.o] Error 1

bcftools concat --allow-overlaps is not recognized

When I run this command:
bcftools concat --allow-overlaps
I get this error message:
concat: unrecognized option '--allow-overlaps'
However, using the -a option instead results in no error.

Here is my version of bcftools:
$ bcftools -v
bcftools 0.2.0-rc7-48-g94877a1
Using htslib 0.2.0-rc7-23-gbb4721c

Can't recalculate AC without AN

I was trying to re-calculate the INFO field for a VCF after dropping a sample. There seems to be a problem with the INFO field has an AC value but not an AN value. I get the "bcftools: vcfutils.c:43: bcf_calc_ac: Assertion `an>=nac' failed". I would expect the utility to either calculate an AN value, or not check for the assertion when AN values aren't present.

Here is a sample VCF:

---- test.vcf ----

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GD,Number=1,Type=Integer,Description="Genotype Depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLEA SAMPLEB
20  60081   .   G   A   100 PASS    DP=19;NS=2;AC=1;AF=0.25  GT:GD:GQ    0/0:5:44    0/1:14:71
20  60177   .   A   C   100 PASS    DP=19;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:5:44    0/0:9:56
20  60215   .   G   A   100 PASS    DP=19;NS=2;AC=2;AF=0.50  GT:GD:GQ    1/1:9:56    0/0:10:59
20  60292   .   C   T   100 PASS    DP=18;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:11:62   0/0:6:47
20  60309   .   G   T   100 PASS    DP=18;NS=2;AC=3;AF=0.75  GT:GD:GQ    1/1:12:65   0/1:9:56
20  60324   .   C   T   100 PASS    DP=18;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:11:62   0/0:9:56
20  60427   .   G   A   100 PASS    DP=18;NS=2;AC=4;AF=1.00  GT:GD:GQ    1/1:9:56    1/1:5:44
20  60552   .   G   A   100 PASS    DP=19;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:5:44    0/0:9:56
20  60578   .   C   T   100 PASS    DP=19;NS=2;AC=0;AF=0.00  GT:GD:GQ    0/0:4:41    0/0:4:41

and here is the command I tried to run:

bcftools view -G -Ov -otest.subset.sites.vcf -sSAMPLEB test.vcf

bcftools top-level usage output

The bcftools usage output (i.e. when running "bcftools" by itself) has a few issues:

  • the "call" command says '(former "view"; this version is broken)' but this is somewhat unclear as it might suggest that the call command is broken. If what is intended is to convey that the "bcftools view" command can no longer be used for calling, perhaps this should be documented under "view" and not under "call"
  • there is what appears to be an unnecessary separation between tabix, index commands and the rest of the commands which come under the heading "-- VCF/BCF tools:". All these commands now apply to vcf/bcf
  • the "som" command is currently listed as '(broken)'
  • there is inconsistency between 'former "view"' on the call line and 'former vcfcheck' on the stats line (one uses quotes and the other does not)

This is what I currently see for this usage:

$ bcftools 

Version: 0.2.0-rc3:0.2.0-rc3
Usage:   bcftools <command> <argument>
Commands:
        tabix           tabix for BGZF'd BED, GFF, SAM, VCF and more
        index           index BCF

 -- VCF/BCF tools:
        call            SNP/indel calling (former "view"; this version is broken)
        filter          filter VCF files using fixed thresholds
        gtcheck         tool for detecting swaps and contaminations
        isec            intersections of VCF files
        merge           merge VCF files
        norm            normalize indels
        query           transform VCF into user-defined formats
        som             filter using Self-Organized Maps (broken)
        stats           produce VCF stats (former vcfcheck)
        subset          subset and filter vcf and bcf
        view            VCF<->BCF conversion

vcfutils.pl vcf2fq

Is the vcfutils.pl supposed to work with the new format specification of samtools/bcftools? I tried using the vcfutils.pl script included in the develop branch of bcftools to extract a fastq sequence from my bcf file but it fails giving the error:

Use of uninitialized value $q in numeric lt (<) at /usr/local/bin/vcfutils.pl line 508, <> line 103.
Use of uninitialized value $q in addition (+) at /usr/local/bin/vcfutils.pl line 518, <> line 103.

Using a bcf file generated using samtools 0.2 / bcftools 0.2 I tried to call this using:

bcftools view my.raw.snps.bcf | vcfutils.pl vcf2fq - > ./my.raw.fastq 

The original bcf file was generated using the command

samtools mpileup -6EBSDugp -f ./my_ref.fasta my.sorted.bam | bcftools call -vm -O b - > ./my.raw.snps.bcf

Am I doing something wrong or is there a bug in vcfutils.pl?

isec stdout

Hi again!

Is isec meant to be generating results to stdout? I'm getting unexpected binary output when using -w

Also, using -w 1 I see 0000.vcf, 0001.vcf and 0002.vcf but they only contain the headers.

# works as expected
bcftools isec -O z -p rare_snps tmp_target.vcf dbsnp_common.vcf.gz

# tmux explodes
bcftools isec -O z -w 1 -p rare_snps tmp_target.vcf dbsnp_common.vcf.gz

# capture stdout but results are wrong
bcftools isec -O z -w 1 -p rare_snps tmp_target.vcf dbsnp_common.vcf.gz > whatsthis.txt

samtools fixmate loses a read

My 2 read file dropped to a 1 read file after running through fixmate. I haven't tested, but perhaps it's simply missing the last read in a file?

Trivial example:

jkb@seq3a[work/samtools...] cat overlap50.sam
@HD     VN:1.0  SO:coordinate
@SQ     SN:17   LN:81195210     M5:351f64d4f4f9ddd45b35336ad97aa6de     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37       SP:Human
@RG     ID:ERR162872    LB:3815246      SM:HG00100      PI:349  CN:SC   PL:ILLUMINA     DS:SRP001294
foo2    99      17      478     60      100M    =       569     190     TAGGAGAGAGAAATGAAGACATATGTCCACACAAAAACCTGTTCATTGCAGCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGG   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872
foo2    147     17      528     60      100M    =       569     190     GCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGGGGAACTCATCAACAACAAACTTGTGGTTTACCCACACAATGGAAGACCAC   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872
jkb@seq3a[work/samtools...] ./samtools view -S -b overlap50.sam|./samtools sort -n -o - - |./samtools fixmate - - | ./samtools view -h -
@HD     VN:1.0  SO:queryname
@SQ     SN:17   LN:81195210     M5:351f64d4f4f9ddd45b35336ad97aa6de     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37       SP:Human
@RG     ID:ERR162872    LB:3815246      SM:HG00100      PI:349  CN:SC   PL:ILLUMINA     DS:SRP001294
foo2    99      17      478     60      100M    =       528     150     TAGGAGAGAGAAATGAAGACATATGTCCACACAAAAACCTGTTCATTGCAGCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGG   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872  MQ:i:60 ct:Z:1F100M-50T2R100M
jkb@seq3a[work/samtools...] ./samtools view -S -b overlap50.sam|./samtools sort -n -o - - |./samtools-0.1.19 fixmate - - | ./samtools view -h -
[bam_header_read] EOF marker is absent. The input is probably truncated.
@HD     VN:1.0  SO:queryname
@SQ     SN:17   LN:81195210     M5:351f64d4f4f9ddd45b35336ad97aa6de     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37       SP:Human
@RG     ID:ERR162872    LB:3815246      SM:HG00100      PI:349  CN:SC   PL:ILLUMINA     DS:SRP001294
foo2    99      17      478     60      100M    =       528     150     TAGGAGAGAGAAATGAAGACATATGTCCACACAAAAACCTGTTCATTGCAGCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGG   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872  CT:Z:1F100M-50T2R100M
foo2    147     17      528     60      100M    =       478     -150    GCTTTCTACCATCACCAAAAATTGCAAACAACCACACGCCCTTCAACTGGGGAACTCATCAACAACAAACTTGTGGTTTACCCACACAATGGAAGACCAC   BCG>EG9GGICC@;HHILIHIICFEKJ<9JI9I<3J7I=EKBDF?A2FHIK2BFFJJ<F2CHFIGJJCGJJF;83HC/KE?II9IH0>D(J5E@F697?@     RG:Z:ERR162872

bcftools subset docs don't make -r and -t option functionality clear

From the current usage docs:

   -r, --regions <reg|file>       same as -t but index-jumps rather than streams to a region (requires indexed VCF/BCF)
   -t, --targets <reg|file>       restrict to positions in tab-delimited tabix indexed file <chr,pos> or <chr,from,to>, 1-based, inclusive

It makes it seem like the -r functionality is the same as -t, except that -r requres an indexed VCF/BCF and will be more efficient in seeking to those regions.

In actuality, -r appears to include all variants overlapping with the -r regions, while -t includes only variants that start in the given interval.

Also, the docs make it seem like -t and -r might be mutually exclusive, but in actuality it probably makes a lot of sense to use -t with -r if you have an indexed file but want the -t functionality.

Finally, the method for specifying <reg> as opposed to <file> is not specified and is different from the file format so I think it is important to list it. I think the definition of <reg> is something like:

<chr>:<from>-<to>[,<chr>:<from>-<to>]*

bcftools query doesn't give a sensible error if multiple file arguments are given

"bcftools query" shows in its usage that it takes a single VCF/BCF file as an argument following the options.

However, it does not give a usage error if you pass it multiple files (rather, it seems to try to open and read them, but not as additional VCF/BCF files).

$ cp norm.vcf norm2.vcf
$ bcftools query -f '%CHROM\t%POS\n' norm.vcf norm2.vcf
Index required, expected .vcf.gz or .bcf file: norm.vcf
Failed to open or the file not indexed: norm.vcf

Expected: some kind of error indicating an invalid argument and a usage message
Actual: an error regarding an index being required for an uncompressed VCF file

$ bgzip norm.vcf
$ bgzip norm2.vcf
$ tabix norm.vcf.gz
$ tabix norm2.vcf.gz
$ bcftools query -f '%CHROM\t%POS\n' norm.vcf.gz norm2.vcf.gz && echo success
success

Expected: some kind of error indicating an invalid argument and a usage message
Actual: successful exit, but no output

$ bcftools query -f '%CHROM\t%POS\n' norm.vcf.gz nonexistant.vcf.gz 
[E::hts_open] fail to open file 'nonexistant.vcf.gz'
Failed to open or the file not indexed: nonexistant.vcf.gz

Expected: some kind of error indicating an invalid argument and a usage message
Actual: error opening a file that shouldn't have been opened in the first place

How variants are filtered by vcfutils.pl varFilter

I wonder how variants are filtered by vcfutils.pl varFilter. Using -p option will print the filtered VCF entry with a single letter tag at the beginning of the line.

$ vcfutils.pl varFilter
Usage:   vcfutils.pl varFilter [options] <in.vcf>
         -p        print filtered variants

As explained: "The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option."
(http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ)

From the vcfutils.pl script, I think the letter is one of UQdDaGgPMS. Could anyone explain/document what those UQdDaGgPMS tags mean?

sub varFilter_aux {
  my ($first, $is_print) = @_;
  if ($first->[1] == 0) {
    print join("\t", @$first[3 .. @$first-1]), "\n";
  } elsif ($is_print) {
    print STDERR join("\t", substr("UQdDaGgPMS", $first->[1], 1), @$first[3 .. @$first-1]), "\n";
  }
}

Consensus Calling - vcfutils.pl

This used to be part of vcfutils.pl’s vcf2fq and was documented on the website unfortunately it disappeared when we separated out the projects. We should probably reinstate this.

[norm] returns wrong result

20:69503:GTGGAC,GTGGACACAC,GTGGACAC,GTGGACACACAC,GTGG,GTGGACACACACAC,ATGGACACACAC

is normalized to

20:69506:GAC,GACACAC,GACAC,GACACACAC,G,GACACACACAC,GACACACAC

This is wrong, the last allele is ATGGACACACAC and should not be truncated to GACACACAC

bcftools view sample_name issues

From: http://sourceforge.net/mailarchive/message.php?msg_id=31835966

A couple of minor bugs which have cropped up recently for me.

  1. When supplying a list of sample names with -s to view a subset, if it encounters a name not found in the header it silently passes over it. (In fact it does this even there is only one entry in the list, and outputs a vcf with no samples.) I think at the very least it should raise a warning (e.g. at htslib/vcf.c:1838, if I read the code correctly), and preferably an error. Otherwise a typo or other error in the sample name will have downstream consequences which may be difficult to debug.
  2. Sometimes a vcf sample name may correspond to a genuine filename (i.e. such that stat returns 0 at bcftools/vcfview.c:53). In that case, if presented with -s foo, vcftools will erroneously try and read a list of sample names from the file foo. Particularly problematic is that this behaviour will depend on where (i.e. what directory) the program is run.

To avoid this, I think it's probably necessary to distinguish with separate option flags whether samples are specified as a list or in a file. Alternatively, default to treating the argument as a sample name if it is found in the vcf header.

(Both are useful options, so I wouldn't drop either. This is/was a problem for vcf-subset also.)

Regards,
Aylwyn

-t works in plot-vcfstats, but not --title

This is minor, but the --title flag appears to have not been added.

main::error('Unknown parameter or non-existent file "--title". Run -h for ...') called at /usr/local/bin/plot-vcfstats line 213
main::parse_params() called at /usr/local/bin/plot-vcfstats line 22

I'm getting an error when trying to plot regarding log-scaling

plot-vcfstats aln_mem.txt -p plots/
Parsing vcfstats output: aln_mem.txt
Plotting graphs: python plots/plot.py
Traceback (most recent call last):
File "plots/plot.py", line 121, in
ax1.set_xscale('log')
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 2493, in set_xscale
self.autoscale_view(scaley=False)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/axes.py", line 1860, in autoscale_view
x0, x1 = xlocator.view_limits(x0, x1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/ticker.py", line 1345, in view_limits
"Data has no positive values, and therefore can not be log-scaled.")
ValueError: Data has no positive values, and therefore can not be log-scaled.
The command exited with non-zero status 256:
python plots/plot.py
at /usr/local/bin/plot-vcfstats line 57.
main::error('The command exited with non-zero status 256:\x{a}\x{9}python plots/pl...') called at /usr/local/bin/plot-vcfstats line 245
main::plot('HASH(0x7f98eb001e28)') called at /usr/local/bin/plot-vcfstats line 46

Any ideas what might be happening?

--use-header option for view (Feature Request)

It would be helpful to be able to easily add a new header via something like bcftools view --use-header.

My use case is constant fixing of improper INFO specifications on very large vcfs. Currently, I print the header and vcf body to separate files, manually clean up the header, and then cat them back together.

feature request: option to print basic stats in human readable form

Maybe there is already an easy way to do this, but I am having a hard time figuring out how to get basic statistics about my bcf files. In particular it would be nice if there was an option for bcftools stats to just print out the number of samples and the number of sites in a bcf to stdout.

Filter on FORMAT fields

Hi all,
I am trying to use BCFTOOLS (version: 0.2.0-rc8-10-g7a1041c) to filter a bcf file but I am running into a couple of problems and I am not sure if I am mis-specifying my filter expressions or if there's a couple of bugs to be ironed out.

This was posted as a question on Seqanswers here

Filtering on string based FORMAT fields

e.g. trying to filter based on the genotype filed (FORMAT/GT) I use:

bcftools filter -i 'FORMAT/GT=="1/1"' ./mysnps.bcf | less

I return the text of the vcf header (i.e. it does not throw an error which to me indicates that bcftools was able to parse the filter expression) but zero variants. I can't get the string comparison to work.

Filtering on array subscripted FORMAT fields:

When I try to use a subscripted FORMAT field (e.g. PL) like this:

bcftools filter -i 'FORMAT/PL[0]==1' ./mysnps.bcf | less

I get the error:

Error: Arrays must be subscripted, e.g. PL[0]

Are array subscripted filters available for FORMAT fields?

Other filters work just fine, e.g.

bcftools filter -i "%QUAL>30" ./mysnps.bcf | less # this filter works fine
bcftools_0.2 filter -i "FORMAT/SP==1" JEL423.raw.snps.3.bcf | less # using a FORMAT field here works

FYI here are the relevant parts of my bcf header showing the INFO and FORMAT fields and their types which are included in the file I am filtering:

##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version=3>
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=0.2.0-rc8-10-g7a1041c+htslib-0.2.0-rc8-6-gd49dfa6
##bcftools_callCommand=call -vm -O b -
##bcftools_viewVersion=0.2.0-rc8-10-g7a1041c+htslib-0.2.0-rc8-6-gd49dfa6

Unclear which branch of htslib to use (should htslib be added as a git submodule?)

The header of bcftools on the github page says:

To compile, the bcftools+calling branch of htslib is needed: git clone --branch=bcftools+calling git://github.com/samtools/htslib.git htslib

But when I tried to do compile against that htslib branch, there were undefined references. I did find that http://samtools.github.io/bcftools/ suggests the develop branch instead, (git clone --branch=develop git://github.com/samtools/htslib.git)-- using this branch of htslib allows compilation to proceed. So at the very least, I think the text in the github short description needs to be changed.

It also might be a good idea to add htslib as a git submodule in bcftools so that a branch/version of htslib can be pinned by those doing bcftools development.

bcftools concat of files with single contig header lines

With separate chromosome BCFs that each only have a single ##contig line for their chromosome (e.g. added by default by some bcftools commands), then concat will segfault when combining them.

bcftools concat 1.bcf 2.bcf

where 1.bcf has ##contig=<ID=1,length=2147483647> and 2.bcf has ##contig=<ID=2,length=2147483647>.

The segfault comes after the last record of 1.bcf and before the first record of 2.bcf.

Region vs Target Filtering Documentation

A small documentation suggestion would be to clarify performance / use case for using -R vs -T.

I naively used -R with a sorted/tabixed (but not merged) bed file and the resulting order of the vcf was compromised. There were no issues with -T however it was substantially slower on a large vcf.

# -R (index)
real    4m54.352s
user    4m44.714s
sys     0m6.064s

# -T (streaming)
real    17m17.298s
user    17m3.484s
sys     0m9.653s

bcftools merge wont finish

I am testing "bcftools merge" for the first time on bgzipped vcf.gz files. I fire this command and bcftools is stuck not producing anything out, with the output file being of size 0. Am I doing something wrong?

find /my/dir/ -maxdepth 1 -name 'chr.vcf.gz' | sort | xargs bcftools merge -o z > all.genome.vcf.gz

Thanks

bcftools filter for genotype fields

Giulio Genovese requested:

bcftools filter is capable to filter variant sites, but it would be nice if it was able to filter genotype calls. GATK VariantFiltration/-G_filter/-G_filterName (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_filters_VariantFiltration.html) is sort of capable to do so, but it can only tag genotype calls that satisfy a certain JEXL expression. It would be great if bcftools filter was able to do something similar, and in particular it would be great if bcftools filter was able to change to missing those genotype calls that satisfy a given filter expression.

-R / -T complement (Feature Request)

It would be nice to have the option to filter out regions instead of keeping them (a la "grep -v").

I gather this can be done via bcftools isec however it's a bit cumbersome for pipelines due to the output being written to a subdirectory and not simply stdout.

record truncation

In an earlier post I had a problem with bcftools view --private and it was tracked down to incorrect number of columns for some rows.

Digging a bit deeper it looks like the line truncation was introduced during bcftools norm, specifically the conversion to bcf and not vcf.

# normalization to bcf and vcf
bcftools norm -O b -f ucsc.hg19.upper.fa  tmp_target.vcf > tmp_norm.bcf.gz
bcftools norm -O z -f ucsc.hg19.upper.fa  tmp_target.vcf > tmp_norm.vcf.gz

# viewing only has a problem with the bcf
bcftools view tmp_norm.bcf.gz > tmp_norm2.vcf 
Segmentation fault (core dumped)
bcftools view mp_norm.vcf.gz > tmp_norm2.vcf 

https://drive.google.com/file/d/0B5pgT9nmbwfSRXJWNTlFVDJwbWs/edit?usp=sharing

How to get started using htslib

I would like to use htslib to subset out haplotypes from an indexed bcf file from within my own c++ code. I am at a bit of a loss on where to start to get myself familiar with the htslib API. Could I get some pointers on which files in the htslib directory I might use as templates for my own code? Thanks.

Multi-allelic SNP calling

I have been trying to figure out if I understand this version of bcftools correctly. Does it now call multi-allelic SNPs (using the -m option)? The samtools manual page still notes that that is not the case with the samtools pipeline and I have heard the same from other researchers as well. Is this a new update (and welcomed) update to the version of bcftools?

Thanks,
Nate

bcftools query is *very* slow with large numbers of samples.

I have a VCF file containing 95k samples. Even when all variant lines are removed from the VCF, a simple query takes over 3 minutes:

$ time bcftools query --list-samples test.vcf.gz | wc -l
95708

real    3m16.188s
user    3m8.571s
sys     0m6.156s

The linux "perf top" command indicates all time is being spent in bcf_hdr_sync

A sample VCF can be found at https://www.dropbox.com/s/0heoazk6n8zciqo/test.vcf.gz

I am using bcftools/htslib 0.2.0-rc8.

An older version of htscmd (from May 2013) I have runs in less than a second:

$ time htscmd vcfquery --list-columns test.vcf.gz | wc -l
95708

real    0m0.101s
user    0m0.084s
sys     0m0.020s

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.