illumina / hap.py Goto Github PK
View Code? Open in Web Editor NEWHaplotype VCF comparison tools
License: Other
Haplotype VCF comparison tools
License: Other
At
hap.py/src/c++/lib/variant/VariantReader.cpp
Line 735 in 6c907ce
ad
is adcount
, whereas in the function hap.py/src/c++/lib/tools/BCFHelpers.cpp
Line 532 in 6c907ce
ad
) writes values.size()
elements (values
is a vector resembling the elements in the AD
entry of a sample), causing a buffer overflow.
This could either be solved by: exiting on an incorrect AD field, truncating the interpreted AD
fields so that they are of correct length or making the size of AD
be the size of the actualAD
fields (I haven’t actually looked though the code to see how this code path is used and what would be appropriate).
This bug can be shown to happen in the following vcf (I’ve exaggerated the number of AD fields to make problems happen). This should segfault on the next delete []
statement due to heap corruption:
##fileformat=VCFv4.1
##contig=<ID=chr1,length=10000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1
chr1 2 . AAT T 1 PASS . GT:AD 1/1:1,1,1,1,2,3,2,4,1,3,5,5,3,2,12,1,3,3,1
It seems there is a not well documented dependency for hap.py:
[ 89%] Linking CXX executable ../../../bin/vcfhdr2json
Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_intervalbuffer.cpp.o
Linking CXX executable ../../../bin/fastainfo
[ 90%] Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_intervallist.cpp.o
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/fastainfo] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/fastainfo.dir/all] Fehler 2
make[2]: *** [bin/vcfhdr2json] Fehler 1
make[1]: *** Warte auf noch nicht beendete Prozesse...
make[1]: *** [src/c++/main/CMakeFiles/vcfhdr2json.dir/all] Fehler 2
[ 92%] [ 92%] Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_main.cpp.o
Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_locationinfo.cpp.o
[ 94%] Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_popcount.cpp.o
Linking CXX executable ../../../bin/gvcf2bed
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/gvcf2bed] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/gvcf2bed.dir/all] Fehler 2
[ 95%] Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_refvar.cpp.o
[ 96%] Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_stringutil.cpp.o
Linking CXX executable ../../../bin/dipenum
[ 97%] Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_variantprocessing.cpp.o
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/dipenum] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/dipenum.dir/all] Fehler 2
[ 98%] Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_variants.cpp.o
Linking CXX executable ../../../bin/preprocess
Linking CXX executable ../../../bin/hapenum
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/preprocess] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/preprocess.dir/all] Fehler 2
[100%] Building CXX object src/c++/test/CMakeFiles/test_haplotypes.dir/test_variantstatistics.cpp.o
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/hapenum] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/hapenum.dir/all] Fehler 2
Linking CXX executable ../../../bin/hapcmp
Linking CXX executable ../../../bin/blocksplit
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/hapcmp] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/hapcmp.dir/all] Fehler 2
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/blocksplit] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/blocksplit.dir/all] Fehler 2
Linking CXX executable ../../../bin/vcfcheck
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/vcfcheck] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/vcfcheck.dir/all] Fehler 2
Linking CXX executable ../../../bin/alleles
Linking CXX executable ../../../bin/validatevcf
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/alleles] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/alleles.dir/all] Fehler 2
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/validatevcf] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/validatevcf.dir/all] Fehler 2
Linking CXX executable ../../../bin/multimerge
Linking CXX executable ../../../bin/xcmp
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/multimerge] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/multimerge.dir/all] Fehler 2
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/xcmp] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/xcmp.dir/all] Fehler 2
Linking CXX executable ../../../bin/roc
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/roc] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/roc.dir/all] Fehler 2
Linking CXX executable ../../../bin/scmp
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/scmp] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/scmp.dir/all] Fehler 2
Linking CXX executable ../../../bin/quantify
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/quantify] Fehler 1
make[1]: *** [src/c++/main/CMakeFiles/quantify.dir/all] Fehler 2
Linking CXX executable ../../../bin/test_haplotypes
/bin/ld: cannot find -lstdc++
collect2: Fehler: ld gab 1 als Ende-Status zurück
make[2]: *** [bin/test_haplotypes] Fehler 1
make[1]: *** [src/c++/test/CMakeFiles/test_haplotypes.dir/all] Fehler 2
make: *** [all] Fehler 2
Traceback (most recent call last):
File "install.py", line 298, in <module>
main()
File "install.py", line 283, in main
build_haplotypes(source_dir, build_dir, args)
File "install.py", line 157, in build_haplotypes
subprocess.check_call(to_run, shell=True)
File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /tmp/buildn0adKc && make -j24' returned non-zero exit status 2
``
I'v tried to compile it under CentOS Linux release 7.2.1511
Anyone know this problem? Thanks in advance!
In our deep dive into happy with DeepVariant we've noticed that hap.py includes in its confident calls for comparison any multi-base event (e.g., a deletion) that merely overlaps with the confident regions, rather than requiring that event to be full contained within the confident regions. Internally we require the event to be fully contained within the confident regions to be considered confident. We wanted to double check here that (a) hap.py intended this behavior and (b) check this is the recommended approach (overlaps, not contains) according to the GA4GH benchmarking recommendations?
Occasionally variant callers will produce output that is incoherent, but hap.py sometimes does not detect this type of error and can produce incorrect true positives. For instance, given input like:
Truth VCF:
11 34133117 TTCCA TACCA,T 1/2
Query VCF:
11 34133117 TTCCA T 0/1
11 34133118 T A 0/1
hap.py will proceed normally and emit a 'TP' result. However, the variants in the query VCF make conflicting assertions about the state of position 34133118 (the first variant asserts that it's ref + deleted, while the second asserts that it's ref + 'A'; they can't both be correct).
It seems that hap.py should produce at minimum a FP/ FN result for the above situation, or, even better, an error value of some sort indicating that the input was invalid.
How do I provide sample name on the command line of HAP.py to run vcfeval for multi-sample vcf
Here is the error I get currently:
2017-05-11 21:35:33,701 ERROR Error running rtg tools / vcfeval. Return code was 1, output: / Error: No sample name provided but calls is a multi-sample VCF.
2017-05-11 21:35:33,702 ERROR Traceback (most recent call last):
2017-05-11 21:35:33,702 ERROR File "/opt/hap.py/bin/hap.py", line 417, in
2017-05-11 21:35:33,702 ERROR main()
2017-05-11 21:35:33,702 ERROR File "/opt/hap.py/bin/hap.py", line 394, in main
2017-05-11 21:35:33,702 ERROR tempfiles += Haplo.vcfeval.runVCFEval(args.vcf1, args.vcf2, output_name, args)
2017-05-11 21:35:33,702 ERROR File "/opt/hap.py/lib/python27/Haplo/vcfeval.py", line 136, in runVCFEval
2017-05-11 21:35:33,703 ERROR raise Exception("Error running rtg tools / vcfeval. Return code was %i, output: %s / %s \n" % (rc, o, e))
2017-05-11 21:35:33,703 ERROR Exception: Error running rtg tools / vcfeval. Return code was 1, output: / Error: No sample name provided but calls is a multi-sample VCF.
Are there any plans to support symbolic SVs, or SVs in breakend notation? To properly compare such variants requires haplotype reconstruction as equivalent representations can be quite different. For example, the CHM1 and CHM13 truth sets from Huddleston 2016 report duplication events as INS, whereas most short read SVs will report the same event as a DUP. In the case of tandem duplication expansions, the nominal variant positions can be non-overlapping (even hundreds of base pairs apart) but still result in the same haplotype thus are just different representations of the same variant.
I'm trying to install hap.py on my laptop and things went ok until the unit tests run.
I ran this command:
python install.py ~/hap.py-install
This seems to be the relevant part of the message:
chrS:12-12 12-12:A . 0/1 0/1
chrS:13-13 13-13:C . 0/1 0/1
chrS:14-14 14-14:A 14-14:C 0/1 0/2 2/1
Unexpected variant calls: chrS:14-14 14-14:A 14-14:C 0/1 0/2 2/1
Assertion failed: (false), function test_method, file /Users/dchurch/Documents/source/hap.py/src/c++/test/test_allelesplit.cpp, line 197.
unknown location:0: fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/Users/dchurch/Documents/source/hap.py/src/c++/test/test_align.cpp:298: last checkpoint
and
*** 1 failure detected in test suite "Master Test Suite"
Boost Unit tests FAILED!
Traceback (most recent call last):
File "install.py", line 315, in <module>
main()
File "install.py", line 311, in main
test_haplotypes(source_dir, python_shebang, args)
File "install.py", line 191, in test_haplotypes
subprocess.check_call(to_run, shell=True)
File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /Users/dchurch/hap.py-install && /Users/dchurch/Documents/source/hap.py/src/sh/run_tests.sh' returned non-zero exit status 1
Any help would be appreciated.
Thanks for providing this in an image on Docker Hub. Would it be possible to create another build with the --with-rtgtools
option and tag it with :rtgtools
or some such?
Thanks for a great tool - can I also use this for benchmarking haploid genomes?
Thank you,
Hi, this is the first time I use hap.py, and I use it in a SGE environment. Surprising, even with small variant files (about 40k variant positions from WES) as input, it costs a very high memory (upmost to nearly 50G). I'm not sure if I've used the right arguments.
python hap.py ref.snp.vcf.gz sample.snp.vcf.gz -o output -f region.bed -r hg19.fasta --force-interactive
And the SGE arguments are:
-l num_proc=4,vf=10G
Any solution to this issue ? THANKS.
Hi hap.py authors:
I think I may have run into a bug in happy, but I'm not 100% certain. Here's the example:
query:
20 57777379 . C CT 0 . . GT 0/1
20 57777394 . TA T 0 . . GT 0/1
20 57777395 . A T 0 . . GT 0/1
truth:
20 57777395 rs3899824 A T 50 PASS . GT:DP:ADALL:AD:GQ:IGT:I
PS:PS 1|1:639:13,200:12,179:293:1/1:.:HOMVAR
ref:
CCTTTTTTTTTTTTTTTAAAAAAAAAAACACATCTAAT starting at 57777378
hap.py output
20 57777379 . C CT 0 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:i1_5:INDEL:het:0
20 57777394 . TA T 0 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:d1_5:INDEL:het:0
20 57777395 . A T 50 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 1/1:FN:am:tv:SNP:homalt:. 0/1:FP:am:tv:SNP:het:0
I know this representation isn't ideal, but what is happening here is:
-- query has 0/1 for A/T @ 57777395, which is 50% of the right answer for the truth hom-alt.
-- query has a het C=>CT followed by TA=>T, which in effect replaces one of As in the run of Ts with one more T. This has the same effect on the haplotype as the A => T SNP, and since both are het this should produce two matching haplotypes.
My view is that happy should emit no FPs and no FNs here, as these two representations are in fact identical. Seems like a bug to me.
Here's how I run hap.py:
docker run -it -v "${ROOT}:${ROOT}" -v "${OUTPUT_DIR}:${OUTPUT_DIR}" pkrusche/hap.py /opt/hap.py/bin/hap.py "${TRUTH_VCF}" "${NEW_LABELER_VCF}" --preprocess-truth --no-adjust-conf-regions -f "${TRUTH_BED}" -r "${REF%.gz}" -o "${OUTPUT_DIR}/new_labeler.happy.output" -l ${REGION}
hap.py relies on importing some libraries from a relative path. Unfortunately this path is calculated relative to the invoked path of hap.py, and not the actual location of the file, causing errors when using a symlink to hap.py. This can be fixed with this trivial change:
Before (hap.py:37
):
scriptDir = os.path.abspath(os.path.dirname(__file__))
After:
scriptDir = os.path.abspath(os.path.dirname(os.path.realpath(__file__)))
Reference // This is from the reference file cowpox.fa
gi_30844336_ref_NC_003663.2_:27650..27749
CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGTTGACATCTGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA
Truth // ACATC is replaced by TACGAATTG
gi_30844336_ref_NC_003663.2_ 27700 . ACATC TACGAATTG 200 . . GT 1/1
CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGTTG ACATC TGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA // reference (cowpox.fa)
CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGTTG TACGAATTG TGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA // truth (mutatedcowpox.vcf)
BWA // variantcalling by BWA
gi_30844336_ref_NC_003663.2_ 27698 . TG TGTACG 306.478 . AB=0.809524;ABP=20.4855;AC=1;AF=0.5;AN=2;AO=17;CIGAR=1M4I1M;DP=21;DPB=55;DPRA=0;EPP=3.13803;EPPR=5.18177;GTI=0;LEN=4;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=2.76283;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=479;QR=123;RO=4;RPL=5;RPP=9.26925;RPPR=11.6962;RPR=12;RUN=1;SAF=13;SAP=13.3567;SAR=4;SRF=1;SRP=5.18177;SRR=3;TYPE=ins GT:DP:RO:QR:AO:QA:GL 0/1:21:4:123:17:479:-37.0656,0,-5.05395
gi_30844336_ref_NC_003663.2_ 27701 . CATC ATTG 421.462 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=17;CIGAR=2X1M1X;DP=17;DPB=17;DPRA=0;EPP=4.1599;EPPR=0;GTI=0;LEN=4;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=26.5627;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=505;QR=0;RO=0;RPL=6;RPP=6.20364;RPPR=0;RPR=11;RUN=1;SAF=13;SAP=13.3567;SAR=4;SRF=0;SRP=0;SRR=0;TYPE=complex GT:DP:RO:QR:AO:QA:GL 1/1:17:0:0:17:505:-45.7407,-5.11751,0
CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGT TG A CATC TGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA // reference
CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGT TGTACG A ATTG TGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA // BWA
Truth CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGTTG TACGAATTG TGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA // reference
BWA CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGT TGTACG A ATTG TGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA // BWA
Truth CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGTTGTACGAATTGTGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA // reference
BWA CGTTGATTGCAATGGCTATACATGTATATCCGTTATTTGATCTAATGTTGTACGAATTGTGAACCGGATTCTAGCAGTAAAGACACTAGAGATTGTTTATTATA // BWA
HAP.PY
gi_30844336_ref_NC_003663.2_ 27698 . T TGTAC 306.478 . BS=27698 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:i1_5:INDEL:het:306.478
gi_30844336_ref_NC_003663.2_ 27700 . ACATC TACGAATTG 200 . BS=27698 GT:BD:BK:BI:BVT:BLT:QQ 1/1:FN:lm:c1_5,tv:INDEL:homalt:. ./.:.:.:.:NOCALL:nocall:0
gi_30844336_ref_NC_003663.2_ 27701 . C A 421.462 . BS=27698 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 1/1:FP:lm:tv:SNP:homalt:421.462
gi_30844336_ref_NC_003663.2_ 27702 . A T 421.462 . BS=27698 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 1/1:FP:lm:tv:SNP:homalt:421.462
gi_30844336_ref_NC_003663.2_ 27704 . C G 421.462 . BS=27698 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 1/1:FP:lm:tv:SNP:homalt:421.462
// The variants called are equivalent but HAP.PY counted it as False Negative. Hope this issue can be fixed.
Running hap.py reveals several warnings and errors. The positions always regard variants with at least one * at the "ALT" column. It seems, hap.py does not count asterisks at "ALT" column. Is hap.py compatible with the VCFv4.2 or later?
Error and warnings below
Best wishes
Martin
2018-02-27 12:15:39,660 WARNING [W] too many AD fields at chrX:111680917 max_ad = 2 retrieved: 3
2018-02-27 12:15:39,661 WARNING [W] too many AD fields at chrX:117985447 max_ad = 2 retrieved: 3
2018-02-27 12:15:41,130 WARNING [W] too many AD fields at chrX:119640487 max_ad = 2 retrieved: 3
2018-02-27 12:15:41,131 WARNING [W] too many AD fields at chrX:119640489 max_ad = 2 retrieved: 3
2018-02-27 12:15:41,302 ERROR One of the preprocess jobs failed
2018-02-27 12:15:41,302 ERROR Traceback (most recent call last):
2018-02-27 12:15:41,302 ERROR File "/usr/local/bin/hap.py", line 511, in
2018-02-27 12:15:41,303 ERROR main()
2018-02-27 12:15:41,303 ERROR File "/usr/local/bin/hap.py", line 366, in main
2018-02-27 12:15:41,303 ERROR "QUERY")
2018-02-27 12:15:41,303 ERROR File "/usr/local/bin/pre.py", line 203, in preprocess
2018-02-27 12:15:41,303 ERROR haploid_x=gender == "male")
2018-02-27 12:15:41,303 ERROR File "/usr/local/lib/python27/Haplo/partialcredit.py", line 213, in partialCredit
2018-02-27 12:15:41,304 ERROR raise Exception("One of the preprocess jobs failed")
2018-02-27 12:15:41,304 ERROR Exception: One of the preprocess jobs failed
Hi,
I'm getting the following error messages:
2017-06-01 14:09:41,816 ERROR Input file NA12878.diploidSV.vcf.gz} does not exist.
2017-06-01 14:09:41,817 ERROR Traceback (most recent call last):
2017-06-01 14:09:41,817 ERROR File "/opt/hap.py/bin/hap.py", line 460, in
2017-06-01 14:09:41,817 ERROR main()
2017-06-01 14:09:41,817 ERROR File "/opt/hap.py/bin/hap.py", line 224, in main
2017-06-01 14:09:41,817 ERROR raise Exception("Input file %s does not exist." % args.vcf2)
2017-06-01 14:09:41,817 ERROR Exception: Input file NA12878.diploidSV.vcf.gz} does not exist.
The commands I'm running are:
#!/bin/bash -x
HAPPY=/opt/hap.py/bin/hap.py
export HGREF=/home/neuro/Public/RefSeqIndexAllPrograms/GATK/hg19_1stM_unmask_ran_all.fa
WORKDIR=/path/to/NA12878
TRUTH=${WORKDIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz
QUERY=${WORKDIR}/NA12878.diploidSV.vcf.gz
${HAPPY} ${TRUTH} ${QUERY} -o test -r ${HGREF}
The header for both files:
zcat NA12878.diploidSV.vcf.gz | head
##fileformat=VCFv4.1
##fileDate=20170412
##source=GenerateSVCandidates 0.29.6
##reference=file:///home/neuro/Public/RefSeqIndexAllPrograms/GATK/hg19_1stM_unmask_ran_all.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
zcat HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz | head
##fileformat=VCFv4.2
##fileDate=20160824
##CL=vcffilter -i filtered-phase-transfer.vcf.gz -o - --javascript "ensureFormatHeader("##FORMAT=<ID=PS,Number=1,Type=String,Description=\"Phase set for GT\">"); function record() {if(INTEGRATION.GT=="1/1") { INTEGRATION.IPS="."; INTEGRATION.PS="HOMVAR"; INTEGRATION.GT="1|1";} else {if((INTEGRATION.GT=="0/1" || INTEGRATION.GT=="1/2" || INTEGRATION.GT=="2/1" || INTEGRATION.GT=="1/0") ) {if(INTEGRATION.IPS.length>1) {INTEGRATION.PS=INTEGRATION.IPS; INTEGRATION.GT=INTEGRATION.IGT;} else {INTEGRATION.PS=".";};} else { if((INTEGRATION.IPS.length<2)) { INTEGRATION.IPS=".";} INTEGRATION.PS="PATMAT";};};}"
##RUN-ID=99570df1-3bbc-46df-b8e8-0839daa980e3
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
The files are definitely in the working directory, so I'm not quite sure why it can't find the query file. Any help would be much appreciated.
Cheers,
Thuong
Hi, I simulated reads with variants on A.thaliana. Then I run a variant caller, and when I am going to benchmark the calling I have that error. The golden vcf produced by the simulator does not have samples because it created the variants itself. But hap.py does not like that:
2018-06-11 11:40:13,580 ERROR CalledProcessError: Command 'vcfcheck /gatk/required_files/A.tha_variant_01_golden.vcf --check-bcf-errors 1' returned non-zero exit status 1
Is there anyway I can modify the .vcf or pass an option to hap.py to make it work? Thank you very much
This is a part of the golden.vcf:
##fileformat=VCFv4.1
##reference=/A.tha.nuclear.fa
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=VMX,Number=1,Type=String,Description="SNP is Missense in these Read Frames">
##INFO=<ID=VNX,Number=1,Type=String,Description="SNP is Nonsense in these Read Frames">
##INFO=<ID=VFX,Number=1,Type=String,Description="Indel Causes Frameshift">
##INFO=<ID=WP,Number=A,Type=Integer,Description="Ploidy indicator">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=CNV,Description="Copy number variable region">
##ALT=<ID=TRANS,Description="Translocation">
##ALT=<ID=INV-TRANS,Description="Inverted translocation">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 513 . C T . PASS WP=1/0
1 542 . ATG A . PASS WP=0/0
1 573 . C T . PASS WP=1/0
1 643 . G A . PASS WP=0/0
1 659 . A G . PASS WP=0/1
/bin/ld: cannot find -lstdc++
collect2: error: ld returned 1 exit status
make[2]: *** [bin/quantify] Error 1
make[1]: *** [src/c++/main/CMakeFiles/quantify.dir/all] Error 2
make: *** [all] Error 2
Traceback (most recent call last):
File "install.py", line 332, in
main()
File "install.py", line 317, in main
build_haplotypes(source_dir, build_dir, args)
File "install.py", line 186, in build_haplotypes
subprocess.check_call(to_run, shell=True)
File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /tmp/buildPXIbnx && make -j64' returned non-zero exit status 2
Kindly help in resolving the same...
I experienced this error:
[W] Found a variant with more 4 > 2 (max) alt alleles. These become no-calls.
log:
2018-02-14 09:30:18,372 ERROR Command 'quantify /tmp/hap.py.result.jryogM.vcf.gz -o /mnt/yard2/ian/structural_variants/CHM1_CHM13_analysis/HG2.happy.roc.tsv -r /mnt/opt/refdata_new/hg19-2.0.0/fasta/genome.fa --threads 8 --output-vtc 0 --output-rocs 1 --type xcmp --qq IQQ --qq-header QUAL --roc-delta 0.500000 --clean-info 1 --fix-chr-regions 0 -v /mnt/yard2/ian/structural_variants/CHM1_CHM13_analysis/HG2.happy.vcf.gz --roc-regions '*'' returned non-zero exit status -11
2018-02-14 09:30:18,372 ERROR Traceback (most recent call last):
2018-02-14 09:30:18,372 ERROR File "build/bin/hap.py", line 511, in <module>
2018-02-14 09:30:18,373 ERROR main()
2018-02-14 09:30:18,373 ERROR File "build/bin/hap.py", line 496, in main
2018-02-14 09:30:18,373 ERROR qfy.quantify(args)
2018-02-14 09:30:18,373 ERROR File "/mnt/home/ian/hap.py/build/bin/qfy.py", line 129, in quantify
2018-02-14 09:30:18,374 ERROR strat_fixchr=args.strat_fixchr)
2018-02-14 09:30:18,374 ERROR File "/mnt/home/ian/hap.py/build/lib/python27/Haplo/quantify.py", line 178, in run_quantify
2018-02-14 09:30:18,375 ERROR subprocess.check_call(run_str, shell=True, stdout=tfo, stderr=tfe)
2018-02-14 09:30:18,375 ERROR File "/mnt/home/ian/anaconda2/lib/python2.7/subprocess.py", line 186, in check_call
2018-02-14 09:30:18,377 ERROR raise CalledProcessError(retcode, cmd)
2018-02-14 09:30:18,377 ERROR CalledProcessError: Command 'quantify /tmp/hap.py.result.jryogM.vcf.gz -o /mnt/yard2/ian/structural_variants/CHM1_CHM13_analysis/HG2.happy.roc.tsv -r /mnt/opt/refdata_new/hg19-2.0.0/fasta/genome.fa --threads 8 --output-vtc 0 --output-rocs 1 --type xcmp --qq IQQ --qq-header QUAL --roc-delta 0.500000 --clean-info 1 --fix-chr-regions 0 -v /mnt/yard2/ian/structural_variants/CHM1_CHM13_analysis/HG2.happy.vcf.gz --roc-regions '*'' returned non-zero exit status -11
stdout:
bespin1 Wed Feb 14 09:27 hap.py $build/bin/hap.py \
> /mnt/yard2/ian/structural_variants/CHM1_CHM13_analysis/svanalyzer_union_171212_v0.5.0_annotated.with_chr.vcf \
> /mnt/yard2/ian/structural_variants/longranger2.0-vcfs/HG2_combined.sorted.vcf.gz \
> -r /mnt/opt/refdata_new/hg19-2.0.0/fasta/genome.fa \
> -o /mnt/yard2/ian/structural_variants/CHM1_CHM13_analysis/HG2.happy \
> --roc QUAL -V \
> --logfile /mnt/yard2/ian/structural_variants/CHM1_CHM13_analysis/happy.log \
> --threads 8
[W] overlapping records at chr1:1285401 for sample 1
[W] variant at chr1:79708114 has more than one base of reference padding
[W] Variants that have >1 base of reference padding: 142
[W] Variants that overlap on the reference allele: 2765
[I] Total VCF records: 74216
[I] Non-reference VCF records: 66671
[W] Symbolic / SV ALT alleles at chr1:649703
[W] overlapping records at chr1:829171 for sample 0
[W] variant at chr1:20164759 has more than one base of reference padding
[W] Variants that have >1 base of reference padding: 253
[W] Variants that overlap on the reference allele: 291
[W] Variants that have symbolic ALT alleles: 6177
[I] Total VCF records: 32009
[I] Non-reference VCF records: 31880
Any idea what is going on? Since /tmp/hap.py.result.jryogM.vcf.gz
is a temporary file I can't check on it after the run. Is there a way to retain intermediates?
I'm running hap.py like this:
hap.py patient_3_strelka.vcf.gz patient_3_varscan.vcf.gz --reference /b37/human_g1k_v37.fasta -o ${NAME}_str_var
My input vcfs are sorted, bgzipped, and tabix indexed.
I'm running it on a 64 core server with 500GB RAM. It's running CentOS Linux release 7.3.1611.
Gcc is version 4.9.2. I upgraded from 4.8 this morning; after upgrading gcc I uninstalled and re-installed hap.py using Bioconda.
It runs for a while and then produces the following error messages:
2017-11-17 14:18:42,736 ERROR Command 'quantify /tmp/4487.1.all.q/hap.py.result.WAYuiy.vcf.gz -o _str_var.roc.tsv -r /b37/human_g1k_v37.fasta --threads 64 --output-vtc 0 --output-rocs 1 --type xcmp --qq IQQ --qq-header QUAL --roc-delta 0.500000 --clean-info 1 --fix-chr-regions 0 -v _str_var.vcf.gz --roc-regions '*'' returned non-zero exit status 1
2017-11-17 14:18:42,736 ERROR Traceback (most recent call last):
2017-11-17 14:18:42,736 ERROR File "/home/wid/miniconda3/envs/patrick/bin/hap.py", line 460, in
2017-11-17 14:18:42,737 ERROR main()
2017-11-17 14:18:42,737 ERROR File "/home/wid/miniconda3/envs/patrick/bin/hap.py", line 445, in main
2017-11-17 14:18:42,737 ERROR qfy.quantify(args)
2017-11-17 14:18:42,737 ERROR File "/home/wid/miniconda3/envs/patrick/bin/qfy.py", line 129, in quantify
2017-11-17 14:18:42,737 ERROR strat_fixchr=args.strat_fixchr)
2017-11-17 14:18:42,737 ERROR File "/home/wid/miniconda3/envs/patrick/lib/python27/Haplo/quantify.py", line 177, in run_quantify
2017-11-17 14:18:42,738 ERROR subprocess.check_call(run_str, shell=True, stdout=tfo, stderr=tfe)
2017-11-17 14:18:42,738 ERROR File "/home/wid/miniconda3/envs/patrick/lib/python2.7/subprocess.py", line 186, in check_call
2017-11-17 14:18:42,738 ERROR raise CalledProcessError(retcode, cmd)
2017-11-17 14:18:42,739 ERROR CalledProcessError: Command 'quantify /tmp/4487.1.all.q/hap.py.result.WAYuiy.vcf.gz -o _str_var.roc.tsv -r /data/Phil/ref_phil/GATK_resource/b37/human_g1k_v37.fasta --threads 64 --output-vtc 0 --output-rocs 1 --type xcmp --qq IQQ --qq-header QUAL --roc-delta 0.500000 --clean-info 1 --fix-chr-regions 0 -v _str_var.vcf.gz --roc-regions '*'' returned non-zero exit status 1
This seems to have something to do with the ROCs. I added the --no.roc flag to the command, but it failed the same way.
I got hap.py v0.3.10 installed on Red Hat Enterprise Linux Server (release 7.2.1511), but it takes ~10 hours to complete even the example dataset suggested here. With gVCFs, it runs into several errors (major one being "Exception when running <function preprocessWrapper at 0x23a6cf8>" and "IOError: [Errno 23] Too many open files in system").
Looking at log output from running the example dataset preprocess consumes ~0.6 hrs and xcmp consumes >8 hrs. Log file is attached here, and the logs with 'time taken' part are shown below. Hap.py doc says it should take <15 minutes to finish the job for gVCFs, but even the example dataset takes way longer time to complete. Any idea where the problem arises? I have also used hap.py v0.3.9 and it had the same problem.
2017-08-23 11:49:01,434 INFO blocksplit for chr21 -- time taken 1.23
2017-08-23 12:24:37,917 INFO preprocess for chr21:36330398-38937043 -- time taken 2136.40
2017-08-23 12:24:43,907 INFO preprocess for chr21:26860115-28424230 -- time taken 2142.38
2017-08-23 12:24:46,338 INFO preprocess for chr21:43622442-44642556 -- time taken 2144.65
2017-08-23 12:24:48,196 INFO preprocess for chr21:38937044-41083006 -- time taken 2146.59
2017-08-23 12:24:50,242 INFO preprocess for chr21:23383456-25351788 -- time taken 2148.72
2017-08-23 12:24:52,227 INFO preprocess for chr21:16812555-18980058 -- time taken 2150.71
2017-08-23 12:24:52,309 INFO preprocess for chr21:46758486-2147483647 -- time taken 2150.63
2017-08-23 12:24:54,070 INFO preprocess for chr21:41083007-42084376 -- time taken 2152.39
2017-08-23 12:24:54,710 INFO preprocess for chr21:21943268-23383455 -- time taken 2153.19
2017-08-23 12:24:54,789 INFO preprocess for chr21:32571313-34319568 -- time taken 2153.11
2017-08-23 12:24:54,874 INFO preprocess for chr21:44642557-46758485 -- time taken 2153.19
2017-08-23 12:24:54,933 INFO preprocess for chr21:28424231-30006304 -- time taken 2153.41
2017-08-23 12:24:55,033 INFO preprocess for chr21:34319569-36330397 -- time taken 2153.36
2017-08-23 12:24:55,060 INFO preprocess for chr21:42084377-43622441 -- time taken 2153.38
2017-08-23 12:24:55,087 INFO preprocess for chr21:25351789-26860114 -- time taken 2153.57
2017-08-23 12:24:55,103 INFO preprocess for chr21:30006305-32571312 -- time taken 2153.59
2017-08-23 12:24:55,384 INFO preprocess for chr21:1-9605451 -- time taken 2153.86
2017-08-23 12:24:55,666 INFO preprocess for chr21:18980059-21943267 -- time taken 2154.14
2017-08-23 12:24:55,770 INFO preprocess for chr21:11198117-16812554 -- time taken 2154.25
2017-08-23 12:24:56,185 INFO preprocess for chr21:10044887-10657837 -- time taken 2154.66
2017-08-23 12:24:56,865 INFO preprocess for chr21:9605452-10044886 -- time taken 2155.34
2017-08-23 12:24:58,042 INFO preprocess for chr21:10657838-11198116 -- time taken 2156.52
2017-08-23 12:25:06,894 INFO preprocess for /tmp/query.pppjVPot.vcf.gz -- time taken 2170.67
2017-08-23 12:25:08,313 INFO blocksplit for chr21 -- time taken 1.37
2017-08-23 20:56:50,889 INFO xcmp for chunk chr21:43203367-43990133 -- time taken 30702.43
2017-08-23 20:57:52,969 INFO xcmp for chunk chr21:9955614-10086540 -- time taken 30764.49
2017-08-23 20:58:39,561 INFO xcmp for chunk chr21:47554728-2147483647 -- time taken 30811.09
2017-08-23 20:58:40,135 INFO xcmp for chunk chr21:18700617-19485076 -- time taken 30811.65
2017-08-23 20:58:47,228 INFO xcmp for chunk chr21:11141681-14362815 -- time taken 30818.74
2017-08-23 20:58:56,355 INFO xcmp for chunk chr21:19485077-20200412 -- time taken 30827.87
2017-08-23 20:59:02,967 INFO xcmp for chunk chr21:15659651-16522182 -- time taken 30834.48
2017-08-23 20:59:20,403 INFO xcmp for chunk chr21:41601933-42337467 -- time taken 30851.92
2017-08-23 20:59:22,483 INFO xcmp for chunk chr21:24204974-25028671 -- time taken 30853.98
2017-08-23 20:59:26,725 INFO xcmp for chunk chr21:35268047-36463156 -- time taken 30858.27
2017-08-23 20:59:38,237 INFO xcmp for chunk chr21:15268599-15659650 -- time taken 30869.75
2017-08-23 20:59:39,899 INFO xcmp for chunk chr21:17778067-18700616 -- time taken 30871.41
2017-08-23 20:59:44,890 INFO xcmp for chunk chr21:46922203-47554727 -- time taken 30876.42
2017-08-23 20:59:47,690 INFO xcmp for chunk chr21:34242631-35268046 -- time taken 30879.23
2017-08-23 20:59:49,259 INFO xcmp for chunk chr21:28700929-29512110 -- time taken 30880.77
2017-08-23 20:59:55,445 INFO xcmp for chunk chr21:9849286-9903489 -- time taken 30886.95
2017-08-23 20:59:58,195 INFO xcmp for chunk chr21:10191719-10637115 -- time taken 30889.72
2017-08-23 21:00:01,130 INFO xcmp for chunk chr21:9903490-9955613 -- time taken 30892.64
2017-08-23 21:00:05,232 INFO xcmp for chunk chr21:16522183-17778066 -- time taken 30896.74
2017-08-23 21:00:05,295 INFO xcmp for chunk chr21:11101337-11141680 -- time taken 30896.81
2017-08-23 21:00:06,175 INFO xcmp for chunk chr21:10795606-10854430 -- time taken 30897.68
2017-08-23 21:00:07,350 INFO xcmp for chunk chr21:36463157-37557574 -- time taken 30898.91
2017-08-23 21:00:09,297 INFO xcmp for chunk chr21:26685873-27755314 -- time taken 30900.80
2017-08-23 21:00:09,297 INFO xcmp for chunk chr21:45567792-46267390 -- time taken 30900.82
2017-08-23 21:00:09,671 INFO xcmp for chunk chr21:20200413-20929776 -- time taken 30901.20
2017-08-23 21:00:11,343 INFO xcmp for chunk chr21:9477179-9568735 -- time taken 30902.86
2017-08-23 21:00:11,477 INFO xcmp for chunk chr21:1-9477178 -- time taken 30902.99
2017-08-23 21:00:11,744 INFO xcmp for chunk chr21:25795687-26685872 -- time taken 30903.26
2017-08-23 21:00:12,442 INFO xcmp for chunk chr21:10762268-10795605 -- time taken 30903.97
2017-08-23 21:00:12,568 INFO xcmp for chunk chr21:21758584-22405450 -- time taken 30904.08
2017-08-23 21:00:13,251 INFO xcmp for chunk chr21:38362658-39058109 -- time taken 30904.79
2017-08-23 21:00:14,191 INFO xcmp for chunk chr21:43990134-44509792 -- time taken 30905.71
2017-08-23 21:00:14,291 INFO xcmp for chunk chr21:42337468-43203366 -- time taken 30905.85
2017-08-23 21:00:15,321 INFO xcmp for chunk chr21:30489401-32023024 -- time taken 30906.87
2017-08-23 21:00:15,618 INFO xcmp for chunk chr21:20929777-21758583 -- time taken 30907.13
2017-08-23 21:00:15,773 INFO xcmp for chunk chr21:10854431-11053798 -- time taken 30907.30
2017-08-23 21:00:15,869 INFO xcmp for chunk chr21:39058110-39964967 -- time taken 30907.39
2017-08-23 21:00:16,056 INFO xcmp for chunk chr21:40929749-41601932 -- time taken 30907.61
2017-08-23 21:00:16,127 INFO xcmp for chunk chr21:9568736-9701279 -- time taken 30907.65
2017-08-23 21:00:16,360 INFO xcmp for chunk chr21:29512111-30489400 -- time taken 30907.91
2017-08-23 21:00:16,899 INFO xcmp for chunk chr21:22405451-23197628 -- time taken 30908.41
2017-08-23 21:00:17,223 INFO xcmp for chunk chr21:27755315-28700928 -- time taken 30908.73
2017-08-23 21:00:17,746 INFO xcmp for chunk chr21:10086541-10191718 -- time taken 30909.26
2017-08-23 21:00:17,747 INFO xcmp for chunk chr21:10637116-10727483 -- time taken 30909.27
2017-08-23 21:00:17,898 INFO xcmp for chunk chr21:39964968-40929748 -- time taken 30909.45
2017-08-23 21:00:17,983 INFO xcmp for chunk chr21:32023025-33039875 -- time taken 30909.53
2017-08-23 21:00:18,052 INFO xcmp for chunk chr21:25028672-25795686 -- time taken 30909.57
2017-08-23 21:00:18,153 INFO xcmp for chunk chr21:9701280-9849285 -- time taken 30909.68
2017-08-23 21:00:18,260 INFO xcmp for chunk chr21:37557575-38362657 -- time taken 30909.78
2017-08-23 21:00:18,273 INFO xcmp for chunk chr21:46267391-46922202 -- time taken 30909.82
2017-08-23 21:00:18,330 INFO xcmp for chunk chr21:14362816-14798906 -- time taken 30909.84
2017-08-23 21:00:18,339 INFO xcmp for chunk chr21:14798907-15268598 -- time taken 30909.85
2017-08-23 21:00:18,406 INFO xcmp for chunk chr21:10727484-10762267 -- time taken 30909.95
2017-08-23 21:00:18,448 INFO xcmp for chunk chr21:33039876-34242630 -- time taken 30909.99
2017-08-23 21:00:18,458 INFO xcmp for chunk chr21:44509793-45567791 -- time taken 30910.01
2017-08-23 21:00:19,136 INFO xcmp for chunk chr21:23197629-24204973 -- time taken 30910.65
2017-08-23 21:00:19,276 INFO xcmp for chunk chr21:11053799-11101336 -- time taken 30910.79
Log file:
NA12878_example_happy.txt
hap.py fails for vcfs due to presence of star ALT alleles, and I am still trying to figure out how to best deal with this. In the process, I found a bug in error reporting, where variant positions appear to be using zero-indexing instead of one-indexing. For example, in the error log snippet below, 1:15273
should actually be 1:15274
. I'm using hap.py v0.3.10.
2018-05-15 14:58:27,833 ERROR Preprocess command preprocess /tmp/tmpkHm0Sh.vcf.gz:* -o /tmp/input.akWCP_.prep.vcf.gz -V 1 -L 0 -r ref_genome/grch37/human_g1k_v37.fasta failed. Outputs are here /tmp/stdout3Yq7WQ.log / /tmp/stderrimbr59.log
2018-05-15 14:58:27,834 ERROR [W] too many AD fields at 1:15273 max_ad = 2 retrieved: 3
2018-05-15 14:58:27,834 ERROR [W] too many AD fields at 1:20226 max_ad = 2 retrieved: 3
2018-05-15 14:58:27,836 ERROR [W] too many AD fields at 1:54711 max_ad = 2 retrieved: 4
2018-05-15 14:58:27,836 ERROR [W] too many AD fields at 1:54723 max_ad = 2 retrieved: 3
2018-05-15 14:58:27,836 ERROR [W] too many AD fields at 1:82132 max_ad = 2 retrieved: 4
2018-05-15 14:58:27,841 ERROR Exception when running <function preprocessWrapper at 0x2aaad1e11cf8>:
2018-05-15 14:58:27,841 ERROR ------------------------------------------------------------
2018-05-15 14:58:27,841 ERROR Traceback (most recent call last):
2018-05-15 14:58:27,842 ERROR File "/dddd/software/happy-0.3.10/lib/python27/Tools/parallel.py", line 72, in parMapper
2018-05-15 14:58:27,851 ERROR return arg[1]['fun'](arg[0], *arg[1]['args'], **arg[1]['kwargs'])
2018-05-15 14:58:27,851 ERROR File "/dddd/software/happy-0.3.10/lib/python27/Haplo/partialcredit.py", line 67, in preprocessWrapper
2018-05-15 14:58:27,884 ERROR subprocess.check_call(to_run, shell=True, stdout=tfo, stderr=tfe)
2018-05-15 14:58:27,884 ERROR File "/gpfs/gpfs1/software/Python-2.7.11/lib/python2.7/subprocess.py", line 540, in check_call
2018-05-15 14:58:27,897 ERROR raise CalledProcessError(retcode, cmd)
2018-05-15 14:58:27,897 ERROR CalledProcessError: Command 'preprocess /tmp/tmpkHm0Sh.vcf.gz:* -o /tmp/input.akWCP_.prep.vcf.gz -V 1 -L 0 -r ref_genome/grch37/human_g1k_v37.fasta' returned non-zero exit status -6
2018-05-15 14:58:27,898 ERROR ------------------------------------------------------------
2018-05-15 14:58:27,909 ERROR One of the preprocess jobs failed
2018-05-15 14:58:27,909 ERROR Traceback (most recent call last):
2018-05-15 14:58:27,909 ERROR File "/dddd/software/happy-0.3.10/bin/hap.py", line 511, in <module>
2018-05-15 14:58:27,917 ERROR main()
2018-05-15 14:58:27,917 ERROR File "/dddd/software/happy-0.3.10/bin/hap.py", line 366, in main
2018-05-15 14:58:27,918 ERROR "QUERY")
2018-05-15 14:58:27,918 ERROR File "/dddd/software/happy-0.3.10/bin/pre.py", line 203, in preprocess
2018-05-15 14:58:27,925 ERROR haploid_x=gender == "male")
2018-05-15 14:58:27,926 ERROR File "/dddd/software/happy-0.3.10/lib/python27/Haplo/partialcredit.py", line 214, in partialCredit
2018-05-15 14:58:27,926 ERROR raise Exception("One of the preprocess jobs failed")
2018-05-15 14:58:27,926 ERROR Exception: One of the preprocess jobs failed
Trying to install on a centos 7.2 machine using
sudo python install.py ~/hap.py-install --with-rtgtools
got this error
Linking CXX executable ../../../bin/multimerge
/bin/ld: cannot find -lstdc++
collect2: error: ld returned 1 exit status
make[2]: *** [bin/multimerge] Error 1
make[1]: *** [src/c++/main/CMakeFiles/multimerge.dir/all] Error 2
Linking CXX executable ../../../bin/alleles
/bin/ld: cannot find -lstdc++
collect2: error: ld returned 1 exit status
make[2]: *** [bin/alleles] Error 1
make[1]: *** [src/c++/main/CMakeFiles/alleles.dir/all] Error 2
Linking CXX executable ../../../bin/test_haplotypes
/bin/ld: cannot find -lstdc++
collect2: error: ld returned 1 exit status
make[2]: *** [bin/test_haplotypes] Error 1
make[1]: *** [src/c++/test/CMakeFiles/test_haplotypes.dir/all] Error 2
Linking CXX executable ../../../bin/scmp
/bin/ld: cannot find -lstdc++
collect2: error: ld returned 1 exit status
make[2]: *** [bin/scmp] Error 1
make[1]: *** [src/c++/main/CMakeFiles/scmp.dir/all] Error 2
Linking CXX executable ../../../bin/quantify
/bin/ld: cannot find -lstdc++
collect2: error: ld returned 1 exit status
make[2]: *** [bin/quantify] Error 1
make[1]: *** [src/c++/main/CMakeFiles/quantify.dir/all] Error 2
make: *** [all] Error 2
Traceback (most recent call last):
File "install.py", line 298, in <module>
main()
File "install.py", line 283, in main
build_haplotypes(source_dir, build_dir, args)
File "install.py", line 157, in build_haplotypes
subprocess.check_call(to_run, shell=True)
File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /tmp/buildIMEVcD && make -j32' returned non-zero exit status 2
install.py failed when Multimerge test
source version:current version(commit d51d111)
#./install.py --python system /usr/hpc-bio/hap.py
...
...
...
Running Multimerge test (1)
Adding file '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../data/merge1.vcf.gz' / sample 'NA12877'
Adding file '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../data/merge2.vcf.gz' / sample 'NA12878'
Writing '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../data/merge1.vcf.gz:NA12877' as sample 'NA12877'
Writing '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../data/merge2.vcf.gz:NA12878' as sample 'NA12878'
Multimerge test SUCCEEDED.
Running Multimerge data import test
Adding file '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../data/import_errors.vcf.gz' / sample 'NA12877'
Writing '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../data/import_errors.vcf.gz:NA12877' as sample 'NA12877'
[W] Found a variant with more 3 > 2 (max) alt alleles. These become no-calls.
Multimerge test SUCCEEDED.
Running Multimerge test (2)
Output is in /tmp/multimerge.nKc2BbJ3L8.vcf
Adding file '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../../example/multimerge/hap_alleles_1.vcf.gz' / sample ''
Adding file '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../../example/multimerge/hap_alleles_2.vcf.gz' / sample ''
Writing '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../../example/multimerge/hap_alleles_1.vcf.gz:' as sample 'hap_alleles_1.vcf'
Writing '/nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../../example/multimerge/hap_alleles_2.vcf.gz:' as sample 'hap_alleles_2.vcf'
[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String
Contig chr1 is not known
35a35,40
> chr1 19472002 . GACACACACACACACACACAC G 0 . . GT 0/1 0/1
> chr1 19472022 . C CAC 0 . . GT ./. 0/1
> chr1 19472022 . C CAC 0 . . GT 0/1 ./.
> chr1 31755330 . GTATCTATC G 0 . . GT 0/1 0/1
> chr1 31755334 . CTATC C 0 . . GT ./. 0/1
> chr1 31755334 . CTATC C 0 . . GT 0/1 ./.
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=hg19
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##INFO=<ID=END,Number=.,Type=Integer,Description="SV end position">
##INFO=<ID=IMPORT_FAIL,Number=.,Type=Flag,Description="Flag to identify variants that could not be imported.">
##FORMAT=<ID=AGT,Number=1,Type=String,Description="Genotypes at ambiguous locations">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=A,Type=Integer,Description="Allele Depths">
##FORMAT=<ID=ADO,Number=.,Type=Integer,Description="Summed depth of non-called alleles.">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT hap_alleles_1.vcf hap_alleles_2.vcf
Multimerge test (2) FAILED. diff -I^# /tmp/multimerge.nKc2BbJ3L8.vcf /nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/../../example/multimerge/hap_alleles_leftshifted.vcf
Multimerge tests FAILED!
Traceback (most recent call last):
File "./install.py", line 334, in <module>
main()
File "./install.py", line 330, in main
test_haplotypes(source_dir, python_shebang, args)
File "./install.py", line 199, in test_haplotypes
subprocess.check_call(to_run, shell=True)
File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /usr/hpc-bio/hap.py && /nfs/git/dna/benchmarking-tools/tools/hap.py/src/sh/run_tests.sh' returned non-zero exit status 1
Hi Happy authors,
We've been using hap.py for our internal DeepVariant evaluations and recently did a deep dive into some of the FPs and FNs and discovered what looks like a bug in happy. Here's the situation (these are using the b37 truth from GIAB for HG002 if that's useful):
Candidate variant call:
20 16392574 . C CGTGT 0 . . GT 1/0
truth:
20 16392574 . CGTGTGTGT CGTGTGTGTGTGT,C 50 PASS . GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|2:259:0,33,47:2,43,61:297:2|1:.:PATMAT
happy vcf output:
20 16392574 . CGTGTGTGT C 50 . BS=16392574;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 0/1:FN:lm:d6_15:INDEL:het:. ./.:.:.:.:NOCALL:nocall:0
20 16392574 . C CGTGT 0 . BS=16392574;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:i1_5:INDEL:het:0
20 16392582 . T TGTGT 50 . BS=16392574;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 0/1:FN:lm:i1_5:INDEL:het:. ./.:.:.:.:NOCALL:nocall:0
In this situation we have a multi-allelic het-alt in truth and a candidate variant that contains only one of the two alleles (C=>CGTGT is the same as CGTGTGTGT=>CGTGTGTGTGTGT since we can remove the suffix GTGTGTGT to produce C=>CGTGT). However, in the happy VCF output you can see it is saying there's on FN (for the deletion allele), one FP for the hom-ref assignment for the C=>CGTGT allele, and one extra and incorrect FN allele T=>TGTGT. This last allele is actually the same as the C=>CGTGT but it's been put at an unusual position (16392582 not 16392574) as the reference in this region is a repeat of lots of GT dinucleotides.
It's unclear to use if this is just a problem with the VCF emitter, but if it affects the actual TP/FP/FN calculations then it's definitely much more serious.
All the best,
Mark
When -f or -T option is not specified on the command line, that is , there is no bed file provided as input , HAP.PY filters out all variants from the VCF truth VCF file.
This should not be the case, as we should be able to compare two VCF files without the need of any region definition.
He is how the Summary looks like:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 0 0 0 906937 0 906937 0 0.0 NaN 1.0 . NaN . 1.806528
INDEL PASS 0 0 0 906937 0 906937 0 0.0 NaN 1.0 . NaN . 1.806528
SNP ALL 0 0 0 3820642 0 3820642 0 0.0 NaN 1.0 . 2.025957 . 1.592023
SNP PASS 0 0 0 3820642 0 3820642 0 0.0 NaN 1.0 . 2.025957 . 1.592023
While I understand that a copy of hg19 with 'chr' prefixed to the chromosome names is is required passing integration tests, it seems that chromosome names are required to start with 'chr' even for general analysis. For instance, when running hap.py such as:
~/hap.py-install/bin/hap.py test_input.vcf.gz test_calls.vcf.gz -r ~/resources/GRCh38.p2/GRCh38.p2.fa -o happytest
I get an error like:
...
[fai_fetch_seq] The sequence "chr10" not found
...
There's no contig called 'chr10' in the VCFs or in the supplied reference fasta, is hap.py maybe accidentally looking for contigs that start with 'chr' even in cases when its not performing internal testing?
Hi, not able to install hap.py
git clone https://github.com/sequencing/hap.py
mkdir hap.py-build
cd hap.py-build
Run CMake
cmake ../hap.py
Build
make
After make it showing error, can any one help me with installation ?
jsoncpp.cpp:(.text.ZN4Json6Reader15StructuredErrorC2EOS1[ZN4Json6Reader15StructuredErrorC5EOS1]+0x45): undefined reference to `std::__cxx11::basic_string<char, std::char_traits, std::allocator >::basic_string(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&&)'
collect2: error: ld returned 1 exit status
make[2]: *** [src/c++/main/CMakeFiles/xcmp.dir/build.make:111: bin/xcmp] Error 1
make[1]: *** [CMakeFiles/Makefile2:304: src/c++/main/CMakeFiles/xcmp.dir/all] Error 2
make: *** [Makefile:128: all] Error 2
Hi there
pre.py may has a small regex issue:
the default arg --fixchr only transformed chromosome names in "main text" of a vcf file(chromosome column), while it does not transform chromosome names at the contig INFO part in header lines.
the script pre.py used is :
bcftools view [my.vcf.gz] | perl -pe 's/^([0-9XYM])/chr$1/' | perl -pe s/chrMT/chrM/ | bcftools view -o /tmp/tmp.vcf.gz -O z
this works fine for main text, but it ignored header lines. The contig info still use b37 chromosome format(as below). This will cause errors when I use bed file contain hg19 chr format. (like cannot creat tabix index )
eg:
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<**ID=4,**length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001
chr1 239339 rs184451216 A G 50 PASS platforms=1;platformnames=10X;datasets=1;datasetnames=10XChromium;callsets=1;c
chr1 239482 rs201702841 G T 50 PASS platforms=2;platformnames=10X,Illumina;datasets=2;datasetnames=10XChromium,HiS
chr1 240147 rs202156508 C T 50 PASS platforms=2;platformnames=10X,Illumina;datasets=2;datasetnames=10XChromium,HiS
chr1 240898 rs14799093
The default ubuntu-based docker image is built using the default Python libraries from Ubuntu's repo, which for the pysam library are very outdated. This breaks the Tools/bamstats.py module, which calls pysam.idxstats(). This function previously returned a list instead of the string that the module expects, leading to an attribute exception.
I am trying to run som.py using the Docker file provided here. I keep getting bcftools errors that appear to be related to masking in the reference .fa file. Running vcfcheck, I get the same sequences between the VCF and reference file, but one is upper case and one is lower case, so it says they don't match. I think this may be due to the version of bcftools.
Here is the bcftools error:
2018-05-07 21:18:27,692 ERROR Error running BCFTOOLS; please check if your file has issues using vcfcheck. Return code was 255, output: / [E::hts_idx_push] unsorted positions on sequence #1: 36937065 followed by 36937059
index: failed to create index for "/tmp/tmpv2p9Tv/normalized_truth.vcf.gz"
2018-05-07 21:18:27,692 ERROR Traceback (most recent call last):
2018-05-07 21:18:27,692 ERROR File "/opt/hap.py/bin/som.py", line 779, in
2018-05-07 21:18:27,693 ERROR main()
2018-05-07 21:18:27,693 ERROR File "/opt/hap.py/bin/som.py", line 276, in main
2018-05-07 21:18:27,693 ERROR args.ref)
2018-05-07 21:18:27,693 ERROR File "/opt/hap.py/lib/python27/Tools/bcftools.py", line 220, in preprocessVCF
2018-05-07 21:18:27,693 ERROR runBcftools("index", "-t", output)
2018-05-07 21:18:27,693 ERROR File "/opt/hap.py/lib/python27/Tools/bcftools.py", line 50, in runBcftools
2018-05-07 21:18:27,693 ERROR ". Return code was %i, output: %s / %s \n" % (rc, o, e))
2018-05-07 21:18:27,693 ERROR Exception: Error running BCFTOOLS; please check if your file has issues using vcfcheck. Return code was 255, output: / [E::hts_idx_push] unsorted positions on sequence #1: 36937065 followed by 36937059index: failed to create index for "/tmp/tmpv2p9Tv/normalized_truth.vcf.gz"
Here is the output of vcfcheck using the reference .fa and vcf:
mismatched reference G should be g at chr1:115252609
mismatched reference C should be c at chr13:28589495
mismatched reference C should be c at chr13:28589509
mismatched reference AAG should be Aag at chr13:28602226
mismatched reference C should be c at chr13:28611669
mismatched reference A should be a at chr13:28611710
mismatched reference CTTT should be cttt at chr17:7578711
mismatched reference A should be a at chr17:7578837
mismatched reference AGCAGAGGCTTAAGGAGGAGGAAGAAGACAAGAAACGCAAAGAGGAGGAGGAG should be AGCAGAGGCTTAAGGAGGAGGAAGAAGACAAGAAACGCAAAgaggaggaggag at chr19:13054564
mismatched reference C should be c at chr19:33792631
mismatched reference T should be t at chr20:51296340
mismatched reference G should be g at chr21:36206932
mismatched reference T should be t at chr21:43606997
mismatched reference TAAA should be Taaa at chr7:148504906
mismatched reference A should be a at chr7:148508481
mismatched reference C should be c at chr7:148525904
Chr prefixed truth test successful
Chr prefix detection tests SUCCEEDED!
2017-04-18 13:00:40,482 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-04-18 13:00:40,483 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Variant decomposition test Hap.py v0.3.7 from /home/mcgaugheyd/hap.py-install/bin
2017-04-18 13:00:41,274 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-04-18 13:00:41,275 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-04-18 13:00:41,518 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-04-18 13:00:41,519 ERROR coercing to Unicode: need string or buffer, NoneType found
2017-04-18 13:00:41,519 ERROR Traceback (most recent call last):
2017-04-18 13:00:41,519 ERROR File "/home/mcgaugheyd/hap.py-install/bin/hap.py", line 460, in <module>
2017-04-18 13:00:41,519 ERROR main()
2017-04-18 13:00:41,519 ERROR File "/home/mcgaugheyd/hap.py-install/bin/hap.py", line 199, in main
2017-04-18 13:00:41,519 ERROR if not os.path.exists(args.ref):
2017-04-18 13:00:41,519 ERROR File "/home/mcgaugheyd/anaconda2/lib/python2.7/genericpath.py", line 26, in exists
2017-04-18 13:00:41,519 ERROR os.stat(path)
2017-04-18 13:00:41,520 ERROR TypeError: coercing to Unicode: need string or buffer, NoneType found
hap.py failed!
Variant decomposition test FAILED!
Traceback (most recent call last):
File "install.py", line 334, in <module>
main()
File "install.py", line 330, in main
test_haplotypes(source_dir, python_shebang, args)
File "install.py", line 199, in test_haplotypes
subprocess.check_call(to_run, shell=True)
File "/home/mcgaugheyd/anaconda2/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /home/mcgaugheyd/hap.py-install && /usr/local/bin/hap.py/src/sh/run_tests.sh' returned non-zero exit status 1
Install fails as shown below, do you have any idea what is wrong?
the dependencies are all installes and I ran
sudo -H python install.py ~/hap.py-install
I could later install with the 'make 'method but do not quite understand the need of the initial and failed install.
Thanks for your feedback
...
Skipping Locations.INDEL -- not present on both sides
Chr prefixed truth test successful
Chr prefix detection tests SUCCEEDED!
2017-10-29 10:25:42,256 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-10-29 10:25:42,257 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Variant decomposition test Hap.py v0.3.10 from /home/luna.kuleuven.be/u0002316/hap.py-install/bin
2017-10-29 10:25:42,827 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-10-29 10:25:42,828 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-10-29 10:25:42,975 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-10-29 10:25:42,976 ERROR Please specify a valid reference path using -r.
2017-10-29 10:25:42,976 ERROR Traceback (most recent call last):
2017-10-29 10:25:42,976 ERROR File "/home/luna.kuleuven.be/u0002316/hap.py-install/bin/hap.py", line 511, in <module>
2017-10-29 10:25:42,976 ERROR main()
2017-10-29 10:25:42,976 ERROR File "/home/luna.kuleuven.be/u0002316/hap.py-install/bin/hap.py", line 211, in main
2017-10-29 10:25:42,976 ERROR raise Exception("Please specify a valid reference path using -r.")
2017-10-29 10:25:42,976 ERROR Exception: Please specify a valid reference path using -r.
hap.py failed!
Variant decomposition test FAILED!
Traceback (most recent call last):
File "install.py", line 298, in <module>
main()
File "install.py", line 294, in main
test_haplotypes(source_dir, python_shebang, args)
File "install.py", line 170, in test_haplotypes
subprocess.check_call(to_run, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /home/luna.kuleuven.be/u0002316/hap.py-install && /opt/biotools/hap.py/src/sh/run_tests.sh' returned non-zero exit status 1
I downloaded one of the vcf files for NA12877 from a Public Dataset in Basespace (see screenshot).
https://basespace.illumina.com/analyses/18540672?preview=False&projectId=16095081
Ran it against hap.py and got the values below:
i=NA12877-rep1_S1.genome.vcf.gz && sudo docker run -it -v $PWD:/data/input -v $GROUP/PlatinumGenomes/:/data/PlatinumGenomes pkrusche/hap.py /opt/hap.py/bin/hap.py /data/PlatinumGenomes/hg19/8.0.1/NA12877/NA12877.vcf.gz /data/input/$i -o /data/test
HAPPY
Benchmarking Summary:
TRUTH.TOTAL QUERY.TOTAL METRIC.Recall METRIC.Precision METRIC.Frac_NA
Locations.INDEL 591048 622986 0.833987 0.781263 0
Locations.SNP 3592710 3226918 0.889341 0.989879 0
Are these numbers correct? I was expecting SNP Recall to be in the order of 0.98 and SNP Precision to be around 0.998 .
I'm trying to run a comparison of the GIAB HG001 reference VCF against some local data, and am coming up with consistent crashes in hap.py. Unfortunately, the error messages don't point towards a specific cause.
Are there any instructions for how to isolate the root cause of problems like this? Is there a way of turning on debugging messages, or really any message that could point to the cause of this error?
2017-01-23 16:45:14,544 ERROR ------------------------------------------------------------
2017-01-23 16:45:14,544 ERROR Traceback (most recent call last):
2017-01-23 16:45:14,544 ERROR File "/opt/hap.py/lib/python27/Tools/parallel.py", line 71, in parMapper
2017-01-23 16:45:14,544 ERROR return arg[1]['fun'](arg[0], *arg[1]['args'], **arg[1]['kwargs'])
2017-01-23 16:45:14,544 ERROR File "/opt/hap.py/lib/python27/Haplo/partialcredit.py", line 64, in preprocessWrapper
2017-01-23 16:45:14,545 ERROR subprocess.check_call(to_run, shell=True, stdout=tfo, stderr=tfe)
2017-01-23 16:45:14,545 ERROR File "/opt/python27/lib/python2.7/subprocess.py", line 541, in check_call
2017-01-23 16:45:14,545 ERROR raise CalledProcessError(retcode, cmd)
2017-01-23 16:45:14,546 ERROR CalledProcessError: Command 'preprocess /tmp/tmpVFXClX.vcf.gz:* -l 4:46200230-49670140 -o /tmp/input.4:46200230-496701409qciL6.prep.vcf.gz -V 1 -L 0 -r /ssd/reference/fda/nih.GRCh37.fasta' returned non-zero exit status 1
2017-01-23 16:45:14,546 ERROR ------------------------------------------------------------
2017-01-23 16:45:33,072 ERROR One of the preprocess jobs failed
2017-01-23 16:45:33,072 ERROR Traceback (most recent call last):
2017-01-23 16:45:33,072 ERROR File "/opt/local/bin/hap.py", line 419, in <module>
2017-01-23 16:45:33,072 ERROR main()
2017-01-23 16:45:33,073 ERROR File "/opt/local/bin/hap.py", line 306, in main
2017-01-23 16:45:33,073 ERROR args.gender) # same gender as truth above
2017-01-23 16:45:33,073 ERROR File "/opt/hap.py/bin/pre.py", line 202, in preprocess
2017-01-23 16:45:33,073 ERROR haploid_x=gender == "male")
2017-01-23 16:45:33,073 ERROR File "/opt/hap.py/lib/python27/Haplo/partialcredit.py", line 201, in partialCredit
2017-01-23 16:45:33,073 ERROR raise Exception("One of the preprocess jobs failed")
2017-01-23 16:45:33,073 ERROR Exception: One of the preprocess jobs failed```
Does hap.py compare gVCFs of various forms? Aside from the standard form on page 24 of gVCF spec where a symbolic alt allele <*> indicates a hom ref block, GATK uses a different symbolic alt allele < NON_REF >. Some gVCFs mark no call blocks rather than hom ref blocks, and some mark both. Does hap.py accept all these forms of gVCFs? Thanks a lot.
Hi,
Is it possible to use hap.py to make allele only comparisons? (i.e. Can the genotype check be disabled?). I know som.py does this, but I'd like to keep hap.py's haplotype resolution - only without the genotype checking.
vcf-eval appears to support this with the --squash-ploidy option.
Thanks.
Is there a URL/pointer to a prebuilt docker image of hap.py with hg19?
And given an input vcf.gz/vcf.gz.tbi pair of files, how shall one run hap.py with such prebuilt docker image?
Thx
I was trying to use hap.py with Google DeepVariant, with the following command.
docker run -it -v `pwd`:/data pkrusche/hap.py /opt/hap.py/bin/hap.py \
/data/quickstart-testdata/test_nist.b37_chr20_100kbp_at_10mb.vcf.gz \
"/data/${FINAL_OUTPUT_VCF}" \
-f /data/quickstart-testdata/test_nist.b37_chr20_100kbp_at_10mb.bed \
-r "/data/${REF}" \
-o "/data/${OUTPUT_DIR}/happy.output" \
--engine=vcfeval \
-l chr20:10000000-10010000
However, I get the following error
docker: Error response from daemon: error while creating mount source path '/homes/pyk12/Desktop/DeepVariant': mkdir /homes: read-only file system.
I am not sure if I have installed docker or hap.py wrong, or have made other mistakes. I have tested docker with docker run hello-world
and apparently it can run. Do you have any idea what the problem is?
I am using ubuntu 16.04 and I used docker pull pkrusche/hap.py
to get the docker images. Following is the output of the docker info.
Containers: 39
Running: 0
Paused: 0
Stopped: 39
Images: 41
Server Version: 17.06.2-ce
Storage Driver: aufs
Root Dir: /var/snap/docker/common/var-lib-docker/aufs
Backing Filesystem: xfs
Dirs: 139
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-119-generic
Operating System: Ubuntu Core 16
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.55 GiB
Docker Root Dir: /var/snap/docker/common/var-lib-docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 16
Goroutines: 26
System Time: 2018-04-25T16:14:55.308825278+01:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Got this two errors while trying to run:
python2 hap.py/bin/hap.py
example/happy/PG_NA12878_chr21.vcf.gz
example/happy/NA12878_chr21.vcf.gz
-f example/happy/PG_Conf_chr21.bed.gz
-o test
I'm using hg19 from UCSC, could that be the issue?
2018-01-19 12:33:41,959 WARNING [DEBUG] Error at hap.py/src/c++/lib/diploidgraphs/DiploidReference.cpp:287
2018-01-19 12:33:41,959 WARNING Cannot find matching haplotype pairs at chr21:37213761-37213821
2018-01-19 12:33:41,959 WARNING xcmp(_Z10error_stopPKciS0_z+0x10b)[0x542420]
2018-01-19 12:33:41,959 WARNING xcmp(_ZN10haplotypes16DiploidReference9setRegionEPKcllRKNSt7__cxx114listIN7variant8VariantsESaIS6_EEEi+0x964)[0x55b97c]
2018-01-19 12:33:41,959 WARNING xcmp(_ZN10haplotypes14DiploidCompare9setRegionEPKcllRKNSt7__cxx114listIN7variant8VariantsESaIS6_EEEii+0x28b)[0x5563d5]
2018-01-19 12:33:41,959 WARNING xcmp[0x53cf8e]
2018-01-19 12:33:41,959 WARNING xcmp(main+0x2458)[0x54007f]
2018-01-19 12:33:41,959 WARNING /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4000842830]
2018-01-19 12:33:41,959 WARNING xcmp(_start+0x29)[0x53c399]
2018-01-19 12:33:41,959 WARNING
2018-01-19 12:33:41,959 WARNING [DEBUG END]
2018-01-19 12:33:41,959 WARNING
2018-01-19 12:33:41,959 WARNING [DEBUG] Error at hap.py/src/c++/lib/diploidgraphs/DiploidReference.cpp:181
2018-01-19 12:33:41,960 WARNING Too many het nodes (21) at chr21:37868998-37869302
2018-01-19 12:33:41,960 WARNING xcmp(_Z10error_stopPKciS0_z+0x10b)[0x542420]
2018-01-19 12:33:41,960 WARNING xcmp(_ZN10haplotypes16DiploidReference9setRegionEPKcllRKNSt7__cxx114listIN7variant8VariantsESaIS6_EEEi+0x2a3)[0x55b2bb]
2018-01-19 12:33:41,960 WARNING xcmp(_ZN10haplotypes14DiploidCompare9setRegionEPKcllRKNSt7__cxx114listIN7variant8VariantsESaIS6_EEEii+0x28b)[0x5563d5]
2018-01-19 12:33:41,960 WARNING xcmp[0x53cf8e]
2018-01-19 12:33:41,960 WARNING xcmp(main+0x2458)[0x54007f]
2018-01-19 12:33:41,960 WARNING /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4000842830]
2018-01-19 12:33:41,960 WARNING xcmp(_start+0x29)[0x53c399]
2018-01-19 12:33:41,960 WARNING
2018-01-19 12:33:41,960 WARNING [DEBUG END]
I have a file (attached, repro.vcf.gz) which will consistently cause a segmentation fault in preprocess
when running pre.py
.
The problem can be reproduced with the latest pkrusche/hap.py
Docker image by having the attached repro.vcf.gz file as well as hg19.fa and .fai files in the current working directory and running the following command:
$ docker run -it -v `pwd`:/data pkrusche/hap.py /opt/hap.py/bin/pre.py -r /data/hg19.fa /data/repro.vcf.gz /data/out.vcf.gz
Output:
2018-01-24 10:35:39,664 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2018-01-24 10:35:39,670 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
[I] Total VCF records: 39000
[I] Non-reference VCF records: 39
2018-01-24 10:35:58,838 ERROR Preprocess command preprocess /tmp/tmpZtDJn6.vcf.gz:* -l chr16:1-2147483647 -o /tmp/input.chr16:1-2147483647AV1SRT.prep.vcf.gz -V 1 -L 1 -r /data/hg19.fa failed. Outputs are here /tmp/stdout5XGXK7.log / /tmp/stderryoI6H2.log
2018-01-24 10:35:58,839 ERROR Segmentation fault (core dumped)
2018-01-24 10:35:58,839 ERROR Exception when running <function preprocessWrapper at 0x7f7be157e9b0>:
2018-01-24 10:35:58,839 ERROR ------------------------------------------------------------
2018-01-24 10:35:58,839 ERROR Traceback (most recent call last):
2018-01-24 10:35:58,839 ERROR File "/opt/hap.py/lib/python27/Tools/parallel.py", line 72, in parMapper
2018-01-24 10:35:58,840 ERROR return arg[1]['fun'](arg[0], *arg[1]['args'], **arg[1]['kwargs'])
2018-01-24 10:35:58,840 ERROR File "/opt/hap.py/lib/python27/Haplo/partialcredit.py", line 67, in preprocessWrapper
2018-01-24 10:35:58,840 ERROR subprocess.check_call(to_run, shell=True, stdout=tfo, stderr=tfe)
2018-01-24 10:35:58,840 ERROR File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2018-01-24 10:35:58,841 ERROR raise CalledProcessError(retcode, cmd)
2018-01-24 10:35:58,841 ERROR CalledProcessError: Command 'preprocess /tmp/tmpZtDJn6.vcf.gz:* -l chr16:1-2147483647 -o /tmp/input.chr16:1-2147483647AV1SRT.prep.vcf.gz -V 1 -L 1 -r /data/hg19.fa' returned non-zero exit status 139
2018-01-24 10:35:58,841 ERROR ------------------------------------------------------------
Traceback (most recent call last):
File "/opt/hap.py/bin/pre.py", line 395, in <module>
main()
File "/opt/hap.py/bin/pre.py", line 391, in main
preprocessWrapper(args)
File "/opt/hap.py/bin/pre.py", line 241, in preprocessWrapper
args.somatic_allele_conversion)
File "/opt/hap.py/bin/pre.py", line 203, in preprocess
haploid_x=gender == "male")
File "/opt/hap.py/lib/python27/Haplo/partialcredit.py", line 214, in partialCredit
raise Exception("One of the preprocess jobs failed")
Exception: One of the preprocess jobs failed
The file is about as small as what will consistently reproduce the issue. If I remove some lines from the end, it might only segfault occasionally.
Running the preprocess program in GDB (locally compiled binary, not using the Docker image anymore) indicates that it is crashing in htslib code:
$ gdb --args preprocess repro.vcf.gz:* -o output.vcf.gz -V 1 -L 1 -r hg19.fa
GNU gdb (GDB) Red Hat Enterprise Linux 7.8.2-38.el7
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from preprocess...done.
(gdb) run
Starting program: /opt/hap.py/bin/preprocess repro.vcf.gz:\* -o output.vcf.gz -V 1 -L 1 -r hg19.fa
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00000000005504d8 in vcf_parse_format (s=<error reading variable: Cannot access memory at address 0x7fffff7fef88>, h=<error reading variable: Cannot access memory at address 0x7fffff7fef80>,
v=<error reading variable: Cannot access memory at address 0x7fffff7fef78>, p=<error reading variable: Cannot access memory at address 0x7fffff7fef70>,
q=<error reading variable: Cannot access memory at address 0x7fffff7fef68>) at vcf.c:1883
1883 vcf.c: No such file or directory.
(gdb) where 6
#0 0x00000000005504d8 in vcf_parse_format (s=<error reading variable: Cannot access memory at address 0x7fffff7fef88>, h=<error reading variable: Cannot access memory at address 0x7fffff7fef80>,
v=<error reading variable: Cannot access memory at address 0x7fffff7fef78>, p=<error reading variable: Cannot access memory at address 0x7fffff7fef70>,
q=<error reading variable: Cannot access memory at address 0x7fffff7fef68>) at vcf.c:1883
#1 0x0000000000552dd1 in vcf_parse (s=0x8a5fd0, h=0x8a5150, v=0x8b7660) at vcf.c:2377
#2 0x0000000000540274 in _reader_fill_buffer (files=0x8a5f70, reader=0x8a6b00) at synced_bcf_reader.c:527
#3 0x0000000000540577 in _reader_next_line (files=0x8a5f70) at synced_bcf_reader.c:591
#4 0x00000000005407f3 in bcf_sr_next_line (files=0x8a5f70) at synced_bcf_reader.c:633
#5 0x00000000004ae219 in variant::VariantReader::advance() ()
(More stack frames follow...)
I have a question related to calculating local match statistics. The paper suggests:
TP.LM = QUERY.TP + FP.GT + FP.AL
FN.LM = TRUTH.FN - FP.GT - FP.AL
FP.LM = QUERY.FP - FP.GT - FP.AL
It seems that the FP.AL are counted relative to the query. If I try to use those numbers relative to the truth set (for example to calculate recall with local matching) I get numbers that don't make sense, simply because the local matches aren't 1-to-1. Perhaps this is simply a feature request, but is there a way to get the FN.AL counts? Or am I interpreting something wrong?
Downloaded hg19/8.0.1/NA12878/NA12878.vcf.gz
and index file from the Platinum Genomes ftp site.
Running on a small test.vcf.gz
file like so:
sudo docker run -it -v $PWD:/data/input -v $GROUP/PlatinumGenomes/:/data/PlatinumGenomes pkrusche/hap.py /opt/hap.py/bin/hap.py /data/PlatinumGenomes/hg19/8.0.1/NA12878/NA12878.vcf.gz /data/input/test.vcf.gz -o /data/test 2015-12-02 10:16:54,779 WARNING Reference sequence check failed! Please ensure that truth and query VCF use the same reference sequence as hap.py. XCMP may fail if this is not the case, and the results will not be accurate.
2015-12-02 10:16:54,779 WARNING No calls for location chr2 in query!
2015-12-02 10:16:54,779 WARNING No calls for location chr3 in query!
2015-12-02 10:16:54,779 WARNING No calls for location chr4 in query!
2015-12-02 10:16:54,779 WARNING No calls for location chr5 in query!
2015-12-02 10:16:54,779 WARNING No calls for location chr6 in query!
2015-12-02 10:16:54,779 WARNING No calls for location chr7 in query!
2015-12-02 10:16:54,779 WARNING No calls for location chr8 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr9 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr10 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr11 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr12 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr13 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr14 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr15 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr16 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr17 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr18 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr19 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr20 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr21 in query!
2015-12-02 10:16:54,780 WARNING No calls for location chr22 in query!
2015-12-02 10:16:56,399 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:56,453 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:56,548 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:56,632 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:56,685 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:56,800 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:57,102 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:57,400 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:57,400 WARNING [W::vcf_parse] INFO 'CS' is not defined in the header, assuming Type=String
2015-12-02 10:16:57,400 WARNING [W::vcf_parse] INFO 'Context' is not defined in the header, assuming Type=String
2015-12-02 10:16:57,400 WARNING [W::vcf_parse] INFO 'DP' is not defined in the header, assuming Type=String
2015-12-02 10:16:57,400 WARNING [W::vcf_parse] INFO 'MQ0' is not defined in the header, assuming Type=String
2015-12-02 10:16:57,400 WARNING [W::vcf_parse] INFO 'NS' is not defined in the header, assuming Type=String
2015-12-02 10:16:57,400 WARNING [W::vcf_parse] INFO 'REF' is not defined in the header, assuming Type=String
2015-12-02 10:16:57,400 WARNING [W::vcf_parse] INFO 'SB' is not defined in the header, assuming Type=String
2015-12-02 10:16:57,643 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:57,747 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:57,828 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:57,942 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:57,992 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:57,992 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,074 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,329 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,345 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,377 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,451 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,508 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,622 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,626 WARNING Could not parse the header line: "##vcfProcessLog=<InputVCF=<>, InputVCFSource=<>, InputVCFVer=<1.1>, InputVCFParam=<> InputVCFgeneAnno=<>>"
2015-12-02 10:16:58,627 ERROR Cannot read sample names from truth input file
2015-12-02 10:16:58,627 ERROR Traceback (most recent call last):
2015-12-02 10:16:58,627 ERROR File "/opt/hap.py/bin/hap.py", line 804, in <module>
2015-12-02 10:16:58,627 ERROR main()
2015-12-02 10:16:58,628 ERROR File "/opt/hap.py/bin/hap.py", line 575, in main
2015-12-02 10:16:58,628 ERROR raise Exception("Cannot read sample names from truth input file")
2015-12-02 10:16:58,628 ERROR Exception: Cannot read sample names from truth input file
My script:
export HGREF=/mypath/hs37d5.fasta
hap.py /path/ctrl.vcf /path/query.vcf -o happy.out -r /path/hs37d5.fasta
g++ version gcc/gcc-7.2.0/bin/g++
Python 2.7.13 :: Anaconda custom (64-bit)
but I still got these ERRORs :
2017-12-28 11:39:56,825 ERROR [stderr] regex_error
2017-12-28 11:39:57,010 ERROR Command 'quantify /tmp/hap.py.result.kQ8D10.vcf.gz -o happy.out12268.roc.tsv -r hs37d5.fasta --threads 32 --output-vtc 0 --output-rocs 1 --type xcmp --qq IQQ --qq-header QUAL --roc-delta 0.500000 --clean-info 1 --fix-chr-regions 0 -v happy.out12268.vcf.gz --roc-regions ''' returned non-zero exit status 1
2017-12-28 11:39:57,010 ERROR Traceback (most recent call last):
2017-12-28 11:39:57,010 ERROR File "/nextcode/sge_software/hap/hapv0.3.10/bin/hap.py", line 511, in
2017-12-28 11:39:57,021 ERROR main()
2017-12-28 11:39:57,022 ERROR File "/nextcode/sge_software/hap/hapv0.3.10/bin/hap.py", line 496, in main
2017-12-28 11:39:57,022 ERROR qfy.quantify(args)
2017-12-28 11:39:57,022 ERROR File "/nextcode/sge_software/hap/hapv0.3.10/bin/qfy.py", line 129, in quantify
2017-12-28 11:39:57,023 ERROR strat_fixchr=args.strat_fixchr)
2017-12-28 11:39:57,023 ERROR File "/nextcode/sge_software/hap/hapv0.3.10/lib/python27/Haplo/quantify.py", line 177, in run_quantify
2017-12-28 11:39:57,027 ERROR subprocess.check_call(run_str, shell=True, stdout=tfo, stderr=tfe)
2017-12-28 11:39:57,027 ERROR File "/nextcode/sge_software/anaconda2/lib/python2.7/subprocess.py", line 186, in check_call
2017-12-28 11:39:57,050 ERROR raise CalledProcessError(retcode, cmd)
2017-12-28 11:39:57,051 ERROR CalledProcessError: Command 'quantify /tmp/hap.py.result.kQ8D10.vcf.gz -o happy.out12268.roc.tsv -r hs37d5.fasta --threads 32 --output-vtc 0 --output-rocs 1 --type xcmp --qq IQQ --qq-header QUAL --roc-delta 0.500000 --clean-info 1 --fix-chr-regions 0 -v happy.out12268.vcf.gz --roc-regions ''' returned non-zero exit status 1
[...]
*** 2391 failures detected in test suite "Master Test Suite"
Boost Unit tests FAILED!
Traceback (most recent call last):
File "install.py", line 315, in <module>
main()
File "install.py", line 311, in main
test_haplotypes(source_dir, python_shebang, args)
File "install.py", line 191, in test_haplotypes
subprocess.check_call(to_run, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /opt/hap.py && /opt/hap.py-source/src/sh/run_tests.sh' returned non-zero exit status 1
The command '/bin/sh -c HG19=/opt/hap.py-data/hg19.fa python install.py /opt/hap.py' returned a non-zero code: 1
Any ideas?
Hi,
I have installed hap.py through bioconda and I tried to run the program according to the read me document:
hap.py hap.py/example/happy/PG_NA12878_chr21.vcf.gz hap.py/example/happy/NA12878_chr21.vcf.gz -r /mnt/hds/proj/bioinfo/MIP_ANALYSIS/references_4.0/Homo_sapiens.GRCh37.d5.fasta -f hap.py/example/happy/PG_Conf_chr21.bed.gz -o test
I got the following error:
2017-09-04 14:10:07,886 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-09-04 14:10:07,897 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
[W] overlapping records at chr21:10993857 for sample 0
[W] Symbolic / SV ALT alleles at chr21:15847469
[W] Variants that overlap on the reference allele: 144
[W] Variants that have symbolic ALT alleles: 14
[I] Total VCF records: 65402
[I] Non-reference VCF records: 65402
Contig chr21 is not known
2017-09-04 14:10:21,531 ERROR Command 'gvcf2bed /tmp/truth.ppCDsZ5w.vcf.gz -r /mnt/hds/proj/bioinfo/MIP_ANALYSIS/references_4.0/Homo_sapiens.GRCh37.d5.fasta -o /tmp/tmpqpBTVv.bed -T hap.py/example/happy/PG_Conf_chr21.bed.gz' returned non-zero exit status 1
2017-09-04 14:10:21,531 ERROR Traceback (most recent call last):
2017-09-04 14:10:21,531 ERROR File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/rnaseq/bin/hap.py", line 460, in <module>
2017-09-04 14:10:21,532 ERROR main()
2017-09-04 14:10:21,532 ERROR File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/rnaseq/bin/hap.py", line 282, in main
2017-09-04 14:10:21,532 ERROR conf_temp = Haplo.gvcf2bed.gvcf2bed(args.vcf1, args.ref, args.fp_bedfile, args.scratch_prefix)
2017-09-04 14:10:21,532 ERROR File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/rnaseq/lib/python27/Haplo/gvcf2bed.py", line 38, in gvcf2bed
2017-09-04 14:10:21,544 ERROR subprocess.check_call(cmdline, shell=True)
2017-09-04 14:10:21,545 ERROR File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/rnaseq/lib/python2.7/subprocess.py", line 186, in check_call
2017-09-04 14:10:21,561 ERROR raise CalledProcessError(retcode, cmd)
2017-09-04 14:10:21,561 ERROR CalledProcessError: Command 'gvcf2bed /tmp/truth.ppCDsZ5w.vcf.gz -r /mnt/hds/proj/bioinfo/MIP_ANALYSIS/references_4.0/Homo_sapiens.GRCh37.d5.fasta -o /tmp/tmpqpBTVv.bed -T hap.py/example/happy/PG_Conf_chr21.bed.gz' returned non-zero exit status 1
I downloaded gvcf2bed and tried to run the same command again and got a new error.
2017-09-04 14:25:11,100 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2017-09-04 14:25:11,104 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
[W] overlapping records at chr21:10993857 for sample 0
[W] Symbolic / SV ALT alleles at chr21:15847469
[W] Variants that overlap on the reference allele: 144
[W] Variants that have symbolic ALT alleles: 14
[I] Total VCF records: 65402
[I] Non-reference VCF records: 65402
usage: gvcf2bed [-h] -I INPUT -O OUTPUT [-s SAMPLE] [-q QUALITY]
[-nq NON_VARIANT_QUALITY] [-b]
gvcf2bed: error: argument -I/--input is required
2017-09-04 14:25:12,072 ERROR Command 'gvcf2bed /tmp/truth.ppfH1zZb.vcf.gz -r /mnt/hds/proj/bioinfo/MIP_ANALYSIS/references_4.0/Homo_sapiens.GRCh37.d5.fasta -o /tmp/tmpCeSiNo.bed -T hap.py/example/happy/PG_Conf_chr21.bed.gz' returned non-zero exit status 2
2017-09-04 14:25:12,072 ERROR Traceback (most recent call last):
2017-09-04 14:25:12,072 ERROR File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/rnaseq/bin/hap.py", line 460, in <module>
2017-09-04 14:25:12,072 ERROR main()
2017-09-04 14:25:12,073 ERROR File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/rnaseq/bin/hap.py", line 282, in main
2017-09-04 14:25:12,073 ERROR conf_temp = Haplo.gvcf2bed.gvcf2bed(args.vcf1, args.ref, args.fp_bedfile, args.scratch_prefix)
2017-09-04 14:25:12,073 ERROR File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/rnaseq/lib/python27/Haplo/gvcf2bed.py", line 38, in gvcf2bed
2017-09-04 14:25:12,073 ERROR subprocess.check_call(cmdline, shell=True)
2017-09-04 14:25:12,073 ERROR File "/mnt/hds/proj/bioinfo/SERVER/miniconda/envs/rnaseq/lib/python2.7/subprocess.py", line 186, in check_call
2017-09-04 14:25:12,074 ERROR raise CalledProcessError(retcode, cmd)
2017-09-04 14:25:12,074 ERROR CalledProcessError: Command 'gvcf2bed /tmp/truth.ppfH1zZb.vcf.gz -r /mnt/hds/proj/bioinfo/MIP_ANALYSIS/references_4.0/Homo_sapiens.GRCh37.d5.fasta -o /tmp/tmpCeSiNo.bed -T hap.py/example/happy/PG_Conf_chr21.bed.gz' returned non-zero exit status 2
It seems like gvcf2bed only supports in python 3 but I was unable to install hap.py in python 3. This probably isn't causing the error but which python version you recommend to run hap.py?
Hi, I am comparing two vcf file, and get the following errors:
[W] overlapping records at 1:6799149 for sample 0
[W] Variants that overlap on the reference allele: 1346
[I] Total VCF records: 4029532
[I] Non-reference VCF records: 4029532
[W] overlapping records at 1:8149316 for sample 0
[W] Variants that overlap on the reference allele: 464
[I] Total VCF records: 4003155
[I] Non-reference VCF records: 4003155
2018-01-10 12:31:59,960 WARNING No calls for location MT in query!
2018-01-10 12:34:22,279 WARNING [W] too many AD fields at 1:49573365 max_ad = 6 retrieved: 7
2018-01-10 12:34:22,279 WARNING *** Error in `xcmp': free(): invalid next size (fast): 0x0000000002c96e90 ***
2018-01-10 12:34:22,280 WARNING ======= Backtrace: =========
2018-01-10 12:34:22,280 WARNING /usr/lib64/libc.so.6(+0x7d023)[0x7efdc5c05023]
2018-01-10 12:34:22,280 WARNING xcmp(_ZN7variant13VariantReader7advanceEv+0x211f)[0x56eac3]
2018-01-10 12:34:22,280 WARNING xcmp(main+0x2222)[0x533839]
2018-01-10 12:34:22,280 WARNING /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7efdc5ba9b15]
2018-01-10 12:34:22,280 WARNING xcmp[0x52fd41]
2018-01-10 12:34:22,280 WARNING ======= Memory map: ========
2018-01-10 12:34:22,280 WARNING 00400000-007a0000 r-xp 00000000 00:25 20026347337 tools/hap.py-build/bin/xcmp
...
...
2018-01-10 12:47:17,996 WARNING ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
2018-01-10 12:47:17,996 ERROR Exception when running <function xcmpWrapper at 0x7fa633357668>:
2018-01-10 12:47:17,996 ERROR ------------------------------------------------------------
2018-01-10 12:47:17,997 ERROR Traceback (most recent call last):
2018-01-10 12:47:17,997 ERROR File "tools/hap.py-build/lib/python27/Tools/parallel.py", line 71, in parMapper
2018-01-10 12:47:17,997 ERROR return arg[1]['fun'](arg[0], *arg[1]['args'], **arg[1]['kwargs'])
2018-01-10 12:47:17,997 ERROR File "tools/hap.py-build/lib/python27/Haplo/xcmp.py", line 69, in xcmpWrapper
2018-01-10 12:47:17,997 ERROR subprocess.check_call(to_run, shell=True, stdout=tfo, stderr=tfe)
2018-01-10 12:47:17,997 ERROR File "/home/cc439/anaconda2/envs/hornet/lib/python2.7/subprocess.py", line 186, in check_call
2018-01-10 12:47:17,997 ERROR raise CalledProcessError(retcode, cmd)
2018-01-10 12:47:17,997 ERROR CalledProcessError: Command 'xcmp /tmp/truth.ppooeVs5.vcf.gz /tmp/query.ppR_g6QK.vcf.gz -l 3:46861889-55529583 -o /tmp/result.3:46861889-55529583yNJudr.bcf -r refdata-b37-2.1.0/fasta/genome.fa -f 0 -n 16768 --expand-hapblocks 30 --window 50 --no-hapcmp 0 --qq QUAL' returned non-zero exit status -6
2018-01-10 12:47:17,997 ERROR ------------------------------------------------------------
Any possible reasons for this?
Hi,
I'm running qfy.py using a vcf produced by vcfeval in ga4gh mode. The execution lasts around 2h after which it fails. The command line and the stderr are the following:
python hap.py/bin/qfy.py --report-prefix combined --write-vcf --write-counts --verbose --type ga4gh --threads 8 --stratification stratification.tsv --roc-delta 20 --roc QQ --reference human_g1k_v37_decoy.fasta combined.vcf.gz
2018-03-12T17:39:59.736617253Z 2018-03-12 17:39:59,089 ERROR [stderr] Added region file 'truseq.bed' as 'truseq' (214115 intervals)
2018-03-12T17:39:59.800729266Z 2018-03-12 17:39:59,666 ERROR [stderr] Killed
2018-03-12T17:39:59.800761058Z 2018-03-12 17:39:59,728 ERROR Command 'quantify combined.vcf.gz -o combined.roc.tsv -r human_g1k_v37_decoy.fasta --threads 8 --output-vtc 0 --output-rocs 1 --type ga4gh --qq QQ --roc-delta 20.000000 --clean-info 1 --fix-chr-regions 0 -v combined.vcf.gz -R 'truseq:truseq.bed' --roc-regions '*'' returned non-zero exit status 137
2018-03-12T17:39:59.801256247Z 2018-03-12 17:39:59,750 ERROR Traceback (most recent call last):
2018-03-12T17:39:59.801259753Z 2018-03-12 17:39:59,761 ERROR File "/opt/hap.py/bin/qfy.py", line 378, in <module>
2018-03-12T17:39:59.872980577Z 2018-03-12 17:39:59,865 ERROR main()
2018-03-12T17:39:59.873145682Z 2018-03-12 17:39:59,872 ERROR File "/opt/hap.py/bin/qfy.py", line 374, in main
2018-03-12T17:39:59.873228664Z 2018-03-12 17:39:59,873 ERROR quantify(args)
2018-03-12T17:39:59.890464487Z 2018-03-12 17:39:59,873 ERROR File "/opt/hap.py/bin/qfy.py", line 129, in quantify
2018-03-12T17:39:59.890505413Z 2018-03-12 17:39:59,873 ERROR strat_fixchr=args.strat_fixchr)
2018-03-12T17:39:59.890519038Z 2018-03-12 17:39:59,874 ERROR File "/opt/hap.py/lib/python27/Haplo/quantify.py", line 178, in run_quantify
2018-03-12T17:39:59.911044306Z 2018-03-12 17:39:59,910 ERROR subprocess.check_call(run_str, shell=True, stdout=tfo, stderr=tfe)
2018-03-12T17:39:59.911065951Z 2018-03-12 17:39:59,910 ERROR File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2018-03-12T17:39:59.951891424Z 2018-03-12 17:39:59,951 ERROR raise CalledProcessError(retcode, cmd)
2018-03-12T17:39:59.965796603Z 2018-03-12 17:39:59,964 ERROR CalledProcessError: Command 'quantify combined.vcf.gz -o combined.roc.tsv -r human_g1k_v37_decoy.fasta --threads 8 --output-vtc 0 --output-rocs 1 --type ga4gh --qq QQ --roc-delta 20.000000 --clean-info 1 --fix-chr-regions 0 -v combined.vcf.gz -R 'truseq:truseq.bed' --roc-regions '*'' returned non-zero exit status 137
I'd be happy to share the input files if that would help.
P.S. I truncated the file paths from the stderr just for simplicity's sake.
Thanks
I see in the documentation for som.py that it outputs confidence intervals for recall and precision using Jeffreys method for binomial proportions. This is calculated using the script: https://github.com/Illumina/hap.py/blob/master/src/python/Tools/ci.py.
As far as I can see there isn't an option to output confidence intervals for hap.py (please do correct me if I'm wrong!). In future versions would it be possible to output this information to the summary.csv file?
In the meantime would the above Tools/ci.py script be suitable for us to use to calculate the confidence intervals for the recall and precision output from hap.py? Or would you recommend a different method?
I try to install at Ubuntu 16.04, and install all depencies in the dockerfile.
With numexpr is 2.6.2 and pandas 0.20.3,numpy 1.13.1
The install is failed ruing the test file, the related log is as following. Does it because som.py used deperated
method sort which is not supported by the pandas?:
/usr/bin/python /home/jianfeng.qian/hap.py-install/bin/som.py /home/jianfeng.qian/graph_geno/hap.py/src/sh/../../example/sompy/PG_admix_truth_snvs.vcf.gz /home/jianfeng.qian/graph_geno/hap.py/src/sh/../../example/sompy/strelka_admix_snvs.vcf.gz -o /tmp/sompy.L2qFyDafe7 -P --feature-table hcc.strelka.indel
2017-07-14 02:58:03,424 WARNING Missing feature I.QSI_NT
2017-07-14 02:58:03,429 WARNING Missing feature I.RC
2017-07-14 02:58:03,430 WARNING Missing feature I.IC
2017-07-14 02:58:03,430 WARNING Missing feature I.IHP
2017-07-14 02:58:03,430 WARNING Missing feature S.1.BCN50
2017-07-14 02:58:03,430 WARNING Missing feature S.2.BCN50
2017-07-14 02:58:03,430 WARNING Missing feature S.1.FDP50
2017-07-14 02:58:03,431 WARNING Missing feature S.2.FDP50
2017-07-14 02:58:03,431 WARNING Missing feature S.1.TAR
2017-07-14 02:58:03,431 WARNING Missing feature S.2.TAR
2017-07-14 02:58:03,431 WARNING Missing feature S.1.TIR
2017-07-14 02:58:03,431 WARNING Missing feature S.2.TIR
2017-07-14 02:58:03,432 WARNING Missing feature S.1.TOR
2017-07-14 02:58:03,432 WARNING Missing feature S.2.TOR
2017-07-14 02:58:17,032 ERROR 'DataFrame' object has no attribute 'sort'
2017-07-14 02:58:17,038 ERROR Traceback (most recent call last):
2017-07-14 02:58:17,038 ERROR File "/home/jianfeng.qian/hap.py-install/bin/som.py", line 779, in <module>
2017-07-14 02:58:17,038 ERROR main()
2017-07-14 02:58:17,038 ERROR File "/home/jianfeng.qian/hap.py-install/bin/som.py", line 500, in main
2017-07-14 02:58:17,039 ERROR tps.sort(["CHROM", "POS"], inplace=True)
2017-07-14 02:58:17,039 ERROR File "/usr/local/lib/python2.7/dist-packages/pandas-0.20.3-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 3081, in __getattr__
2017-07-14 02:58:17,041 ERROR return object.__getattribute__(self, name)
2017-07-14 02:58:17,041 ERROR AttributeError: 'DataFrame' object has no attribute 'sort'
som.py failed!
Som.py test FAILED!
Traceback (most recent call last):
File "install.py", line 325, in <module>
main()
File "install.py", line 321, in main
test_haplotypes(source_dir, python_shebang, args)
File "install.py", line 190, in test_haplotypes
subprocess.check_call(to_run, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cd /home/jianfeng.qian/hap.py-install && /home/jianfeng.qian/graph_geno/hap.py/src/sh/run_tests.sh' returned non-zero exit status 1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.