Comments (12)
Did this work in the end? I do think that we added support for removing the *
allele (for assessment on VCF 4.2 files, only the initial deletion and not the downstream *
alleles would need to be considered).
from hap.py.
Hi Peter,
we extracted one sample from a multi VCF with GATK SelectVariants. In my opinion, in a single sample vcf a '' should not occur. Am i wrong at this point? Is there a better tool? With '' it still doesn´t work.
Greetings from Munich
Martin
from hap.py.
To my understanding the *
allele indicates an upstream deletion as the "other" ALT allele e.g. for hemizygous SNPs, so they can also occur in single samples. Empty ALTs are not legal I think.
Can you perhaps share a problematic VCF record?
Another thing to try is to switch off preprocessing, this will be standard in future versions:
--no-leftshift --no-decompose --gender=none
from hap.py.
The VCF is about 2.5Gb in size. I try to extract one small chromosome. Another question is: Is Hap.py able to deal with NCBI "NC_000001" nomenclature of chromosomes?
from hap.py.
A few records that cause the problem / have the alleles in question are probably sufficient.
W.r.t. the chromosomes, can you show an example also? W.r.t. chromosome names and locations, the only requirement hap.py has is that the VCF and BED chromosome names match the reference fasta file.
from hap.py.
With a few changes (removing-unused-alternates & excludeNonVariants from query dataset) and your options, i finaly made it to another ERROR:
2018-03-05 09:05:48,503 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
[W] overlapping records at NC_000001:37070742 for sample 0
[W] Variants that overlap on the reference allele: 93
[I] Total VCF records: 3619471
[I] Non-reference VCF records: 3619471
[W] overlapping records at NC_000001:3403037 for sample 0
[W] Variants that overlap on the reference allele: 954
[I] Total VCF records: 2423857
[I] Non-reference VCF records: 2415113
2018-03-05 09:07:10,877 ERROR 'utf8' codec can't decode byte 0xe4 in position 4: invalid continuation byte
2018-03-05 09:07:10,877 ERROR Traceback (most recent call last):
2018-03-05 09:07:10,878 ERROR File "/usr/local/bin/hap.py", line 511, in <module>
2018-03-05 09:07:10,878 ERROR main()
2018-03-05 09:07:10,878 ERROR File "/usr/local/bin/hap.py", line 366, in main
2018-03-05 09:07:10,879 ERROR "QUERY")
2018-03-05 09:07:10,879 ERROR File "/usr/local/bin/pre.py", line 131, in preprocess
2018-03-05 09:07:11,163 ERROR h = vcfextract.extractHeadersJSON(vcf_input)
2018-03-05 09:07:11,163 ERROR File "/usr/local/lib/python27/Tools/vcfextract.py", line 229, in extractHeadersJSON
2018-03-05 09:07:11,167 ERROR vfh = json.load(open(tf.name))
2018-03-05 09:07:11,167 ERROR File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
2018-03-05 09:07:11,171 ERROR **kw)
2018-03-05 09:07:11,171 ERROR File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
2018-03-05 09:07:11,171 ERROR return _default_decoder.decode(s)
2018-03-05 09:07:11,171 ERROR File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
2018-03-05 09:07:11,172 ERROR obj, end = self.raw_decode(s, idx=_w(s, 0).end())
2018-03-05 09:07:11,172 ERROR File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode
2018-03-05 09:07:11,173 ERROR obj, end = self.scan_once(s, idx)
2018-03-05 09:07:11,173 ERROR UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 4: invalid continuation byte
´´´
from hap.py.
Here is the query VCF, only chr22 (removed-unused-alternates & excludedNonVariants)
new_giab_vars.zip
from hap.py.
One problem with your file appears to be that it includes an umlaut (ä
in März where some tool pasted a timestamp into the header) in a non-utf8 encoding (appears to be Latin1, it's the header from SelectVariants). This is probably the result of different encoding settings for bash and for Java.
Also, hap.py does not translate contig names. The truth and query file must have the same set of contig names, so in order to compare this file to GIAB on hg38, the contig names and positions in your query need to be on the same reference as the ones in the truthset (i.e. something like chr22:...).
from hap.py.
Here is a command line to make the file usable. It's a bit of a hack chromosome-name wise, it would probably better to use a reference with the right names for variant calling:
cat ~/Downloads/new_giab_vars.vcf | perl -pe 's/NC_0+/chr/' | grep -v SelectVariants | grep -v GATKCommandLine | grep -v 'contig=<ID=N' | bcftools annotate -x INFO,^FORMAT/GT | awk '$0~"^#" { print $0; next } { print $0 | "LC_ALL=C sort -k1,1 -k2,2n" }' | bcftools view -o new_giab_translated.vcf.gz -O z
bcftools index new_giab_translated.vcf.gz
This removes the offending headers and translates the chromosome names. Also the VCF file needs to be sorted, this does that as well.
from hap.py.
This seems to work, this is what I get:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 6051 407 5644 1234 293 528 159 0.067262 0.584986 0.427877 0.120651 NaN NaN 2.500309 0.583548
INDEL PASS 6051 403 5648 1030 276 345 154 0.066601 0.597080 0.334951 0.119834 NaN NaN 2.500309 0.527489
SNP ALL 34462 13645 20817 32353 7839 10869 5445 0.395943 0.635124 0.335950 0.487792 2.429894 2.246512 2.240218 1.588876
SNP PASS 34462 13309 21153 27301 7644 6348 5303 0.386193 0.635184 0.232519 0.480339 2.429894 2.371294 2.240218 1.482088
from hap.py.
It works pretty well on the first sight. It seems, march or as we say März
is not a good month for genome analysis ;-). Thank you.
from hap.py.
Happy it worked!
I think the umlauts should not break things normally, it's just that the GATK binaries apparently use a different encoding from your shell. They are Java-based, so any wrapper scripts you may have for it might contain encoding switches that make it use something not UTF-8. You can specify this for the JVM that runs GATK: https://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding
from hap.py.
Related Issues (20)
- Parsing results VCF gives different counts of TRUTH FN than summary HOT 1
- CMake Error at CMakeLists.txt:32 (message): Building external dependencies has failed
- Error running BCFTOOLS :Argument list too long
- Integrating vcfdist as a comparison engine into hap.py
- missing reference HOT 1
- VCF format issues with --write-vcf, - FORMAT field inconsistencies HOT 1
- Trying to print sequence when we mean contig name
- error code 1
- Docker Implementation: Several Error Messages Related to "preprocess" HOT 1
- ROC and PR curve HOT 2
- Docker fails to build for both bases
- Incorrect number of FORMAT/AD values on scmp-distance engine
- Link provide in email is broken HOT 1
- [E::bgzf_uncompress] inflate failed: invalid distance too far back HOT 1
- Docker build failed HOT 1
- How is the false positive rate calculated in som.py stats?
- Can't find reference HOT 1
- While using hap.py, there is a problem:
- Make a new pre-built docker image?
- Using --usefiltered-truth results in incorrect FP calls for filtered variants in the truth (xcmp)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hap.py.