Coder Social home page Coder Social logo

Comments (12)

pkrusche avatar pkrusche commented on August 24, 2024

Did this work in the end? I do think that we added support for removing the * allele (for assessment on VCF 4.2 files, only the initial deletion and not the downstream * alleles would need to be considered).

from hap.py.

mziegler12 avatar mziegler12 commented on August 24, 2024

Hi Peter,

we extracted one sample from a multi VCF with GATK SelectVariants. In my opinion, in a single sample vcf a '' should not occur. Am i wrong at this point? Is there a better tool? With '' it still doesn´t work.

Greetings from Munich
Martin

from hap.py.

pkrusche avatar pkrusche commented on August 24, 2024

To my understanding the * allele indicates an upstream deletion as the "other" ALT allele e.g. for hemizygous SNPs, so they can also occur in single samples. Empty ALTs are not legal I think.

Can you perhaps share a problematic VCF record?

Another thing to try is to switch off preprocessing, this will be standard in future versions:

--no-leftshift --no-decompose --gender=none

from hap.py.

mziegler12 avatar mziegler12 commented on August 24, 2024

The VCF is about 2.5Gb in size. I try to extract one small chromosome. Another question is: Is Hap.py able to deal with NCBI "NC_000001" nomenclature of chromosomes?

from hap.py.

pkrusche avatar pkrusche commented on August 24, 2024

A few records that cause the problem / have the alleles in question are probably sufficient.

W.r.t. the chromosomes, can you show an example also? W.r.t. chromosome names and locations, the only requirement hap.py has is that the VCF and BED chromosome names match the reference fasta file.

from hap.py.

mziegler12 avatar mziegler12 commented on August 24, 2024

With a few changes (removing-unused-alternates & excludeNonVariants from query dataset) and your options, i finaly made it to another ERROR:

2018-03-05 09:05:48,503 WARNING  No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
[W] overlapping records at NC_000001:37070742 for sample 0
[W] Variants that overlap on the reference allele: 93
[I] Total VCF records:         3619471
[I] Non-reference VCF records: 3619471
[W] overlapping records at NC_000001:3403037 for sample 0
[W] Variants that overlap on the reference allele: 954
[I] Total VCF records:         2423857
[I] Non-reference VCF records: 2415113
2018-03-05 09:07:10,877 ERROR    'utf8' codec can't decode byte 0xe4 in position 4: invalid continuation byte
2018-03-05 09:07:10,877 ERROR    Traceback (most recent call last):
2018-03-05 09:07:10,878 ERROR      File "/usr/local/bin/hap.py", line 511, in <module>
2018-03-05 09:07:10,878 ERROR        main()
2018-03-05 09:07:10,878 ERROR      File "/usr/local/bin/hap.py", line 366, in main
2018-03-05 09:07:10,879 ERROR        "QUERY")
2018-03-05 09:07:10,879 ERROR      File "/usr/local/bin/pre.py", line 131, in preprocess
2018-03-05 09:07:11,163 ERROR        h = vcfextract.extractHeadersJSON(vcf_input)
2018-03-05 09:07:11,163 ERROR      File "/usr/local/lib/python27/Tools/vcfextract.py", line 229, in extractHeadersJSON
2018-03-05 09:07:11,167 ERROR        vfh = json.load(open(tf.name))
2018-03-05 09:07:11,167 ERROR      File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
2018-03-05 09:07:11,171 ERROR        **kw)
2018-03-05 09:07:11,171 ERROR      File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
2018-03-05 09:07:11,171 ERROR        return _default_decoder.decode(s)
2018-03-05 09:07:11,171 ERROR      File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
2018-03-05 09:07:11,172 ERROR        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
2018-03-05 09:07:11,172 ERROR      File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode
2018-03-05 09:07:11,173 ERROR        obj, end = self.scan_once(s, idx)
2018-03-05 09:07:11,173 ERROR    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 4: invalid continuation byte
´´´

from hap.py.

mziegler12 avatar mziegler12 commented on August 24, 2024

Here is the query VCF, only chr22 (removed-unused-alternates & excludedNonVariants)
new_giab_vars.zip

from hap.py.

pkrusche avatar pkrusche commented on August 24, 2024

One problem with your file appears to be that it includes an umlaut (ä in März where some tool pasted a timestamp into the header) in a non-utf8 encoding (appears to be Latin1, it's the header from SelectVariants). This is probably the result of different encoding settings for bash and for Java.

Also, hap.py does not translate contig names. The truth and query file must have the same set of contig names, so in order to compare this file to GIAB on hg38, the contig names and positions in your query need to be on the same reference as the ones in the truthset (i.e. something like chr22:...).

from hap.py.

pkrusche avatar pkrusche commented on August 24, 2024

Here is a command line to make the file usable. It's a bit of a hack chromosome-name wise, it would probably better to use a reference with the right names for variant calling:

cat ~/Downloads/new_giab_vars.vcf | perl -pe 's/NC_0+/chr/' | grep -v SelectVariants | grep -v GATKCommandLine | grep -v 'contig=<ID=N' | bcftools annotate -x INFO,^FORMAT/GT | awk '$0~"^#" { print $0; next } { print $0 | "LC_ALL=C sort -k1,1 -k2,2n" }' | bcftools view -o new_giab_translated.vcf.gz -O z
bcftools index new_giab_translated.vcf.gz

This removes the offending headers and translates the chromosome names. Also the VCF file needs to be sorted, this does that as well.

from hap.py.

pkrusche avatar pkrusche commented on August 24, 2024

This seems to work, this is what I get:

Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         6051       407      5644         1234       293        528    159       0.067262          0.584986        0.427877         0.120651                     NaN                     NaN                   2.500309                   0.583548
INDEL   PASS         6051       403      5648         1030       276        345    154       0.066601          0.597080        0.334951         0.119834                     NaN                     NaN                   2.500309                   0.527489
  SNP    ALL        34462     13645     20817        32353      7839      10869   5445       0.395943          0.635124        0.335950         0.487792                2.429894                2.246512                   2.240218                   1.588876
  SNP   PASS        34462     13309     21153        27301      7644       6348   5303       0.386193          0.635184        0.232519         0.480339                2.429894                2.371294                   2.240218                   1.482088

from hap.py.

mziegler12 avatar mziegler12 commented on August 24, 2024

It works pretty well on the first sight. It seems, march or as we say März is not a good month for genome analysis ;-). Thank you.

from hap.py.

pkrusche avatar pkrusche commented on August 24, 2024

Happy it worked!

I think the umlauts should not break things normally, it's just that the GATK binaries apparently use a different encoding from your shell. They are Java-based, so any wrapper scripts you may have for it might contain encoding switches that make it use something not UTF-8. You can specify this for the JVM that runs GATK: https://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding

from hap.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.