Comments (1)
@BrianLohman I am also parsing the annotated VCF from hap.py because I found it confusing to have a position with FN
in the TRUTH field but with FP
in the QUERY field. I figured out how add up the counts from my parsing script to have my data match the STDOUT, but I'm still getting different values for QUERY.TOTAL
and TRUTH.TOTAL
which doesn't make sense to me. Perhaps you or someone else sees something I am missing?
For example, here is my STDOUT:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 525370 519748 5622 543056 1386 0 1169 144 0.989299 0.997448 0.0 0.993357 NaN NaN 1.528472 1.744304
INDEL PASS 525370 519748 5622 543056 1386 0 1169 144 0.989299 0.997448 0.0 0.993357 NaN NaN 1.528472 1.744304
SNP ALL 3365341 3337928 27413 3341865 2187 0 1188 245 0.991854 0.999346 0.0 0.995586 2.100079 2.099354 1.581145 1.573484
SNP PASS 3365341 3337928 27413 3341865 2187 0 1188 245 0.991854 0.999346 0.0 0.995586 2.100079 2.099354 1.581145 1.573484
SUM
TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP
3890711 3857676 66035 3884921 3573
Rather than simply counting values in the TRUTH or QUERY column, I joined the column values with a _
. As I read in the file, I added the TRUTH_QUERY combo as a key in a Python dictionary the first time and set the value = 0. And then any time after that I +1 to the dictionary value.
And here is what my parsing script returns:
TotalTruthLoci,4048333
._FP,1270
._N,0
._TP,24560
FN_.,30733
FN_FP,2296
FN_N,0
FN_TP,6
N_.,0
N_N,0
TP_.,887
TP_FP,7
TP_N,0
TP_TP,3856782
UNK_UNK,0
TotalTestLoci,3916541
If I then do the following:
TotalTP=(TP_FP + TP_. + TP_TP)
TotalFN=(FN_. + FN_FP + FN_TP)
TotalFP=(._FP + FN_FP + TP_FP)
My output matches the hap.py STDOUT:
TotalTP = 3857676
TotalFN = 30035
TotalFP = 3753
For the math to add up, I have to exclude all ._TP
patterns, which mostly seem to be repeats of the same variant also labeled TP_.
. The redudancy occurs with both SNPs and INDELs.
1 418660 . . NOCALL nocall . TP 0/1 INDEL het d1_5
1 418660 TP 0/1 INDEL het d1_5 . . NOCALL nocall .
But there are also similar redundancies with FN_FP
, so I'm not sure why these are not excluded either:
1 796491 . . NOCALL nocall . FP 0/1 INDEL het d6_15
KEY: FN_._INDEL_NOCALL VALUE:69926
1 796491 FN 0/1 INDEL het d6_15 . . NOCALL nocall .
Where my script currently breaks is that I end up with 31,620 more QUERY.TOTAL
variants than the STDOUT. Even subtracting the 24,560 ._TP
variants leaves me still with 7,060 more QUERY.TOTAL than STDOUT
. I also have 157,622 more variants for TRUTH.TOTAL
.
Let me know if you have any thoughts. My processing module is more complex than what you shared, as it's a customized intermediate step in a much larger pipeline. However, if you'd be interested in using it, I could try to make a portable helper script, once I know the math is correct.
from hap.py.
Related Issues (20)
- WES TI TV ratio HOT 2
- Confused with the interpretation when having TRUTH.FN and QUERY.FP HOT 2
- pre.py -> vcfcheck fails due to call with invalid genotype (non-existent allele)
- CMake Error at CMakeLists.txt:32 (message): Building external dependencies has failed
- Error running BCFTOOLS :Argument list too long
- Integrating vcfdist as a comparison engine into hap.py
- missing reference HOT 1
- VCF format issues with --write-vcf, - FORMAT field inconsistencies HOT 1
- Trying to print sequence when we mean contig name
- error code 1
- Docker Implementation: Several Error Messages Related to "preprocess" HOT 1
- ROC and PR curve HOT 2
- Docker fails to build for both bases
- Incorrect number of FORMAT/AD values on scmp-distance engine
- Link provide in email is broken HOT 1
- [E::bgzf_uncompress] inflate failed: invalid distance too far back HOT 1
- Docker build failed
- How is the false positive rate calculated in som.py stats?
- Can't find reference HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hap.py.