Comments (4)
My preference would be to remove them.
from nanovarbench.
I'm happy to leave them in, but I think the tricky issue might be how to assess these variants.
Using your second example, if the reference sequence is CGTAGT
but the correct variant-applied sequence is CGTAACAACACTTGGAAT
, a variant caller could represent this in different ways.
It could do it in two variants, like you showed above:
chromosome 3 60f6a01c T TAACAACACTTGG
chromosome 5 a1f621d4 G A
Or it could do it in one variant:
chromosome 3 ab49ec54 TAG TAACAACACTTGGAA
Both are correct, so if you're assessing a variant caller, you'd need to accept both answers.
This problem applies not just to 'overlapping' variants but any close variants. And it could get even more complex with groups of close variants. E.g. if you had 3 variants near each other, you could group them in four ways: a,b,c or ab,c or a,bc or abc. And the number of possible groupings will grow enormously with more near variants.
So while I don't think you need to filter out variants because they are 'overlapping', you might want to filter out too-close variants, if that makes assessment easier.
from nanovarbench.
I think the complication you've mentioned here is actually more in the realm of variant assessment. vcfdist (described in this paper) handles this by standardising the variant representation between the truth and query VCFs (See Figure 1b and 3d and Suppl. Fig 2 for good illustrations).
So while I don't think you need to filter out variants because they are 'overlapping', you might want to filter out too-close variants, if that makes assessment easier.
I'd like to keep close variants in there as this will likely be a good separating factor between the variant callers. And we don't want to make it too easy for them 😁
However, I am beginning to think that excluding these valid overlapping variants might be a good idea. I've just found some more complicated examples where bcftools +remove-overlaps
doesn't think they're overlaps, but bcftools consensus
does.
chromosome_2 1561582 0cb9eac6 A ATTTCTTTTGATAAGAAAGTATTAAGTG . PASS . GT 1/1
chromosome_2 1561582 4bc9851a A AT . PASS . GT 1/1
from nanovarbench.
So essentially, these can be removed with bcftools norm -d indels
, indicating they're kind of the same variant.
I think this changes my mind and I now think we should remove all of these types of overlaps.
from nanovarbench.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nanovarbench.