Comments (6)
I'm not sure I have a full explanation for why this is happening, but one thing that I would recommend is dropping the base qualities on the Ns. If an N comes off of a sequencing machine, the base is usually given a base quality of #
, whereas you've given them a near-perfect base quality. If this still occurs with more realistic base qualities, it would also help to have the full alignments to look at (the path
field in the JSON).
from vg.
With one N, you still get full-length alignments from gapless extension, because the full-length bonus is larger than the penalty for a single mismatch. With two or more Ns, the trailing Ns get trimmed, and Giraffe does tail alignment with Dozeu. What does Dozeu do when the read and/or the graph contain Ns? (Gapless extension masks all non-ACGT characters in the read with X, assuming that the graph does not contain character X.)
from vg.
I believe that Dozeu will treat all Ns as 0 score mismatches (even if both read and reference are N). The full length bonus is not applied during dynamic programming, but it is applied when choosing the optimal traceback location and in the final score.
from vg.
I think I know what is happening. Let's concentrate on the read with two trailing Ns.
There are three equally good alignments in the graph, with the problematic one being path >45>47>48>49>50
in the following subgraph:
H VN:Z:1.1 RS:Z:GRCh38
S 44686345 AGGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCAT
S 44686346 A
S 44686347 G
S 44686348 ACACTGTA
S 44686349 C
S 44686350 TCCAGCCTGGGTGACAGAGCGAAATTCCAGCAAAGAAAGGGAGAGAGGAAGGGAGAGAGGGAGTAGGGGAGAGAGAGAGAGAGAG
L 44686345 + 44686346 + 0M
L 44686345 + 44686347 + 0M
L 44686346 + 44686348 + 0M
L 44686347 + 44686348 + 0M
L 44686348 + 44686349 + 0M
L 44686349 + 44686350 + 0M
W GRCh38 0 chr5 176090011 176090154 >44686345>44686346>44686348>44686349>44686350 WT:i:51
W unknown 1 chr5 0 57 >44686345>44686347>44686348 WT:i:1
W unknown 2 chr5 0 143 >44686345>44686347>44686348>44686349>44686350 WT:i:36
W unknown 3 chr5 0 85 >44686350 WT:i:1
The first (internal) N corresponds to node 49. Haplotype unknown 1
ends at node 48, because it would have a rare allele at the position of node 49. Meanwhile, haplotype unknown 2
would be the correct alignment. We get one right-maximal extension with no mismatches over unknown 1
and another with 10 additional matches, 4 additional mismatches (all Ns), and a full-length bonus over unknown 2
. Because the extension over unknown 1
scores higher, we select it as the best extension.
The best extension has an unaligned 14 bp tail, which we align using Dozeu. Because Dozeu does not use the mismatch penalty for Ns, this alignment has a higher score than the equivalent alignments elsewhere in the graph.
from vg.
A couple of questions:
-
Do the reads automatically get trimmed if there are at least 2 Ns. Or is it that in this example the 1-N read just happens to be small enough to get the full-length bonus?
-
Do the Ns receive a penalty during the right-maximal extension?
-
Is there anyway we can tell Dozeu to use a non-0 penalty for Ns?
from vg.
Giraffe uses two alignment algorithms. First it tries to find an alignment without indels. If it can't find a full-length alignment with a few mismatches, it trims the partial alignments it found, as mismatches near the ends of an alignment often indicate the presence of indels. Then it aligns the tails of the trimmed alignment while allowing indels.
In the first (gapless extension) algorithm, Ns are mismatches with a normal mismatch penalty against everything. In the second algorithm (Dozeu), Ns are zero-cost mismatches against everything. This is effectively a rare bug that requires a complex fix, because the above behaviors are inherent to the way the alignment algorithms have been implemented.
from vg.
Related Issues (20)
- Some questions about the <vg call> commond output HOT 7
- vg call crash with "ERROR: Signal 6 occurred. VG has crashed." HOT 8
- I have a question while using vg. HOT 6
- I have a question while using vg HOT 1
- Variant Calling from HPRC Pangenome HOT 3
- ERROR: Tag "transcript_id" not found in attributes (line 145). HOT 3
- Release vg v1.57.0
- vg mpmap mapping reads to pan-transcriptom too slow HOT 8
- Hello, I want to get some information from Snarl file HOT 1
- a problem about vg autoidex (v1.56.0) HOT 1
- Merge of different chromosome graph files HOT 2
- Program stuck at [IndexRegistry]: Chunking VCF(s) for days HOT 2
- vg alignments not reporting split reads HOT 1
- Mapping paired end reads with vg giraffe: "Falling back on single-end mapping" HOT 9
- Autoindex should parse tabix-indexed monolithic VCFs in parallel
- vg pack error HOT 3
- VCF file empty when calling SV on ONT data HOT 9
- vg map errors HOT 5
- Genotyping SVs in a minigraph-cactus graph yields many similar alleles in output vcf HOT 1
- How to align both long and paired end short reads using vg HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vg.