Coder Social home page Coder Social logo

Comments (6)

jeizenga avatar jeizenga commented on June 30, 2024

I'm not sure I have a full explanation for why this is happening, but one thing that I would recommend is dropping the base qualities on the Ns. If an N comes off of a sequencing machine, the base is usually given a base quality of #, whereas you've given them a near-perfect base quality. If this still occurs with more realistic base qualities, it would also help to have the full alignments to look at (the path field in the JSON).

from vg.

jltsiren avatar jltsiren commented on June 30, 2024

With one N, you still get full-length alignments from gapless extension, because the full-length bonus is larger than the penalty for a single mismatch. With two or more Ns, the trailing Ns get trimmed, and Giraffe does tail alignment with Dozeu. What does Dozeu do when the read and/or the graph contain Ns? (Gapless extension masks all non-ACGT characters in the read with X, assuming that the graph does not contain character X.)

from vg.

jeizenga avatar jeizenga commented on June 30, 2024

I believe that Dozeu will treat all Ns as 0 score mismatches (even if both read and reference are N). The full length bonus is not applied during dynamic programming, but it is applied when choosing the optimal traceback location and in the final score.

from vg.

jltsiren avatar jltsiren commented on June 30, 2024

I think I know what is happening. Let's concentrate on the read with two trailing Ns.

There are three equally good alignments in the graph, with the problematic one being path >45>47>48>49>50 in the following subgraph:

H	VN:Z:1.1	RS:Z:GRCh38
S	44686345	AGGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCAT
S	44686346	A
S	44686347	G
S	44686348	ACACTGTA
S	44686349	C
S	44686350	TCCAGCCTGGGTGACAGAGCGAAATTCCAGCAAAGAAAGGGAGAGAGGAAGGGAGAGAGGGAGTAGGGGAGAGAGAGAGAGAGAG
L	44686345	+	44686346	+	0M
L	44686345	+	44686347	+	0M
L	44686346	+	44686348	+	0M
L	44686347	+	44686348	+	0M
L	44686348	+	44686349	+	0M
L	44686349	+	44686350	+	0M
W	GRCh38	0	chr5	176090011	176090154	>44686345>44686346>44686348>44686349>44686350	WT:i:51
W	unknown	1	chr5	0	57	>44686345>44686347>44686348	WT:i:1
W	unknown	2	chr5	0	143	>44686345>44686347>44686348>44686349>44686350	WT:i:36
W	unknown	3	chr5	0	85	>44686350	WT:i:1

The first (internal) N corresponds to node 49. Haplotype unknown 1 ends at node 48, because it would have a rare allele at the position of node 49. Meanwhile, haplotype unknown 2 would be the correct alignment. We get one right-maximal extension with no mismatches over unknown 1 and another with 10 additional matches, 4 additional mismatches (all Ns), and a full-length bonus over unknown 2. Because the extension over unknown 1 scores higher, we select it as the best extension.

The best extension has an unaligned 14 bp tail, which we align using Dozeu. Because Dozeu does not use the mismatch penalty for Ns, this alignment has a higher score than the equivalent alignments elsewhere in the graph.

from vg.

alshai avatar alshai commented on June 30, 2024

A couple of questions:

  1. Do the reads automatically get trimmed if there are at least 2 Ns. Or is it that in this example the 1-N read just happens to be small enough to get the full-length bonus?

  2. Do the Ns receive a penalty during the right-maximal extension?

  3. Is there anyway we can tell Dozeu to use a non-0 penalty for Ns?

from vg.

jltsiren avatar jltsiren commented on June 30, 2024

Giraffe uses two alignment algorithms. First it tries to find an alignment without indels. If it can't find a full-length alignment with a few mismatches, it trims the partial alignments it found, as mismatches near the ends of an alignment often indicate the presence of indels. Then it aligns the tails of the trimmed alignment while allowing indels.

In the first (gapless extension) algorithm, Ns are mismatches with a normal mismatch penalty against everything. In the second algorithm (Dozeu), Ns are zero-cost mismatches against everything. This is effectively a rare bug that requires a complex fix, because the above behaviors are inherent to the way the alignment algorithms have been implemented.

from vg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.