Coder Social home page Coder Social logo

Comments (4)

iljungr avatar iljungr commented on May 29, 2024

Some questions:

Running muscle on chr1/first_chrom (with gapopen=-1) and adding numbering to the result yields:

chr1
12345 678
ATGCA--TGC
first_chrom
--GCATTTGC
12345678

mapAtoB('chr1', 6) should return a deterministic answer (perhaps 4)

What is meant by deterministic? If it only means "produces the same answer on multiple runs with exactly the same input" then I would take whatever muscle maps it to, which in this case gives the answer 6. (I'm assuming muscle itself is deterministic.) If "deterministic" instead means "can easily be explained", then this wouldn't be sufficient since I don't know why muscle aligned things this way, rather than putting the gap after the T.

If we want to do anything other than take whatever answer muscle gave us, it will be complicated, and in principle would require us to do aligning ourselves. We'd essentially be asking if it could it have aligned it some other way. If all that's there is one base aligned to a run of the same base then we could handle that as a special case, but, what if you have something like this:
1234 5678
GGAT----ATGG
GGATATATATGG
123456789012
We could map 3 to either 3 or 7, since the aligner could have put the gap before or after the initial AT with the same score. The ambiguity exists even though position 3 itself is not next to a gap.

For now, in cases where muscle aligns a base to a non-gap base, I will map to that base.

mapAtoB('chr1',1) should ... maybe raise an IndexError or .. maybe return a negative number or something (we can figure that out). Actually, I think IndexError might be better.

mapBtoA('first_chrom', 4) or 5 or 6 should all return the same answer

If we take the muscle output as gospel and don't try to do any aligning ourselves, then mapAtoB('chr1',1) and mapBtoA('first_chrom', 4) are both cases of a base in one string aligned to a gap in the other, so is there any reason they should return different answers (IndexError in one case and not the other)?

An answer that would provide the maximum information to the caller would be to have bases that map to gaps return fractions indicating which two bases they are between and how far along the gap they are. So in this case:
mapAtoB('chr1',1) -> 1/3
mapBtoA('first_chrom', 4) -> 5 1/3
mapBtoA('first_chrom', 5) -> 5 2/3

from viral-ngs.

dpark01 avatar dpark01 commented on May 29, 2024

A few thoughts... The fractional value is interesting, but probably not the most useful answer based on how these things would be used. Perhaps the best answer would actually be an integer range in one direction (AtoB in your second example) and everything collapsing to one integer in the other direction (BtoA in your second example). Because the caller is basically asking the question "what is the equivalent position / sequence / allele on the other end of this mapping at this position?"

I think "deterministic" should just be whatever muscle maps to when it gives a clear answer (because I think it's deterministic enough in these scenarios).

As for your other questions, it depends on how you're actually counting the positions. I am assuming muscle is not giving you numbered positions in three coordinate spaces, I'm assuming it's just giving you the alignment as a FASTA file, and you're doing the counting yourself. As such, here's kind of what I envisioned. Calling mapAtoB('chr1',x) would walk down the aligned sequences counting up "x" non-gap characters in sequence A. You can't just index into the x'th position of the sequence, you have to actually walk from the beginning and count up the non-gaps yourself and stop when your count reaches x. Then you look over into sequence B at that position in the alignment and proceed back to the beginning and count up how many non-gaps you see in sequence B. That's the position you return as your answer.

As such, for your first example sequence, mapAtoB('chr1',1) would return zero, which is an invalid number (it should be an integer from 1 to length(B)). Also, I'm asking the thing to map a position that doesn't actually exist in the target genome--a "gap" is something that goes in between two things, and in this case, there's only one thing. The other side doesn't exist, the dashes are actually just denoting the end of the sequence. So IndexError seems sensible. Or you could return zero or length(B)+1 (if you fall off the other side) but it's the same thing in the end--a coordinate that you can't index into B and ask the sequence for.

True gaps are different because if A has an insertion relative to B, and it has several positions that all map to the same coordinate, it's useful for the user to know what the corresponding position and sequence is on the other side of that alignment. Conceivably, you can take the output of these mappers and build a SNP/indel caller off of it if you wanted to.

Another way to do this (instead of the manually counting up and down, which seems slow for long genomes) is to convert the gapped alignment to a CIGAR representation and do careful arithmetic on the results.. I might be able to find code somewhere that tries to use that representation for coordinate mapping. A quick google suggests that Perl code might exist publicly for such things, but not Python. Personally, I'd not worry about performance in this implementation--we can optimize that later if need be.

from viral-ngs.

iljungr avatar iljungr commented on May 29, 2024

Perhaps the best answer would actually be an integer range in one direction (AtoB in your second example) and everything collapsing to one integer in the other direction (BtoA in your second example).

There is ambiguity in how to define the range. For example:
A-T
ACT
Do we think of the A becoming AC, or do we think of the T becoming CT? In the former case, AtoB(1)->[1,2], AtoB(2)->3, and BtoA(2)->1, whereas in the latter case AtoB(1)->1, AtoB(2)->[2,3] and BtoA(2)->2.

There is no non-arbitrary way to resolve this, since in reality the C is not part of either the A or the T.

I can resolve by arbitrarily attaching inserts to the 5' base, though it might get us in trouble if two samples have reverse-complementary consensus sequences.

from viral-ngs.

dpark01 avatar dpark01 commented on May 29, 2024

So, there's no right way, but there are conventions that have evolved over time. The current convention favored by those who use VCF as a standard format for describing indels is to anchor on an initial shared base and leave it at that. So in your example, we'd say the two alleles are "A" and "AC". This differs from the V-Phaser convention which is to say "" and "C". There are reasons, though, that most of the field has leaned the way it did. So using that convention, the A->B mappings would be 1 -> [1,2], 2 -> 3. And the B->A mappings would be 1 -> 1, 2 -> 1, 3 -> 2.

from viral-ngs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.