I'm testing skani to compute the pairwise ANI of all

Thanks for the explanation, <a class="user-mention notranslate" data-hovercard-type="u

Identical sequences have pairwise AF lower than 100% about skani HOT 4 CLOSED

apcamargo commented on August 19, 2024

Identical sequences have pairwise AF lower than 100%

from skani.

Comments (4)

bluenote-1577 commented on August 19, 2024

Hi Antonio,

Thanks for bringing this issue up. This is an issue that I can fix, but I may need some testing. Let me explain:

The way skani calculates AF is by calculating the length of "chains" of k-mers, i.e. sequences of k-mers. These k-mers are subsampled and may not start at the beginning of the sequence. So what you're seeing is that the distance between the first and last k-mer is 86% of your genome.

Lowering -c helps because there are more k-mers, so the first k-mer starts closer to the ends of your genome.

As for the -m, this won't change the AF at all since it is just for filtering genomes. We don't allow c > m, which is what the message is trying to say. I'll make that more clear and push a change.

I'll implement a fix to this issue soon, but I'll need to do some testing. In practice, currently skani will underestimate AF for reasonably sized genomes identitical (> 20kb) by ~1% or so, so this is mostly an issue for small genomes.

Jim

from skani.

bluenote-1577 commented on August 19, 2024

Actually, looking through my code, I implemented a workaround to this issue a while back... but there was a bug and it wasn't applied correctly. I've pushed an approximate fix for this issue and it should give much more reasonable results now.

The fix I implemented was simply adding "c" bases to the ends of the chains to account for the missing k-mers near the ends of the chains. This isn't perfect, but works on average. It still isn't a great fix for small sequences and doesn't guarantee 100% AF for identical sequences.

I'll leave this issue up while I implement a better solution that guarantees 100% AF for identical sequences.

from skani.

apcamargo commented on August 19, 2024

Thanks for the explanation, @bluenote-1577! I'll try out the latest commit and let you know if I have any issues.

from skani.

bluenote-1577 commented on August 19, 2024

Coming back to this issue -- most of the time the AF is 100% for identical genomes since 0.1.0, but it isn't technically guaranteed. I think this is okay, even if it's not 100% it'll be very very close, and we are an approximate method after all.

If someone has issues with this or it fails spectacuarly in some cases, feel free to reference or open up this issue again.

from skani.

Recommend Projects

Identical sequences have pairwise AF lower than 100% about skani HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent