Coder Social home page Coder Social logo

Comments (4)

bluenote-1577 avatar bluenote-1577 commented on August 19, 2024

Hi Antonio,

Thanks for bringing this issue up. This is an issue that I can fix, but I may need some testing. Let me explain:

The way skani calculates AF is by calculating the length of "chains" of k-mers, i.e. sequences of k-mers. These k-mers are subsampled and may not start at the beginning of the sequence. So what you're seeing is that the distance between the first and last k-mer is 86% of your genome.

Lowering -c helps because there are more k-mers, so the first k-mer starts closer to the ends of your genome.

As for the -m, this won't change the AF at all since it is just for filtering genomes. We don't allow c > m, which is what the message is trying to say. I'll make that more clear and push a change.

I'll implement a fix to this issue soon, but I'll need to do some testing. In practice, currently skani will underestimate AF for reasonably sized genomes identitical (> 20kb) by ~1% or so, so this is mostly an issue for small genomes.

Jim

from skani.

bluenote-1577 avatar bluenote-1577 commented on August 19, 2024

Actually, looking through my code, I implemented a workaround to this issue a while back... but there was a bug and it wasn't applied correctly. I've pushed an approximate fix for this issue and it should give much more reasonable results now.

The fix I implemented was simply adding "c" bases to the ends of the chains to account for the missing k-mers near the ends of the chains. This isn't perfect, but works on average. It still isn't a great fix for small sequences and doesn't guarantee 100% AF for identical sequences.

I'll leave this issue up while I implement a better solution that guarantees 100% AF for identical sequences.

from skani.

apcamargo avatar apcamargo commented on August 19, 2024

Thanks for the explanation, @bluenote-1577! I'll try out the latest commit and let you know if I have any issues.

from skani.

bluenote-1577 avatar bluenote-1577 commented on August 19, 2024

Coming back to this issue -- most of the time the AF is 100% for identical genomes since 0.1.0, but it isn't technically guaranteed. I think this is okay, even if it's not 100% it'll be very very close, and we are an approximate method after all.

If someone has issues with this or it fails spectacuarly in some cases, feel free to reference or open up this issue again.

from skani.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.