Coder Social home page Coder Social logo

kmers's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kmers's Issues

Internal storage order

So after trying to write some code interfacing kmers with some k-mers parsed using needletail, I realized we are storing the k-mers in the opposite order. In kmers the leftmost nucleotide is stored in the highest order bit, in needletail and some of the other (C++) libraries I have used in the past it's the opposite.

Are there any strong arguments for one of these schemes over the other (cc @natir, @Daniel-Liu-c0deb0t, @luizirber)? Is this something we should also consider making part of the policy of how the k-mers are constructed?

Dealing with the (temporary) lack of constant computations in const generics

So, it turns out that currently (as of rust v1.51) with MVP const generics, we are not allowed to do simple computations with generic parameters. For example, it would be desirable to have something like this to automatically compute the size of the storage array we want to use based on the value of K provided to the class. However, such a capability is gated behind a feature flag on the nightly branch of rustc. We should determine how we want to handle this from a design perspective. I see a few options:

  1. Go with > 1 const generic parameter, to avoid having to do const-generic arithmetic on the receiving end. Once that is available in stable, we can of course simplify the interface. This is a little bit onerous, since now we have to e.g. provide a helper (const) function or macro or some such so that the user doesn't have to think about the number of words that should be used for storage, which is, anyway, error prone.
  2. Throw caution to the wind and require nightly rustc with the feature gate to allow ourselves to do the arithmetic we want with const generic parameters.
  3. Some other and much more clever solution I've not yet considered.

I'd appreciate others' thoughts on this.

library name and logo?

As a fundamental start point for any good software project, we need to decide on the name and generate a logo. I kept the repo name simple, but I was imagining we'd write it out as something like kme-rs or kme.rs or something cute like that ;P. I just wanted to open discussion on this very important topic here to invite opinions and thoughts.

Encoding Type

I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest

A -> 00
C -> 01
T -> 10
G -> 11

There's a couple of benefits:

  1. These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.
  2. Complement by using XOR ...0101010

On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.

Roadmap / TODO

This issue will provide a roadmap for the library, along with specific tasks (TODOs). Ideally we should break these tasks into short and long term tasks and, as the library becomes more mature, tie individual tasks to specific release candidates.

This does not compile on Windows no more.

Hi Team!
Great software and I have a lot of fun with it! But after my last cargo clean I am not longer able to compile kmers on Windows11. I know most bioinformaticians are not working on Windows and nether am I, BUT we also have a Windows specific VR application that I use quite a lot (CellexalVR). We developed that our selves and it is really cool to work with single-cell data in that tool. Just if you want to try. Anyhow I therefore work on Windows and it would be really really cool if I could exclude this "simple-sds" lib from "my" kmers ;-)

Meaning could you PLEASE add a key somewhere before you included simple-sds (2 weeks ago) so that I can depend on kmers before this lib has been added? That would be very helpful! I do not assume simple-sds will support windows in the short term - or?

Thank you!

Discussion / possible features

Hi there!
I'd like to discuss the state of this project to see if it would be suitable for larger Rust libraries.

First I'd like to say that I really like the design of this library, especially the separation of encoding and Kmer, and the fact that it scales with large K using bitfields.

Still, I think that it would be nice to support more elementary operations to make the library more convenient, and that further optimizations could be explored.
In no particular order, I'm thinking of the following features

  • navigational methods computing the predecessor/successor of a k-mer by prepending/appending a given base
  • an iterator producing k-mers from an iterator of u8
  • a method to compute the canonical version of a k-mer / to test if a k-mer is canonical
  • optimizing some operations with SIMD if we use multiples integers to store a k-mer

Do you think these features are appropriate for this library, or would you prefer to keep it simple?
I've already implemented some of these features in a less generic way (see this module), so I could try to adapt the code to this library.

Likely stupid question -> kmer u64 to kmer [u8]

I have created u64 representations of a kmer like this:

for kmer in needletail::kmer::Kmers::new(seq, 32 as u8 ) {
     let km = Kmer::from(kmer).into_u64();

Is there a function to reverse that? To get from this u64 a kmer representation back?
I am quite sure there is and I am absolutely sure I will not find it ;-)
Please help!

I tried this, but it fails with an utf8 error:

std::str::from_utf8( &km.to_le_bytes()).unwrap().to_string()

I am sure it can be done and it should be simple. But I am not getting it ;-)

Let user choose padding

Hi,

Actually library, store kmer in an array of u64 it could be nice to let user choose the type of array. I work with very short kmer, use and u64 to store it, would have a significant cost.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.