combine-lab / kmers Goto Github PK
View Code? Open in Web Editor NEWA bit-packed k-mer representation (and relevant utilities) for rust
License: BSD 3-Clause "New" or "Revised" License
A bit-packed k-mer representation (and relevant utilities) for rust
License: BSD 3-Clause "New" or "Revised" License
So after trying to write some code interfacing kmers
with some k-mers parsed using needletail
, I realized we are storing the k-mers in the opposite order. In kmers
the leftmost nucleotide is stored in the highest order bit, in needletail
and some of the other (C++) libraries I have used in the past it's the opposite.
Are there any strong arguments for one of these schemes over the other (cc @natir, @Daniel-Liu-c0deb0t, @luizirber)? Is this something we should also consider making part of the policy of how the k-mers are constructed?
So, it turns out that currently (as of rust v1.51) with MVP const generics, we are not allowed to do simple computations with generic parameters. For example, it would be desirable to have something like this to automatically compute the size of the storage array we want to use based on the value of K
provided to the class. However, such a capability is gated behind a feature flag on the nightly branch of rustc. We should determine how we want to handle this from a design perspective. I see a few options:
I'd appreciate others' thoughts on this.
As a fundamental start point for any good software project, we need to decide on the name and generate a logo. I kept the repo name simple, but I was imagining we'd write it out as something like kme-rs
or kme.rs
or something cute like that ;P. I just wanted to open discussion on this very important topic here to invite opinions and thoughts.
I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest
A -> 00
C -> 01
T -> 10
G -> 11
There's a couple of benefits:
On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.
This issue will provide a roadmap for the library, along with specific tasks (TODOs). Ideally we should break these tasks into short and long term tasks and, as the library becomes more mature, tie individual tasks to specific release candidates.
Hi Team!
Great software and I have a lot of fun with it! But after my last cargo clean
I am not longer able to compile kmers
on Windows11. I know most bioinformaticians are not working on Windows and nether am I, BUT we also have a Windows specific VR application that I use quite a lot (CellexalVR). We developed that our selves and it is really cool to work with single-cell data in that tool. Just if you want to try. Anyhow I therefore work on Windows and it would be really really cool if I could exclude this "simple-sds" lib from "my" kmers ;-)
Meaning could you PLEASE add a key somewhere before you included simple-sds (2 weeks ago) so that I can depend on kmers before this lib has been added? That would be very helpful! I do not assume simple-sds will support windows in the short term - or?
Thank you!
Hi there!
I'd like to discuss the state of this project to see if it would be suitable for larger Rust libraries.
First I'd like to say that I really like the design of this library, especially the separation of encoding and Kmer, and the fact that it scales with large K using bitfields.
Still, I think that it would be nice to support more elementary operations to make the library more convenient, and that further optimizations could be explored.
In no particular order, I'm thinking of the following features
Do you think these features are appropriate for this library, or would you prefer to keep it simple?
I've already implemented some of these features in a less generic way (see this module), so I could try to adapt the code to this library.
I have created u64 representations of a kmer like this:
for kmer in needletail::kmer::Kmers::new(seq, 32 as u8 ) {
let km = Kmer::from(kmer).into_u64();
Is there a function to reverse that? To get from this u64 a kmer representation back?
I am quite sure there is and I am absolutely sure I will not find it ;-)
Please help!
I tried this, but it fails with an utf8 error:
std::str::from_utf8( &km.to_le_bytes()).unwrap().to_string()
I am sure it can be done and it should be simple. But I am not getting it ;-)
Hi,
Actually library, store kmer in an array of u64
it could be nice to let user choose the type of array. I work with very short kmer, use and u64 to store it, would have a significant cost.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.