Coder Social home page Coder Social logo

cbl's Introduction

Conway-Bromage-Lyndon

A Rust library providing fully dynamic sets of k-mers with high locality.

The data structure is described in Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets, please cite it if you use this library.

It supports the following operations:

  • inserting a single k-mer (with insert), or every k-mer from a sequence (with insert_seq)
  • deleting a single k-mer (with remove), or every k-mer from a sequence (with remove_seq)
  • membership of a single k-mer (with contains), or every k-mer from a sequence (with contains_seq)
  • iterating over the k-mers stored in the set (with iter)
  • union / intersection / difference of two sets (with | / & / -)
  • (de)serialization with serde

Requirements

Rust nightly 1.77+

If you have not installed Rust yet, please visit rustup.rs to install it. This library uses some nightly features of the Rust compiler (version 1.77+), you can install the latest nightly version with

rustup install nightly

If you don't want to use the +nightly flag every time you run cargo, you can set it as default with

rustup default nightly

Additional headers for Linux

This library uses C++ bindings for the sux library and tiered vectors. Depending on your configuration, some headers used for the bindings might be missing, in that case please install the following packages:

Ubuntu

sudo apt install -y libstdc++-12-dev libclang-dev

Fedora

sudo dnf install -y clang15-devel

Using the library

You can add CBL in an existing Rust project with

cargo +nightly add --git https://github.com/imartayan/CBL.git

or by adding the following dependency in your Cargo.toml

cbl = { git = "https://github.com/imartayan/CBL.git" }

If the build fails, try to install additional headers.

Choosing the right parameters

The CBL struct takes two main parameters as constants:

  • an integer K specifying the size of the k-mers
  • an integer type T (e.g. u32, u64, u128) that must be large enough to store both a k-mer and its number of bits together

Therefore T should be large enough to store $2k + \lg(2k)$ bits. In particular, since primitive integers cannot store more than 128 bits, this means that K must be ≤ 59.

Additionally, you can specify a third (optional) parameter PREFIX_BITS which determines the size of the underlying bitvector. Changing this parameter affects the space usage and the query time of the data structure, see the paper for more details.

Example usage

use cbl::CBL;
use needletail::parse_fastx_file;
use std::env::args;

// define the parameters K and T
const K: usize = 25;
type T = u64; // T must be large enough to store $2k + \lg(2k)$ bits

fn main() {
    let args: Vec<String> = args().collect();
    let input_filename = args.get(1).expect("No argument given");

    // create a CBL index with parameters K and T
    let mut cbl = CBL::<K, T>::new();

    let mut reader = parse_fastx_file(input_filename).unwrap();
    // for each sequence of the FASTA/Q file
    while let Some(record) = reader.next() {
        let seqrec = record.expect("Invalid record");

        // insert each k-mer of the sequence in the index
        cbl.insert_seq(&seqrec.seq());
    }
}

Building from source

You can clone the repository and its submodules with

git clone --recursive https://github.com/imartayan/CBL.git

If you did not use the --recursive flag, make sure to load the submodules with

git submodule update --init --recursive

Running the binaries

You can compile the binaries with

cargo +nightly build --release --examples

If the build fails, try to install additional headers.

By default, the binaries are compiled with a fixed K equal to 25, you can compile them with a different K as follows

K=59 cargo +nightly build --release --examples

Note that K values ≥ 60 are not supported by this library.

Similarly, PREFIX_BITS is equal to 24 by default and you can change it with

K=59 PREFIX_BITS=28 cargo +nightly build --release --examples

Note that PREFIX_BITS values ≥ 29 are not supported by this library.

Once compiled, the main binary will be located at target/release/examples/cbl. It supports the following commands:

Usage: cbl <COMMAND>

Commands:
  build        Build an index containing the k-mers of a FASTA/Q file
  count        Count the k-mers contained in an index
  list         List the k-mers contained in an index
  query        Query an index for every k-mer contained in a FASTA/Q file
  insert       Add the k-mers of a FASTA/Q file to an index
  remove       Remove the k-mers of a FASTA/Q file from an index
  merge        Compute the union of two indexes
  inter        Compute the intersection of two indexes
  diff         Compute the difference of two indexes
  sym-diff     Compute the symmetric difference of two indexes
  repartition  Show the repartition of the k-mers in the data structure
  help         Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

Running the tests

You can run all the tests with

cargo +nightly test --lib

Building the documentation

You can build the documentation of the library and open it in your browser with

cargo +nightly doc --lib --no-deps --open

Citation

Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. Martayan, I., Cazaux, B., Limasset, A., and Marchet, C. https://doi.org/10.1093/bioinformatics/btae217

@article{cbl,
  title   = {{Conway–Bromage–Lyndon (CBL): an exact, dynamic representation of k-mer sets}},
  author  = {Martayan, Igor and Cazaux, Bastien and Limasset, Antoine and Marchet, Camille},
  journal = {Bioinformatics},
  volume  = {40},
  number  = {Supplement_1},
  pages   = {i48-i57},
  year    = {2024},
  month   = {06},
  issn    = {1367-4811},
  doi     = {10.1093/bioinformatics/btae217},
  url     = {https://doi.org/10.1093/bioinformatics/btae217},
  eprint  = {https://academic.oup.com/bioinformatics/article-pdf/40/Supplement\_1/i48/58354678/btae217.pdf}
}

cbl's People

Contributors

imartayan avatar

Stargazers

Qin Lin avatar Qiang Wang avatar Dan Browne avatar Jeff Carpenter avatar Luca Santuari avatar DDuchen avatar Johon Li Tuobang 李拓邦 avatar Trent Hauck avatar Florian Ingels avatar Adam Taranto avatar Shaun Jackman avatar Camille Marchet avatar Francesco Andreace avatar Tommi Mäklin avatar Konstantinos Kyriakidis avatar Camilo García avatar Ronak Shah avatar  avatar Dr. K. D. Murray avatar Brent Pedersen avatar Himadri Bhattacharjee avatar Humood Alanzi avatar Rom Grk avatar  avatar Sora Yonezawa avatar Max Brown avatar Nick Minor avatar Ragnar Groot Koerkamp avatar Zamin Iqbal avatar Jianshu_Zhao avatar Giulio Ermanno Pibiri avatar Karel Břinda avatar Brendan J. Pinto avatar Hajime Suzuki avatar Yoann Dufresne avatar Kez Cleal avatar Spencer Nystrom avatar Darek Kedra avatar Martin Larralde avatar  avatar Li Song avatar Heru Handika avatar Wei Shen avatar

Watchers

Ragnar Groot Koerkamp avatar  avatar

cbl's Issues

Discussion / question

Hi @imartayan,

Very exciting work; congratulations on this! This isn't an issue per-se, but rather a discussion point that I wanted to raise (I raised a similar one in the ggcat repo a while back).

While I understand the desire to make our own lives (as developers) as easy as possible, I wonder if you might be able to enumerate what specific nightly features are required by CBL, and what prevents it from building on the latest stable rust (or at least beta).

Jon Gjengset — one of my favorite Rustaceans — has an excellent talk (relevant part linked here) about the tradeoffs of relying on nightly features and why it may, much of the time, just not be worthwhile. In particular, I'm curious what would be required to build on stable (or beta), and what particular features are being used. Features that are slated for stabilization mean it's just a matter of time — few release cycles — until those are on stable. But some nightly features may never make it to stable, or be removed or abandoned (or be unsound 😱), and may be worth replacing with something else, or a stable crate that emulates their behavior.

Anyway, I just wanted to kick off this discussion with you to get your thoughts and feedback. Congrats again!

--Rob

Compilation issues on linux and OS X

I'm excited to try this library, it looks great

However I've gotten stuck compiling on two machines, if you had any advice that would be great

Fedora 38

Running cargo +nightly build throws an issue in autocxx-bindgen:

  thread 'main' panicked at /home/jlees/.cargo/registry/src/index.crates.io-6f17d22bba15001f/autocxx-bindgen-0.65.1/ir/context.rs:1997:26:
  Non floating-type complex? Type(_Complex _Float16, kind: Complex, cconv: 100, decl: Cursor( kind: NoDeclFound, loc: builtin definitions, usr: None), canon: Cursor( kind: NoDeclFound, loc: builtin definitions, usr: None)), Type(_Float16, kind: Float16, cconv: 100, decl: Cursor( kind: NoDeclFound, loc: builtin definitions, usr: None), canon: Cursor( kind: NoDeclFound, loc: builtin definitions, usr: None))

This is with libstdc++-13

OS X

I wasn't sure if OS X is supported or not, but on an M1 mac I get:

CXXFLAGS = Some("-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -fmessage-length=0)
  running: env -u IPHONEOS_DEPLOYMENT_TARGET "x86_64-apple-darwin13.4.0-clang++" "-O1" "-ffunction-sections" "-fdata-sections" "-fPIC" "-gdwarf-2" "-fno-omit-frame-pointer" "--target=arm64-apple-darwin" "-march=core2" "-mtune=haswell" "-mssse3" "-ftree-vectorize" "-fPIC" "-fstack-protector-strong" "-O2" "-pipe" "-stdlib=libc++" "-fvisibility-inlines-hidden" "-fmessage-length=0" "-o" "/Users/jlees/Documents/EBI/SKA project/CBL/target/debug/build/link-cplusplus-352f80a998321554/out/669417d7ccbf6cde-dummy.o" "-c" "/Users/jlees/Documents/EBI/SKA project/CBL/target/debug/build/link-cplusplus-352f80a998321554/out/dummy.cc"
  cargo:warning=x86_64-apple-darwin13.4: error: unsupported argument 'core2' to option '-march='

I couldn't work out if it was possible to remove the march option (which is in both CFLAGS and CXXFLAGS). I tried removing it from build.rs but it made no difference

Reverse complements

Dear @imartayan , very nice work on data structures for k-mers!
I'm currently working on some experiments that include CBL. However, I realized that in some cases, reverse complements are probably not handled correctly (namely, when all queries supposed to be positive, only about a half of them were reported as positive by CBL).

I've created the following setup to reproduce the issue: I compiled CBL (with default K and PREFIX_BITS) and built an index on the S. pneumoniae pangenome:

$ ../CBL/target/release/examples/cbl build -o spneumo_pangenome_k32.fa.cbl spneumo_pangenome_k32.fa

Then I created the following query file:

$ cat query.k_25.fa
>q1
CTTTATAGTCTGAAAAAAGGTAACC
>q2 = reverse complement of q1
GGTTACCTTTTTTCAGACTATAAAG

and ran CBL query:

$ ../CBL/target/release/examples/cbl query spneumo_pangenome_k32.fa.cbl query.k_25.fa
Reading the index stored in spneumo_pangenome_k32.fa.cbl
Querying the 25-mers contained in query.k_25.fa
# queries: 2
# positive queries: 1 (50.00%)

I would appreciate if you could look into this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.