Coder Social home page Coder Social logo

bio-seq's Introduction

Docs.rs CI status

bio-seq

Bit-packed and well-typed biological sequences

Add bio-seq to Cargo.toml:

[dependencies]
bio-seq = "0.12"

Example: Iterating over the kmers for a sequence:

use bio_seq::prelude::*;

let seq = dna!("ATACGATCGATCGATCGATCCGT");

// iterate over the 8-mers of the reverse complement
for kmer in seq.revcomp().kmers::<8>() {
    println!("{}", kmer);
}

// ACGGATCG
// CGGATCGA
// GGATCGAT
// GATCGATC
// ATCGATCG
// ...

Example: The 4-bit encoding of IUPAC nucleotide ambiguity codes naturally represent a set of bases for each position (0001: A, 1111: N, 0000: *, ...):

use bio_seq::prelude::*;

let seq = iupac!("AGCTNNCAGTCGACGTATGTA");
let pattern = iupac!("AYG");

for slice in seq.windows(pattern.len()) {
    if pattern.contains(slice) {
        println!("{} matches pattern", slice);
    }
}

// ACG matches pattern
// ATG matches pattern

The goal of this crate is to make handling biological sequence data safe and convenient. The TranslationTable trait implements genetic coding:

// This is a debruijn sequence of all possible 3-mers:
let seq: Seq<Dna> =
    dna!("AATTTGTGGGTTCGTCTGCGGCTCCGCCCTTAGTACTATGAGGACGATCAGCACCATAAGAACAAA");
let aminos: Seq<Amino> = Seq::from_iter(seq.windows(3).map(|codon| translation::STANDARD.to_amino(codon)));
assert_eq!(
    aminos,
    amino!("NIFLCVWGGVFSRVSLCARGALSPRAPPLL*SVYTLYM*ERGDTRDISQSAHTPHI*KRENTQK")
);

Contents

  • Codec: Coding/Decoding schemes for the characters of a biological sequence
  • Seq: A sequence of encoded characters
  • Kmer: A fixed size sequence of length K
  • Derivable codecs: This crate offers utilities for defining your own bit-level encodings
  • Safe conversion between sequences

Codecs

The Codec trait describes the coding/decoding process for the characters of a biological sequence. This trait can be derived procedurally. There are four built-in codecs:

codec::Dna

Using the lexicographically ordered 2-bit representation

codec::Iupac

IUPAC nucleotide ambiguity codes are represented with 4 bits. This supports membership resolution with bitwise operations. Logical or is the union:

assert_eq!(iupac!("AS-GYTNA") | iupac!("ANTGCAT-"), iupac!("ANTGYWNA"));

Logical and is the intersection of two iupac sequences:

assert_eq!(iupac!("ACGTSWKM") & iupac!("WKMSTNNA"), iupac!("A----WKA"));

codec::Text

utf-8 strings that are read directly from common plain-text file formats can be treated as sequences. Additional logic can be defined to ensure that 'a' == 'A' and for handling 'N'.

codec::Amino

Amino acid sequences are represented with 6 bits. The representation of amino acids is designed to be easy to coerce from sequences of 2-bit encoded DNA.

Strings of encoded characters are packed into Seq. Slicing, chunking, and windowing return SeqSlice. Seq<A: Codec> and &SeqSlice<A: Codec> are analogous to String and &str. As with the standard string types, these are stored on the heap. Kmers are generally stored on the stack, implementing Copy. All data is stored little-endian. This effects the order that sequences map to the integers ("colexicographic" order):

for i in 0..=15 {
    println!("{}: {}", i, Kmer::<Dna, 5>::from(i));
}
0: AAAAA
1: CAAAA
2: GAAAA
3: TAAAA
4: ACAAA
5: CCAAA
6: GCAAA
7: TCAAA
8: AGAAA
9: CGAAA
10: GGAAA
11: TGAAA
12: ATAAA
13: CTAAA
14: GTAAA
15: TTAAA

kmers are short sequences of length k that can fit into a register (e.g. usize, or SIMD vector) and generally implement Copy. these are implemented with const generics and k is fixed at compile time.

Efficient encodings

For encodings with a dense mapping between characters and integers a lookup table can be indexed in constant time by treating kmers directly as usize:

fn kmer_histogram<C: Codec, const K: usize>(seq: &SeqSlice<C>) -> Vec<usize> {
    // For dna::Dna our histogram will need 4^4
    // bins to count every possible 4-mer.
    let mut histo = vec![0; 1 << (C::WIDTH * K as u8)];

    for kmer in seq.kmers::<K>() {
        histo[usize::from(kmer)] += 1;
    }

    histo
}

This example builds a histogram of kmer occurences.

Sketching

Hashing

The Hash trait is implemented for Kmers

Canonical Kmers

Depending on the application, it may be permissible to superimpose the forward and reverse complements of a kmer:

k = kmer!("ACGTGACGT");
let canonical = k ^ k.revcomp(); // TODO: implement ReverseComplement for Kmer

Kmer minimisers

The 2-bit representation of nucleotides is ordered A < C < G < T. Sequences and kmers are stored little-endian and are ordered "colexicographically". This means that AAAA < CAAA < GAAA < ... < AAAC < ... < TTTT

fn minimise(seq: Seq<Dna>) -> Option<Kmer::<Dna, 8>> {
    seq.kmers().min()
}

Example: Hashing minimiser of canonical Kmers

for ckmer in seq.window(8).map(|kmer| hash(kmer ^ kmer.revcomp())) {
    // TODO: example
    ...
}

Derivable codecs

Sequence coding/decoding is derived from the variant names and discriminants of enum types:

use bio_seq_derive::Codec;
use bio_seq::codec::Codec;

#[derive(Clone, Copy, Debug, PartialEq, Codec)]
#[repr(u8)]
pub enum Dna {
    A = 0b00,
    C = 0b01,
    G = 0b10,
    T = 0b11,
}

A #[width(n)] attribute specifies how many bits the encoding requires per symbol. The maximum supported is 8. If this attribute isn't specified then the optimal width will be chosen.

#[alt(...,)] and #[display('x')] attributes can be used to define alternative representations or display the item with a special character. Here is the definition for the stop codon in codec::Amino:

pub enum Amino {
    #[display('*')] // print the stop codon as a '*'
    #[alt(0b001011, 0b100011)] // TGA, TAG
    X = 0b000011, // TAA (stop)

Sequence conversions

Translation table traits

Translation tables provide methods for translating codons into amino acids:

pub trait TranslationTable<A: Codec, B: Codec> {
    fn to_amino(&self, codon: &SeqSlice<A>) -> B;
    fn to_codon(&self, amino: B) -> Result<Seq<A>, TranslationError>;
}

/// A partial translation table where not all triples of characters map to amino acids
pub trait PartialTranslationTable<A: Codec, B: Codec> {
    fn try_to_amino(&self, codon: &SeqSlice<A>) -> Result<B, TranslationError>;
    fn try_to_codon(&self, amino: B) -> Result<Seq<A>, TranslationError>;
}

The standard genetic code is provided as a translation::STANDARD constant:

use crate::prelude::*;
use crate::translation::STANDARD;
use crate::translation::TranslationTable;

let seq: Seq<Dna> =
    dna!("AATTTGTGGGTTCGTCTGCGGCTCCGCCCTTAGTACTATGAGGACGATCAGCACCATAAGAACAAA");

let aminos: Seq<Amino> = seq
    .windows(3)
    .map(|codon| STANDARD.to_amino(&codon))
    .collect::<Seq<Amino>>();

assert_eq!(
    aminos,
    amino!("NIFLCVWGGVFSRVSLCARGALSPRAPPLL*SVYTLYM*ERGDTRDISQSAHTPHI*KRENTQK")
);

Custom translation tables

Instantiate a translation table from a type that implements Into<HashMap<Seq<A>, B>>:

let codon_mapping: [(Seq<Dna>, Amino); 6] = [
    (dna!("AAA"), Amino::A),
    (dna!("ATG"), Amino::A),
    (dna!("CCC"), Amino::C),
    (dna!("GGG"), Amino::E),
    (dna!("TTT"), Amino::D),
    (dna!("TTA"), Amino::F),
];

let table = CodonTable::from_map(codon_mapping);

let seq: Seq<Dna> = dna!("AAACCCGGGTTTTTATTAATG");
let mut amino_seq: Seq<Amino> = Seq::new();

for codon in seq.chunks(3) {
    amino_seq.push(table.try_to_amino(codon).unwrap());
}

assert_eq!(amino_seq, amino!("ACEDFFA"));

Implementing the TranslationTable trait directly:

struct Mitochondria;

impl TranslationTable<Dna, Amino> for Mitochondria {
    fn to_amino(&self, codon: &SeqSlice<Dna>) -> Amino {
        if *codon == dna!("AGA") {
            Amino::X
        } else if *codon == dna!("AGG") {
            Amino::X
        } else if *codon == dna!("ATA") {
            Amino::M
        } else if *codon == dna!("TGA") {
            Amino::W
        } else {
            Amino::unsafe_from_bits(Into::<u8>::into(codon))
        }
    }

    fn to_codon(&self, _amino: Amino) -> Result<Seq<Dna>, TranslationError> {
        unimplemented!()
    }
}

bio-seq's People

Contributors

jeff-k avatar szabgab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

jianshu93 szabgab

bio-seq's Issues

Why different use statements for 4-letter DNA and IUPAC DNA

The "use" statements seem to work differently for four-letter DNA and IUPAC, see the following two examples. I don't understand why the "use" statement patter differs. Is there something I've misunderstood?

For DNA:

use bio_seq::*;
use bio_seq::codec::Dna;
fn main() {
    let input_string = "AAAA";
    let dna_seq = Seq::<Dna>::from_str(input_string).unwrap();
    println!("dna_seq: {}", dna_seq);
}

For IUPAC:

use bio_seq::*;
use bio_seq::codec::iupac::Iupac;
fn main() {
    let input_string = "AAAA";
    let iupac_seq = Seq::<Iupac>::from_str(input_string).unwrap();
    println!("iupac_seq: {}", iupac_seq);
}

Note the difference in particular between:
use bio_seq::codec::Dna;
and
use bio_seq::codec::iupac::Iupac;

Add support for multiple genetic codes

I've been looking all over for a Rust library that supports multiple genetic codes. This library seems to be the best translation library, so I hope this is on your radar. Thanks!

amino acid kmer utility

Hello Jeff,

It's really nice that you have this bit by bit encoding of sequences for DNA. You said on todo list that amino acid support will also be available. I think we need at least 8 bit right instead of 2 or 4 for AGCT, and IUAC dna (16) because there are 20 amino acid and more for IUAC rules. I am doing amino acid kmer counting for a huge database of bacterial genomes in amino acid format and I cannot find any bit by bit encoding and decoding of amino acid sequences except this one. Will amino acid kmer available soon? I will be very happy to work on this with you or whatever. email is: [email protected]

Many thanks,

Jianshu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.