Coder Social home page Coder Social logo

tomarrell / rust-elias-fano Goto Github PK

View Code? Open in Web Editor NEW
28.0 4.0 5.0 26 KB

Elias-Fano encoding implementation in Rust

License: MIT License

Rust 100.00%
algorithm elias-fano encoding inverted-index search-algorithm compression rust hacktoberfest

rust-elias-fano's Introduction

Elias-Fano, in Rust

Build Status Rust Docs

Elias-Fano encoding in Rust.

The Elias-Fano encoding scheme is a quasi-succinct compression method for monotonic integers using gap compression on a Bitset. It allows for decompression of a bit at any position in O(1) time complexity.

Being quasi-succinct, it is therefore almost as good as the best theoretical possible compression as determined by the Shannon-Hartley theorem.

This implementation is based largely on one written in Go by Antonio Mallia, which can be found at his repository amallia/go-ef.

Todo:

  • Tests
  • Example usage
  • Benchmarks, comparison with other implementations

Installation

Add the following line to your Cargo.toml:

[dependencies]
+ elias-fano = "1.1.0"

Example Usage

extern crate elias_fano;

use elias_fano::EliasFano;

fn main() {
    let sorted_array = [0, 3, 40, 1000];
    let size = sorted_array.len();

    let mut ef = EliasFano::new(sorted_array[size - 1], size as u64);

    ef.compress(sorted_array.iter());

    println!("{}", ef.value()); // 1

    match ef.next() {
        Ok(val) => println!("Retrieved value: {}", val), // 3
        Err(error) => println!("Err: {}", error),        // Out of bounds
    }

    let _ = ef.next();
    println!("{}", ef.value()); // 40

    ef.reset();
    println!("{}", ef.value()); // 0

    let _ = ef.visit(3);
    println!("{}", ef.value()); // 1000
}

License

MIT licensed, see LICENSE for more details.

rust-elias-fano's People

Contributors

jackyzhen avatar tomarrell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rust-elias-fano's Issues

Add pointer index

Pointer array is used to retrieve child edges, one of the two primitive retrieval operations of the edge database.
Pointer array will store two parallel int arrays:

  1. Elias-fano compression of the source vertexes in ascending order
  2. An Elias-fano compression of the offset into the partition to retreive the first source edge of the corresponding vertex.

Benchmark and tests

This is a ๐Ÿ”ฅ project . Some suggestions:

  • Some tests and test data to demonstrate how this lib is used.
  • Some benchmark tests to show performance.
  • Some benchmark notes against implementations in other languages as part of README to show case rust ๐Ÿš€ ๐Ÿš€๐Ÿš€

Skip to a specific value

Recently Elias Fano has become more popular to represent posting lists in search engine.

In order to intersect two positng lists efficiently, it is then required to have an efficient implementation of skipping to a specific document.

In tantivy, the interface is as follows.

One of the main benefits of Elias Fano is that it makes it easy to skip rapidly by counting zeroes and ones in the unary encoded high bits bit sequence.

Is it possible to implement a rank operation?

Hello,

Without iterating through all the values from the start, would it be possible to implement a rank operation that returns the number of values smaller than or equal to some value?

I'm not sure if Elias-Fano makes this kind of operation easy, so alternatively I'm wondering if a next_greater_than_or_equal (as described here) would be much trouble to implement for this crate?

Skip a givne number of elements

This is similar but different from #5.

There are some use case for skipping a given number of elements.
Once again, Elias Fano makes it possible by counting zeroes and ones in the unary encoded sequence.

One use case was met in tantivy when encoding positions. Given a term, term positions for all of the documents are concatenated and encoded as an increasing sequence.
Accessing the positions in the n-th document is made possible by suming all of the term frequency in
document 1 to (n-1)th and skipping the same amount of values in the elias fano encoded positions.

Refactor get_next_set

"get_next_set's implementation is probably suboptimal.

You loop over the bitset until you reach a bit set to one. Granted the density of this bitset is high as per your choice of number of bits. Still it is difficult to predict, so that a lot of branch prediction error.

Instead you can use u64::trailing_zeros which compiles to a single instructions : https://github.com/tantivy-search/tantivy/blob/78673172d001a0c4c7c73cdd3d9923fc43fc0312/src/common/bitset.rs#L87" - fulmicoton

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.