Coder Social home page Coder Social logo

webgraph-ans-rs's Introduction

webgraph-ans-rs

Web graphs are fundamental structures that represent the very complex nature of a specific portion of the World Wide Web used in a variety of applications such as search engines. In practical terms, a web graph is a mathematical abstraction in which there is a node for each web page and an arc from node ๐‘ฅ to node ๐‘ฆ if the page associated with the first node contains a hyperlink to the page associated with the second. It's easy to guess that the size of these enormous structures makes the traditional way of storing them obsolete.

One of the greatest frameworks built with the goal of compressing web graphs is WebGraph, a framework that, beyond offering various tools that can be used to operate un such structures, exploits the properties of web graphs (locality and similarity), as well as some other ideas tailored for the context, to compress them in an efficient format called BVGraph.

This project aims to improve the records of the mentioned frameworks, which have been standing for almost two decades, by adding another layer of compression by means of Asymmetrical Numeral Systems (ANS) over the first layer of compression performed by WebGraph.

Compressor

Since the BVGraph format is composed of 9 models (Outdegree, ReferenceOffset, BlockCount, Blocks, IntervalCount, IntervalStart, IntervalLen, FirstResidual, Residual), the model used by the compressor is going to be switched on the fly among the nine built for each specific component of the compression format. Moreover, to overcome the problem of dealing with enormous alphabets (we set as maximum symbol $2^{48} - 1$), the symbol folding technique introduced by Moffat and Petri in their work is implemented.

In general, the coder uses a 32-bit state and 16-bit renormalization step with some other constraints discussed below.

The compressor's interval for each model is defined as:

$$I = [M * K, M * K * B)$$

where:

  1. $M$ is the sum (power of two) of all approximated symbols frequencies for a specific model.
  2. $K = 2^{16} - M$
  3. $B = 2^{16}$

All this is done to guarantee that the shared interval between all models is $[2^{16}, 2^{32})$, a guarantee needed to make the compressor able to switch models on the fly.

PS: This implementation assumes that the most frequent symbols are the smallest positive integers.

Binaries

The binary that can be used to recompress a .graph is bvcomp:

$ cargo build --release --bin bvcomp
$ ./target/release/bvcomp <path_to_graph> <output_dir> <new_graph_name> <compression_params>

For example:

$ ./target/release/bvcomp tests/data/cnr-2000/cnr-2000 ans-cnr-2000

recompresses with standard compression parameters the cnr-2000.graph file in the tests/data/cnr-2000/ directory and save the new compressed graph in the current directory with the name ans-cnr-2000.graph.

PS: graphs can be found here.

webgraph-ans-rs's People

Contributors

ciminilorenzo avatar vigna avatar

Watchers

 avatar  avatar

webgraph-ans-rs's Issues

Improve decoding/encoding performances

The following while loop executes, one at time, the code included into the brackets, that is extracting one bit from the state that needs to be shrunk. Modify the function in such a way it does all it in one single operation.

pub fn encode(&mut self, symbols: &[u64]) -> Result<u64, StreamingRANSError> {
        if symbols.is_empty() {
            return Err(StreamingRANSError::EmptyInput);
        }
        let mut state = self.state_allowed_interval.start;
        symbols
            .iter()
            .for_each(|symbol| {
                while state >= self.model.get_symbol_frequency(*symbol) * 2 {
                    // extract the least significant bit
                    let bit = state & 1;
                    // append the lsb we have just extracted
                    self.shrunken_bits.push(bit == 1);
                    // new state is gonna be `floor(state:2)`
                    state = state.div(2);
                }
                state = self.encode_step(state, symbol);
            });
        Ok(state)
    }

How big the interval dimension can be?

By looking at the code released by Jeff Johnson, the equivalent of my interval_dimension field is named probBits.
It has the following comment:

// What the ANS probability accuracy is; all symbols have quantized
// probabilities of 1/2^probBits.
// 9, 10, 11 are only valid values. When in doubt, use 10 (e.g., all symbol
// probabilities are one of {1/1024, 2/1024, ..., 1023/1024, 1024/1024})
int probBits;

with the following default dimension:

// Default number of probability quantization bits to use, if an alternative is
// not specified
constexpr int kANSDefaultProbBits = 10;

If one wants to use the same strategy, the minimum probability representable would be: 1/1024 = 0,0009765622.

At the moment, the actual implementation follows the same criteria:

  1. one builder builds the AnsModel with the default value, that is 10;
  2. one builder allows to give as input the value, that must be in the range [9,11].

The normalized frequencies are calculated as follows:

  ordered_symbols_frequencies
        .iter_mut()
        .map(|frequency| {
            // !! Since floor() is used, if the symbol's probability is less than 
            // 0,0009765622, the corresponding approximated frequency would be 0.
            (*frequency as f64 / total_frequency as f64 * precision).floor() as u64
        })
        .collect()

Regarding Francesco's code, is takes the precision as input without checks on the value.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.