quickwit-oss / bitpacking Goto Github PK

View Code? Open in Web Editor NEW

259.0 19.0 29.0 476 KB

SIMD algorithms for integer compression via bitpacking. This crate is a port of a C library called simdcomp.

License: MIT License

Rust 97.76% PowerShell 0.43% Shell 1.81%

rust compression simd

bitpacking's Introduction

Fast Bitpacking algorithms

This crate is a Rust port of Daniel Lemire's simdcomp C library.

It makes it possible to compress/decompress :

sequence of small integers
sequences of increasing integers

⭐ It is fast. Expect > 4 billions integers per seconds.

How to compile ?

bitpacking compiles on stable rust but require rust > 1.27 to compile.

Just add to your Cargo.toml :

bitpacking = "0.5"

For some bitpacking flavor and for some platform, the bitpacking crate may benefit from some specific simd instruction set.

In this case, it will always ship an alternative scalar implementation and will fall back to the scalar implementation at runtime.

In other words, your do not need to configure anything. Your program will run correctly, and at the fastest speed available for your CPU.

Documentation

Reference documentation

What is bitpacking ?

Traditional compression schemes like LZ4 are not really suited to address this problem efficiently. Instead, there are different families of solutions to this problem.

One of the most straightforward and efficient ones is bitpacking :

Integers are first grouped into blocks of constant size (e.g. 128 when using the SSE2 implementation).
If not available implicitely, compute the minimum number of bits b that makes it possible to represent all of the integers. In other words, the smallest b such that all integers in the block are stricly smaller than 2^b.
The bitpacked representation is then some variation of the concatenation of the integers restricted to their least significant b-bits.

For instance, assuming a block of 4, when encoding 4, 9, 3, 2. Assuming that the highest value in the block is 9, b = 4. All values will then be encoded over 4 bits as follows.

original number	binary representation
4	0100
9	1001
3	0011
2	0010
...	...

As a result, each integer of this block will only require 4 bits.

Choosing between BitPacker1x, BitPacker4x and BitPacker8x.

⚠️ BitPacker1x, BitPacker4x, and BitPacker8x produce different formats, and are incompatible one with another.

BitPacker4x and BitPacker8x are designed specifically to leverage SSE3 and AVX2 instructions respectively.

It will safely fallback at runtime to a scalar implementation of these format if these instruction sets are not available on the running CPU.

👌 I recommend using BitPacker4x if you are in doubt.

BitPacker1x

BitPacker1x is what you would expect from a bitpacker. The integer representation over b bits are simply concatenated one after the other. One block must contain 32 integers.

BitPacker4x

BitPacker4x bits ordering works in layers of 4 integers. This gives an opportunity to leverage SSE3 instructions to encode and decode the stream. One block must contain 128 integers.

BitPacker8x

BitPacker8x bits ordering works in layers of 8 integers. This gives an opportunity to leverage AVX2 instructions to encode and decode the stream. One block must contain 256 integers.

Compressing small integers

extern crate bitpacking;

use bitpacking::{BitPacker4x, BitPacker};


// Detects if `SSE3` is available on the current computed
// and uses the best available implementation accordingly.
let bitpacker = BitPacker4x::new();

// Computes the number of bits used for each integers in the blocks.
// my_data is assumed to have a len of 128 for `BitPacker4x`.
let num_bits: u8 = bitpacker.num_bits(&my_data);

// The compressed array will take exactly `num_bits * BitPacker4x::BLOCK_LEN / 8`.
// But it is ok to have an output with a different len as long as it is larger
// than this.
let mut compressed = vec![0u8; 4 * BitPacker4x::BLOCK_LEN];

// Compress returns the len.
let compressed_len = bitpacker.compress(&my_data, &mut compressed[..], num_bits);
assert_eq!((num_bits as usize) *  BitPacker4x::BLOCK_LEN / 8, compressed_len);

// Decompressing
let mut decompressed = vec![0u32; BitPacker4x::BLOCK_LEN];
bitpacker.decompress(&compressed[..compressed_len], &mut decompressed[..], num_bits);

assert_eq!(&my_data, &decompressed);

Benchmark

The following benchmarks have been run on one thread on my laptop's CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz.

You can get accurate figures on your hardware by running the following command.

cargo bench

BitPacker1x

operation	throughput
compress	1.4 billions int/s
compress_delta	1.0 billions int/s
decompress	1.8 billions int/s
decompress_delta	1.4 billions int/s

BitPacker4x (assuming SSE3 instructions are available)

operation	throughput
compress	5.3 billions int/s
compress_delta	2.8 billions int/s
decompress	5.5 billions int/s
decompress_delta	5 billions int/s

BitPacker8x (assuming AVX2 instructions are available)

operation	throughput
compress	7 billions int/s
compress_delta	600 millions int/s
decompress	6.5 billions int/s
decompress_delta	5.6 billions int/s

Reference

SIMD Compression and the Intersection of Sorted Integers

Other crates you might want to check out

stream vbyte A Stream-VByte implementation
mayda Another crate implementation the same algorithms.

bitpacking's People

Contributors

Stargazers

Watchers

bitpacking's Issues

Example for compressing/decompressing an arbitrary number of u32 int values

I have been able to compress and decompress using BitPacker4x, and the compression looks quite good on sorted u32 values. I have a large number of u32 values to compress (up to a few 1000), and to compress I looped through chunks equal to 128 u32 values in size. I need to store the final compressed result, and I did this by appending each to a result Vec<u8> which I can then store. This works fine.

Now I need to take the result Vec<u8> and decompress each chunk. At this point I no longer have the compressed_len for each chunk of bytes. Is there a way to do this so that I can decompress each compressed chunk and rebuild the original Vec<u32>? In the original library from Daniel Lemere in C++ he does support any arbitrary number of int values, that would be really nice here too. But if you can guide me on how to use it properly that will be very helpful.

BitPacker4x en/decoding problem

I have made a playground test where I show what I have done so far to en/decode a list of ordered u32s, the list of integers I use in this example are not en/decoded correctly it seems like there is a difference of exactly 1024 (2^10) between the original numbers 19984 and 21086 (which changed into 20062).

Where do you think I could have made a mistake, is it related to #23 (comment) ?

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=a11e9600e3c3f8e1d4420527668673d2

(Unfortunately bitpacking and zerocopy are not supported by the playground you will have to copy/paste it, sorry for the inconvenience)

Preparing for rust 1.27

Rust 1.27 will include support for simd instructions.
bitpacking will compile on stable so we can remove all simd files and README explanations.

Delta for AVX2

As suggest by `` on reddit, compute_delta as well as the integration on AVX2 would be more efficient if it was relying on the `_mm256_alignr_epi8` instruction.

Copy pasting the incomplete snippet from reddit.


fn compute_delta(curr: DataType, prev: DataType) -> DataType {
    unsafe {
        let shifted = _mm256_alignr_epi8(curr, _mm256_permute2x128_i256(prev, curr, 0x21), 12);
        _mm256_sub_epi32(curr, shifted)
    }
}

This crate vs tantivy-bitpacker

I noticed that there is a separate crate under the Tantivy project that also does bitpacking. So far as I can tell, the other crate does 64 bit compression and does not use SIMD, and the Tantivy project uses both crates. Is this correct?

Can I ask about the future of this crate? Will you be consolidating the two? What's the plan?

Also, it appears that the master branch is at version 0.8.3, but crates.io has 0.8.4. Is there some other repo somewhere I've missed?

Have a light scalar implementation

The current implementations generate large binaries because they have one specialized implementation for each bitwidth, and do loop unrolling.

Add a flag-enabled implementation that uses a more compact scalar implementation. This would be useful for web assembly for instance.

Compilation require heavy amount of memory

Building bitpacking has a peek memory usage of 1.9Gio on my system, and caused issue in CI and for some users building on lower-end hardware (see Plume-org/Plume#632).
This seems to be directly related to heavy usage of macro for code expansion. Once code is expended, I measured a bit more than 330k loc. This is distributed approximately as 1/5 of bitpacker1x, 2/5 of bitpacker4x, and 2/5 bitpacker8x.
Commenting mod bitpacker8x;reduced peek memory usage to 1.2Gio, so it might be interesting to have features to enable/disable individual bitpackers depending on what is required. For instance, Tantivy seems to use only 4x, so it's probably possible to go sub 1G by just disabling unneeded bitpackers

BitPacker4x::num_bits_sorted SIGSEGV on empty arrays

When called with an empty array the num_bits_sorted triggers a segfault, this is not specified anywhere that the array must not be empty (or do I missed it?).

fn num_bits_sorted(&self, initial: u32, decompressed: &[u32]) -> u8 {
     unsafe { scalar::UnsafeBitPackerImpl::num_bits_sorted(initial, decompressed) }
}

If you want to easily reproduce the bug just create an empty vec![] and give it to BitPacker4x::num_bits_sorted.

Help / Question: is this applicable to parquet's RLE encoding?

Hi,

I am considering re-using this awesome crate to read parquet files. Parquet uses RLE encoding that seems to be equivalent to using BitPacker1x.

However, I am not 100% sure, and would really appreciate any guidance here.

parquet accepts blocks larger / smaller than 32 items. Do you handle that in tantivy? E.g. do you loop loop over the chunks of 32 items? I imagine that some operations at the boundaries is needed. The example I have is

    #[test]
    fn test_rle() {
        // Test data: 0-7 with bit width 3
        // 0: 000
        // 1: 001
        // 2: 010
        // 3: 011
        // 4: 100
        // 5: 101
        // 6: 110
        // 7: 111
        let num_bits = 3;
        let length = 8;
        // encoded: 0b10001000u8, 0b11000110, 0b11111010
        // we need to pad it with zeros.
        let data = vec![0b10001000u8, 0b11000110, 0b11111010, 0, 0, 0, 0, 0, 0, 0, 0, 0];

        let mut decompressed = vec![0u32; BitPacker1x::BLOCK_LEN];
        bitpacker.decompress(&values, &mut decompressed[..], num_bits);
        decompressed.truncate(length); // truncate to recover it.

        assert_eq!(decompressed, vec![0, 1, 2, 3, 4, 5, 6, 7]);
    }

I wonder how do we generalize it for data whose length is larger than 12. Essentially, I was hoping to have

fn decode(values: &[u8], length: u32, num_bits: u8) -> Vec<u32>

on which values does not have to be a multiple of 12. Any pointers in this direction?

64 bit version?

I'm using BitPacker4x, and it appears to compress only 32 bit integers. Is there a way to compress an array of 64 bit integers?

please use runtime dispatch via target_feature

It would be great if this crate didn't require using any compilation flags at all, and instead auto-configured itself automatically at runtime based on what the running CPU supports. This process is documented in the std::arch module docs.

Some other nits:

The README has no link to docs.
The README does not make it clear that using this crate requires Rust nightly.
It would be great if the API of this crate we much simpler. I don't think a trait is needed at all. You could just provide the current trait's methods as either top-level functions or methods of a single type. With runtime dispatch, you could fallback to a scalar implementation, which would let you support Rust stable pretty easily I think.

Alternative data types for output?

This crate is super useful, thanks for writing/maintaining it!

I don't know if this is feasible but it would be nice if this crate could output to different types beyond u32. I needed to output to u8 to save space and I ended up rewriting a lot stuff for my purposes (I am only decompressing and always have the same width so I could simplify things).

Having done so, I think that it'd be possible to add another level of macro magic here and output to u8 and u16 for when those types are needed. I'm not too familiar with programming rust macros though, it might create a huge binary if it wasn't done carefully.

Just thought I'd put it out there as a potential enhancement, thanks again for the crate.

BitPacker4x incorrectly encodes bit width 32 sorted blocks

Easiest way to explain this is through a repro:

#[test]
fn test_delta_bit_width_32() {
    use bitpacking::BitPacker4x;
    let values = vec![i32::max_value() as u32 + 1; BitPacker4x::BLOCK_LEN];
    let bit_packer = BitPacker4x::new();
    let bit_width = bit_packer.num_bits_sorted(0, &values);
    assert_eq!(bit_width, 32);

    let mut block = vec![0u8; BitPacker4x::compressed_block_size(bit_width)];
    bit_packer.compress_sorted(0, &values, &mut block, bit_width);

    let mut decoded_values = vec![0x10101010; BitPacker4x::BLOCK_LEN];
    bit_packer.decompress_sorted(0, &block, &mut decoded_values, bit_width);

    assert_eq!(values, decoded_values);
}

fails with:

thread 'page::bitpacked::tests::test_delta_bit_width_32' panicked at 'assertion failed: `(left == right)`
  left: `[2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648, 2147483648]`,
 right: `[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`', data/src/page/bitpacked.rs:296:9

'unstable' Cargo feature does not exist

Was curious what benchmarks I would get, guided by README...

❯ cargo +nightly bench --features unstable
error: Package `bitpacking v0.8.3 (/tmp/bitpacking)` does not have the feature `unstable`

Is this the current way:

cargo +nightly bench

error in bitbackerx8

There seems to be a copy-past error here, the index should be different from one another https://github.com/tantivy-search/bitpacking/blob/8cc0a1700e690de97b0bb4e0fb205024c0deaf3c/src/bitpacker8x.rs#L35-L36

Remove simd bitpacking implementations for high bit width

For sorted stuff, having a super fast implementation is not necessary above bitwidth = 8. We can fallback to the scalar implementation.
For unsorted stuff, I think we can also restrict ourselves to bitwidth <= 8, although this is a trade-off.

quickwit-oss / bitpacking Goto Github PK

bitpacking's Introduction

Fast Bitpacking algorithms

How to compile ?

Documentation

What is bitpacking ?

Choosing between BitPacker1x, BitPacker4x and BitPacker8x.

BitPacker1x

BitPacker4x

BitPacker8x

Compressing small integers

Benchmark

BitPacker1x

BitPacker4x (assuming SSE3 instructions are available)

BitPacker8x (assuming AVX2 instructions are available)

Reference

Other crates you might want to check out

bitpacking's People

Contributors

Stargazers

Watchers

Forkers

bitpacking's Issues

Recommend Projects

Recommend Topics

Recommend Org