Hello, In my use case, I have a Vec

TANS with separate probability about constriction HOT 1 OPEN

irevoire commented on July 18, 2024

TANS with separate probability

from constriction.

Comments (1)

robamler commented on July 18, 2024

Sorry for the delay. I'm not sure I understand the question. Of course you can use the same probabilistic model to independently compress/decompress several messages. And yes, in this case you have to keep the model in memory only once, since compression and decompression don't consume the model, they only need a reference to it. Admittedly, this is a bit obscured by the generic nature of the API; for example, the method AnsCoder::encode_iid_symbols_reverse takes a generic argument model whose type has to implement EncoderModel, so it may indeed seem like you'd have to provide a fresh entropy model every time. But there's a blanket implementation of EncoderModel for any reference &M where M implements EncoderModel, so you only need a single owned EncoderModel and can hand out as many shared references to it as you like (some small entropy models also implement Copy; for those, it's usually more performant to pass them by value).

I'm attaching an example of a full compression/decompression round trip below. But in brief, if I understand correctly what you're trying to achieve, then your struct for the compressed representation of Index should probably look something like this:

struct CompressedIndex {
    doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
    probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
    alphabet: Vec<char>, // List of all distinct characters that can appear in a message (see full example below).
}

And there's nothing that holds you back from encoding or decoding each entry of doc independently, using the shared entropy model probs and the shared alphabet (see full round-trip example below).

From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.

I'm not sure I understand. Of course you have to provide the probabilities anytime you encode or decode a symbol (in fact, you have to provide the entire entropy model, not just the probability of the specific symbol you're currently encoding or decoding). That's not a limitation of constriction, it's a fundamental theoretical limitation of source coding: one cannot (losslessly) compress data without a probabilistic model of the data source ("source coding theorem").

Full Example

use std::collections::HashMap;

use constriction::{
    backends::Cursor,
    stream::{
        model::DefaultContiguousCategoricalEntropyModel, stack::DefaultAnsCoder, Decode, Encode,
    },
    UnwrapInfallible,
};

#[derive(Debug, PartialEq, Eq)]
struct UncompressedIndex {
    doc: Vec<String>,
}

#[derive(Debug)]
struct CompressedIndex {
    doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
    probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
    alphabet: Vec<char>, // List of all distinct characters that can appear in a message.
}

impl UncompressedIndex {
    fn compress(
        &self,
        probs: DefaultContiguousCategoricalEntropyModel,
        alphabet: Vec<char>,
    ) -> CompressedIndex {
        let inverse_alphabet = alphabet
            .iter()
            .enumerate()
            .map(|(index, &character)| (character, index))
            .collect::<HashMap<_, _>>();

        let doc = self
            .doc
            .iter()
            .map(|message| {
                let mut coder = DefaultAnsCoder::new();

                // Start with a special EOF symbol so that `CompressedIndex::decompress` knows when to terminate:
                coder.encode_symbol(alphabet.len(), &probs).unwrap();

                // Then encode the message, character by character, in reverse order:
                for character in message.chars().rev() {
                    let char_index = *inverse_alphabet.get(&character).unwrap();
                    coder.encode_symbol(char_index, &probs).unwrap();
                }

                coder.into_compressed().unwrap_infallible()
            })
            .collect();

        CompressedIndex {
            doc,
            probs,
            alphabet,
        }
    }
}

impl CompressedIndex {
    fn decompress(&self) -> UncompressedIndex {
        let doc = self
            .doc
            .iter()
            .map(|data| {
                let mut coder =
                    DefaultAnsCoder::from_compressed(Cursor::new_at_write_end(&data[..])).unwrap();
                core::iter::from_fn(|| {
                    let symbol_id = coder.decode_symbol(&self.probs).unwrap();
                    self.alphabet.get(symbol_id).copied() // Returns `None` if `symbol_id` is the EOF token, which terminates the iterator.
                })
                .collect()
            })
            .collect();

        UncompressedIndex { doc }
    }
}

#[test]
fn round_trip() {
    let uncompressed = UncompressedIndex {
        doc: vec!["Hello, World!".to_string(), "Goodbye.".to_string()],
    };

    let alphabet = vec![
        'H', 'e', 'l', 'o', ',', ' ', 'W', 'r', 'd', '!', 'G', 'b', 'y', '.',
    ];
    let counts = [1., 2., 3., 4., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2.]; // The last entry is for the EOF token.
    let probs =
        DefaultContiguousCategoricalEntropyModel::from_floating_point_probabilities(&counts)
            .unwrap();

    let compressed = uncompressed.compress(probs, alphabet);
    let reconstructed = compressed.decompress();
    assert_eq!(uncompressed, reconstructed);
}

from constriction.

TANS with separate probability about constriction HOT 1 OPEN

Comments (1)

Full Example

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent