Comments (1)
Sorry for the delay. I'm not sure I understand the question. Of course you can use the same probabilistic model to independently compress/decompress several messages. And yes, in this case you have to keep the model in memory only once, since compression and decompression don't consume the model, they only need a reference to it. Admittedly, this is a bit obscured by the generic nature of the API; for example, the method AnsCoder::encode_iid_symbols_reverse
takes a generic argument model
whose type has to implement EncoderModel
, so it may indeed seem like you'd have to provide a fresh entropy model every time. But there's a blanket implementation of EncoderModel
for any reference &M
where M
implements EncoderModel
, so you only need a single owned EncoderModel
and can hand out as many shared references to it as you like (some small entropy models also implement Copy
; for those, it's usually more performant to pass them by value).
I'm attaching an example of a full compression/decompression round trip below. But in brief, if I understand correctly what you're trying to achieve, then your struct for the compressed representation of Index
should probably look something like this:
struct CompressedIndex {
doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
alphabet: Vec<char>, // List of all distinct characters that can appear in a message (see full example below).
}
And there's nothing that holds you back from encoding or decoding each entry of doc
independently, using the shared entropy model probs
and the shared alphabet
(see full round-trip example below).
From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.
I'm not sure I understand. Of course you have to provide the probabilities anytime you encode or decode a symbol (in fact, you have to provide the entire entropy model, not just the probability of the specific symbol you're currently encoding or decoding). That's not a limitation of constriction
, it's a fundamental theoretical limitation of source coding: one cannot (losslessly) compress data without a probabilistic model of the data source ("source coding theorem").
Full Example
use std::collections::HashMap;
use constriction::{
backends::Cursor,
stream::{
model::DefaultContiguousCategoricalEntropyModel, stack::DefaultAnsCoder, Decode, Encode,
},
UnwrapInfallible,
};
#[derive(Debug, PartialEq, Eq)]
struct UncompressedIndex {
doc: Vec<String>,
}
#[derive(Debug)]
struct CompressedIndex {
doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
alphabet: Vec<char>, // List of all distinct characters that can appear in a message.
}
impl UncompressedIndex {
fn compress(
&self,
probs: DefaultContiguousCategoricalEntropyModel,
alphabet: Vec<char>,
) -> CompressedIndex {
let inverse_alphabet = alphabet
.iter()
.enumerate()
.map(|(index, &character)| (character, index))
.collect::<HashMap<_, _>>();
let doc = self
.doc
.iter()
.map(|message| {
let mut coder = DefaultAnsCoder::new();
// Start with a special EOF symbol so that `CompressedIndex::decompress` knows when to terminate:
coder.encode_symbol(alphabet.len(), &probs).unwrap();
// Then encode the message, character by character, in reverse order:
for character in message.chars().rev() {
let char_index = *inverse_alphabet.get(&character).unwrap();
coder.encode_symbol(char_index, &probs).unwrap();
}
coder.into_compressed().unwrap_infallible()
})
.collect();
CompressedIndex {
doc,
probs,
alphabet,
}
}
}
impl CompressedIndex {
fn decompress(&self) -> UncompressedIndex {
let doc = self
.doc
.iter()
.map(|data| {
let mut coder =
DefaultAnsCoder::from_compressed(Cursor::new_at_write_end(&data[..])).unwrap();
core::iter::from_fn(|| {
let symbol_id = coder.decode_symbol(&self.probs).unwrap();
self.alphabet.get(symbol_id).copied() // Returns `None` if `symbol_id` is the EOF token, which terminates the iterator.
})
.collect()
})
.collect();
UncompressedIndex { doc }
}
}
#[test]
fn round_trip() {
let uncompressed = UncompressedIndex {
doc: vec!["Hello, World!".to_string(), "Goodbye.".to_string()],
};
let alphabet = vec![
'H', 'e', 'l', 'o', ',', ' ', 'W', 'r', 'd', '!', 'G', 'b', 'y', '.',
];
let counts = [1., 2., 3., 4., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2.]; // The last entry is for the EOF token.
let probs =
DefaultContiguousCategoricalEntropyModel::from_floating_point_probabilities(&counts)
.unwrap();
let compressed = uncompressed.compress(probs, alphabet);
let reconstructed = compressed.decompress();
assert_eq!(uncompressed, reconstructed);
}
from constriction.
Related Issues (10)
- ImportError: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found HOT 2
- Creating a categorical distribution sometimes fails to converge HOT 1
- assertions not supported for parameter cdf in constriction.stream.model.CustomModel HOT 1
- Output Dict/Struct of Huffman Symbol Codes HOT 2
- Vectorize cdf query of custom model HOT 1
- Different behavior on amd64 and arm HOT 8
- Relax minimal numpy version, if possible HOT 6
- Encoding int8/unit8 streams of symbols HOT 5
- List `LICENSE.html` in `RECORD` file of python wheels HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from constriction.