nlfiedler / fastcdc-rs Goto Github PK
View Code? Open in Web Editor NEWFastCDC implementation in Rust
Home Page: https://crates.io/crates/fastcdc
License: MIT License
FastCDC implementation in Rust
Home Page: https://crates.io/crates/fastcdc
License: MIT License
If we have a slack channel or a gitter as an im tool, we can communicate freely and in time.
Hi there ๐
First, thank you very much for this awesome crate, I learned a lot reading the code and the documentation is top tier!
I would like to know if you are interested in adding SuperCDC (https://www.computer.org/csdl/proceedings-article/iccd/2022/618600a170/1JeFWRlp7gs) to this crate?
I've just glanced over the paper, but it shows a throughput improvement of ~1.5 to 8-10 times over FastCDC 2020.
If you are interested, please let me know how I can help.
Need to figure out how to use fastcdc::v2020::StreamCDC in order to obtain the same effect as the ronomon's FastCDC::with_eof
My goal is to rewrite the current puzzlefs implementation (based on fastcdc 1.0.5) which uses a custom streaming implementation: https://github.com/anuvu/puzzlefs/blob/master/builder/src/fastcdc_fs.rs
It would also address project-machine/puzzlefs#15
I have the following error when building:
error[E0107]: missing generics for struct `fastcdc::v2020::StreamCDC`
--> puzzlefs-lib/src/builder.rs:111:18
|
111 | mut chunker: StreamCDC,
| ^^^^^^^^^ expected 1 generic argument
Semver link: https://semver.org/#summary
We need to chunking the file from oss on cloud, and the file maybe big, and the chunking server may not have enough memory to hold in the chunk process, so we want to chunking the oss input stream which opened by chunking process, the chunking process use fastcdc-rs as a lib to cutting stream into chunk, but the chunking logic is in the fastcdc-rs lib.
One limitation I'm running into is identifying chunks by a unique identifier without having to rehash the contents. I'm wondering if it would be useful to expose the hash value of the chunk in the Chunk
struct? The cut
function could potentially return (u32, usize)
and then add that to the return of the iterator.
Thanks for the crate it works well!
Hi there!๐
I looked into the code of v2020::<FastCDC as Iterator>::size_hint()
, which returns the estimated total number of chunks. But the documentation for Iterator::size_hint() says:
size_hint() returns the bounds on the remaining length of the iterator.
So it may be better to return the estimated number of remaining chunks:
fn size_hint(&self) -> (usize, Option<usize>) {
// NOTE: This intentionally returns the upper bound for both `size_hint`
// values, as the upper bound doesn't actually seem to get used by `std`
// and using the actual lower bound is practically guaranteed to require
// a second capacity growth.
let upper_bound = self.remaining / self.min_size;
(upper_bound, Some(upper_bound))
}
I'd be happy to make a PR if it is resonable.โค๏ธ
I am pretty sure there exists research for a CDC algorithm named QuickCDC from at least 4 years ago (jrobhoward/quickcdc is seemingly based on it but references other research). However, I cannot find that paper anywhere, and the only results that turn up now are from 2021, which is too recent to be the same thing. If you have a link, please add a comment here. Thank you.
The connection between the FastCDC
struct, the Iterator
, and the type of its Item
is not clear at all, need some explanation to tie it together. Add a code example to the FastCDC
struct to show the basic operation.
It seems like it would be a small optimization to supply Iterator::size_hint()
to cut down on allocations when using collect()
. Given the bounds on chunk sizes, it seems like this is straightforward to calculate, where size is [source.len() / min_size, source.len() / max_size]
.
One other consideration, from my little bit of digging in to Vec
's FromIter
and reading https://internals.rust-lang.org/t/is-size-hint-1-ever-used/8187/2, it doesn't seem like the upper bound is ever used, so my preference would be to "incorrectly" supply the upper bound as both the lower bound and upper bound and avoid the essentially guaranteed additional allocation if we start with the lower bound for capacity.
If this change seems reasonable, I'd be happy to make a PR ๐
Public hash
field was added to Chunk struct. This breaks code since it is public interface. According to semver such changes should not go to patch version.
This implementation copies another implementation that uses a right-shift in its Gear rollsum. This has problems I reported here;
In testing, we've observed that some chunks experience hash collisions. Is there any way to avoid this?
Hey!
The StreamCDC
struct currently contains a Box<dyn Read>
, which requires an additional heap allocation and potentially prevents some compiler optimisations by obscuring the reader type.
Is there a particular reason for this Box that isn't immediately obvious, or would you accept a change to parameterise StreamCDC over an R: Read
, such that monomorphisation can occur?
Thanks for this crate! :)
Would be great if the crate could take a (possibly async) stream of bytes, did it's own buffering, and return a stream of chunks. Or perhaps something more general like a struct implementing Read
.
Hello! Thank you for your effort in maintaining this crate!
I'm currently using the ronomon
version of the FastCDC algorithm and have been looking to switch over to v2020
. I am having issues since the consts max for MIN, AVERAGE, and MAX have been lowered significantly and the assertions confirming their sizes panic before I can recover: panicked at 'assertion failed: avg_size <= AVERAGE_MAX'
Here is the AVERAGE_MAX const per algorithm:
Ronomon: pub const AVERAGE_MAX: usize = 268_435_456;
v2020: pub const AVERAGE_MAX: u32 = 4_194_304;
Any reason the average max only be set to 4 MB? Are these values set and specified in the whitepaper? It's quite limiting and the amount of chunks generated by one file would increase dramatically if I switch algorithms.
Cheers,
ciehanski
In my benchmarks, I have not been able to reproduce the experimental result in the paper that the "rolling two bytes each time" optimization improves performance by 30-40%.
One benchmark is to use the example in this repo:
hyperfine \
--warmup 2 \
--parameter-list version v2016,v2020 \
'cargo run --example {version} -- --size 16384 test/fixtures/SekienAkashita.jpg'
hyperfine
reports ~50ms duration on my machine for both versions, with 1.00 +/- 0.02x performance difference.
I also ran a different benchmark on one of my own datasets, with 2 GB of cache artifacts generated by a bazel build
. In my experimental setup I was using 8 threads, and using fastcdc-rs (comparing v2016 vs v2020) to chunk all 2GB of files in parallel (just reading the files and chunking them, and throwing away the result - nothing else).
With 8 threads enabled I found that v2020 was only 1.01X faster than v2016. With 1 thread enabled I see that v2020 is 1.07X faster. Two observations about these results:
Maybe the README could mention this data point as a caveat, to help manage expectations? Something like "the authors claim a 30-40% performance improvement, but one benchmark using this rust implementation showed a more modest improvement (roughly a 1% improvement for multi-threaded chunking, or a 7% improvement for single-threaded chunking)."
Here is my benchmark as a gist in case anyone would like to test with their dataset. Copy it to src/main.rs
in a clone of this repo, then run it like cargo add walkdir crossbeam-channel && cargo build && hyperfine --warmup 2 --parameter-list version 2016,2020 './target/release/fastcdc DATA_DIR'
-- replace DATA_DIR with your custom data directory, preferably with many files exceeding the min chunk size.
The newly introduced AsyncStreamCDC
needs unit tests, API documentation, and a working example. The first question is how to supply an appropriate AsyncRead
to the chunker since tokio is not compatible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.