nlfiedler / fastcdc-rs Goto Github PK

View Code? Open in Web Editor NEW

128.0 8.0 23.0 207 KB

FastCDC implementation in Rust

Home Page: https://crates.io/crates/fastcdc

License: MIT License

Rust 100.00%

rust deduplication chunking-algorithm

fastcdc-rs's People

Contributors

Stargazers

Watchers

fastcdc-rs's Issues

Is there a im tool as https://gitter.im/dpc/rdedup or slack

If we have a slack channel or a gitter as an im tool, we can communicate freely and in time.

Implementing SuperCDC 2022

Hi there 👋

First, thank you very much for this awesome crate, I learned a lot reading the code and the documentation is top tier!

I would like to know if you are interested in adding SuperCDC (https://www.computer.org/csdl/proceedings-article/iccd/2022/618600a170/1JeFWRlp7gs) to this crate?

I've just glanced over the paper, but it shows a throughput improvement of ~1.5 to 8-10 times over FastCDC 2020.

If you are interested, please let me know how I can help.

v2020 equivalent of FastCDC::with_eof

Need to figure out how to use fastcdc::v2020::StreamCDC in order to obtain the same effect as the ronomon's FastCDC::with_eof

My goal is to rewrite the current puzzlefs implementation (based on fastcdc 1.0.5) which uses a custom streaming implementation: https://github.com/anuvu/puzzlefs/blob/master/builder/src/fastcdc_fs.rs

It would also address project-machine/puzzlefs#15

Req: supply consume the byte as stream to avoid big buf as input

background

We need to chunking the file from oss on cloud, and the file maybe big, and the chunking server may not have enough memory to hold in the chunk process, so we want to chunking the oss input stream which opened by chunking process, the chunking process use fastcdc-rs as a lib to cutting stream into chunk, but the chunking logic is in the fastcdc-rs lib.

request

If fastcdc-rs supply a way to consume the byte as stream, we can avoid big buf
Supply to use this lib through jni or jnr to let java use this useful lib

Provide hash in chunk struct

One limitation I'm running into is identifying chunks by a unique identifier without having to rehash the contents. I'm wondering if it would be useful to expose the hash value of the chunk in the Chunk struct? The cut function could potentially return (u32, usize) and then add that to the return of the iterator.

Thanks for the crate it works well!

Maybe size_hint() should return the estimated remaining chunk number

Hi there!👋

I looked into the code of v2020::<FastCDC as Iterator>::size_hint(), which returns the estimated total number of chunks. But the documentation for Iterator::size_hint() says:

size_hint() returns the bounds on the remaining length of the iterator.

So it may be better to return the estimated number of remaining chunks:

    fn size_hint(&self) -> (usize, Option<usize>) {
        // NOTE: This intentionally returns the upper bound for both `size_hint`
        // values, as the upper bound doesn't actually seem to get used by `std`
        // and using the actual lower bound is practically guaranteed to require
        // a second capacity growth.
        let upper_bound = self.remaining / self.min_size;
        (upper_bound, Some(upper_bound))
    }

I'd be happy to make a PR if it is resonable.❤️

Anyone have a link to the original QuickCDC paper?

I am pretty sure there exists research for a CDC algorithm named QuickCDC from at least 4 years ago (jrobhoward/quickcdc is seemingly based on it but references other research). However, I cannot find that paper anywhere, and the only results that turn up now are from 2021, which is too recent to be the same thing. If you have a link, please add a comment here. Thank you.

doc: make it clear the iterator returns Chunks

The connection between the FastCDC struct, the Iterator, and the type of its Item is not clear at all, need some explanation to tie it together. Add a code example to the FastCDC struct to show the basic operation.

FastCDC Iterator size_hint

It seems like it would be a small optimization to supply Iterator::size_hint() to cut down on allocations when using collect(). Given the bounds on chunk sizes, it seems like this is straightforward to calculate, where size is [source.len() / min_size, source.len() / max_size].

One other consideration, from my little bit of digging in to Vec's FromIter and reading https://internals.rust-lang.org/t/is-size-hint-1-ever-used/8187/2, it doesn't seem like the upper bound is ever used, so my preference would be to "incorrectly" supply the upper bound as both the lower bound and upper bound and avoid the essentially guaranteed additional allocation if we start with the lower bound for capacity.

If this change seems reasonable, I'd be happy to make a PR 🙂

Please revert v1.0.7 due to breaking change in public interface

Public hash field was added to Chunk struct. This breaks code since it is public interface. According to semver such changes should not go to patch version.

Problems with right-shift in Gear rollsum.

This implementation copies another implementation that uses a right-shift in its Gear rollsum. This has problems I reported here;

ronomon/deduplication#7

Hash Collisions

In testing, we've observed that some chunks experience hash collisions. Is there any way to avoid this?

Purpose of Box in StreamCDC struct

Hey!

The StreamCDC struct currently contains a Box<dyn Read>, which requires an additional heap allocation and potentially prevents some compiler optimisations by obscuring the reader type.

Is there a particular reason for this Box that isn't immediately obvious, or would you accept a change to parameterise StreamCDC over an R: Read, such that monomorphisation can occur?

Thanks for this crate! :)

[feature] Support stream of bytes

Would be great if the crate could take a (possibly async) stream of bytes, did it's own buffering, and return a stream of chunks. Or perhaps something more general like a struct implementing Read.

Extremely low MIN,AVG,MAX on v2020

Hello! Thank you for your effort in maintaining this crate!

I'm currently using the ronomon version of the FastCDC algorithm and have been looking to switch over to v2020. I am having issues since the consts max for MIN, AVERAGE, and MAX have been lowered significantly and the assertions confirming their sizes panic before I can recover: panicked at 'assertion failed: avg_size <= AVERAGE_MAX'

Here is the AVERAGE_MAX const per algorithm:

Ronomon: pub const AVERAGE_MAX: usize = 268_435_456;
v2020: pub const AVERAGE_MAX: u32 = 4_194_304;

Any reason the average max only be set to 4 MB? Are these values set and specified in the whitepaper? It's quite limiting and the amount of chunks generated by one file would increase dramatically if I switch algorithms.

Cheers,
ciehanski

v2020 version is not 30-40% better than v2016 version?

In my benchmarks, I have not been able to reproduce the experimental result in the paper that the "rolling two bytes each time" optimization improves performance by 30-40%.

One benchmark is to use the example in this repo:

hyperfine \
  --warmup 2 \
  --parameter-list version v2016,v2020 \
  'cargo run --example {version} -- --size 16384 test/fixtures/SekienAkashita.jpg'

hyperfine reports ~50ms duration on my machine for both versions, with 1.00 +/- 0.02x performance difference.

I also ran a different benchmark on one of my own datasets, with 2 GB of cache artifacts generated by a bazel build. In my experimental setup I was using 8 threads, and using fastcdc-rs (comparing v2016 vs v2020) to chunk all 2GB of files in parallel (just reading the files and chunking them, and throwing away the result - nothing else).

With 8 threads enabled I found that v2020 was only 1.01X faster than v2016. With 1 thread enabled I see that v2020 is 1.07X faster. Two observations about these results:

The best case (7% faster single-threaded performance) is much less than 30-40%
Multi-threaded performance (in which I saw only 1% improvement) is probably more relevant for a real-world data deduplication system

Maybe the README could mention this data point as a caveat, to help manage expectations? Something like "the authors claim a 30-40% performance improvement, but one benchmark using this rust implementation showed a more modest improvement (roughly a 1% improvement for multi-threaded chunking, or a 7% improvement for single-threaded chunking)."

Here is my benchmark as a gist in case anyone would like to test with their dataset. Copy it to src/main.rs in a clone of this repo, then run it like cargo add walkdir crossbeam-channel && cargo build && hyperfine --warmup 2 --parameter-list version 2016,2020 './target/release/fastcdc DATA_DIR' -- replace DATA_DIR with your custom data directory, preferably with many files exceeding the min chunk size.

Test, example, API docs for AsyncStreamCDC

The newly introduced AsyncStreamCDC needs unit tests, API documentation, and a working example. The first question is how to supply an appropriate AsyncRead to the chunker since tokio is not compatible.

nlfiedler / fastcdc-rs Goto Github PK

fastcdc-rs's People

Contributors

Stargazers

Watchers

Forkers

fastcdc-rs's Issues

background

request

Recommend Projects

Recommend Topics

Recommend Org