Coder Social home page Coder Social logo

fastcdc-rs's People

Contributors

ariel-miculas avatar blackbeam avatar cdata avatar ciehanski avatar dristic avatar flokli avatar jorickert avatar joshtriplett avatar lokyinzhao avatar maobaolong avatar mzr avatar nagy avatar nlfiedler avatar rickvanprim avatar smoozilla avatar swatinem avatar titusz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastcdc-rs's Issues

Version 3.1.0 breaks semver

I have the following error when building:

error[E0107]: missing generics for struct `fastcdc::v2020::StreamCDC`
   --> puzzlefs-lib/src/builder.rs:111:18
    |
111 |     mut chunker: StreamCDC,
    |                  ^^^^^^^^^ expected 1 generic argument

Semver link: https://semver.org/#summary

Req: supply consume the byte as stream to avoid big buf as input

background

We need to chunking the file from oss on cloud, and the file maybe big, and the chunking server may not have enough memory to hold in the chunk process, so we want to chunking the oss input stream which opened by chunking process, the chunking process use fastcdc-rs as a lib to cutting stream into chunk, but the chunking logic is in the fastcdc-rs lib.

request

  • If fastcdc-rs supply a way to consume the byte as stream, we can avoid big buf
  • Supply to use this lib through jni or jnr to let java use this useful lib

Provide hash in chunk struct

One limitation I'm running into is identifying chunks by a unique identifier without having to rehash the contents. I'm wondering if it would be useful to expose the hash value of the chunk in the Chunk struct? The cut function could potentially return (u32, usize) and then add that to the return of the iterator.

Thanks for the crate it works well!

Maybe size_hint() should return the estimated remaining chunk number

Hi there!๐Ÿ‘‹

I looked into the code of v2020::<FastCDC as Iterator>::size_hint(), which returns the estimated total number of chunks. But the documentation for Iterator::size_hint() says:

size_hint() returns the bounds on the remaining length of the iterator.

So it may be better to return the estimated number of remaining chunks:

    fn size_hint(&self) -> (usize, Option<usize>) {
        // NOTE: This intentionally returns the upper bound for both `size_hint`
        // values, as the upper bound doesn't actually seem to get used by `std`
        // and using the actual lower bound is practically guaranteed to require
        // a second capacity growth.
        let upper_bound = self.remaining / self.min_size;
        (upper_bound, Some(upper_bound))
    }

I'd be happy to make a PR if it is resonable.โค๏ธ

Anyone have a link to the original QuickCDC paper?

I am pretty sure there exists research for a CDC algorithm named QuickCDC from at least 4 years ago (jrobhoward/quickcdc is seemingly based on it but references other research). However, I cannot find that paper anywhere, and the only results that turn up now are from 2021, which is too recent to be the same thing. If you have a link, please add a comment here. Thank you.

doc: make it clear the iterator returns Chunks

The connection between the FastCDC struct, the Iterator, and the type of its Item is not clear at all, need some explanation to tie it together. Add a code example to the FastCDC struct to show the basic operation.

FastCDC Iterator size_hint

It seems like it would be a small optimization to supply Iterator::size_hint() to cut down on allocations when using collect(). Given the bounds on chunk sizes, it seems like this is straightforward to calculate, where size is [source.len() / min_size, source.len() / max_size].

One other consideration, from my little bit of digging in to Vec's FromIter and reading https://internals.rust-lang.org/t/is-size-hint-1-ever-used/8187/2, it doesn't seem like the upper bound is ever used, so my preference would be to "incorrectly" supply the upper bound as both the lower bound and upper bound and avoid the essentially guaranteed additional allocation if we start with the lower bound for capacity.

If this change seems reasonable, I'd be happy to make a PR ๐Ÿ™‚

Hash Collisions

In testing, we've observed that some chunks experience hash collisions. Is there any way to avoid this?

Purpose of Box in StreamCDC struct

Hey!

The StreamCDC struct currently contains a Box<dyn Read>, which requires an additional heap allocation and potentially prevents some compiler optimisations by obscuring the reader type.

Is there a particular reason for this Box that isn't immediately obvious, or would you accept a change to parameterise StreamCDC over an R: Read, such that monomorphisation can occur?

Thanks for this crate! :)

[feature] Support stream of bytes

Would be great if the crate could take a (possibly async) stream of bytes, did it's own buffering, and return a stream of chunks. Or perhaps something more general like a struct implementing Read.

Extremely low MIN,AVG,MAX on v2020

Hello! Thank you for your effort in maintaining this crate!

I'm currently using the ronomon version of the FastCDC algorithm and have been looking to switch over to v2020. I am having issues since the consts max for MIN, AVERAGE, and MAX have been lowered significantly and the assertions confirming their sizes panic before I can recover: panicked at 'assertion failed: avg_size <= AVERAGE_MAX'

Here is the AVERAGE_MAX const per algorithm:

Ronomon: pub const AVERAGE_MAX: usize = 268_435_456;
v2020: pub const AVERAGE_MAX: u32 = 4_194_304;

Any reason the average max only be set to 4 MB? Are these values set and specified in the whitepaper? It's quite limiting and the amount of chunks generated by one file would increase dramatically if I switch algorithms.

Cheers,
ciehanski

v2020 version is not 30-40% better than v2016 version?

In my benchmarks, I have not been able to reproduce the experimental result in the paper that the "rolling two bytes each time" optimization improves performance by 30-40%.

One benchmark is to use the example in this repo:

hyperfine \
  --warmup 2 \
  --parameter-list version v2016,v2020 \
  'cargo run --example {version} -- --size 16384 test/fixtures/SekienAkashita.jpg'

hyperfine reports ~50ms duration on my machine for both versions, with 1.00 +/- 0.02x performance difference.

I also ran a different benchmark on one of my own datasets, with 2 GB of cache artifacts generated by a bazel build. In my experimental setup I was using 8 threads, and using fastcdc-rs (comparing v2016 vs v2020) to chunk all 2GB of files in parallel (just reading the files and chunking them, and throwing away the result - nothing else).

With 8 threads enabled I found that v2020 was only 1.01X faster than v2016. With 1 thread enabled I see that v2020 is 1.07X faster. Two observations about these results:

  • The best case (7% faster single-threaded performance) is much less than 30-40%
  • Multi-threaded performance (in which I saw only 1% improvement) is probably more relevant for a real-world data deduplication system

Maybe the README could mention this data point as a caveat, to help manage expectations? Something like "the authors claim a 30-40% performance improvement, but one benchmark using this rust implementation showed a more modest improvement (roughly a 1% improvement for multi-threaded chunking, or a 7% improvement for single-threaded chunking)."

Here is my benchmark as a gist in case anyone would like to test with their dataset. Copy it to src/main.rs in a clone of this repo, then run it like cargo add walkdir crossbeam-channel && cargo build && hyperfine --warmup 2 --parameter-list version 2016,2020 './target/release/fastcdc DATA_DIR' -- replace DATA_DIR with your custom data directory, preferably with many files exceeding the min chunk size.

Test, example, API docs for AsyncStreamCDC

The newly introduced AsyncStreamCDC needs unit tests, API documentation, and a working example. The first question is how to supply an appropriate AsyncRead to the chunker since tokio is not compatible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.