Coder Social home page Coder Social logo

Comments (4)

chrschng avatar chrschng commented on June 15, 2024

Nice catch, thanks! That's a bug indeed. Another issue with this: the input can never end exactly on a chunk boundary that is > cfg_.min_size. The gear hash is never stored, the chunks are identified based on their BLAKE3 hash, so there is no danger of corruption when fixing this. The only mildly negative effect is that for cdc_stream, the cache will be invalidated because all chunk boundaries change, but it's not a big deal.

Feel free to send me a pull request. If not, now that I look at the code again after a long time, I see a couple of other issues that I'd like to address, so I could also fix that bug myself.

Note that you need to accept the Google Contributor's License Agreement before we can approve your contribution:
https://cla.developers.google.com/

from cdc-file-transfer.

dbaarda avatar dbaarda commented on June 15, 2024

I have a few other things related to the fastcdc/gear implementation that I'll send a pull request for soon, but this particular bug/fix is tiny so I'll probably send a separate fix for it first.

The other stuff is related to my research https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst which showed that the "normalization" fastCDC does is actually worse than simple exponential chunking, and I see that you normalize it even more than fastCDC did.

It turns out the deduplication wins FastCDC reported for normalized chunking were an artefact of a smaller average chunk size. Any "normalizing" of the chunk-size distribution actually hurts deduplication because it makes the chunk boundarys more dependent on the previous chunk boundary, which messes with the re-synchronisation after insert/delete deltas. If you use the same average chunk size and choose your min/max settings right, a simple exponential chunker gets better deduplication.

I was going to file another bug related to that, but I thought I'd whip up some code and test it all first.

BTW, I also work for Google, but stumbled onto this during my holidays doing hobby non-work stuff. I dunno if I technically needed to but I signed the CLA using my personal gmail/github account details.

from cdc-file-transfer.

chrschng avatar chrschng commented on June 15, 2024

Interesting, I might take a look when I find some time.

Note that for our use-case, deduplication was not that only requirement that we had. First and foremost, I implemented FastCDC for cdc_stream to support playing a game without uploading it to the cloud first. The goal was to utilize the bandwidth well and reduce the number of round-trips. This works best with fixed chunk sizes as that gives predictable latencies.

My implementation of FastCDC is a compromise between predictable chunk sizes and deduplication. That's why I introduced multiple stages and bit masks to get a stronger normalization than the original algorithm proposed. The application of FastCDC in cdc_sync came as an afterthought. We never optimized the implementation for the rsync use-case, though we probably could just use less mask_stages for cdc_sync to reduce the normalization.

from cdc-file-transfer.

dbaarda avatar dbaarda commented on June 15, 2024

Ahh... interesting, so the normal distribution of chunk sizes was an explicit objective/requirement.

Note in my testing I came up with an even "better chunk normalizer" that gives an almost perfect normal distribution (it's actually a Weibull distribution) that's much simpler to implement than FastCDC's "NC" normalizer. It was my first experiment to improve on FastCDC and was surprised when it had even worse deduplication. This led me to test a whole bunch of things that clearly showed any sort of normalization makes deduplication worse.

If your requirements are tightly constrained chunk sizes, then "Regression Chunking" might be worth using. It improves deduplication when you have a small max_size, at the cost of some reduced speed benefits from cut-point-skipping. It hardly sacrifices any deduplication with max_size=2avg_size, and even works OK with max_size=1.5avg_size.

from cdc-file-transfer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.