Comments (4)
Nice catch, thanks! That's a bug indeed. Another issue with this: the input can never end exactly on a chunk boundary that is > cfg_.min_size
. The gear hash is never stored, the chunks are identified based on their BLAKE3 hash, so there is no danger of corruption when fixing this. The only mildly negative effect is that for cdc_stream, the cache will be invalidated because all chunk boundaries change, but it's not a big deal.
Feel free to send me a pull request. If not, now that I look at the code again after a long time, I see a couple of other issues that I'd like to address, so I could also fix that bug myself.
Note that you need to accept the Google Contributor's License Agreement before we can approve your contribution:
https://cla.developers.google.com/
from cdc-file-transfer.
I have a few other things related to the fastcdc/gear implementation that I'll send a pull request for soon, but this particular bug/fix is tiny so I'll probably send a separate fix for it first.
The other stuff is related to my research https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst which showed that the "normalization" fastCDC does is actually worse than simple exponential chunking, and I see that you normalize it even more than fastCDC did.
It turns out the deduplication wins FastCDC reported for normalized chunking were an artefact of a smaller average chunk size. Any "normalizing" of the chunk-size distribution actually hurts deduplication because it makes the chunk boundarys more dependent on the previous chunk boundary, which messes with the re-synchronisation after insert/delete deltas. If you use the same average chunk size and choose your min/max settings right, a simple exponential chunker gets better deduplication.
I was going to file another bug related to that, but I thought I'd whip up some code and test it all first.
BTW, I also work for Google, but stumbled onto this during my holidays doing hobby non-work stuff. I dunno if I technically needed to but I signed the CLA using my personal gmail/github account details.
from cdc-file-transfer.
Interesting, I might take a look when I find some time.
Note that for our use-case, deduplication was not that only requirement that we had. First and foremost, I implemented FastCDC for cdc_stream
to support playing a game without uploading it to the cloud first. The goal was to utilize the bandwidth well and reduce the number of round-trips. This works best with fixed chunk sizes as that gives predictable latencies.
My implementation of FastCDC is a compromise between predictable chunk sizes and deduplication. That's why I introduced multiple stages and bit masks to get a stronger normalization than the original algorithm proposed. The application of FastCDC in cdc_sync
came as an afterthought. We never optimized the implementation for the rsync use-case, though we probably could just use less mask_stages for cdc_sync
to reduce the normalization.
from cdc-file-transfer.
Ahh... interesting, so the normal distribution of chunk sizes was an explicit objective/requirement.
Note in my testing I came up with an even "better chunk normalizer" that gives an almost perfect normal distribution (it's actually a Weibull distribution) that's much simpler to implement than FastCDC's "NC" normalizer. It was my first experiment to improve on FastCDC and was surprised when it had even worse deduplication. This led me to test a whole bunch of things that clearly showed any sort of normalization makes deduplication worse.
If your requirements are tightly constrained chunk sizes, then "Regression Chunking" might be worth using. It improves deduplication when you have a small max_size, at the cost of some reduced speed benefits from cut-point-skipping. It hardly sacrifices any deduplication with max_size=2avg_size, and even works OK with max_size=1.5avg_size.
from cdc-file-transfer.
Related Issues (20)
- hetzner clound cdc_rsync HOT 3
- EOF detected; Failed to receive packet of size 4
- Teste
- version `GLIBC_2.34' not found
- Clean up mentiones of gamelet and related
- Readme in release zip files is missing png files
- QUESTION - LINUX ONLY HOT 7
- Deploy cdc_rsync_server based upon target system cpu architecture HOT 2
- Can cdc_rsync be used as a local Windows rsync tool? HOT 10
- [Feature Request] Support for MacOS HOT 1
- netstat is discouraged in modern Linux; try ss first in Linux HOT 10
- Document how cdc-file-transfer is installed
- Do port detection in cdc_rsync_server and cdc_fuse_fs instead of running netstat/ss
- Get rid of shell scripting in ssh commands
- Docs should mention the need to install vc_redist
- Remove git submodules HOT 3
- Patching of file isn't working HOT 3
- [cdc_rsync] Fails to overwrite directory with a file
- problem with comping in SLES 12sp5 gcc7 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cdc-file-transfer.