Coder Social home page Coder Social logo

Deduplication Strategies about attic HOT 3 OPEN

zhaofengli avatar zhaofengli commented on July 23, 2024 5
Deduplication Strategies

from attic.

Comments (3)

zhaofengli avatar zhaofengli commented on July 23, 2024 3

FastCDC-based chunking has been added in e8f9f3c 1. In this new model, NARs are backed by a sequence of content-addressed chunks in the Global Chunk Store. Newly-uploaded NARs will be split into chunks with FastCDC and only new chunks will be uploaded to the storage backend. NARs that have existed prior to chunking will be converted to have a single chunk.

For example, this works reasonably well for huge unfree paths that are rebuilt without version change (e.g., Zoom and VSCode), even with a very large chunk size. I have some simple numbers on here and will post more later. It also works even if there are some differences (e.g., llvm-13.0.0-lib -> llvm-13.0.1-lib). The default uses 128 KiB as the average chunk size, and I will add atticadm test-chunking so chunk sizes can be easily fine-tuned.

I'm leaving this issue open so other approaches can be explored.

Some relevant FAQs:

Why chunk NARs instead of individual files?

In the current design, chunking is applied to the entire uncompressed NAR file instead of individual constituent files in the NAR. Big NARs that benefit the most from chunk-based deduplication (e.g., VSCode, Zoom) often have hundreds or thousands of small files. During NAR reassembly, it's often uneconomical or impractical to fetch thousands of files to reconstruct the NAR in a scalable way. By chunking the entire NAR, it's possible to configure the average chunk size to a larger value, ignoring file boundaries and lumping small files together. This is also the approach casync has taken.

You may have heard that the Tvix store protocol chunks individual files instead of the NAR. The design of Attic is driven by the desire to effectively utilize existing platforms with practical limitations2, while looking forward to the future.

Why not just erase store path references?

At first glance, erasing store path references and storing them separately seems easy but it's actually difficult to do in practice. It makes NAR assembly difficult (not a simple concatenation of chunks anymore) and the immediate benefits from doing so alone are minimal. Many files have other differences, like a minor version upgrade or .note.gnu.build-id. The files that this approach works with (mostly small paths like configurations) don't have much overhead to begin with. Therefore I opted for the approach that does provide considerable gains and is simple to implement to start with.

What happens if a chunk is corrupt/missing?

When a chunk is deleted from the database, all dependent .narinfo and .nar will become unavailable (503). However, this can be recovered from automatically when any NAR containing the chunk is uploaded.

At the moment, Attic cannot automatically detect when a chunk is corrupt or missing. Correctly distinguishing between transient and persistent failures is difficult. The atticadm utility will have the functionality to kill/delete bad chunks.

Footnotes

  1. Apologies for the big code dump commit - the commit history is pretty chaotic in the private WIP branch. It basically remodelled the entire storage model.

  2. In more concrete terms, I want to use Cloudflare Workers to reassemble NARs for the sweet, sweet free egress 😃

from attic.

blaggacao avatar blaggacao commented on July 23, 2024

Further references

The spongix nix cache implementation already uses desync (which has an interesting readme!), a casync implementation in go.

For interested parties, here is the introduction blog post for casync, which explains the algorithm: https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html

And the really deep dive is here: https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/


Stripping of store references has also been discussed in the context of nix-casync.

And the introductory blog post is also a good entrypoint from a Nix perspective.

from attic.

shimunn avatar shimunn commented on July 23, 2024

Wouldn't be simpler to put each store reference into it's own chunk? Doing so would only add complexity to the chunker but not to the assembler.

from attic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.