Comments (3)
FastCDC-based chunking has been added in e8f9f3c 1. In this new model, NARs are backed by a sequence of content-addressed chunks in the Global Chunk Store. Newly-uploaded NARs will be split into chunks with FastCDC and only new chunks will be uploaded to the storage backend. NARs that have existed prior to chunking will be converted to have a single chunk.
For example, this works reasonably well for huge unfree paths that are rebuilt without version change (e.g., Zoom and VSCode), even with a very large chunk size. I have some simple numbers on here and will post more later. It also works even if there are some differences (e.g., llvm-13.0.0-lib
-> llvm-13.0.1-lib
). The default uses 128 KiB as the average chunk size, and I will add atticadm test-chunking
so chunk sizes can be easily fine-tuned.
I'm leaving this issue open so other approaches can be explored.
Some relevant FAQs:
Why chunk NARs instead of individual files?
In the current design, chunking is applied to the entire uncompressed NAR file instead of individual constituent files in the NAR. Big NARs that benefit the most from chunk-based deduplication (e.g., VSCode, Zoom) often have hundreds or thousands of small files. During NAR reassembly, it's often uneconomical or impractical to fetch thousands of files to reconstruct the NAR in a scalable way. By chunking the entire NAR, it's possible to configure the average chunk size to a larger value, ignoring file boundaries and lumping small files together. This is also the approach casync
has taken.
You may have heard that the Tvix store protocol chunks individual files instead of the NAR. The design of Attic is driven by the desire to effectively utilize existing platforms with practical limitations2, while looking forward to the future.
Why not just erase store path references?
At first glance, erasing store path references and storing them separately seems easy but it's actually difficult to do in practice. It makes NAR assembly difficult (not a simple concatenation of chunks anymore) and the immediate benefits from doing so alone are minimal. Many files have other differences, like a minor version upgrade or .note.gnu.build-id
. The files that this approach works with (mostly small paths like configurations) don't have much overhead to begin with. Therefore I opted for the approach that does provide considerable gains and is simple to implement to start with.
What happens if a chunk is corrupt/missing?
When a chunk is deleted from the database, all dependent .narinfo
and .nar
will become unavailable (503). However, this can be recovered from automatically when any NAR containing the chunk is uploaded.
At the moment, Attic cannot automatically detect when a chunk is corrupt or missing. Correctly distinguishing between transient and persistent failures is difficult. The atticadm
utility will have the functionality to kill/delete bad chunks.
Footnotes
from attic.
Further references
The spongix
nix cache implementation already uses desync
(which has an interesting readme!), a casync
implementation in go
.
For interested parties, here is the introduction blog post for casync
, which explains the algorithm: https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
And the really deep dive is here: https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
Stripping of store references has also been discussed in the context of nix-casync
.
And the introductory blog post is also a good entrypoint from a Nix perspective.
from attic.
Wouldn't be simpler to put each store reference into it's own chunk? Doing so would only add complexity to the chunker but not to the assembler.
from attic.
Related Issues (20)
- ERROR attic_server::error: Database error: Failed to acquire connection from pool: Connection pool timed out HOT 1
- Prometheus metrics HOT 1
- error in building after flake update HOT 4
- [Feature Request] use pre-signed URLs to upload directly to the bucket (work around 413 errors) HOT 1
- working around Google Cloud Run request limit of 32 MB HOT 2
- Stream error in h2 framing layer reported by nix HOT 17
- How important is the database? HOT 2
- Attic tries to request http instead of https HOT 1
- Lower CPU usage settings? HOT 2
- Error: relative URL without a base HOT 1
- panic at sqlx-postgres HOT 6
- Download all records HOT 1
- Duplicate key error while pushing to cache HOT 1
- AccessError - weird race condition with auth token checks? HOT 4
- use differential compression
- warning: 'https://cache.domain.com/prod' does not appear to be a binary cache HOT 14
- Big uploads seem to fail. HOT 2
- S3 backend does not re-use connections HOT 1
- Allow building installables in 'attic push'
- make-token --configure not sufficient for configure --public
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from attic.