Comments (4)
Okay, I'm going to use this issue to collect info / random thoughts as I go through the code and hopefully this will spark further discussion before summarizing this in the docs and / or coming up with plans for future improvements.
-
mkdwarfs
has a section on internal operation that's useful to understand where it might be using memory. There's also a high-level sequence diagram. -
As mentioned in the
mksquashfs
issue, things like keeping track of hardlinks and de-duplicating files require a global view of all files. Basically, most features that significantly differentiate DwarFS from other read-only file systems require that global view (e.g. similarity ordering). So a core question is whether you'd like to be able to turn off all these features in order to create a DwarFS image with billions of files, or whether you'd still want these features to somehow work. Personally, I'd be in favour of the latter. -
In its current shape and form, it'll be really hard to create a DwarFS image for which the unpacked metadata does not fit into memory. Once the metadata is packed and written to disk, it can be memory-mapped without every unpacking it, so accessing a file system image should not run into memory issues. Assuming that all memory-consuming features would be disabled, we'd still have to build huge lists for inodes / directory entries / chunks and keep them in memory. These would consume at least 64 bytes per file, assuming 32-bit numeric values (as is currently the case). So for 1 billion files, you'd be looking at at least 64 GB RAM just for the metadata (and this is really just a lower bound after a quick glance at the data structures).
-
Is the ability to pack a billion files a hard requirement? If so, why? By disabling all advanced features, there's likely not a huge benefit (in terms of space) in having a single image over, say, 20 images each holding a subset of the data and merge-mounting them. (And yeah, it might be one potential solution to just have
mkdwarfs
somehow start a new sub-image when exceeding a certain number of input files and then transparently merge this sub-images when reading the image.) -
I came across STXXL recently, which might make it possible to move larger data structures from RAM to disk (optionally, of course). But I have zero experience with it and most certainly, this would have an impact on performance. It might even be possible to use these types to represent the unpacked metadata.
from dwarfs.
Hi! Totally agree it'd be good to document the current state of things. I don't think it's easily possible at the moment to influence the amount of memory used per-file, though (I'd have to take a closer look to be completely sure). The memory limit option will only affect how many filesystem blocks can be queued for compression, so that isn't going to help with the memory consumed per-file. I'll get back with more info when I find some time to look into this more closely.
from dwarfs.
- Another option might be to allocate "static" and infrequently used data in disk-backed, memory-mapped files.
from dwarfs.
So a core question is whether you'd like to be able to turn off all these features in order to create a DwarFS image with billions of files, or whether you'd still want these features to somehow work. Personally, I'd be in favour of the latter.
I think it would be fantastic if there was a way to get a "constant-memory operation" mode.
Sometimes you just have too many files, and it's the number of files that's the problem (e.g. causing disk seeks upon reads or disk scrubs, payment-per-request on cloud storage systems, and so on). Then you basically want constant-memory tar
but without the drawbacks (no concurrent reads => slow, lack of fast extraction due to lack of index, lack of easy mountability to read the data without unpacking, lack of intelligent file-dependent streaming compression).
So I think it would be valuable to be able to disable dwarfs's other features already for gaining those.
But in addition, it would be even cooler to be able to opt into some of dwarfs's more advanced feature in a "constent-memory" mode, e.g. deduplicating against only the last N MB read, or the hashes of the last N million files. This would unlock most of dwarfs's space-saving features while still being able to set up an automatic job and knowing it will never run out of RAM.
So in this case, you could gradually tune between the "a few KB memory needed" mode of tar
, and "a few GB needed, configurable" mode of dwarfs which would already get massive space saving gains.
from dwarfs.
Related Issues (20)
- cannot enter subdirs of overlayed dwarfs HOT 10
- Error while building HOT 2
- Fuse Passthrough
- Are the prebuilt binaries affected by xz/liblzma backdoor? HOT 6
- [Feature request] Allow providing dwarfs with a dedup library HOT 4
- official debian package, allow building without git? HOT 4
- mkdwarfs always crashes with SIGABRT HOT 13
- [MacOS] DwarFS mount not seen through Finder HOT 5
- [Core Dump] Signal 7 (SIGBUS) (code: nonexistent physical address) on making archive of currently running OS (possibly bad use case) HOT 2
- Homebrew formula HOT 22
- some problem on the README.md files. Please Check up and Fix. HOT 1
- read scalability issues with large archives HOT 9
- Unexpected exception: `inode has no file (any)` HOT 8
- [Feature Request] Mounting multiple archives to the same path HOT 12
- Segfault when using the mold linker HOT 2
- Cannot build v0.9.9 on Ubuntu 22.04 HOT 5
- Vendor fbthrift & folly using vcpkg HOT 2
- exception thrown in worker thread: class dwarfs::runtime_error: lzma_stream_encoder HOT 2
- [Feature Request] Provide non-generic packaging CI for major linux distributions HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dwarfs.