I'm having a consistent problem with all filesystems created with <code class="notrans

Bug report: dwarfs fails on tiny window sizes about dwarfs HOT 6 CLOSED

M-Gonzalo commented on May 14, 2024

Bug report: dwarfs fails on tiny window sizes

from dwarfs.

Comments (6)

M-Gonzalo commented on May 14, 2024 1

Thanks for the report! Unfortunately I'm unable to download the sample file (getting an access denied error).

My bad. Try now?

from dwarfs.

mhx commented on May 14, 2024 1

It just so happens that window size is the only one that allows full deduplication of the big files.

At some point deduplication stops being useful, though. The overhead spent on metadata for a small chunk is much bigger than simply letting a proper compression algorithm take care of the redundant data.

Building a DwarFS image from the data in your image, but with default options, yields a file that's less than half the size.

from dwarfs.

mhx commented on May 14, 2024

Thanks for the report! Unfortunately I'm unable to download the sample file (getting an access denied error).

from dwarfs.

mhx commented on May 14, 2024

Wow, you've managed to build an incredibly inefficient dwarfs image. :)

-W 4, i.e. a moving match window of size 16, while not forbidden by mkdwarfs, is very likely a bad idea.

In case of your file system image, it means files are extremely fragmented:

                [385] -> (block=0, offset=19372883, size=19)
                [386] -> (block=0, offset=19388720, size=13)
                [387] -> (block=0, offset=19388468, size=16)
                [388] -> (block=0, offset=19373327, size=25)
                [389] -> (block=0, offset=19388733, size=19)
                [390] -> (block=0, offset=19388228, size=27)
                [391] -> (block=0, offset=19388752, size=24)
                [392] -> (block=0, offset=19388468, size=16)
                [393] -> (block=0, offset=19388776, size=4)
                [394] -> (block=0, offset=19367800, size=21)
                [395] -> (block=0, offset=19388780, size=15)
                [396] -> (block=0, offset=19372883, size=19)
                [397] -> (block=0, offset=19388795, size=4)
                [398] -> (block=0, offset=19368512, size=17)
                [399] -> (block=0, offset=19388799, size=11)

So each file is made up of sometimes millions of tiny chunks.

The metadata block is gigantic:

$ ./dwarfsck imdb.dwarfs 
DwarFS version 2.3 [2]
created by: libdwarfs v0.5.6-16-g7345578
created on: 2021-11-15 00:36:33
block size: 64 MiB
block count: 1
inode count: 2338
original filesystem size: 142.8 MiB
compressed block size: 5.877 MiB (27.04%)
uncompressed block size: 21.74 MiB
compressed metadata size: 19.91 MiB (31.13%)
uncompressed metadata size: 63.95 MiB
options: mtime_only
         packed_names
metadata memory usage:
               total metadata............67,055,144 bytes       28680.6 bytes/inode
    12,765,199 chunks....................67,017,295 bytes  99.9%   5.3 bytes/item
         1,329 compact_names.................13,485 bytes   0.0%  10.1 bytes/item
               |- data                       10,450 bytes   0.0%   7.9 bytes/item
               |- unpacked                   18,957 bytes  1.81x  14.3 bytes/item
               |- dict                          707 bytes   0.0%   0.5 bytes/item
               '- index                       2,328 bytes   0.0%   1.8 bytes/item
         2,338 inodes.........................9,937 bytes   0.0%   4.3 bytes/item
         2,481 dir_entries....................7,133 bytes   0.0%   2.9 bytes/item
         2,091 chunk_table....................6,273 bytes   0.0%   3.0 bytes/item
           237 directories......................711 bytes   0.0%   3.0 bytes/item
             8 compact_symlinks.................174 bytes   0.0%  21.8 bytes/item
               |- data                          165 bytes   0.0%  20.6 bytes/item
               '- index                           9 bytes   0.0%   1.1 bytes/item
             5 modes.............................10 bytes   0.0%   2.0 bytes/item
            10 symlink_table......................4 bytes   0.0%   0.4 bytes/item
             1 gids...............................2 bytes   0.0%   2.0 bytes/item
             1 uids...............................2 bytes   0.0%   2.0 bytes/item
             0 devices............................0 bytes   0.0%   0.0 bytes/item
             3 shared_files_table.................0 bytes   0.0%   0.0 bytes/item

That's about 3 times as much metadata as block data.

In any case, the problem is as follows:

dwarfs receives a request to read 256k from a large file
since the blocks are tiny, this results in tens of thousands of blocks to be read
the FUSE library uses writev() to transfer these blocks to the kernel
however, writev() only supports up to SC_IOV_MAX blocks per call, which happens to be set to 1024

The bug in dwarfs is that it doesn't actually check the return value of the fuse library call. I've fixed this locally, and I can now read a large file from your image, but the performance is predictably abysmal:

$ time cat tmp/title.ratings.async.json >/dev/null

real    12m15.731s
user    0m0.000s
sys     0m1.081s

from dwarfs.

mhx commented on May 14, 2024

Please try the commit I've just pushed to see if that fixes your problem.

from dwarfs.

M-Gonzalo commented on May 14, 2024

Yes, this is an extreme case made just for the report. It just so happens that window size is the only one that allows full deduplication of the big files. But sometimes a size of 7 is good enough to find s good balance. I'll compile the last version and test it. An I'll get back to you with the results

from dwarfs.

Bug report: dwarfs fails on tiny window sizes about dwarfs HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent