Coder Social home page Coder Social logo

Comments (6)

M-Gonzalo avatar M-Gonzalo commented on May 14, 2024 1

Thanks for the report! Unfortunately I'm unable to download the sample file (getting an access denied error).

My bad. Try now?

from dwarfs.

mhx avatar mhx commented on May 14, 2024 1

It just so happens that window size is the only one that allows full deduplication of the big files.

At some point deduplication stops being useful, though. The overhead spent on metadata for a small chunk is much bigger than simply letting a proper compression algorithm take care of the redundant data.

Building a DwarFS image from the data in your image, but with default options, yields a file that's less than half the size.

from dwarfs.

mhx avatar mhx commented on May 14, 2024

Thanks for the report! Unfortunately I'm unable to download the sample file (getting an access denied error).

from dwarfs.

mhx avatar mhx commented on May 14, 2024

Wow, you've managed to build an incredibly inefficient dwarfs image. :)

-W 4, i.e. a moving match window of size 16, while not forbidden by mkdwarfs, is very likely a bad idea.

In case of your file system image, it means files are extremely fragmented:

                [385] -> (block=0, offset=19372883, size=19)
                [386] -> (block=0, offset=19388720, size=13)
                [387] -> (block=0, offset=19388468, size=16)
                [388] -> (block=0, offset=19373327, size=25)
                [389] -> (block=0, offset=19388733, size=19)
                [390] -> (block=0, offset=19388228, size=27)
                [391] -> (block=0, offset=19388752, size=24)
                [392] -> (block=0, offset=19388468, size=16)
                [393] -> (block=0, offset=19388776, size=4)
                [394] -> (block=0, offset=19367800, size=21)
                [395] -> (block=0, offset=19388780, size=15)
                [396] -> (block=0, offset=19372883, size=19)
                [397] -> (block=0, offset=19388795, size=4)
                [398] -> (block=0, offset=19368512, size=17)
                [399] -> (block=0, offset=19388799, size=11)

So each file is made up of sometimes millions of tiny chunks.

The metadata block is gigantic:

$ ./dwarfsck imdb.dwarfs 
DwarFS version 2.3 [2]
created by: libdwarfs v0.5.6-16-g7345578
created on: 2021-11-15 00:36:33
block size: 64 MiB
block count: 1
inode count: 2338
original filesystem size: 142.8 MiB
compressed block size: 5.877 MiB (27.04%)
uncompressed block size: 21.74 MiB
compressed metadata size: 19.91 MiB (31.13%)
uncompressed metadata size: 63.95 MiB
options: mtime_only
         packed_names
metadata memory usage:
               total metadata............67,055,144 bytes       28680.6 bytes/inode
    12,765,199 chunks....................67,017,295 bytes  99.9%   5.3 bytes/item
         1,329 compact_names.................13,485 bytes   0.0%  10.1 bytes/item
               |- data                       10,450 bytes   0.0%   7.9 bytes/item
               |- unpacked                   18,957 bytes  1.81x  14.3 bytes/item
               |- dict                          707 bytes   0.0%   0.5 bytes/item
               '- index                       2,328 bytes   0.0%   1.8 bytes/item
         2,338 inodes.........................9,937 bytes   0.0%   4.3 bytes/item
         2,481 dir_entries....................7,133 bytes   0.0%   2.9 bytes/item
         2,091 chunk_table....................6,273 bytes   0.0%   3.0 bytes/item
           237 directories......................711 bytes   0.0%   3.0 bytes/item
             8 compact_symlinks.................174 bytes   0.0%  21.8 bytes/item
               |- data                          165 bytes   0.0%  20.6 bytes/item
               '- index                           9 bytes   0.0%   1.1 bytes/item
             5 modes.............................10 bytes   0.0%   2.0 bytes/item
            10 symlink_table......................4 bytes   0.0%   0.4 bytes/item
             1 gids...............................2 bytes   0.0%   2.0 bytes/item
             1 uids...............................2 bytes   0.0%   2.0 bytes/item
             0 devices............................0 bytes   0.0%   0.0 bytes/item
             3 shared_files_table.................0 bytes   0.0%   0.0 bytes/item

That's about 3 times as much metadata as block data.

In any case, the problem is as follows:

  • dwarfs receives a request to read 256k from a large file
  • since the blocks are tiny, this results in tens of thousands of blocks to be read
  • the FUSE library uses writev() to transfer these blocks to the kernel
  • however, writev() only supports up to SC_IOV_MAX blocks per call, which happens to be set to 1024

The bug in dwarfs is that it doesn't actually check the return value of the fuse library call. I've fixed this locally, and I can now read a large file from your image, but the performance is predictably abysmal:

$ time cat tmp/title.ratings.async.json >/dev/null

real    12m15.731s
user    0m0.000s
sys     0m1.081s

from dwarfs.

mhx avatar mhx commented on May 14, 2024

Please try the commit I've just pushed to see if that fixes your problem.

from dwarfs.

M-Gonzalo avatar M-Gonzalo commented on May 14, 2024

Yes, this is an extreme case made just for the report. It just so happens that window size is the only one that allows full deduplication of the big files. But sometimes a size of 7 is good enough to find s good balance. I'll compile the last version and test it. An I'll get back to you with the results

from dwarfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.