schnaader / fairytale Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 13.0 409 KB

encode.ru community archiver

License: GNU Lesser General Public License v3.0

C++ 33.17% C 63.64% CMake 0.49% Objective-C 2.69%

archiver community-driven compression state-of-the-art usability

fairytale's People

Contributors

Stargazers

Watchers

Forkers

stephanbusch bdnbje jagannatharjun michaeldietzel m-gonzalo j1m1l0k0 victortu946 aung306 aravipati1 devp4 xingyan-liang aditidubey12 justusgeslani

fairytale's Issues

Final codec: maybe lzma2 to begin with?

I realize this project is nothing short of dead right now, halted to say the least. Still, I cling to the hope that someone will take the mantle and start working on its core code this new year. The idea is just as good as the first day, so having said that, let me propose another issue that needs to be addressed before releasing the first version.

All pre-processed and reordered data needs to be actually compressed before writing it to disk. We have considered some novel algorithms, but given the fact that none of them seem to be production-ready yet, I say lzma2 is as good as any, and battle-proven. Even without all the fancy new methods and algorithms for pre-processing, a Fairytale archive compressed with it should be about as small as a .7z, and smaller than a .zip, .zipx, maybe even .rar

OTOH, I noticed most of the people involved in data compression likes to work with paq* algorithms, context mixing and such. It's been said that paq8px internal structure doesn't provide all that's needed to produce the most efficient archive, so maybe this could be the next playground for neural networks, after porting the code of paq8px...

Any thoughts about this?

Recompression: Inexpensive expansion for all methods in squashfs

Squashfs is a read-only filesystem that is frequently used to transparently compress whole operating systems in a live portable media, to distribute software in Snap and AppImage formats, and to efficiently store large multimedia archives. It divides the data into rather small blocks and then compresses them with one of 6 algorithms:

gzip
LZ4
LZMA
LZMA2
LZO
zstd

If Fairytale were able to recompress them, it could signify several GB of savings on a sysadmin's drive. But doing so by means of brute-force guessing the precise method that was used to create the file, goes from extremely impractical to virtually impossible.

Luckily, there's no need to do that. Squashfs stores all options used to compress a block so recompressing a SQS file is just a matter of decompressing the streams and copying the flags. Complexity of O(1). Just to make sure, I had a private conversation with the author, Philip Lougher:

As I understand the docs, squashfs stores now the options used to compress every block. [...] My doubt is: Does squashfs really store all compression options inside the final archive?

Yes it does.

If a user has specified non-default compression options, these are
stored in the final archive.

If the user has used the default compression options, these are not
stored in the archive, because no stored compression options indicate
defaults were used. If the defaults were used then storing them is
unnecessary, because the software knows what the defaults are.

The presence of compression options is indicated by setting the
SQUASHFS_COMP_OPT bit (bit 10) in the Squashfs flags field in the
Squashfs superblock.

If that is set, then the compression options are stored immediately
after the superblock. The size and structure of the compression
options vary depending on which compressor was used to compress the
filesystem. The compressor used is stored in the "compression" field
in the superblock.

If all the various compressors are enabled and compiled into
Mksquashfs, then mksquashfs -info will list the compression options
supported by each compressor, and the defaults (which is used, means
no compression options will be stored).

Unsquashfs -stat will report the compressor used and any non-default
compression options used by the filesystem. If no compression options
are reported by -stat then the default options were used.

Cross-compiling using Docker

Dockcross ( https://github.com/dockcross/dockcross ) looks like a good way to cross-compile for the different targets (32-/64-bit Windows/Linux/OSX/ARM) that can be run on one machine and compiles all the executables.

Using that, it would be possible to automate the executable creation so that they can be added to the GitHub release section faster.

Recompression: Non-zlib deflate

When Fairytale can't correctly re-compress a valid deflate-compressed stream, it's due to it being generated by other implementation rather than zlib library. There are a few approaches (like preflate) that rely on diff patches to reconstruct a valid stream and save at least a bit of information.

IMHO, even when useful as a last resource, this is sub-par, because the yielded decompressed chunk is actually broken, there is still overhead to it and cpu cycles are spent in diff calculation.

I want to propose, in turn, a different approach. Having in mind that there aren't that much alternative implementations of deflate actually widespread, Fairytale can optimally handle deflate recompression this way:

Try zlib. If it doesn't work on the first couple streams of the file:
Try 7z deflate. AFAICT, most non-zlib chunks on the wild are produced by 7z so it will happily handle them without hassle. If not:
Try (kzip | IPP | libdeflate | fastzlib) etc.

Fairytale can try all of them, just a few or go straight to the diff mode.

Advantages?

Currently, if a deflated file is not produced by zlib, Fairytale, precomp, paq8 and such will spend quite the time trying to make it work nonetheless. Using this approach, only the first few streams will confuse the program, until it finds what library is the right one. So in one word, speed.
Currently, there is no re-compression whatsoever if zlib fails. And on diff-based re-compressors, not all information is replaced with its unpacked version. Using the new approach, more files could be completely decompressed, fed into the recursion loop, de-duplicated and the works. In one word, strength.

JSON parser causes program to enter infinite loop

@M-Gonzalo found an issue with a PDF file that causes the new JSON parser to continually output
1 byte detections at position 18446744073709551615, which is probably the result of a -1 getting cast to a uint64_t.

I was able to reproduce this issue on OSX yesterday. However, when I try the same thing again, I get numerous 1 byte JSON detections, followed by an infinite repeating cycle as follows:

Possible text detection at 0, 1935 bytes
Possible text detection at 3907, 1119 bytes
Possible text detection at 37668, 22694 bytes
Possible text detection at 72364, 19153 bytes
Possible text detection at 103772, 23085 bytes
Possible text detection at 137207, 11849 bytes
Possible text detection at 751182, 1741 bytes

Unicode file name

Look like this FileStream can't handle a Unicode file name. Is that true?

fairytale/filestream.h

Line 32 in 3bdb3f9

bool open(const char *filename);

(Sorry if my English is bad, I'm not a native speaker)

Recompression: zstd

Continuing the idea from issue #2, there is another compression algorithm that is becoming more popular day after day. We're talking about zstd, or Zstandard, a very fast and efficient method for on-the-fly compression and decompression.

Now it is included into the Linux kernel,
squashfs,
dpkg and apt,
just to cite a few. It seems to be a trend aiming to replace good old deflate with zstandard.

Given it is a relatively weak algorithm and even faster than deflate, I want to propose a model to losslessly recompress streams of data packed with it. There is nothing done in this regard yet, so it is a long shot. But this is a fairytale! So I'm issuing this, nevertheless.

Unit tests

There's a good example at https://github.com/deftio/travis-ci-cpp-example that shows how to do unit testing and get coverage using Travis CI and coveralls.io.

I'll have a look at it, this issue is just used as a reminder for myself.

Multimedia: JPEG compression using Lepton

Besides being able to compress baseline and progressive jpg images using a dedicated method, Fairytale should be also capable of re-compress jpg images that use arithmetic coding. For that we can't count on packJPG but we can use the Lepton library, a tool released by the Dropbox team. It can also serve as a drop-in replacement for the packJPG library, should it fail on a certain stream.

Compilation

How do you compile this? I think we should provide a Makefile, or maybe even better, a CMakeLists.txt

Right now I am trying to compile with this command:

g++ fairytale.cpp filestream.cpp hybridstream.cpp storagemanager.cpp block.cpp deduper.cpp analyser.cpp

But I get this error:

parsers/deflateparser.h: In member function 'bool DeflateParser<BruteMode>::validate(uint32_t, uint32_t, bool)':
parsers/deflateparser.h:90:49: error: call of overloaded 'max(long unsigned int, const uint32_t&)' is ambiguous
     return (in>max(32ul/max(1ul, out/max(1ul, in)), uint32_t(brute)<<7));
                                                 ^
In file included from stream.h:23:0,
                 from filestream.h:23,
                 from hybridstream.h:23,
                 from storagemanager.h:23,
                 from block.h:23,
                 from deduper.h:23,
                 from analyser.h:23,
                 from analyser.cpp:20:
common.h:145:12: note: candidate: int max(int, int)
 inline int max(int a, int b) { return a<b?b:a; }
            ^~~
common.h:147:17: note: candidate: uint32_t max(uint32_t, uint32_t)
 inline uint32_t max(uint32_t a, uint32_t b) { return a<b?b:a; }
                 ^~~
common.h:149:15: note: candidate: size_t max(size_t, size_t)
 inline size_t max(size_t a, size_t b) { return a<b?b:a; }
               ^~~
common.h:151:16: note: candidate: int64_t max(int64_t, int64_t)
 inline int64_t max(int64_t a, int64_t b) { return a<b?b:a; }
                ^~~```

Special case: Widely-spread known objects

There are some pieces of data that exists in identical form on countless devices around the world. Some examples are license files like GPL and re-distributable libraries such as zlib1, qt* or 7z.dll.

For some of these, having an off-line dictionary bundled with the Fairytale distribution in a compressed form could help a lot. A very simple parser could recognize them, and then encode them just as a minuscule reference, leading to denser archives, faster to produce.

This method has an obvious drawback, which is the size of said dictionary. Even so, modern archivers tend to occupy dozens of MB on disk, even more than a hundred, because of their choice of graphical libraries.

This is very subject to discussion. Might be worth giving it a try.

If you happen to know a file that fit these characteristics, please mention it in the comments.

Multimedia: JPEG compression using packJPG

There are some kind of files that are particularly difficult to pack, because they are already compressed. JPEG images figure among the most common of these. Luckily, today exists open source software capable of performing a lossless re-compression on them. Using the proven packJPG library, by Matthias Stirner, we can effectively compress most jpg images in an efficient manner, that is not only stronger than any general purpose algorithm, but also relatively fast.

Combining this with the already implemented parser that allows to find jpg images embedded inside arbitrary data, Fairytale should be capable of reaching world-class ratios on most modern data sets.

Archive file format

At the moment, the output file doesn't have its final format. Because this archive file format is an essential part of the project, it's open for discussion. There's a wiki page with drafts for all parts of the format and a Gitter chat room where things can be discussed.

Multimedia: JPEG solid compression using a dedicated model

Fairytale will eventually support the lossless re-compression of jpg images using one of two methods. See issue #12 and #13. Nevertheless, both libraries are only capable of processing just one image at the time.

In the real world, though, we can often encounter very similar images on the same folder / data set. For example, a few camera shots on the same spot or screen-shots of the same program inside a help file. These similarities can be exploited to generate a much denser archive but only after undoing the compression of each individual file and recognizing them as valid images. Currently, we don't have the ability to do that but a good place to start is uncmpJPG by Matthias Stirner.

Recursive parsing on spoof deflate streams

The parser in its current state will be fooled by a file like this recurse.gz and cause an infinite loop.
It's decompressed form is its compressed form, hence it'll produce an unlimited depth parse which will never end and eventually fill the ram or worse; the disk.

Verbose switch

The VERBOSE flag was disabled by default because on Windows systems, the console output results in a slowdown. It might be good to add a "v" command line option instead of a compiler flag so verbosity can be controlled without the need of a recompile.

Recompression: Deflate

Currently, a plethora of media formats are compressed with a legacy algorithm from about 20 years ago. Those media formats include, among many others, legacy .zip archives, PDF documents, all files saved by Microsoft Office and LibreOffice suites, PNG images, Java .jar applications, android .apk apps and even hard drive images.
There are working methods for losslessly undoing said compression in order to apply stronger algorithms to the original data. The most popular implementation of deflate comes in the form of zlib library, and Precomp is an open source implementation of such method of recompression. Even so, deflate streams aren't produced only by zlib library, but by other programs too and those aren't always recognized by precomp.

Fairytale should in due time be capable of doing recompression of most if not all deflate implementations.

[SUGGESTION] Hooking system.

Lower priority but should be considered while coding the underlying framework.

Consider structing the code with event driven hooking points.
To clarify,
Structured code with event "hooking" points has ability to extend it's functions in several places using object oriented method.
That hooking method can have plugins written with high level coding.
As a side effect it will extend the community and and also assist the development of the project in parallel.

https://softwareengineering.stackexchange.com/questions/237876/what-should-plugins-use-hooks-events-or-something-else
https://cement.readthedocs.io/en/latest/dev/hooks/
https://stackoverflow.com/questions/28166754/plugin-system-with-events-or-hooks