Coder Social home page Coder Social logo

kanzi's Introduction

Kanzi

Kanzi is a modern, modular, expandable and efficient lossless data compressor implemented in Java.

  • modern: state-of-the-art algorithms are implemented and multi-core CPUs can take advantage of the built-in multi-threading.
  • modular: entropy codec and a combination of transforms can be provided at runtime to best match the kind of data to compress.
  • expandable: clean design with heavy use of interfaces as contracts makes integrating and expanding the code easy. No dependencies.
  • efficient: the code is optimized for efficiency (trade-off between compression ratio and speed).

Unlike the most common lossless data compressors, Kanzi uses a variety of different compression algorithms and supports a wider range of compression ratios as a result. Most usual compressors do not take advantage of the many cores and threads available on modern CPUs (what a waste!). Kanzi is multithreaded by design and uses several threads by default to compress blocks concurrently. It is not compatible with standard compression formats. Kanzi is a lossless data compressor, not an archiver. It uses checksums (optional but recommended) to validate data integrity but does not have a mechanism for data recovery. It also lacks data deduplication across files.

For more details, check https://github.com/flanglet/kanzi/wiki.

See how to reuse the code here: https://github.com/flanglet/kanzi/wiki/Using-and-extending-the-code

There is a C++ implementation available here: https://github.com/flanglet/kanzi-cpp

There is Go implementation available here: https://github.com/flanglet/kanzi-go

Build Status

Why Kanzi

There are many excellent, open-source lossless data compressors available already.

If gzip is starting to show its age, zstd and brotli are open-source, standardized and used daily by millions of people. Zstd is incredibly fast and probably the best choice in many cases. There are a few scenarios where Kanzi could be a better choice:

  • gzip, lzma, brotli, zstd are all LZ based. It means that they can reach certain compression ratios only. Kanzi also makes use of BWT and CM which can compress beyond what LZ can do.

  • These LZ based compressors are well suited for software distribution (one compression / many decompressions) due to their fast decompression (but low compression speed at high compression ratios). There are other scenarios where compression speed is critical: when data is generated before being compressed and consumed (one compression / one decompression) or during backups (many compressions / one decompression).

  • Kanzi has built-in customized data transforms (multimedia, utf, text, dna, ...) that can be chosen and combined at compression time to better compress specific kinds of data.

  • Kanzi can take advantage of the multiple cores of a modern CPU to improve performance

  • It is easy to implement a new transform or entropy codec to either test an idea or improve compression ratio on specific kinds of data.

Benchmarks

Test machine:

AWS c5a8xlarge: AMD EPYC 7R32 (32 vCPUs), 64 GB RAM

openjdk 21.0.1+12-29

Ubuntu 22.04.3 LTS

Kanzi version 2.2 Java implementation.

On this machine kanzi can use up to 16 threads (depending on compression level). bzip3 uses 16 threads. zstd can use 2 for compression, other compressors are single threaded.

silesia.tar

Download at http://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip

Compressor Encoding (sec) Decoding (sec) Size
Original 211,957,760
Kanzi -l 1 1.337 1.186 80,284,705
lz4 1.9.5 -4 3.397 0.987 79,914,864
Zstd 1.5.5 -2 0.761 0.286 69,590,245
Kanzi -l 2 1.343 1.343 68,231,498
Brotli 1.1.0 -2 1.749 2.459 68,044,145
Gzip 1.10 -9 20.15 1.316 67,652,229
Kanzi -l 3 1.906 1.692 64,916,444
Zstd 1.5.5 -5 2.003 0.324 63,103,408
Kanzi -l 4 2.458 2.521 60,770,201
Zstd 1.5.5 -9 4.166 0.282 59,444,065
Brotli 1.1.0 -6 14.53 4.263 58,552,177
Zstd 1.5.5 -13 19.15 0.276 58,061,115
Brotli 1.1.0 -9 70.07 7.149 56,408,353
Bzip2 1.0.8 -9 16.94 6.734 54,572,500
Kanzi -l 5 3.228 2.268 54,051,139
Zstd 1.5.5 -19 92.82 0.302 52,989,654
Kanzi -l 6 4.950 2.522 49,517,823
Lzma 5.2.5 -9 92.6 3.075 48,744,632
Kanzi -l 7 4.478 3.181 47,308,484
bzip3 1.3.2.r4-gb2d61e8 -j 16 2.682 3.221 47,237,088
Kanzi -l 8 10.67 11.13 43,247,248
Kanzi -l 9 24.78 26.73 41,807,179
zpaq 7.15 -m5 -t16 213.8 213.8 40,050,429

enwik8

Download at https://mattmahoney.net/dc/enwik8.zip

Compressor Encoding (sec) Decoding (sec) Size
Original 100,000,000
Kanzi -l 1 1.221 0.684 43,747,730
Kanzi -l 2 1.254 0.907 37,745,093
Kanzi -l 3 1.093 0.989 33,839,184
Kanzi -l 4 1.800 1.648 29,598,635
Kanzi -l 5 2.066 1.740 26,527,955
Kanzi -l 6 2.648 1.743 24,076,669
Kanzi -l 7 3.742 1.741 22,817,376
Kanzi -l 8 6.619 6.633 21,181,978
Kanzi -l 9 17.81 18.23 20,035,133

Credits

Matt Mahoney, Yann Collet, Jan Ondrus, Yuta Mori, Ilya Muravyov, Neal Burns, Fabian Giesen, Jarek Duda, Ilya Grebnov

Disclaimer

Use at your own risk. Always keep a copy of your original files.

kanzi's People

Contributors

flanglet avatar lgtm-migrator avatar pschichtel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kanzi's Issues

Go BWT transform freezes on some input

Trying a BWT transform on a block from the abba file (from the gauntlet corpus) freezes.
abba-2.gz

./Kanzi -compress -input=abba-2 -transform=bwt -entropy=none -verbose=4 -overwrite -block=32000000

Kanzi 1.0 (C) 2017, Frederic Langlet
Input file name set to 'abba-2'
Output file name set to 'abba-2.knz'
Block size set to 32000000 bytes
Verbosity set to 4
Overwrite set to true
Checksum set to false
Using BWT transform (stage 1)
Using no entropy codec (stage 2)
Using 1 job
Encoding ...
{ "type":"BEFORE_TRANSFORM", "id":1, "size":3150180, "time":1496668411814}
^C

Go BWT Inverse on a particular file generates "index out of range" error

Kanzi --compress --input=bwt.bin --output=test.kanzi --transform=BWT --entropy=none --block="13000000" --force
Kanzi 1.1 (C) 2017, Frederic Langlet
Encoding ...

Encoding: 793 ms
Input size: 12078908
Output size: 12078929
Ratio: 1.000002
Throughput (KB/s): 14874

Kanzi --decompress --input=test.kanzi

Warning: the input file name does not end with the .KNZ extension
Kanzi 1.1 (C) 2017, Frederic Langlet
Decoding ...
runtime error: index out of range

bwt.bin.gz

Go code (benchmarks in particular) are not go-gettable.

It would be nice if this

go get -d -t -v github.com/flanglet/kanzi/go/src/kanzi/benchmark

obtained the Kanzi benchmarks and all their dependencies.
However, this is the result instead:

There was an error running 'go get', stderr = github.com/flanglet/kanzi (download)
package kanzi/bitstream: unrecognized import path "kanzi/bitstream" (import path does not begin with hostname)
package kanzi/entropy: unrecognized import path "kanzi/entropy" (import path does not begin with hostname)
package kanzi/function: unrecognized import path "kanzi/function" (import path does not begin with hostname)
package kanzi/io: unrecognized import path "kanzi/io" (import path does not begin with hostname)
package kanzi/transform: unrecognized import path "kanzi/transform" (import path does not begin with hostname)

I realize this may not be fixable; it is however an issue, and preventing inclusion of the Kanzi benchmarks into a suite of benchmarks run somewhat more automatically. (See https://github.com/dr2chase/bent ).

Kanzi not available on Maven Central?

Hello! I've got a Clojure project that I'd like to integrate Kanzi into. It uses Leiningen to manage dependencies, which requires it to pull from either a Maven repository (for Java projects, Maven Central is the default) or a Clojure one (for Clojure projects, Clojars is the default).

Is there a plan to get artifacts for Kanzi to be deployed into Maven Central?

Performance X Size Compress

I have tested the kanzi package with a image and the performance is fast, you really has done a great job.
But the gz format is more compressed and fastest than that snp, 18 kb aproximately. It's possible reduce the size of the snp format? Or this is not the objective of the snp format?
In my test the size buffer in .gz is the same that used in snp format 32768.

GZIPOutputStream out = new GZIPOutputStream(new FileOutputStream(outfile));
byte[] buf = new byte[32768];
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}

I replaced the CompressedOutputStream("None", "Snappy", out); by CompressedOutputStream(out) of the emory-util-io.jar version 2.1 and the snappy more fastest than .gz, but still the snp is bigger, aproximately 7 kb.

edu.emory.mathcs.util emory-util-io 2.1 '

BWTBlockCodec requires 1Gb+ of memory regardless of input or options

Running Kanzi on a 12Kb text file with default codec allocates a buffer of at least 1Gb. See BWTBlockCodec.java:
@OverRide
public int getMaxEncodedLength(int srcLen)
{
return srcLen + BWT_MAX_HEADER_SIZE + BWT.maxBlockSize();
}
Where BWT.maxBlockSize() will return a final static constant equal to 1Gb.
Is that amount of memory really required? The compressed stream should normally be of the same order as the uncompressed stream. Maybe there should be a Math.min() instead of a sum.
With Java 8 on my computer with the default memory options the process fails with an out of memory exception. I need at least -Xmx4g to make it work.
Thanks

"index out of range" error in Go BWT Forward transform

Kanzi --compress --input=bwt2.bin --entropy=none --transform=bwt --force

Kanzi 1.1 (C) 2017, Frederic Langlet
Encoding ...
panic: runtime error: index out of range

goroutine 5 [running]:
kanzi/transform.(*BWT).Forward(0xc42000e900, 0xc4201f2000, 0xcccc, 0xcccc, 0xc420200003, 0xcccd, 0xcccd, 0x3, 0xc420042c80, 0x40cd6d, ...)
/home/user/go/src/kanzi/transform/BWT.go:146 +0x249
kanzi/function.(*BWTBlockCodec).Forward(0xc42000c0a8, 0xc4201f2000, 0xcccc, 0xcccc, 0xc420200000, 0xccd0, 0xccd0, 0x100, 0x0, 0x101000000000000, ...)
/home/user/go/src/kanzi/function/BWTBlockCodec.go:73 +0x110
kanzi/function.(*ByteTransformSequence).Forward(0xc42000ad20, 0xc4201f2000, 0xcccc, 0xcccc, 0xc420200000, 0xccd0, 0xccd0, 0x0, 0x0, 0x0, ...)
/home/user/go/src/kanzi/function/ByteTransformSequence.go:85 +0x1eb
kanzi/io.(*EncodingTask).encode(0xc4200d82a0)
/home/user/go/src/kanzi/io/CompressedStream.go:466 +0xb97
created by kanzi/io.(*CompressedOutputStream).processBlock
/home/user/go/src/kanzi/io/CompressedStream.go:391 +0x39d

bwt2.bin.gz

Help with an example on image java!

Hi Kanzi, it seems like your code is very good and useful. However, i'm trying to use it on an image java and i'm having a lot of problems.
Is there an example of how to use it on image from reading to writing ?
I would appreciate a lot, thanks!!!
;)

extra space

source has many extra white space with line break

"index out of range" in Go BWT tranform

Running kanzi/BWT transform on a block of "tra1" file from from calgary test suite generates an error message in the Go implementation (not in cpp).
Thanks for all your work !

./app -compress -input=tra1-truncated -output=tra1-cpp -transform=bwt -entropy=none -overwrite
Kanzi 1.0 (C) 2017, Frederic Langlet
Encoding ...
panic: runtime error: index out of range

goroutine 5 [running]:
kanzi/transform.(*DivSufSort).ssCompare3(0xc420070980, 0x2e69, 0x45ed, 0x2, 0x3)
/home/user/go/src/kanzi/transform/DivSufSort.go:523 +0x13b
kanzi/transform.(*DivSufSort).ssMergeForward(0xc420070980, 0x3264, 0x1001, 0x1260, 0x2001, 0x2001, 0x2)
/home/user/go/src/kanzi/transform/DivSufSort.go:863 +0x10f
kanzi/transform.(*DivSufSort).ssSwapMerge(0xc420070980, 0x3264, 0x1001, 0x1260, 0x2001, 0x2001, 0x6d5, 0x2)
/home/user/go/src/kanzi/transform/DivSufSort.go:735 +0x5d2
kanzi/transform.(*DivSufSort).ssSort(0xc420070980, 0x3264, 0x0, 0x26d6, 0x2fa7, 0x2bd, 0x2, 0x620b, 0x1)
/home/user/go/src/kanzi/transform/DivSufSort.go:444 +0x1c7
kanzi/transform.(*DivSufSort).sortTypeBstar(0xc420070980, 0xc42009d800, 0x100, 0x100, 0xc4201fa000, 0x10000, 0x10000, 0x620b, 0xc420070980)
/home/user/go/src/kanzi/transform/DivSufSort.go:280 +0x8c8
kanzi/transform.(*DivSufSort).ComputeSuffixArray(0xc420070980, 0xc4201e6000, 0x620b, 0x620b, 0x620d, 0x620d, 0xc420040b00)
/home/user/go/src/kanzi/transform/DivSufSort.go:112 +0xdb
kanzi/transform.(*BWT).Forward(0xc42006a8a0, 0xc4201e6000, 0x620b, 0x620b, 0xc4201eca82, 0x620d, 0x620d, 0x3, 0xc420040c80, 0x40cd6d, ...)
/home/user/go/src/kanzi/transform/BWT.go:137 +0x12e
kanzi/function.(*BWTBlockCodec).Forward(0xc42000c0b0, 0xc4201e6000, 0x620b, 0x620b, 0xc4201eca80, 0x620f, 0x620f, 0x4396bb, 0x10, 0x10000000053d420, ...)
/home/user/go/src/kanzi/function/BWTBlockCodec.go:73 +0x110
kanzi/function.(*ByteTransformSequence).Forward(0xc42000ade0, 0xc4201e6000, 0x620b, 0x620b, 0xc4201eca80, 0x620f, 0x620f, 0x0, 0x0, 0x0, ...)
/home/user/go/src/kanzi/function/ByteTransformSequence.go:85 +0x1eb
kanzi/io.(*EncodingTask).encode(0xc4200cc380)
/home/user/go/src/kanzi/io/CompressedStream.go:466 +0xb97
created by kanzi/io.(*CompressedOutputStream).processBlock
/home/user/go/src/kanzi/io/CompressedStream.go:391 +0x39d

tra1-truncated.zip

when I tried to use the same algorithm on PDF and DOCX formats, the effect was not ideal.

For the format of Test TEXT, I tried to use TPAQ algorithm, transform: X86+RLT+TEXT, which had a good effect. However, when I tried to use the same algorithm on PDF and DOCX formats, the effect was not ideal. I would like to ask how to set the corresponding algorithm and transform

HashMap<String, Object> ctx = new HashMap<>();
        ctx.put("transform", "X86+RLT+TEXT");
        ctx.put("codec", "TPAQ");
        ctx.put("blockSize", 1024 * 1024);
        ctx.put("checksum", false);
        ctx.put("pool", pool); // not necessary if jobs = 1
        ctx.put("jobs", 4);

"index out of range" in BWT transform

Running kanzi on a truncated "bib" file from from calgary test suite generates an error message.

./Kanzi -compress -input=bib-truncated -output=bib.kanzi -transform=bwt -entropy=none -overwrite

_Kanzi 1.0 (C) 2017, Frederic Langlet
Encoding ...
panic: runtime error: index out of range

goroutine 5 [running]:
kanzi/transform.(*DivSufSort).ssMultiKeyIntroSort(0xc420072980, 0xb273, 0xb37, 0xe14, 0x2)
/home/user/go/src/kanzi/transform/DivSufSort.go:1261 +0x530
kanzi/transform.(*DivSufSort).ssSort(0xc420072980, 0xb273, 0xb37, 0xe14, 0x522d, 0x6046, 0x2, 0x104a0, 0x0)
/home/user/go/src/kanzi/transform/DivSufSort.go:452 +0x2ac
kanzi/transform.(*DivSufSort).sortTypeBstar(0xc420072980, 0xc42009f800, 0x100, 0x100, 0xc42020a000, 0x10000, 0x10000, 0x104a0, 0xc420072980)
/home/user/go/src/kanzi/transform/DivSufSort.go:280 +0x8c8
kanzi/transform.(*DivSufSort).ComputeSuffixArray(0xc420072980, 0xc4201e6000, 0x104a0, 0x104a0, 0x104a1, 0x104a1, 0x1b600)
/home/user/go/src/kanzi/transform/DivSufSort.go:112 +0xdb
kanzi/transform.(*BWT).Forward(0xc42006a8a0, 0xc4201e6000, 0x104a0, 0x104a0, 0xc4201f8003, 0x104a1, 0x104a1, 0x0, 0xc420040c80, 0x40cd6d, ...)
/home/user/go/src/kanzi/transform/BWT.go:137 +0x12e
kanzi/function.(*BWTBlockCodec).Forward(0xc42000c0b0, 0xc4201e6000, 0x104a0, 0x104a0, 0xc4201f8000, 0x104a4, 0x104a4, 0x4396bb, 0x10, 0x10100000053d300, ...)
/home/user/go/src/kanzi/function/BWTBlockCodec.go:73 +0x110
kanzi/function.(*ByteTransformSequence).Forward(0xc42000ade0, 0xc4201e6000, 0x104a0, 0x104a0, 0xc4201f8000, 0x104a4, 0x104a4, 0x0, 0x0, 0x0, ...)
/home/user/go/src/kanzi/function/ByteTransformSequence.go:85 +0x1eb
kanzi/io.(*EncodingTask).encode(0xc4200cc380)
/home/user/go/src/kanzi/io/CompressedStream.go:466 +0xb97
created by kanzi/io.(*CompressedOutputStream).processBlock
/home/user/go/src/kanzi/io/CompressedStream.go:391 +0x39d_

bib.zip

no main manifest attribute, in kanzi-x.x.x.jar

Hi, I apologize in advance, I'm not good enough to compile this project myself and I'm not even a programmer, so I always ask my friend to compile.
With every next version he compiles I get an error when running
no main manifest attribute, in kanzi-x.x.x.jar.
My friend says it's because a this line needs to be added in META-INF\MANIFEST.MF
Main-Class: kanzi.app.Kanzi, which must be terminated by enter.
I always have to unzip kanzi-x.x.x.jar, add this line and zip the file back again.
Is it possible to fix this in the source code so I don't have to do this?
The second thing is that my friend said that the latest version 2.1 is reporting as 2.0.0, so version probably is not overridden some in the source code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.