vapier / ncompress Goto Github PK

View Code? Open in Web Editor NEW

102.0 9.0 17.0 187 KB

a fast, simple LZW file compressor

Home Page: https://vapier.github.io/ncompress/

License: The Unlicense

Makefile 1.30% C 56.05% Shell 25.59% Roff 17.06%

lzw compressor

ncompress's Introduction

About

This is (N)compress. It is an improved version of compress 4.1.

Compress is a fast, simple LZW file compressor. Compress does not have the highest compression rate, but it is one of the fastest programs to compress data. Compress is the defacto standard in the UNIX community for compressing files.

(N)compress 4.2 introduced a special, fast compression hash algorithm. This algorithm uses more memory than the old hash table. If you don't want the faster hash table algorithm set 'Memory free for compress' below 800000.

Starting with compress 3.0, the output format changed in a backwards incompatible way. This is not a big deal as compress 3.0 was first released in Jan 1985, while the first release of compress was available less than a year prior. There shouldn't be any need to produce files that only older versions of compress would accept.

Newer versions of compress are still able to handle the output of older versions though -- i.e. compress 3.0+ is able to decompress files produced by compress 2.0 and older.

Building

For recent systems with GNU make, you can simply run make as the default 'GNUMakefile' will get picked up.

'build' is a menu driven shell script for compiling, testing and installing (N)compress. So to build and install (N)compress all you have to do is run build. Build will first test your system for default settings. The current compile settings are stored in a special file called compress.def.

For user with problems with build there is a default makefile included called 'Makefile.def'. Also build is capable of generating a Makefile with all options (option genmake).

Support

Send comments, complaints and especially patches relating to https://github.com/vapier/ncompress/issues

Licensing

The ncompress code is released into the public domain. See the UNLICENSE file for more details.

Patents

All existing patents on the LZW algorithm have expired world-wide. So LZW is now patent free.

Remarks

Build is a bourne shell script. On some system it is necessary to type 'sh build'.
The build script usages tput for nice screen handling of the script. If your system has no tput no problems.
For configuration testing build uses a lot of small C programs. During those test stderr is redirected to /dev/null. During the compilation of compress output is NOT redirected.
The /bin/sh under Ultrix can't handle ${var:-str} so use ksh for the build script.
The output of (N)compress 4.2+ is not exactly the same as compress 4.0 because of different table reset point. The output of (N)compress 4.2+ is 100% compatible with compress 4.0.
Some systems has performance problems with reads bigger than BUFSIZ (The read a head function is not working as expected). For those system use the default BSIZE input buffer size.
compress can be slower on small files (<10Kb) because of a great table reset overhead. Use cpio or tar to make 1 bigger file if possible, it is faster and also gives a better compression ratio most of the time.
files compressed on a large machine with more bits than allowed by a version of compress on a smaller machine cannot be decompressed! Use the "-b12" flag to generate a file on a large machine that can be uncompressed on a 16-bit machine.

ncompress's People

Contributors

Stargazers

Watchers

Forkers

pronovic rflauer nullcore1024 wxy325 praiskup andrew-aladev pkubatrh acidburn0zzz rcpao-enmotus-com kishorkunal-raj shepherdcn raidtheweb hannob valgur shepardo sysfce2 n-r-k

ncompress's Issues

Test suite fails on macOS

Hello, make check fails on macOS:

./tests/runtests.sh
readlink: illegal option -- f
usage: readlink [-n] [file ...]
using compress: ./compress
Setting up test env
cp: ./README.md: No such file or directory
make: *** [check] Error 1

Unfortunately, BSD readlink (which macOS uses) don't support the same options that GNU readlink does.

Not an issue but thank you!

I have some very old text collections that were compressed with the Unix compress tool, and this project was the only thing I found that allowed me to access them.

Thank you!

The error "filename too long to tack on .Z"

ncompress/compress42.c

Line 1037 in 4e663b1

{

Can anyone please explain to me why that if () block is needed? I'm considering to fill a PR which will drop that entirely. Any objections?

Decompression could be optimized (would be 5x to 20x faster)

Hi,

LZW decompression could be optimized to be somewhat faster by not using the classic build-a-stack-then-reverse-it technique but rather using references in the read buffer.

I have implemented this in a GIF reader a while ago. If that's of interest I could try working on a patch.

Cheers,

Robert.

question: what is the difference between `master` and `ncompress-4.2.4` branch?

ssia

Windows requires setmode binary for stdout

I've received an issue that blocks stdout mode in windows: compress -c is not usable. The problem is legacy line feed magic coming from MS-DOS: system write and 0x0a byte are the worst enemies, because stdout is not in binary mode by default.

We need to add setmode(fileno(stdout), O_BINARY) before using stdout.

compress42.c:707]: (style) Suspicious condition

Source code is

 if ((fgnd_flag = (signal(SIGINT, SIG_IGN)) != SIG_IGN))

Maybe better code

if ((fgnd_flag = signal(SIGINT, SIG_IGN)) != SIG_IGN)

Patch to make build reproducible

The Debian Reproducible Builds project is working on improving all Debian packages to make them build reproducibly, meaning that for a given version the same output is generated every time the code is built on a given platform.

Since ncompress puts the compile date in the version string, the build isn't reproducible. For Debian, the suggested fix is to use some other, predictable string instead of the current date. The general proposal is to use the package date, taken from the Debian changelog.

The patch below shows how I've chosen to implement this for Debian. It might or might not make sense to integrate something like this into the upstream package, but I thought it was worth documenting it. The patch below is reproducible.patch from Debian version 4.2.4.4-10.

Index: ncompress-4.2.4.4/compress42.c
===================================================================
--- ncompress-4.2.4.4.orig/compress42.c
+++ ncompress-4.2.4.4/compress42.c
@@ -1882,7 +1882,7 @@ prratio(stream, num, den)
 void
 about()
   {
-     fprintf(stderr, "Compress version: %s, compiled: %s\n", version_id, COMPILE_DATE);
+     fprintf(stderr, "%s\n", NCOMPRESS_VERSION);
      fprintf(stderr, "Compile options:\n        ");
 #if BYTEORDER == 4321 && NOALLIGN == 1
      fprintf(stderr, "USE_BYTEORDER, ");

To utilize this new code, I added 3 new shell variables in debian/rules:

# Generate a Debian-specific version string to use when compiling compress42.c
# By using the Debian version rather than a compile date, the build is reproducible.
# See also: https://wiki.debian.org/ReproducibleBuilds/TimestampsProposal
PACKAGE_VERSION=$(shell dpkg-parsechangelog --count 1 -Sversion)
PACKAGE_DATE=$(shell dpkg-parsechangelog --count 1 -Sdate)
NCOMPRESS_VERSION="\"Debian ncompress v${PACKAGE_VERSION} (${PACKAGE_DATE})\""

Then, I updated debian/rules to build the code using this command:

gcc $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o compress -DUTIME_H -DNOFUNCDEF -DNCOMPRESS_VERSION=$(NCOMPRESS_VERSION) compress42.c

With this patch in place compress -V yields:

Debian ncompress v4.2.4.4-10 (Tue, 04 Aug 2015 20:13:43 -0500)
Compile options:

        REGISTERS=2 IBUFSIZ=8192, OBUFSIZ=8192, BITS=16

Author version 4.2 (Speed improvement & source cleanup):
     Peter Jannesen  ([email protected])

Author version 4.1 (Added recursive directory compress):
     Dave Mack  ([email protected])

Authors version 4.0 (World release in 1985):
     Spencer W. Thomas, Jim McKie, Steve Davies,
     Ken Turkowski, James A. Woods, Joe Orost

Ratio implementation overflows?

Hello. I couldn't understand ratio implementation. I think that there are some overflow issues in current implementation. Please correct me if I am wrong.

I see the main idea: we have source and destination.
Ratio equals to source_length / destination_length = s / d.
New ratio (s + 2) / (d + 1) is good, (s + 1) / (d + 2) is bad.
So we want to reset when new_ratio < old_ratio.
We won't reset when (s + 2) / (d + 1) > s / d, we will reset when (s + 1) / (d + 2) < s / d.

Than we added 10000 bytes lag for source_length to receive more consolidated ratio.

I see implementation for this algorithm in ruby in rb-compress-lzw. I have no questions about this implementation because there is a gmp library behind it, it will never overflow.

Now I am trying to read implementation in void compress(fdin, fdout).
I see manipulations:

int bytes_out = 0; bytes_in = 0
bytes_out += OBUFSIZ
bytes_out += (outbits+7)>>3
bytes_in += i
if (rpos > rlop) bytes_in += rpos-rlop

Lets imagine large input.
Both bytes_in and bytes_out will overflow.
When dictionary will be filled, we will use bytes_in and bytes_out to count wrong ratio.

I see same problems here.

rat = (bytes_out+(outbits>>3)) >> 8;: bytes_out + (outbits >> 3) can provide overflow and rat will be invalid.

How to fix it?
I have not yet invented any solution =)

Makefile should respect PREFIX

This already exists on the (abandoned?) master branch, but not in the branch that is actually getting maintenance releases. It would be beneficial for packaging purposes to make this available for practical use.

While I am at it, I think it would also be nice to have the install_extra logic. (If the install target needs to continue to install everything for compatibility purposes, it could be split into e.g. make install_core install_extra?)

[Windows Compilation] Consider adding preprocessor macro for lstat and chown

I just add this to compress.c to enable compilation on Windows :

#ifndef _WIN32
	if (chown(ofname, infstat.st_uid, infstat.st_gid)) { /* Copy ownership */
		fprintf(stderr, "\nchown error (ignored) ");
		perror(ofname);
		exit_code = 1;
	}
#endif

#ifndef _WIN32
	signal(SIGHUP, (SIG_TYPE)abort_compress);
#endif

2 times (line 453 and 465) :

#ifdef _WIN32
	if (stat(tempname, &infstat) == -1) {
#else
	if (lstat(tempname, &infstat) == -1) {
#endif

I added this symbol at compilation time :

BYTE_ORDER=LITTLE_ENDIAN

I updated makefile too ; patchlevel.h cannot be created with this command on Windows :

patchlevel.h: version
	echo "#define VERSION \"`cat version`\"" > patchlevel.h

And ... It works

Sparse array as compressor dictionary

Hello. In 2019 year we have a great amount of ram available. So we can reach max possible performance for lzw (with clear) algorithm.

The idea is simple: we can use sparse array instead of double hashing array as dictionary.
Please imagine big array where (code << 8) | symbol => next_code. symbol is between 0, 255, code between 0 and (2 ** 16) - 1 and next_code between 257 and (2 ** 16) - 1. 33.5 MB ram is required for such array.

The problem is that we have to clear sparse array. For example we have to clear dictionary 2040 times for compressing 850 MB linux-4.20.3.tar. 33.5 MB * 2040 ~ 68 GB. I've solved this issue by collecting used sparse array indexes in separate array and clear just these indexes.

The complexity of insert or find is still O(1). You can find docs here. Implementation is here.

I am about to be sure that ncompress won't accept such huge memory eater. I want just to inform people. Thank you.

-C option doesn't work

The README.md file says you can use -C to create files that are "compatible with compress 2.0". But as far as I can tell, you can't. All -C does is change a bit in the header, asserting that the file uses the 2.0 format. But it doesn't use that format, so the file is invalid. It can't be decompressed by ncompress, or anything else I've tried.

$ echo "aaaaaaaaaaaaaaaa" | compress -C | compress -d
insize:8 posbits:9 inbuf:61 02 0A 1C 48 (1)
uncompress: corrupt input

Expected results: Either the feature should be implemented, or -C should fail with an error message, with the README file updated accordingly.

(Using the 4.2.4 branch, but I don't think it matters.)

question: ignored errors after compressions succeeds

Hi @vapier!
I recently took over maintenance for this project's rpm packages in Fedora/RHEL and wanted to follow up with you regarding a question one of our users had in regards to $SUBJECT.

If I understand the code correctly, after finishing the compression, ncompress tries to call chmod/chown on the output file (eg. here), trying to change its mode/owner to values the original file had. In case the operation does not succeed (I guess when the owner of the original file is different than the one running the compression?) the errors from the calls are handled (but ignored as the compression is already done) and the exit value of the program is set to "1".

I wonder, should the exit value of the program be still set to "1", even if the errors were "ignored" and (as far as I can understand) did not really affect the result of the compression? Or is there something I am missing here?

[Question] Compress output clarification request

Hi,

I'm trying to understand the implementation details of compress a little better. Since I'm C illiterate I've been attempting to reverse-engineer the precise behavior by example, albeit with no luck. I'm hoping that someone could clarify with the following examples. I'm considering three files, with contents a, aa, and aaa. They have no EOF. Applying compress yields the following bits:

a.Z:     00011111 10011101 10010000 01100001 00000000
aa.Z:    00011111 10011101 10010000 01100001 11000010 00000000
aaa.Z:   00011111 10011101 10010000 01100001 00000010 00000010

I recognise the first two bytes as the header. Of the third byte, the last five bits indicate the maximum code length, whereas the first three bits appear to be irrelevant (though please correct me if I'm wrong).

Looking at a.Z, the next byte just encodes the character a, though I kind of assumed a leading zero bit as I anticipated that we'd start with 9-bit codes. The trailing zeros appeared like an EOF indicator to me, but looking at the stray 1 at the second-to-last bit in aaa.Z, that is evidently not the case.

I'm struggling to see the pattern in these outputs. Could someone enlighten me?

Incorrect `access` result for Windows

lstat (maps to stat) returns 0 on success and -1 on error just like access so return of the result should be just return lstat(pathname, &st);

compression with 9 bits don't work

it simply doesn't work (no 8-codes block flush after the clear code?):
$ compress -cb9 /usr/bin/compress >/tmp/test.Z
$ compress -dc /tmp/test.Z
insize:7901 posbits:0 inbuf:00 0A F7 23 CC (0)
uncompress: corrupt input
$ gzip -tv /tmp/test.Z
gzip: /tmp/correct.Z: corrupt input.

and it is not working for a long time (at least 12 years):
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=220820

the debian proposed patch (using at least 10 bits) would break the Single UNIX Specification, Version 4 (IEEE Std 1003.1) as 9 <= bits <= 14:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/compress.html

if the bug cannot be resolved, a note should be added, like the one present in the debian package readme:
"If you create files using -b9 (maxbits of 9), your file is probably corrupted, and you will only be able to recover parts of it."

ciao!

Patch to fix code warnings

Attached is code-warnings.patch from Debian package 4.2.4.4-10. It fixes a code warning in compress42.c.

KEN

# Description: Fix compiler warnings in the code.
# Author: Kenneth J. Pronovici <[email protected]>
Index: ncompress-4.2.4.4/compress42.c
===================================================================
--- ncompress-4.2.4.4.orig/compress42.c
+++ ncompress-4.2.4.4/compress42.c
@@ -1007,8 +1007,12 @@ comprexx(fileptr)

               if (infstat.st_nlink > 1 && (!force))
               {
-                 fprintf(stderr, "%s has %d other links: unchanged\n",
-                             tempname, infstat.st_nlink - 1);
+                 /* There's no portable way to print an nlink_t, because it
+                     changes size by platform.  So, I just cast to unsigned
+                     long int and hope for the best.  This is basically the
+                     StackOverflow recommendation: http://stackoverflow.com/questions/1401526 */
+                 fprintf(stderr, "%s has %ld other links: unchanged\n",
+                             tempname, (unsigned long int) infstat.st_nlink - 1);
                  exit_code = 1;
                  return;
               }

Patch to fix typo in help output

Below is patch code-typos.patch from Debian package 4.2.4.4-10. It fixes a minor typo in the help output.

KEN

# Description: Fix typos in the code.
# Author: Kenneth J. Pronovici <[email protected]>
Index: ncompress-4.2.4.4/compress42.c
===================================================================
--- ncompress-4.2.4.4.orig/compress42.c
+++ ncompress-4.2.4.4/compress42.c
@@ -877,7 +877,7 @@ Usage: %s [-dfvcVr] [-b maxbits] [file .
             If -f is not used, the user will be prompted if stdin is.\n\
             a tty, otherwise, the output file will not be overwritten.\n\
        -v   Write compression statistics.\n\
-       -V   Output vesion and compile options.\n\
+       -V   Output version and compile options.\n\
        -r   Recursive. If a filename is a directory, descend\n\
             into it and compress everything in it.\n");

Fast and Simple sourcecode to uncompress LZW files

This code will uncompress LZW files with a very simple data structure (just a realloc'ed array) and the unoptimized code is still 30% faster than unix uncompress.

https://groups.google.com/forum/#!topic/comp.compression/VC8xXmJQNfg

Public domain, do whatever you want with it.