vasi / pixz Goto Github PK

View Code? Open in Web Editor NEW

688.0 36.0 61.0 1.36 MB

Parallel, indexed xz compressor

License: BSD 2-Clause "Simplified" License

C 77.47% Shell 1.74% Makefile 1.27% M4 19.52%

hacktoberfest

pixz's Introduction

pixz

Pixz (pronounced pixie) is a parallel, indexing version of xz.

Repository: https://github.com/vasi/pixz

Downloads: https://github.com/vasi/pixz/releases

pixz vs xz

The existing XZ Utils provide great compression in the .xz file format, but they produce just one big block of compressed data. Pixz instead produces a collection of smaller blocks which makes random access to the original data possible. This is especially useful for large tarballs.

Differences to xz

pixz automatically indexes tarballs during compression (unless the -t argument is used)
pixz supports parallel decompression, which xz does not
pixz defaults to using all available CPU cores, while xz defaults to using only one core
pixz provides -i and -o command line options to specify input and output file
pixz does not need the command line option -z (or --compress). Instead, it compresses by default, and decompresses if -d is passed.
pixz uses different logic to decide whether to use stdin/stdout. pixz somefile will always output to another file, while pixz with no filenames will always use stdin/stdout. There's no -c argument to explicitly request stdout.
Some other flags mean different things for pixz and xz, including -f, -l, -q and -t. Please read the manpages for more detail on these.

Building pixz

General help about the building process's configuration step can be acquired via:

./configure --help

Dependencies

pthreads
liblzma 4.999.9-beta-212 or later (from the xz distribution)
libarchive 2.8 or later
AsciiDoc to generate the man page

Build from Release Tarball

./configure
make
make install

You many need sudo permissions to run make install.

Build from GitHub

git clone https://github.com/vasi/pixz.git
cd pixz
./autogen.sh
./configure
make
make install

You many need sudo permissions to run make install.

Usage

Single Files

Compress a single file (no tarball, just compression), multi-core:

pixz bar bar.xz

Decompress it, multi-core:

pixz -d bar.xz bar

Tarballs

Compress and index a tarball, multi-core:

pixz foo.tar foo.tpxz

Very quickly list the contents of the compressed tarball:

pixz -l foo.tpxz

Decompress the tarball, multi-core:

pixz -d foo.tpxz foo.tar

Very quickly extract a single file, multi-core, also verifies that contents match index:

pixz -x dir/file < foo.tpxz | tar x

Create a tarball using pixz for multi-core compression:

tar -Ipixz -cf foo.tpxz foo/

Specifying Input and Output

These are the same (also work for -x, -d and -l as well):

pixz foo.tar foo.tpxz
pixz < foo.tar > foo.tpxz
pixz -i foo.tar -o foo.tpxz

Extract the files from foo.tpxz into foo.tar:

pixz -x -i foo.tpxz -o foo.tar file1 file2 ...

Compress to foo.tpxz, removing the original:

pixz foo.tar

Extract to foo.tar, removing the original:

pixz -d foo.tpxz

Other Flags

Faster, worse compression:

pixz -1 foo.tar

Better, slower compression:

pixz -9 foo.tar

Use exactly 2 threads:

pixz -p 2 foo.tar

Compress, but do not treat it as a tarball, i.e. do not index it:

pixz -t foo.tar

Decompress, but do not check that contents match index:

pixz -d -t foo.tpxz

List the xz blocks instead of files:

pixz -l -t foo.tpxz

For even more tuning flags, check the manual page:

man pixz

Comparison to other Tools

plzip

about equally complex and efficient
lzip format seems less-used
version 1 is theoretically indexable, I think

ChopZip

written in Python, much simpler
more flexible, supports arbitrary compression programs
uses streams instead of blocks, not indexable
splits input and then combines output, much higher disk usage

pxz

simpler code
uses OpenMP instead of pthreads
uses streams instead of blocks, not indexable
uses temporary files and does not combine them until the whole file is compressed, high disk and memory usage

pbzip2

not indexable
appears slow
bzip2 algorithm is non-ideal

pigz

not indexable

dictzip, idzip

not parallel

pixz's People

Contributors

Stargazers

Watchers

Forkers

takaken hornos weixu8 blue119 gema-arta davidrhodus brorfred wookietreiber cicku ioops mechacat markmuir87 scottk212 archenroot olivierh59500 shirayu pkern grheard ko-zu primmus steini2000 clandmeter hducng deobiet heliac2000 chenishi kevinoid darjankrijan ashang rfmerrill panzertime andyfeng1986 isgasho muditmaurya bradleysepos clayne pythonengineer007 kit-ty-kate tomzhou1999 guoyuqi020 zedseven synergyst chenliangfei silocityit teran ajunlonglive sarether silwol quinndiggity lavine2011 baptistapedro usefulcat mayhemheroes wsldankers xiaolinyi rizalgowandy cs558i skull-squadron khangthk dmole qwert182

pixz's Issues

pixz deadlocks on some input files

I used xz 5.2.1 w/ -9 --threads=0 on a large input file. When I run pixz -d to decompress it, pixz appears to deadlock (and never terminates).

The backtrace for pixz is included below. I apologize that I cannot provide access to the problematic *.xz file as it contains confidential information. I have successfully been able to decompress other files using pixz which were compressed with "xz -9 --thread=0" so I believe that the issue is somehow data dependent to this particular file.

Other than supplying access to the problematic *.xz file can I do anything to help get the required information for debugging?

Thanks and sorry for the vague bug report.

(gdb) thread apply all bt

Thread 6 (Thread 0x7f47827f3700 (LWP 1290)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004036db in ?? ()
#2 0x0000000000403c41 in ?? ()
#3 0x00000000004029da in ?? ()
#4 0x00007f47845680a4 in start_thread (arg=0x7f47827f3700) at pthread_create.c:309
#5 0x00007f4783ad204d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7f4781ff2700 (LWP 1291)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004036db in ?? ()
#2 0x0000000000403c41 in ?? ()
#3 0x00000000004029da in ?? ()
#4 0x00007f47845680a4 in start_thread (arg=0x7f4781ff2700) at pthread_create.c:309
#5 0x00007f4783ad204d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7f47817f1700 (LWP 1292)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004036db in ?? ()
#2 0x0000000000403c41 in ?? ()
#3 0x00000000004029da in ?? ()
#4 0x00007f47845680a4 in start_thread (arg=0x7f47817f1700) at pthread_create.c:309
#5 0x00007f4783ad204d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7f4780ff0700 (LWP 1293)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004036db in ?? ()
#2 0x0000000000403c41 in ?? ()
#3 0x00000000004029da in ?? ()
#4 0x00007f47845680a4 in start_thread (arg=0x7f4780ff0700) at pthread_create.c:309
#5 0x00007f4783ad204d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f47807ef700 (LWP 1294)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00000000004036db in ?? ()
#2 0x0000000000404379 in ?? ()
#3 0x00000000004045b0 in ?? ()
#4 0x00000000004029ba in ?? ()
#5 0x00007f47845680a4 in start_thread (arg=0x7f47807ef700) at pthread_create.c:309
#6 0x00007f4783ad204d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f4784971700 (LWP 1289)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185

---Type to continue, or q to quit---
#1 0x00000000004036db in ?? ()
#2 0x0000000000403ad1 in ?? ()
#3 0x0000000000404b33 in ?? ()
#4 0x00000000004026aa in ?? ()
#5 0x00007f4783a0db45 in __libc_start_main (main=0x4023b0, argc=3, argv=0x7ffce811f178, init=, fini=, rtld_fini=, stack_end=0x7ffce811f168) at libc-start.c:287
#6 0x0000000000402805 in ?? ()

Can we use the index to mount the archive?

So this index.. can we use it for neat tricks like mounting the archive (read-only) without extracting everything first?

For example to use FreeFileSync or some other dir-diff tool?

stdin and stdout

Hi vasi,
Great program, but could you please add options for processing stdin / stdout ? I need to pipe input in from a program and pipe the output to another program.
I need a fast multithreaded decompressor for lzma/xz like pigz !
Many thanks.

Warn on tarball truncation

In tarballs, 1024 zero bytes at the start of a header indicates End Of Archive. Since pixz always tries to interpret its input as tar formatted, it will erroneously truncate its input if it finds this sequence. Users who don't know better lose their data!

There are two distinct cases:

The input starts with EOA, and contains more data afterwards. This is almost certainly not a tarball, and should be interpreted as non-tarball data with no warning.
The input is a non-empty tarball, but contains data after EOA. This could occur due to user error (eg: concatenated tar files with 'cat') or because some program is storing useful data after EOA. There are several reasonable courses of action to take:
- a: Truncate the file at EOA. This loses any following data, but the pixz indexing works fine.
- b: Continue compressing data after EOA, but turn off pixz indexing. This preserves all data, but loses the advantages of indexing, like fast listing and extraction.
- c: Continue compressing after EOA, and leave indexing on. This preserves all data. However, if the archive is decompressed with plain xz, whatever other program puts data after EOA may be confused by the output.

Case 1 should be implemented with high priority. Case 2 is less common, but perhaps there should at least be a warning until I decide what to do.

htole64 / le64toh on old glibc

I have been using and providing production rpms for CentOS / RHEL 5-7 for a while. On 5 ~~and 6~~, glibc is too old and it does not certain macros used in src/endian.c. I am patching in the following:

# if __BYTE_ORDER == __LITTLE_ENDIAN
# define htole64(x) (x)
# define le64toh(x) (x)
# else
# define htole64(x) __bswap_64 (x)
# define le64toh(x) __bswap_64 (x)
# endif

Use autoconf magic AC_CHECK_DECLS and conditionally define the above if not found as a macro or function in the target environment.

Compressing directly from tar and specifying compression level?

This is a documentation question. I'm posting it here as this seems to serve as a support forum of sorts for pixz.

How would one compress directly instream from tar, while specifying the compression level?

I would like to do something like this analogous operation with pxz:
"nice -19 tar Oc foobar_directory | nice -19 pxz -9 -cv - > foobar.tar.xz"

Given the current documentation for pixz it isn't clear how to specify the compression level while doing this:
(tar -Ipixz -cf foo.tpxz foo_directory # Create a tarball using pixz for multi-core compression)

With xz this can be solved by setting the environment variable to maximum compression as default (eg "export XZ_OPT=-9e") but it isn't clear to me if similar variables exist for pixz that I can set.
Or maybe is there a way to pass the -9 to pixz through the "tar -Ipixz" that I'm unaware of?

Thanks in advance for any suggestions!

feature request: support indices other than tar

It would be nice to be able to supply a list of positions in the decompressed file to which index entries would be created. This would allow indexing things other than tar (for instance wikimedia dump .xml files).
I see this as an option such as --indices=<filepos.txt> that would read byte positions from file filepos.txt. Maybe some way to name the positions too. Then during extraction you'd also indicate from which byte position (or position name) you want the extraction to start.

Memory usage on 32-bit systems?

https://bugs.kali.org/view.php?id=1075

-f option results in invalid index

When I create a tpxz archive as follows

$ pixz -9 -f 3 linux-3.7.4.tar linux-3.7.4.tpxz

then the resulting archive has a corrupt index:

$ pixz -x linux-3.7.4/README < linux-3.7.4.tpxz > /dev/null 
Index and archive differ as to next file: linux-3.7.4/README vs linux-3.7.4/

This seems to happen for all values of -f greater than 2. I've only tested this with the -9 compression level.

pixz | xz -d corrupts data

I checked whether pixz and xz format were compatible. Looks like they are. But on a large testfile, I don't get the same data back as before. In particular, the output file is larger than the input file.

All bytes up to the end of input file are identical. Output file simply has additional data. Additional data seems derived from input file, not all zeroes or such.

$ ls -l foo
-rw-r--r-- 1 joern joern 4380151296 Apr 26 11:30 foo
$ cat foo | pixz -0 | xz -d > bar
$ ls -l bar
-rw-r--r-- 1 joern joern 4384062468 Apr 26 12:08 bar

pixz 1.0.2
xz (XZ Utils) 5.1.0alpha

Both from Debian, running on x86_64.

Help screen

One option pixz really needs is "--help". I'd be happy if it just printed the README file when you did --help. :)

Error encoding block when compressing /dev/urandom

When compressing my /dev/urandom for testing I always get this error eventually. It comes something like 15 seconds after starting. It won't reproduce with XZ or PXZ. I don't have the time to dig further but it seems like a bug to me. I can reproduce every time.

I took the last version on the repo and used:
dd if=/dev/urandom | pixz -7 > /dev/null

I can compress without errors with -6. The option -8 raises the error too. I can upload here a test file but it's quite bug: 224 Mio. I'm using -7 because it seems a good compromise between strong compression and multi threading.

Show progress bar with `pv` and `pixz`

Hi,
Thanks for making pixz, I use it for all my backups and it works great! I especially appreciate the indexing feature.

I would like to display a progress bar during my backups and I found a couple of suggestions in the answers to this SO question. For me, both suggestions work with gzip, pigz and xz, but not with pixz, pxz or when specifying multiple cores to xz (xz -T<num>).

I just wanted to check if you knew right off the bat of any quick modification I could make to render that command usable with pixz or if it is not possible due to how pixz differs from the other compression utilities mentioned above.

For reference, the two suggestions from SO are

SIZE=`du -sk folder-with-big-files | cut -f 1`
tar cvf - folder-with-big-files | pv -p -s ${SIZE}k | pigz -c > big-files.tar.gz

and

tar cf - /folder-with-big-files -P | pv -s $(du -sb /folder-with-big-files | awk '{print $1}') | pigz > big-files.tar.gz

Static Compiling for Scientific Linux 6.5

I'd like to compile pixz for use on a Scientific Linux 6.5 OS. It seems that the xz development library is too old to work with the current version of pixz. I get the below error:

$ git clone https://github.com/vasi/pixz.git
$ cd pixz
$ make
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o pixz.o pixz.c
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o common.o common.c
common.c: In function ‘find_file_index’:
common.c:119: error: ‘lzma_index_iter’ undeclared (first use in this function)
common.c:119: error: (Each undeclared identifier is reported only once
common.c:119: error: for each function it appears in.)
common.c:119: error: expected ‘;’ before ‘iter’
common.c:120: warning: implicit declaration of function ‘lzma_index_iter_init’
common.c:120: error: ‘iter’ undeclared (first use in this function)
common.c:122: warning: implicit declaration of function ‘lzma_index_iter_locate’
common.c: In function ‘next_index’:
common.c:329: warning: implicit declaration of function ‘lzma_index_stream_flags’
common.c:331: warning: implicit declaration of function ‘lzma_index_stream_padding’
common.c: In function ‘decode_index’:
common.c:344: error: too few arguments to function ‘lzma_index_cat’
make: *** [common.o] Error 1

I have the following versions of the xz-devel and libarchive-devel packages:

$ yum info xz-devel
Loaded plugins: refresh-packagekit, security
Installed Packages
Name        : xz-devel
Arch        : x86_64
Version     : 4.999.9
Release     : 0.3.beta.20091007git.el6
Size        : 145 k
Repo        : installed
From repo   : sl
Summary     : Devel libraries & headers for liblzma
URL         : http://tukaani.org/xz/
License     : LGPLv2+
Description : Devel libraries and headers for liblzma.


$ yum info libarchive-devel
Loaded plugins: refresh-packagekit, security
Installed Packages
Name        : libarchive-devel
Arch        : x86_64
Version     : 2.8.3
Release     : 4.el6_2
Size        : 97 k
Repo        : installed
From repo   : sl
Summary     : Development files for libarchive
URL         : http://code.google.com/p/libarchive/
License     : BSD
Description : The libarchive-devel package contains libraries and header files for
            : developing applications that use libarchive.

Is it possible to grab the xz and libarchive source (maybe a specific compatible version) and statically compile pixz?

I found a reference to static compiling pixz here: http://lists.opensuse.org/archive/opensuse-commit/2013-12/msg01104.html. Maybe I can do something similar.

pixz -l doesn't display files and folders that start with "._"

I'm creating .tar.xz files using pixz and when I list the tarball contents using pixz -l, it doesn't show files and folders that start with "._" (without quotes).

Here's an example.

$ tar -Ipixz -tf test_folder.tar.xz
test_folder.sha512sum
test_folder/
test_folder/regular_file_1
test_folder/._hidden_file_0
test_folder/.hidden_file_1
test_folder/._hidden_folder_0/
test_folder/._hidden_folder_0/file.ext
test_folder/.hidden_folder_1/
test_folder/.hidden_folder_1/file.ext
test_folder/_regular_file_0

pixz -l test_folder.tar.xz
test_folder.sha512sum
test_folder/
test_folder/regular_file_1
test_folder/.hidden_file_1
test_folder/._hidden_folder_0/file.ext
test_folder/.hidden_folder_1/
test_folder/.hidden_folder_1/file.ext
test_folder/_regular_file_0

The missing file and folder are:

test_folder/._hidden_file_0
test_folder/._hidden_folder_0/

I'm using the current master branch.

pixz reports block encoder instead memory allocation error

Compiled under the latest Cygwin x32 and attempted to compress some files resulted in the following error:

"error creating block encoder"

The command line options I used were:

pixz -9 < file.tar > file.tar.pxz

Increasing the available memory to the Cygwin program alleviated the issue. It took some head scratching.

Support other compressors

Gzip and LZ4 should be doable, since they have fields for extra data. It might be possible to also do tarballs with bzip2 and lzop, by exploiting the End Of Archive marker.

Link fails in some Linux enviroments

On Linux Mint 15 (which is essentially Ubuntu 13.04), the application will not link (and it gives many -Wdeprecated-declarations warnings).

I was able to fix linking by adding "-lm" to the end of LIBADD in the makefile, line 15.

LIBADD = $(THREADS) -llzma -larchive -lm

The full output of attempting to build with the existing makefile follows:

gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o pixz.o pixz.c
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o common.o common.c
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o endian.o endian.c
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o cpu.o cpu.c
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o read.o read.c
read.c: In function ‘pixz_read’:
read.c:118:9: warning: ‘archive_read_support_compression_none’ is deprecated (declared at /usr/include/archive.h:323) [-Wdeprecated-declarations]
read.c:153:3: warning: ‘archive_read_finish’ is deprecated (declared at /usr/include/archive.h:585) [-Wdeprecated-declarations]
read.c: In function ‘taste_tar’:
read.c:667:5: warning: ‘archive_read_support_compression_none’ is deprecated (declared at /usr/include/archive.h:323) [-Wdeprecated-declarations]
read.c:672:2: warning: ‘archive_read_finish’ is deprecated (declared at /usr/include/archive.h:585) [-Wdeprecated-declarations]
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o write.o write.c
write.c: In function ‘read_thread’:
write.c:138:6: warning: ‘archive_read_support_compression_none’ is deprecated (declared at /usr/include/archive.h:323) [-Wdeprecated-declarations]
write.c:162:6: warning: ‘archive_read_finish’ is deprecated (declared at /usr/include/archive.h:585) [-Wdeprecated-declarations]
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'  -c -o list.o list.c
gcc  -g -O0 -Wall  -o pixz pixz.o common.o endian.o cpu.o read.o write.o list.o -lpthread -llzma -larchive
/usr/bin/ld: common.o: undefined reference to symbol 'ceil@@GLIBC_2.2.5'
/usr/bin/ld: note: 'ceil@@GLIBC_2.2.5' is defined in DSO /lib/x86_64-linux-gnu/libm.so.6 so try adding it to the linker command line
/lib/x86_64-linux-gnu/libm.so.6: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make: *** [pixz] Error 1

Better compression options

When compressing the notmuch email corpus ( http://tesseract.cs.unb.ca/notmuch/ ), pixz has a notably worse compression ratio than xz or pxz:

2657 MB original
348 MB xz
369 MB pxz
418 MB pixz

The reason is that pxz uses a much larger block-size than pixz. Where we use dict_size * 1.0, they use dict_size * 3. If we use similar block sizes, we get similar compression ratios, though it costs us a lot of memory usage.

We should allow the user to select important compression parameters like this. Useful options include:

block-size as a multiple of dict_size, minimum should probably be 1.0
number of blocks to keep in memory, minimum of #threads + 1. Could potentially do just #threads with some trickery, but would lose some efficiency
xz's 'extreme' compression
Option to put compressed blocks in filesystem to save memory?

Related places we might want to optimize:

Detect the available memory, and auto-choose useful parameters
Be more careful about unused memory. Free blocks between when they're done and when we reuse them? Maybe just do this under memory pressure?
Split the last block, so we don't end up with just one thread at the end. Only important for small file-size / block-size ratio.

Byte-based random access

I have not found the support forum. If there is one feel free to point me to it.

I have a binary file with variable length records. I can compute the start position of each record. How can I generate an index of the file based on these positions? How can I extract records 1000..1007,2000..2030?

make options (more) compatible with xz

A Debian user (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=714126) asks about using pixz as a drop in replacement for the xz command from xz-utils. Currently this does not seem very feasible to me since among other things the options are not very compatible.

Feel free to close this issue if you think it will not be feasible in the long term either.

Auto-optimize block size based on total file size

On the file

http://d-i.debian.org/daily-images/amd64/20140214-00:18/netboot/mini.iso

pixz -p 8 mini.iso

only uses 2 threads.

It takes about 7 seconds on my i7, so it seems like there is room to ramp up.

If pixz is exited mid-compression, the whole archive is useless.

If pixz is in compression mode and is exited before it exits naturally, the whole pixz archive is useless.

For example, take the issue I ran into today. I had pixz compress a massive amount of data, but the computer crashed before it completed. I now have a 113GB pixz file that pixz refuses to extract. pixz fails with the error "Error reading stream footer" when I try to extract the file.

I know that if pixz is exited mid-compression, then the last file that pixz was compressing is lost, but pixz should be able to extract all of the files before that.

This is a major issue, as it makes pixz useless for compressing a large amount of data. It's a shame, because pixz's parallel nature speeds up this kind of workload immensely.

Missing compatibility with xz

echo works | xz | xz -dc
echo fails | pixz | pixz -dc

1.0.3 tarball not on SourceForge

It looks like pixz 1.0.3 was released, but its tarball was not uploaded to SourceForge where the pixz github project says the file should be.

keep original

It would be nice to have -k/--keep (like bzip2) option. Original file is removed after compression/decompression, but it is not always behavious. Of course pipe method can be used but a filename have to be set, -k is just shorted. Please consider to add this.

Build without libarchive

If the user can't get libarchive, just stub it out. But this means command-line behaviour will differ—they won't get indexing. Is that ok?

unable to quickly decompress from an archive created with -9 -f 3.0

With pixz 1.0.2-2 from Debian, running pixz -x files/file00309491 -i files.tpxz9 results with "Index and archive differ as to next file: files/file00309491 vs files/file00246065".

files.tpxz9 was created with pixz -9 -f 3.0 files.tar files.tpxz9 and is available at the following
magnet link:

magnet:?xl=136933688&dn=files.tpxz9&xt=urn:btih:utilil77ywy4hwmd2r5slxpwkfuy3bo2&tr=udp://tracker.openbittorrent.com:80&tr=udp://open.demonii.com:1337&tr=udp://tracker.coppersurfer.tk:6969&tr=udp://tracker.leechers-paradise.org:6969

(131MB) (sha256 = 2a1cd58008bd527e5d6b8256de796e8368019c2877ed4548afc163d4ede9e614) (more info on it below)

Freshly compiled pixz from git master (936d806) with debug enabled outputs this instead:

$ ~/git/pixz/I/bin/pixz -x files/file00309491 -i ../files.tpxz9
want: files/file00309491
read: skip 1
read: skip 2
read: skip 3
read: skip 4
read: want 5
tar want: files/file00309491
tar off = 185420288, size = 18446744073525179904
(null)
Error reading archive entry
$

Slow decompression of the whole tar archive and then instructing tar to output the file seems to work OK.

files.tar was created from a Wikimedia dump at https://dumps.wikimedia.org/simplewiki/20160203/simplewiki-20160203-pages-meta-current.xml.bz2 with this command line:

$ mkdir files
$ bzcat ../simplewiki-20160203-pages-meta-current.xml.bz2 | ../splitter.pl <(bzgrep -b '<page>' ../simplewiki-20160203-pages-meta-current.xml.bz2  | sed -e 's/: .*$//')
$ tar cf files.tar files

The splitter.pl script is as below:

#! /usr/bin/perl

use strict;
use warnings;

my $splitf;
open $splitf, "<", $ARGV[0];

my $cnt = 0;
my $filecnt = 0;
my $done = 0;
my $buf;

while(!$done) {
        my $mark = <$splitf>;
        if($mark) {
                my $len = read(STDIN, $buf, $mark - $cnt);
                my $outf;
                open $outf, ">", sprintf("file%08d", $filecnt++);
                print $outf $buf;
                close $outf;
                $cnt += $len;
        } else {
                $done = 1;
        }
}

file permission not preserved when compressing/decompressing

When compressing/decompressing new file is created with umask instead with original file permissions, I think it is a bug (what do you think?).

Case study:

# ls -lh
total 41M
-rw------- 1 root root  40M Mar 28 01:20 backup-20130328.sql
# pixz backup-20130328.sql 
# ls -lh backup-20130328.sql.xz 
-rw-r--r-- 1 root root 2.0M Mar 28 01:26 backup-20130328.sql.xz

Build failure on Ubuntu 10.04

What am I missing or doing wrong?

[~/downloads/pixz-1.0.2]# make
gcc  -g -O0 -std=gnu99 -Wall -Wno-unknown-pragmas -DPIXZ_VERSION='"1.0.2"'      -c -o common.o common.c
common.c: In function ‘find_file_index’:
common.c:119: error: ‘lzma_index_iter’ undeclared (first use in this function)
common.c:119: error: (Each undeclared identifier is reported only once
common.c:119: error: for each function it appears in.)
common.c:119: error: expected ‘;’ before ‘iter’
common.c:120: warning: implicit declaration of function ‘lzma_index_iter_init’
common.c:120: error: ‘iter’ undeclared (first use in this function)
common.c:122: warning: implicit declaration of function ‘lzma_index_iter_locate’
common.c: In function ‘next_index’:
common.c:329: warning: implicit declaration of function ‘lzma_index_stream_flags’
common.c:331: warning: implicit declaration of function     ‘lzma_index_stream_padding’
common.c: In function ‘decode_index’:
common.c:344: error: too few arguments to function ‘lzma_index_cat’
make: *** [common.o] Error 1

can not seek in input: Illegal seek

When trying to use pixz with clonezilla to check image, the following command is executed:

cat /home/partimag/cust/sdb2.ext4-ptcl-img.xz.aa |
pixz -d |
LC_ALL=C partclone.chkimg -N -s -

Using stream show following stderr which is not really user friendly and might make thing to the user that the integrity of the file is not OK:

"can not seek in input: Illegal seek"

Seems this warning is provided by decode_index function, anyway there is no failure and pixz is correctly behaving even when the input file is stream. Is it possible to remove this warning information from decode_index function?

Building on OSX

Running into issues building this on OSX: https://gist.github.com/tmcw/5463752

archive.h isn't found, though libarchive is supposedly included in OSX?

Check suffix support

At least .pxz should be added, maybe more?

release early / often, git branching model, CONTRIBUTING.md, (semantic) versioning

You may have realized that I have pushed the last few releases quite fast and each included only a single hotfix. I think like release early and release often.

To support this workflow better, I would like to push for a specific git branching model, this one is a good starting point. There are some other models out there as well as refinements to the abovementioned one but this should be a good starting point for a branching model. I would also put this information in a CONTRIBUTING.md file in the repository, e.g. that new features must be contributed to the develop branch (see model description). Contributors will see a like to that file if they start to submit a PR (see also here), so they will know about these contribution guidelines.

With the hotfix and release early and release often mindset, I would also like to bring in semantic versioning, i.e. hotfixes increase patch version level, new features minor version level, incompatible changes major version number, etc. (read at link for more information). Example: if we bring in more command line options, e.g. for #19 we should increase minor version number, if we change or remove some command line option, we should increase major version number. (Actually, thinking about it, v1.0.3 should have been 1.1.0 ;) you see what I mean.)

I would like to get your opinion on these matters, especially from @vasi. If you have any questions about the topics above, feel free to ask. These topics are all obvious to me but might not to you, so if I left something out which should have been explained or reasoned about better, let me know.

out-of-memory prevention and recovery

For a pre-discussion, see #8.

Feature Request: Support --verbose option from xz

The xz command from xz-utils provides a option called --verbose that prints information on the progress, compression and estimated remaining time. When stdout is redirected, it will instead print a summary of the total compression when it's finished. It would be great if pixz could do the same.

linux build fixes

following modification allows to build pixz on linux. not tested at runtime yet. please consider extending Makefile with following modifications.

diff --git a/Makefile b/Makefile
index 48db5c7..470cc52 100644
--- a/Makefile
+++ b/Makefile
@@ -2,7 +2,7 @@ ifneq ($(shell gcc -v 2>&1 | grep 'Apple Inc'),)
        APPLE=1
 endif

-LIBPREFIX = /Library/Fink/sl64 /opt/local
+LIBPREFIX = /usr /
 ifdef APPLE
 ifeq ($(CC),gcc)
        LDFLAGS += -search_paths_first
@@ -26,7 +26,7 @@ all: $(PROGS)
        $(COMPILE) $@ $<

 $(PROGS): %: %.o $(COMMON)
-       $(LD) $@ $^ -llzma -larchive
+       $(LD) $@ $^ -llzma -larchive -lpthread
 
 clean:
        rm -f *.o $(PROGS)

Build instructions mention non-existent `configure` script

The build instructions list ./configure as the first step after downloading the release tarball, but the 1.0.3 release tarball does not contain a configure script.

(I'm in the process of submitting v1.0.3 to Homebrew, but I can't do that until I can reliably determine the proper installation steps for Mac OS X.)

Might someone point me in the right direction?

tar index doesn't work on large files

pixz 1.05:
8,7GB tar after indexing/packing it with pixz I do
pixz -l bigtar.tpxz
Can't list non-seekable input

With small files it works.

Do something sensible if output is going to a TTY

Right now we just print gibberish :(

Feature Request - compatible "-c" / pipe output

Hi,

I'm trying replace my scripts which use pigz to pixz , but lot of them I need to use the "-c" option, which means redirect the output to pipe ...
Unfortunately this isn't supported by pixz and make much more difficult to handle bigs files with it ...
I've try use "-o -" which is common for certains GNU commands to redirect to output , but not work.

Simplifying , the command which I need to work is something like this :
$ cat dummy.file | pixz -c > dummy.file.xz
(please , this is very simple example of how need to work , but I need use it in this way over much more complexes situations)

pixz doesn't check for out-of-memory

cppcheck complains as follows

[common.c:215]: (error) Common realloc mistake: 'gFileIndexBuf' nulled but not freed upon failure

I didn't follow the code outside that function, but just looking at that line it does look like that could be a problem if realloc returns NULL.

pixz 1.0.3 build failure

In the future, I'm happy to test things before tags too ;-)

Build failure on amd64

>>> Unpacking source...
>>> Unpacking v1.0.3.tar.gz to /var/tmp/portage/app-arch/pixz-1.0.3/work
>>> Source unpacked in /var/tmp/portage/app-arch/pixz-1.0.3/work
>>> Preparing source in /var/tmp/portage/app-arch/pixz-1.0.3/work/pixz-1.0.3 ...
 * Running eautoreconf in '/var/tmp/portage/app-arch/pixz-1.0.3/work/pixz-1.0.3' ...
 * Running aclocal -I m4 ...                                                                                                                                                           [ ok ]
 * Running autoconf --force ...                                                                                                                                                        [ ok ]
 * Running autoheader ...                                                                                                                                                              [ ok ]
 * Running automake --add-missing --copy --foreign --force-missing ...                                                                                                                 [ ok ]
 * Running elibtoolize in: pixz-1.0.3/
>>> Source prepared.
>>> Configuring source in /var/tmp/portage/app-arch/pixz-1.0.3/work/pixz-1.0.3 ...
 * econf: updating pixz-1.0.3/config.sub with /usr/share/gnuconfig/config.sub
 * econf: updating pixz-1.0.3/config.guess with /usr/share/gnuconfig/config.guess
./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --disable-dependency-tracking --disable-silent-rules --libdir=/usr/lib64
configure: loading site script /usr/share/config.site
checking for a BSD-compatible install... /usr/lib/portage/python3.4/ebuild-helpers/xattr/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for style of include used by make... GNU
checking for x86_64-pc-linux-gnu-gcc... x86_64-pc-linux-gnu-gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether x86_64-pc-linux-gnu-gcc accepts -g... yes
checking for x86_64-pc-linux-gnu-gcc option to accept ISO C89... none needed
checking whether x86_64-pc-linux-gnu-gcc understands -c and -o together... yes
checking dependency style of x86_64-pc-linux-gnu-gcc... none
checking for x86_64-pc-linux-gnu-gcc option to accept ISO C99... -std=gnu99
checking for x86_64-pc-linux-gnu-gcc -std=gnu99 option to accept ISO Standard C... (cached) -std=gnu99
checking for a2x... a2x
checking for ceil in -lm... yes
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking if compiler needs -Werror to reject unknown flags... no
checking for the pthreads library -lpthreads... no
checking whether pthreads work without any flags... no
checking whether pthreads work with -Kthread... no
checking whether pthreads work with -kthread... no
checking for the pthreads library -llthread... no
checking whether pthreads work with -pthread... yes
checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE
checking if more special flags are required for pthreads... no
checking for PTHREAD_PRIO_INHERIT... yes
checking for x86_64-pc-linux-gnu-pkg-config... /usr/bin/x86_64-pc-linux-gnu-pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for LIBARCHIVE... yes
checking for LZMA... yes
checking how to run the C preprocessor... x86_64-pc-linux-gnu-gcc -std=gnu99 -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking fcntl.h usability... yes
checking fcntl.h presence... yes
checking for fcntl.h... yes
checking for stdint.h... (cached) yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for unistd.h... (cached) yes
checking for off_t... yes
checking for size_t... yes
checking for ssize_t... yes
checking for uint16_t... yes
checking for uint32_t... yes
checking for uint64_t... yes
checking for uint8_t... yes
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible realloc... yes
checking for working strtod... yes
checking for memchr... yes
checking for memmove... yes
checking for memset... yes
checking for strerror... yes
checking for strtol... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating src/Makefile
config.status: creating test/Makefile
config.status: creating config.h
config.status: executing depfiles commands
>>> Source configured.
>>> Compiling source in /var/tmp/portage/app-arch/pixz-1.0.3/work/pixz-1.0.3 ...
make -j1 CC=x86_64-pc-linux-gnu-gcc OPT= 
make  all-recursive
make[1]: Entering directory '/var/tmp/portage/app-arch/pixz-1.0.3/work/pixz-1.0.3'
Making all in src
make[2]: Entering directory '/var/tmp/portage/app-arch/pixz-1.0.3/work/pixz-1.0.3/src'
x86_64-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I..     -pthread -Wall -Wno-unknown-pragmas -Os -mtune=nocona -pipe -frecord-gcc-switches -march=native -c -o pixz-common.o `test -f 'common.c' || echo './'`common.c
common.c: In function ���dump_file_index���:
common.c:63:5: error: ���for��� loop initial declarations are only allowed in C99 mode
     for (file_index_t *f = gFileIndex; f != NULL; f = f->next) {
     ^
common.c:63:5: note: use option -std=c99 or -std=gnu99 to compile your code
common.c: In function ���free_file_index���:
common.c:75:5: error: ���for��� loop initial declarations are only allowed in C99 mode
     for (file_index_t *f = gFileIndex; f != NULL; ) {
     ^
common.c: In function ���stream_padding���:
common.c:278:2: error: ���for��� loop initial declarations are only allowed in C99 mode
  for (off_t pad = 0; true; pad += sizeof(uint32_t)) {
  ^
common.c: In function ���stream_footer���:
common.c:291:2: error: ���for��� loop initial declarations are only allowed in C99 mode
  for (int i = sizeof(ftr) / sizeof(uint32_t) - 1; i >= 0; --i) {
  ^
common.c: In function ���queue_free���:
common.c:375:5: error: ���for��� loop initial declarations are only allowed in C99 mode
     for (queue_item_t *i = q->first; i; ) {
     ^
common.c: In function ���pipeline_create���:
common.c:478:5: error: ���for��� loop initial declarations are only allowed in C99 mode
     for (size_t i = 0; i < qsize; ++i) {
     ^
common.c:485:17: error: redefinition of ���i���
     for (size_t i = 0; i < gPLProcessCount; ++i) {
                 ^
common.c:478:17: note: previous definition of ���i��� was here
     for (size_t i = 0; i < qsize; ++i) {
                 ^
common.c:485:5: error: ���for��� loop initial declarations are only allowed in C99 mode
     for (size_t i = 0; i < gPLProcessCount; ++i) {
     ^
common.c: In function ���pipeline_stop���:
common.c:522:5: error: ���for��� loop initial declarations are only allowed in C99 mode
     for (size_t i = 0; i < gPLProcessCount; ++i)
     ^
common.c:524:17: error: redefinition of ���i���
     for (size_t i = 0; i < gPLProcessCount; ++i) {
                 ^
common.c:522:17: note: previous definition of ���i��� was here
     for (size_t i = 0; i < gPLProcessCount; ++i)
                 ^
common.c:524:5: error: ���for��� loop initial declarations are only allowed in C99 mode
     for (size_t i = 0; i < gPLProcessCount; ++i) {
     ^
Makefile:441: recipe for target 'pixz-common.o' failed
make[2]: *** [pixz-common.o] Error 1
make[2]: Leaving directory '/var/tmp/portage/app-arch/pixz-1.0.3/work/pixz-1.0.3/src'
Makefile:376: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/var/tmp/portage/app-arch/pixz-1.0.3/work/pixz-1.0.3'
Makefile:317: recipe for target 'all' failed
make: *** [all] Error 2

-e only works after -[0-9]

Could you either document this restriction, or remove it?

thanks!

configure should not hard fail on missing asciidoc

since the man page is shipped in the release tarballs, asciidoc is mostly not needed to build.

distribute pixz.1 in tarball

1.0.3 introduced a dependency on asciidoc and xslttranform (libxslt-tools). pixz.1 should be built at distribution time and shipped with the tarball.

unable to compress large files

I get this (from strace) when I try to compress a 4GB file on arm using pixz 1.0.2:

open("stockchroot-pwnphone-moto.img", O_RDONLY) = -1 EOVERFLOW (Value too large for defined data type)

libm underlinking

pixz-1.0.2 does not link correctly with ld.gold, because of an underlinking issue:

...
x86_64-pc-linux-gnu-gcc   -Wall -Wl,-O1 -Wl,--as-needed -o pixz pixz.o common.o endian.o cpu.o read.
o write.o list.o -lpthread -llzma -larchive
common.o:common.c:function pipeline_create: error: undefined reference to 'ceil'
collect2: error: ld returned 1 exit status
make: *** [pixz] Error 1

ceil() is used in the program but -lm is not in linking flags. Add -lm to LIDADD looks sufficient to fix the issue:

diff --git a/Makefile~ b/Makefile
index fe605f2..36dc27b 100644
--- a/Makefile~
+++ b/Makefile
@@ -12,7 +12,7 @@ MYCFLAGS = $(patsubst %,-I%/include,$(LIBPREFIX)) $(OPT) -std=gnu99 \
 MYLDFLAGS = $(patsubst %,-L%/lib,$(LIBPREFIX)) $(OPT) -Wall

 THREADS = -lpthread
-LIBADD = $(THREADS) -llzma -larchive
+LIBADD = $(THREADS) -llzma -larchive -lm

 CC = gcc
 COMPILE = $(CC) $(MYCFLAGS) $(CFLAGS) -c -o

Large file support on 32-bits

Reported by email. I probably need something like "-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE" on some platforms.

Contributors wanted!

Just checking in - pixz works well for me.