sstadick / crabz Goto Github PK
View Code? Open in Web Editor NEWLike pigz, but rust
License: The Unlicense
Like pigz, but rust
License: The Unlicense
Do not panick when piping result and closing pipe with -f bgzf.
$ crabz -p 4 -d -f bgzf test.fastq.gz | head
[2021-09-07T16:57:14Z INFO crabz] Decompressing (bgzf) with 4 threads available.
@NB501171:702:H7Y55BGXH:4:11401:19233:1053 2:N:0:CACCGCACCA+ANTGACAGTC
NNNNNNTNAAAAATGCCCTAGCCCCCTTCAGAANACAAGGCAAA
+
######/#EEAEEEE/EE//E/E////EA////#AEE/E/E/E<
@NB501171:702:H7Y55BGXH:4:11401:23376:1053 2:N:0:CACCGCACCA+ANTGACAGTC
NNNNNNTNTTATGTAACTAATGCATCTTGCCCTNATCTCTTTGC
+
######E#EEEEEEEEEEEEEE<E/AEEAE/EA#EEEEEEEE/E
@NB501171:702:H7Y55BGXH:4:11401:8502:1053 2:N:0:CACCGCACCA+AATGACAGTC
NNNNNNTNGAAGGCAGACTGCATGGCTTAATTTNAAAAATCATT
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: "SendError(..)"', ~/.cargo/registry/src/github.com-1ecc6299db9ec823/gzp-0.8.0/src/par/decompress.rs:158:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
pigz -n -b 64 -k -p 31 -11 sample.file
crabz -l 9 sample.file -o sample.gz
not much but still
Regular gzipped file created instead of a bgzf one, when 1 thread is requested.
crabz -p 1 -f bgzf -o test.csv.bgzf_threads_1.gz test.csv
crabz -p 2 -f bgzf -o test.csv.bgzf_threads_2.gz test.csv
$ file test.csv.bgzf_threads_1.gz test.csv.bgzf_threads_2.gz
test.csv.bgzf_threads_1.gz: gzip compressed data
test.csv.bgzf_threads_2.gz: gzip compressed data, extra field
# Try to decompress the files
$ crabz -p 1 -d -f bgzf test.csv.bgzf_threads_1.gz | wc -l
[2021-09-14T12:22:32Z INFO crabz] Decompressing (bgzf) with 1 threads available.
Error: Invalid block header: Extra field flag not set
0
$ crabz -p 1 -d -f bgzf test.csv.bgzf_threads_2.gz | wc -l
[2021-09-14T12:23:18Z INFO crabz] Decompressing (bgzf) with 1 threads available.
100000000
I wanted to pipe the output from crabz
to another program, and noticed it always adds a header to the output.
It would be nice for crabz
to detect if it's not running interactively, or at least have the --quiet
flag.
Related to #13.
Impossible you say? See: http://www.hicomb.org/papers/HICOMB2019-07.pdf
When doing parallel compression of a gzip file, crabz uses ZBuilder which instantiates a ParCompress when num_threads > 1.
However, decompression always uses single-threaded MultiGzDecoder. Why is it not using ParDecompress when num_threads > 1?
I had a build error with centos 7 because the default cmake package is an older version. The steps below help to get crabz built and installed successfully on centos 7.
dnf install cmake3
alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake 10 --slave /usr/local/bin/ctest ctest /usr/bin/ctest --slave /usr/local/bin/cpack cpack /usr/bin/cpack --slave /usr/local/bin/ccmake ccmake /usr/bin/ccmake --family cmake
alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake3 20 --slave /usr/local/bin/ctest ctest /usr/bin/ctest3 --slave /usr/local/bin/cpack cpack /usr/bin/cpack3 --slave /usr/local/bin/ccmake ccmake /usr/bin/ccmake3 --family cmake
cargo install crabz --force
alternatives --config cmake
I have tried to install crabz with Rust-only compressor implementations using the following command:
cargo install crabz --no-default-features --features=snap_default,deflate_rust
It fails to compile with the following error:
error[E0599]: no variant or associated item named `Zlib` found for enum `Format` in the current scope
--> /home/shnatsel/.cargo/registry/src/github.com-1ecc6299db9ec823/crabz-0.7.2/src/main.rs:281:21
|
170 | enum Format {
| ----------- variant or associated item `Zlib` not found here
...
281 | Format::Zlib => ("zz", string_set!["zz", "z", "gz"]),
| ^^^^ variant or associated item not found in `Format`
For more information about this error, try `rustc --explain E0599`.
I created a tgz using specific compression level and threads with crabz:
git clone https://github.com/sstadick/crabz.git
mv crabz blah
time tar cf - blah/ | crabz --compression-level=1 --compression-threads=6 > ./blah.tgz
mkdir testcrabz
cd testcrabz/
mv ../blah.tgz .
tar zxf ./blah.tgz
I expected success when extracting a tgz created with crabz.
Instead, I got the following error output:
gzip: stdin: invalid compressed data--format violated
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
NOTE: This error only surfaces when extracting with tar when crabz uses compression level 1.
tar extracts as expected when crabz uses compression levels 0, 2 to 9 BUT NOT 1.
See: http://www.htslib.org/doc/bgzip.html. This should not be too hard if you keep the uncompressed and compressed byte offset of each block you are processing. Right now, I have to use bgzip
to create the index.
@Shnatsel I'm moving performance tracking for crabz
related things to here.
Run the same benchmarks as found here: https://github.com/zlib-ng/pigzbench with the different backends.
The zlib-ng
benchmarks pretty clearly indicate that zlib-ng
is the way to go as a backend, which matches what I see in benchmarks. zlib
and the rust
backends for flate2
perform about the same.
cargo auditable install crabz --no-default-features --features=deflate_rust
results in the following error:
error[E0599]: no variant or associated item named `Snap` found for enum `Format` in the current scope
--> /home/shnatsel/.cargo/registry/src/github.com-1ecc6299db9ec823/crabz-0.8.1/src/main.rs:264:21
|
150 | enum Format {
| ----------- variant or associated item `Snap` not found for this enum
...
264 | Format::Snap => ("sz", string_set!["sz", "snappy"]),
| ^^^^ variant or associated item not found in `Format`
For more information about this error, try `rustc --explain E0599`.
Have you tried https://github.com/intel/isa-l?
It can be installed through conda and then you get igzip
. It is awesome.
ISA-L has library functions that can be called. Since crabz is already dynamically choosing zlib-ng over zlib adding ISA-L should be a possibility. The best mix in my opinion is:
Thanks for the crabz project and have a nice day!
By default the gzip header saves a NULL-terminated filename and a timestamp. However, having these results in non-reproducible output for the same content.
Therefore pigz, gzip and igzip all feature a --no-name flag in order to not include the filename and set the timestamp in the gzip header to 0.
It would be great if crabz could have the same flag for inclusion in xopen. (https://github.com/pycompression/xopen). This library enhances python compression speed by piping through external programs and since a few releases back always creates reproducible output by default.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.