Comments (15)
Hi, this should already work! Eg:
pixz < data.tar > compressed.tpxz
pixz -d < compressed.xz > output
Is it not working for you in some way?
from pixz.
PS: Not all files can be decompressed using multiple CPUs. You will only get the speed-up if it was originally compressed with pixz or another tool that compresses by segments.
from pixz.
thanks for the quick reply, this is a simplified version of what I am trying to do, as I do it with xz;
wget -qO- http://xxx/myimage.xz | xz -c -d | ntfsclone -r --overwrite /dev/sda1
I could not get pixz to do this.
Thanks for any help
from pixz.
pixz can only work on seekable compressed data. This is due to limitations of the .xz format, not anything about pixz itself. I suppose it would be possible to detect that it's un-seekable and just revert to single-CPU mode, but there's no real advantage to using pixz over xz in this case.
In the specific case of HTTP, it would theoretically be possible to make Range requests in lieu of seeking. Maybe there's a FUSE filesystem somewhere that can "mount" an HTTP request as a file? That would be really interesting!
from pixz.
vasi: sorry I am somewhat confused! [below on the same hardware, mid range Core2Duo]
I am currently using pigz to multithread decompress gzip piped from wget as above. Gigabit connection, both cpu's are loaded to about 30% but hdd write speed is actually limiting factor.
I have tried above using xz (5.1.1) with a 'normal' xz image and that also works from wget, but both cpu cores only load to 40-60% and the hdd isn't limiting, it takes about 4x longer. I dont understand what is limiting in this case, nor do I understand why both cores are loaded as I understand xz 5.1.1. is only multithreaded compressing, not decompressing.
As it works I assume a normal xz image doesnt need to be seekable.
Are you saying pigz xz creates an image that needs seek, whereas normal xz is streamable?
Thanks!
from pixz.
hmm, an image created with pixz streams ok using xz, decompressing through wget
from pixz.
Ok, first I'll explain the CPU usage thing, though it's a bit extraneous. Basically, when something is single-threaded, it means it can only use one CPU at a time, but which CPU it's using at any moment could be arbitrary, it's up to the operating system. So imagine a usage pattern like this:
Time CPU 1 CPU 2
0 xz free
10ms free xz
20ms xz free
30ms free xz
40ms xz free
etc...
Your CPU usage monitor probably only measures usage once a second, so it will see this as 500ms on CPU1, and 500ms on CPU2, and will show you 50% on each CPU.
Ok, now back to xz. Data compressed by pixz is entirely compatible with the .xz format. It can be decompressed by .xz, and doesn't require seeking to decompress. You can totally do 'pixz < some.data | xz -cd > thesame.data'.
What does require seeking is multi-threaded decompression. If a .xz file (from xz or pixz) contains multiple 'segments' that can be decompressed in parallel, information about those segments is usually at the end of the file. So without seeking, those segments can't be found, and decompression has to be single-threaded.
I realize now that it's theoretically possible to create a .xz file with multiple segments, and with segment information stored inline. This could in fact be both streamed and parallelized. Unfortunately, it's a fair bit more difficult to create files like this.
from pixz.
sorry you are quite right about cpu/threads now I looked at top while it was running more carefully! that was stupid of me.
I guess I will have to stick with pigz then as its so much faster decompressing :(
(faster than just gzip, now I see its also single threaded but docs says it does other things with separate threads)
My image shrank a lot with xz, 3.5 Gb (gzip) > 2.6Gb (xz). I like your inline idea, I guess you would have a unique feature and speed up decompressing speed greatly as most people have at least 2 cores nowadays?
Do you know of anything thats currently mulithreaded for decompression that compresses better than .gz ?
from pixz.
Maybe threadzip? I'll see if I can modify pixz to produce files that can be streamed and decompressed in parallel, but it won't be right away.
from pixz.
any joy updating for multithreaded decompress?
from pixz.
Alright, so I'm just documenting what needs to happen to make this work:
For parallel decompression, we have to be able to split the compressed data. There are two cases that allow this:
- If the file is seekable, and an xz index is present at the end of the file. This requires seekability.
- If blocks contain a "compressed size" field. This would allow streaming. Currently pixz compresses files without adding "compressed size" to blocks, but xz 5.1 alpha does the right thing.
When decompressing, we may encounter the following cases:
- An index is accessible: Seekable files produced by current pixz or xz 5.1. We should try to use the file-index if present for fast listing/extraction. Decompression should be done in parallel.
- No index is accessible, but blocks have "compressed size" fields: Streaming files produced by xz 5.1 or future pixz. We can't access the file-index, and should warn if the user requests filtered extraction. Decompression should be done in parallel.
- No index is accessible, and blocks have no "compressed size": Streaming files produced by current pixz, and all files produced by xz 5.0.x and earlier. We can't access the file-index, see above. We also can't split the compressed data, so we fall back to single-threaded decompression.
But we expect some weird/unfortunate occurrences:
- XZ files can contain multiple streams, eg: if they're concatenated. We should attempt to find all indexes and combine them. This is incompatible with a pixz file-index, so we should ensure we don't attempt to use one.
- Some input may be partially parallelizable. Maybe two xz files were concatenated, one from xz 5.1 and one from xz 5.0. We should do blocks with "compressed size" in parallel, but blocks without should be done single-threaded.
- Some or all blocks may be too large for memory. (This will especially happen with xz 5.0 and earlier, since it doesn't do any splitting at all.) We must force those blocks to be decompressed single-threaded. Currently all decompression is in-memory, so we have to ensure we have a more stream-oriented way available as well.
- We might encounter an index that disagrees with "compressed size". This is definitely an error in the input, so we should at least warn the user. We could either exit with error, or attempt to continue by choosing either the index or "compressed size".
The implementation plan:
- In the compressor, add the "compressed size" field to blocks. This is done by writing the block header after compressing the block contents.
- Support dynamic decompression block sizes, since while streaming we can't precalculate the necessary block size.
- Support using "compressed size" as the decompression block size, triggered manually.
- Support complete absence of the index, and use that to trigger "compressed size".
- Add a streaming-mode for decompression, triggered manually. Instead of reading large chunks of data and passing them to the encode threads, the read thread will in this mode do the encoding itself, in small chunks. When it accumulates enough compressed output, it will send the output directly to the writer thread, and continue the same compression instead of starting anew. When it reaches the end of the block, it has to make sure it keeps any leftover input data around.
- Trigger streaming-mode on a block-by-block basis, when "compressed size" and the index are both not available, or a block is over some size threshold.
- Support the presence of multiple streams, including combining multiple indexes.
from pixz.
I've done part 1, 2 and 3. Gonna re-order the other ones, and implement "streaming mode" next.
Unfortunately it's difficult, because in stream mode we only find out we're at the end of a block when liblzma tells us it's done. So we'll probably have read a bit too far, and have data left over. We need every part of the decompressor to deal with arbitrary amounts of initial data that may or may not already have been read. Ugh :(
from pixz.
Part 1, writing compressed/uncompressed size into block headers, is committed to master. Any archives you create with pixz should now be streamable when, eventually, I finish the streaming work.
The current progress on streaming is in branch 'stream'. Be aware that this is a temporary branch, I may do amends and rebases.
from pixz.
Ok, branch 'stream' has this feature implemented! :D https://github.com/vasi/pixz/tree/stream
A lot of changes to the codebase were involved, so I would hugely appreciate testing. Let me know how it goes!
from pixz.
A user helped testing. Merged into master.
from pixz.
Related Issues (20)
- configure: error: AsciiDoc not found, not able to generate the man page. HOT 3
- Error decoding stream footer when trying to decompress a 3.1 TiB .tpxz file HOT 8
- cppcheck 2.8 warnings about uninitialized variables
- Crash when using -x option
- What is the default level of compression? HOT 2
- Can't compile on Fedora 38 HOT 2
- msys2 build failure HOT 18
- Indexes HOT 2
- Clarify README section on differences with xz HOT 5
- any plans for another release soon? HOT 1
- option -V doesn't work HOT 2
- -k should be the default HOT 2
- concatenation of *xz files and then decompression using pixz HOT 2
- Building On Windows HOT 2
- manpage not installed if building from release tarball HOT 2
- build env question not package liblzma HOT 1
- Questions about tpxz / file index format HOT 6
- Error creating block encoder HOT 3
- Syntax for converting existing tar.xz archive to indexed pixz file? HOT 1
- Random failures when compressing large directories HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pixz.