Coder Social home page Coder Social logo

stdin and stdout about pixz HOT 15 CLOSED

vasi avatar vasi commented on July 20, 2024
stdin and stdout

from pixz.

Comments (15)

vasi avatar vasi commented on July 20, 2024

Hi, this should already work! Eg:

pixz < data.tar > compressed.tpxz
pixz -d < compressed.xz > output

Is it not working for you in some way?

from pixz.

vasi avatar vasi commented on July 20, 2024

PS: Not all files can be decompressed using multiple CPUs. You will only get the speed-up if it was originally compressed with pixz or another tool that compresses by segments.

from pixz.

richud avatar richud commented on July 20, 2024

thanks for the quick reply, this is a simplified version of what I am trying to do, as I do it with xz;

wget -qO- http://xxx/myimage.xz | xz -c -d | ntfsclone -r --overwrite /dev/sda1

I could not get pixz to do this.
Thanks for any help

from pixz.

vasi avatar vasi commented on July 20, 2024

pixz can only work on seekable compressed data. This is due to limitations of the .xz format, not anything about pixz itself. I suppose it would be possible to detect that it's un-seekable and just revert to single-CPU mode, but there's no real advantage to using pixz over xz in this case.

In the specific case of HTTP, it would theoretically be possible to make Range requests in lieu of seeking. Maybe there's a FUSE filesystem somewhere that can "mount" an HTTP request as a file? That would be really interesting!

from pixz.

richud avatar richud commented on July 20, 2024

vasi: sorry I am somewhat confused! [below on the same hardware, mid range Core2Duo]

I am currently using pigz to multithread decompress gzip piped from wget as above. Gigabit connection, both cpu's are loaded to about 30% but hdd write speed is actually limiting factor.
I have tried above using xz (5.1.1) with a 'normal' xz image and that also works from wget, but both cpu cores only load to 40-60% and the hdd isn't limiting, it takes about 4x longer. I dont understand what is limiting in this case, nor do I understand why both cores are loaded as I understand xz 5.1.1. is only multithreaded compressing, not decompressing.
As it works I assume a normal xz image doesnt need to be seekable.

Are you saying pigz xz creates an image that needs seek, whereas normal xz is streamable?

Thanks!

from pixz.

richud avatar richud commented on July 20, 2024

hmm, an image created with pixz streams ok using xz, decompressing through wget

from pixz.

vasi avatar vasi commented on July 20, 2024

Ok, first I'll explain the CPU usage thing, though it's a bit extraneous. Basically, when something is single-threaded, it means it can only use one CPU at a time, but which CPU it's using at any moment could be arbitrary, it's up to the operating system. So imagine a usage pattern like this:

Time    CPU 1    CPU 2
0       xz       free
10ms    free     xz
20ms    xz       free
30ms    free     xz
40ms    xz       free
etc...

Your CPU usage monitor probably only measures usage once a second, so it will see this as 500ms on CPU1, and 500ms on CPU2, and will show you 50% on each CPU.

Ok, now back to xz. Data compressed by pixz is entirely compatible with the .xz format. It can be decompressed by .xz, and doesn't require seeking to decompress. You can totally do 'pixz < some.data | xz -cd > thesame.data'.

What does require seeking is multi-threaded decompression. If a .xz file (from xz or pixz) contains multiple 'segments' that can be decompressed in parallel, information about those segments is usually at the end of the file. So without seeking, those segments can't be found, and decompression has to be single-threaded.

I realize now that it's theoretically possible to create a .xz file with multiple segments, and with segment information stored inline. This could in fact be both streamed and parallelized. Unfortunately, it's a fair bit more difficult to create files like this.

from pixz.

richud avatar richud commented on July 20, 2024

sorry you are quite right about cpu/threads now I looked at top while it was running more carefully! that was stupid of me.

I guess I will have to stick with pigz then as its so much faster decompressing :(
(faster than just gzip, now I see its also single threaded but docs says it does other things with separate threads)

My image shrank a lot with xz, 3.5 Gb (gzip) > 2.6Gb (xz). I like your inline idea, I guess you would have a unique feature and speed up decompressing speed greatly as most people have at least 2 cores nowadays?

Do you know of anything thats currently mulithreaded for decompression that compresses better than .gz ?

from pixz.

vasi avatar vasi commented on July 20, 2024

Maybe threadzip? I'll see if I can modify pixz to produce files that can be streamed and decompressed in parallel, but it won't be right away.

from pixz.

richud avatar richud commented on July 20, 2024

any joy updating for multithreaded decompress?

from pixz.

vasi avatar vasi commented on July 20, 2024

Alright, so I'm just documenting what needs to happen to make this work:

For parallel decompression, we have to be able to split the compressed data. There are two cases that allow this:

  • If the file is seekable, and an xz index is present at the end of the file. This requires seekability.
  • If blocks contain a "compressed size" field. This would allow streaming. Currently pixz compresses files without adding "compressed size" to blocks, but xz 5.1 alpha does the right thing.

When decompressing, we may encounter the following cases:

  • An index is accessible: Seekable files produced by current pixz or xz 5.1. We should try to use the file-index if present for fast listing/extraction. Decompression should be done in parallel.
  • No index is accessible, but blocks have "compressed size" fields: Streaming files produced by xz 5.1 or future pixz. We can't access the file-index, and should warn if the user requests filtered extraction. Decompression should be done in parallel.
  • No index is accessible, and blocks have no "compressed size": Streaming files produced by current pixz, and all files produced by xz 5.0.x and earlier. We can't access the file-index, see above. We also can't split the compressed data, so we fall back to single-threaded decompression.

But we expect some weird/unfortunate occurrences:

  • XZ files can contain multiple streams, eg: if they're concatenated. We should attempt to find all indexes and combine them. This is incompatible with a pixz file-index, so we should ensure we don't attempt to use one.
  • Some input may be partially parallelizable. Maybe two xz files were concatenated, one from xz 5.1 and one from xz 5.0. We should do blocks with "compressed size" in parallel, but blocks without should be done single-threaded.
  • Some or all blocks may be too large for memory. (This will especially happen with xz 5.0 and earlier, since it doesn't do any splitting at all.) We must force those blocks to be decompressed single-threaded. Currently all decompression is in-memory, so we have to ensure we have a more stream-oriented way available as well.
  • We might encounter an index that disagrees with "compressed size". This is definitely an error in the input, so we should at least warn the user. We could either exit with error, or attempt to continue by choosing either the index or "compressed size".

The implementation plan:

  1. In the compressor, add the "compressed size" field to blocks. This is done by writing the block header after compressing the block contents.
  2. Support dynamic decompression block sizes, since while streaming we can't precalculate the necessary block size.
  3. Support using "compressed size" as the decompression block size, triggered manually.
  4. Support complete absence of the index, and use that to trigger "compressed size".
  5. Add a streaming-mode for decompression, triggered manually. Instead of reading large chunks of data and passing them to the encode threads, the read thread will in this mode do the encoding itself, in small chunks. When it accumulates enough compressed output, it will send the output directly to the writer thread, and continue the same compression instead of starting anew. When it reaches the end of the block, it has to make sure it keeps any leftover input data around.
  6. Trigger streaming-mode on a block-by-block basis, when "compressed size" and the index are both not available, or a block is over some size threshold.
  7. Support the presence of multiple streams, including combining multiple indexes.

from pixz.

vasi avatar vasi commented on July 20, 2024

I've done part 1, 2 and 3. Gonna re-order the other ones, and implement "streaming mode" next.

Unfortunately it's difficult, because in stream mode we only find out we're at the end of a block when liblzma tells us it's done. So we'll probably have read a bit too far, and have data left over. We need every part of the decompressor to deal with arbitrary amounts of initial data that may or may not already have been read. Ugh :(

from pixz.

vasi avatar vasi commented on July 20, 2024

Part 1, writing compressed/uncompressed size into block headers, is committed to master. Any archives you create with pixz should now be streamable when, eventually, I finish the streaming work.

The current progress on streaming is in branch 'stream'. Be aware that this is a temporary branch, I may do amends and rebases.

from pixz.

vasi avatar vasi commented on July 20, 2024

Ok, branch 'stream' has this feature implemented! :D https://github.com/vasi/pixz/tree/stream

A lot of changes to the codebase were involved, so I would hugely appreciate testing. Let me know how it goes!

from pixz.

vasi avatar vasi commented on July 20, 2024

A user helped testing. Merged into master.

from pixz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.