Comments (13)
Hmm, that's a pretty weird use-case, compressing random data should yield no improvement. But it's still a bug, so I'll see if I can reproduce. Can I get the details of the computer on which you experience these bugs? OS, OS version, version of pixz, number of CPUs, and amount of RAM are the important bits.
from pixz.
Also, does this happen with just /dev/urandom? Other regular files? Other dev files? What if you copy some data from urandom to a regular file, and then compress that?
A quick test on my Mac can't reproduce the problem, I can try in Linux or FreeBSD or something later.
from pixz.
I have successfully reproduced the problem with a save with:
dd if=/dev/urandom | tee out | pixz -7 > /dev/null
I have finally reduced to a file of 40 Mio and it was initially 200 Mio but the problem can be reproduced at wish. You can find my computer spec at the end. I have modified the code so it crash dump on the error, I have a core of 323 Mio uncompressed (and 183 Mio in GZIP -9). What should I test, or do with that? Is there any variables or something I can search? I can upload it somewhere too.
If I would have to guess the problem, I would say there is some neaty multi threading issue somewhere, a kind of concurrency interaction. My idea is to add some random sleep in the part it crash and see if it happen less often. What do you think?
It seems this is exclusively the thread 0 that crashes every time, because I see in the stack trace (thnum=0):
_5 0x000000000040683e in encode_thread (thnum=0) at write.c:305
Is that a constant like the CPU id always in the same order? Something like first CPU gets 0, 2nd gets 1, and so on?
I don't have tested with a lot of files except regular files. I would say this problem arises because the entropy of urandom is very high, if not equal to 1. I could test with video, images and compressed files later (if that could help).
My specs: https://gist.github.com/4479221
from pixz.
Hmm, the core dump won't help me unless I have the debug binaries. Could you get a backtrace for me? If you don't know how to do that, I can give you instructions :)
Also, if you can upload your 'out' file that causes the crash somewhere, that would be great.
from pixz.
Here is the backtrace: https://gist.github.com/4479415
Please note that I have added an assert in the die() so the program will dump its core.
Edit: It seems the thnum may be sometimes not 0, it was 1 a few times.
from pixz.
If this can help, here is a dump of some variables in the scope of the die. I don't know but I hope this may give you some insight about the problem: https://gist.github.com/4479481
from pixz.
Oooh, this is interesting, thanks. It's definitely not a concurrency problem. It looks like somehow we're not allocating enough space for the output, but that really shouldn't happen.
I would really appreciate if you could upload the data that causes the crash.
from pixz.
I am unsure where I can upload my file. Any idea? Anyway, you could generate a >50 Mio file with /dev/urandom on your machine and it should cause the problem as well, there doesn't seem to be anything special with my version, it is only a high entropy file.
from pixz.
It seems my last experiment confirmed the fact that high entropy files may cause the problem. I took the BigBuckBunny movie and compressed it twice and it crashed. I did:
wget http://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_320x180.mp4
pixz -7 BigBuckBunny_320x180.mp4
pixz -7 BigBuckBunny_320x180.mp4.xz << Crashed and wrote 12 octets of the output
from pixz.
Yay, I can reproduce the bug now! Thanks :) I'll see what I can figure out.
from pixz.
Whew, that was a doozy of a bug! Thanks so much for your help finding the bug and tracking it down. In my limited testing, incompressible input now works ok, and normal compressible input continues to work. It would be great if you could test a bit too, and let me know whether it's working for you.
Here's a detailed explanation of what went wrong. Normally, each pixz compression thread works like this pseudocode:
setup_compression();
while (get_input_block()) {
allocate_output_space(lzma_block_buffer_bound(input_size()));
do_compression();
}
cleanup_compression();
Two important notes about this:
- We only do setup once, not for every block. This saves time!
- We know how much space to allocate using a function lzma_block_buffer_bound() from the liblzma API.
It turns out that lzma_block_buffer_bound() works correctly only in very particular conditions. It assumes that when the output gets too big, compression switches to a special "incompressible data" mode. Unfortunately, this special mode only activates when we do compression with a particular function. More unfortunately, using that function doesn't allow us to do the only-setup-once technique.
So I've emailed the liblzma author, asking him to fix this situation. And in the meantime, I've implemented the special mode in pixz. It's ugly to have such low-level code there, but it wasn't too hard to implement.
from pixz.
Sorry for my wrong guess about multi-threading, I hope it didn't make you loose time. I'm very glad that I was of any help to you and pixz. I have tried to compress it more than 5 time and it seems the problem is now fixed. I have compressed it with pixz -7 then decompressed it with unxz and checked the md5sum at the end (correct). I tested with /dev/urandom too and all seem fine now. Good job.
Your code looks involved so I cannot check its correctness. I would need to study liblzma and your program too, sorry.
from pixz.
No need to be sorry, you did great. Closing :)
from pixz.
Related Issues (20)
- configure: error: AsciiDoc not found, not able to generate the man page. HOT 3
- Error decoding stream footer when trying to decompress a 3.1 TiB .tpxz file HOT 8
- cppcheck 2.8 warnings about uninitialized variables
- Crash when using -x option
- What is the default level of compression? HOT 2
- Can't compile on Fedora 38 HOT 2
- msys2 build failure HOT 18
- Indexes HOT 2
- Clarify README section on differences with xz HOT 5
- any plans for another release soon? HOT 1
- Server mode HOT 1
- -k should be the default HOT 2
- concatenation of *xz files and then decompression using pixz HOT 2
- Building On Windows HOT 2
- manpage not installed if building from release tarball HOT 2
- build env question not package liblzma HOT 1
- Questions about tpxz / file index format HOT 6
- Error creating block encoder HOT 3
- Syntax for converting existing tar.xz archive to indexed pixz file? HOT 1
- Random failures when compressing large directories HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pixz.