Comments (19)
Thanks for the report @KrzysFR, I'll look into it.
Traces seem to indicate that the problem was created during a call to the Huffman decoder, specifically the part which build the decoding table.
Thanks to your excellent traces, one strange behavior can be spotted right away :
HUF_fillDTableX4()
:
rankValOrigin 0x0000006f7edd9880
HUF_fillDTableX4Level2()
:
rankValOrigin 0x00007ff95ea037fd {Inside clr.dll!
Both pointer addresses should be equal. So, something must have overwritten it.
There is always a risk that this behavior is specific to .NET linking. Nonetheless I'll try to reproduce it within my C-only dev environment. As you provided the sample, it should prove straightforward.
from zstd.
I added extern __declspec(dllexport)
before the methods I need in the .h headers. From the .NET side I'm using LoadLibrary
to get the dll in my process, and then a combination of GetProcAddress(..)
and Marshal.GetDelegateForFunctionPointer(..)
for each method I want to call. I've been using the same technique successfully with zstd 0.1.x as well as lz4 and various other native C libraries in the past.
Also, the value of rankValOrigin seems to point inside the CLR stack (which would be expected since the CLR is the one calling ZSTD_decompress). But since the content of the stack is garbage, the values of all the locals should probably not be trusted...
from zstd.
I'm afraid I don't find a good way to help here.
I tried to reproduce the issue, using your samples, and a C dev environment on Linux, with 2 compilers, gcc and clang.
None of them is able to create this error.
Compressing original_fail.bin
works, and the result can be properly uncompressed, regenerating the exact original file. The compressed file is also different from the one in your package.
Attempting to decompress the downloaded sample original_fail.bin.zst
does indeed fails, but the decompressor gracefully exits, with an input corrupted
error message. Neither valgrind
nor -fsanitize=address
could catch any error, nor any write/read outside of any memory area, nor any pointer going wild.
So it seems the issue is, for the time being, only triggered within your dev environment. Which makes it more difficult to analyze.
from zstd.
Also noteworthy :
I note that compressed files are different, between the one generated by zstd
command line utility and your program.
I also note that even original_pass.bin.zst
cannot be decompressed by the zstd
command line utility.
So it seems the compressed format your are generating with your program is different, and possibly not 0.2 compliant. Just an hypothesis.
from zstd.
The only difference I can see, is that I build everything (zstd.exe and zstd_x64.dll) with Visual Studio 2013 targetting msvcr120, while you are using gcc and clang on linux.
Could the different compiler and platform have an impact on the generated compressed output? Maybe there is some optimization done by msvcc somewhere that changes things up? Note that in your repo you have a solution that targets VS2012 and msvcr110, so it could also be a recent change in msvcc combine with a recent change in 0.2 that produces bad output... Maybe also different settings between Debug and Release build...
If you have access to Visual Studio 2013 (or +) on Windows, I can try to put together all the code necessary to run my test in a very simple console application, if you want.
from zstd.
Yes, I do have access to Visual Studio 2013.
I also tested the VS2012 solution, but could not find any difference with gcc/clang behavior.
I believed VS2012 to be representative of VS2013, but maybe not ...
from zstd.
Quick update: if I build the zstd lib in Debug x64, everything works fine and all vectors compress and decompress properly. Definetly looks like there is some optimization done by msvcc that breaks things in Release...
from zstd.
Update: After going through most of the build settings, here's what I'm getting: all other settings equal, if I disable optimization (/Od
) it works. If I use any other optimization settings (/O1
, /O2
, /Ox
) then it fails.
With the /O2
and a new set of settings (copied from another lz4 vc project), the failure mode is now different than before: Instead of a crash, I'm now getting an infinite loop inside HUF_decompress4X4_usingDTable
, in the for loop that calls the HUF_DECODE_SYMBOLX4_*
macros.
When I dump the values of the op1
to op4
each iteration, they do not change at all, and so for
keeps going (neither endSignal nor op4 change).
If I dump the content of the dt
table right before the for, a whole lot of them have all HUF_DEltX4
fields to 0 (including length
which explains while the pointers do not move) and only a few here and there have non-0 values.
Now one thing I don't understand: If I compute the hash of the compressed data in Debug (where decompression succeeds), they are IDENTICAL to the hash of the same vectors compressed in Release (where decompression fails), which means that it probably does not come from the compression step itself.
I'm not sure what I can do next, because right now I can only "debug" using good old fprintf
(I can't debug or step into code from the .NET side into the native library: I only can take memory dumps (not ideal to see the thing working) or add fprintf everywhere...
Any idea?
from zstd.
It seems you have been going as far as you could given your test tools.
It's not easy to help remotely either. As an example, I recently had to change a test within zstd.c
because it was removed by clang
at highest optimization settings. clang
probably believed the test was useless, which was incorrect : it was in fact designed to detect broken input.
This is the kind of thing that can happen with some broken optimizer. They tend to be fixed, but it takes some time. And analyzing them requires direct access to the broken config.
If you need a work around right now, I would suggest to hardcode within huff0.c
the usage of HUF_decompress4X2
(instead of the dynamic selector implemented within HUF_decompress
). It will evade the problematic section of code.
from zstd.
On my side, I've been comparing VS2012 and VS2013 output.
Strangely, compressing original_fail.bin
with zstd
compiled with VS2012 doesn't give the same compressed result as zstd
compiled with VS2013 :
10/26/2015 11:32 PM 177,876 fail2012.compressed
10/26/2015 11:34 PM 177,858 fail2013.compressed
I also note that the 2012 version can decompress both files, and regenerate original_fail.bin
appropriately. The 2013 release version fails on both files.
But zstd2013
in debug mode successfully decompress both files.
This is in line with your own experiments. Just weird, and damn difficult to debug ...
from zstd.
Step-tracing into the 2013 release version,
the problem seems located at line 1032 within huff0.c
:
for (consumed=minBits; consumed <= memLog-minBits; consumed++)
for (w=1; w<=maxW; w++)
rankVal[consumed][w] = rankVal[0][w] >> consumed;
This section is supposed to build rankVal
, a 2D table.
The debug version works fine, but the release version seems to do strange things. At the end of this section, only the first line of the table seems properly filled, everything else remaining uninitialized to zero.
Obviously, it has nasty consequences on the rest of the decoding table build process ...
_[Edit]_ : this is confirmed with printf
traces : the content of rankVal
remains 0
in release mode. So that's where the problem seems to start.
from zstd.
Ok so If I hardcode calls into HUF_decompress4X2
it works in all cases, and if I instead call HUF_decompress4X4
(or 4X6) then if fails in ALL cases. Probably the vectors that failed earlier were the one that use 4X4 due to your timing heuristic (which seems to deterministically select X2 or X4), not the content itself.
Second, I compared HUF_readDTableX2
and HUF_readDTableX4
to look for some error somewhere, and I found out that if I put an fprintf
inside the following loop in HUF_readDTableX4
then I immediately get an ZSTD_error_corruption_detected
instead of an infinite loop (without the fprintf).
for (consumed=minBits; consumed <= memLog-minBits; consumed++)
for (w = 1; w <= maxW; w++)
{
rankVal[consumed][w] = rankVal[0][w] >> consumed;
//fprintf(stdout, "rankVal[%d][%d]=%d\r\n", consumed, w, rankVal[consumed][w]); // uncomment this line change the behaviour with /O2
}
So:
- Debug: everything always compress and decompress fine
- Release /Od: same as debug works fine
- Release /O2 without printf: infinite loop later in the code
- Release /O2 WITH printf: immediate
ZSTD_error_corruption_detected
... 😑
from zstd.
If I replace the faulty section with :
for (consumed=minBits; consumed <= memLog-minBits; consumed++)
for (w = 1; w <= maxW; w++)
{
rankVal[consumed][w] = rankVal[0][w] >> consumed;
printf("[%u][%u] : %u \n", consumed, w, (U32)(rankVal[consumed][w]));
}
which basically forces the inner loop to do something (printf
) on top of filling the table, then the table is properly filled, and decoding process works properly.
So it seems that rankVal[consumed][w] = rankVal[0][w] >> consumed;
is optimized away when it is alone. The step-trace within Visual behaves strangely, as if w>maxW
after the first attempt, skipping all other loops (neither w
nor maxW
are available value within Visual debugger when tracing the release binary. So these values are likely optimized inside registers). Looks like a compiler bug to me.
from zstd.
I tested using Visual Studio 2015 and the same behavior happens.
Right now, I've added a #define in huff0.h that allows me to choose either the auto select or force calling 4X2, 4X4 or 4X6 (thouh the last two fail currently). With only 4X2 I was able to build a version that works fine and seems to have correct performance.
Looking more into the codegen issue, I've enabled the /FAcs
compiler option (as described in https://randomascii.wordpress.com/2013/10/14/how-to-report-a-vc-code-gen-bug/ ) that produces .cod
files with the assembler produced by the C compiler.
Looking at huff0.cod
, I see this (Release x64, the one that fails).
; 1032 : for (consumed=minBits; consumed <= memLog-minBits; consumed++)
001f0 45 8b dc mov r11d, r12d
001f3 44 2b d8 sub r11d, eax
001f6 44 3b d3 cmp r10d, ebx
001f9 72 43 jb SHORT $LN4@HUF_readDT
001fb 0f 1f 44 00 00 npad 5
$LL6@HUF_readDT:
; 1033 : for (w=1; w<=maxW; w++)
00200 41 3b c3 cmp eax, r11d
00203 77 32 ja SHORT $LN5@HUF_readDT
00205 44 8b c3 mov r8d, ebx
00208 8b c8 mov ecx, eax
0020a 4c 8d 4d f0 lea r9, QWORD PTR rankVal$[rbp-256]
0020e 4f 8d 0c 81 lea r9, QWORD PTR [r9+r8*4]
00212 48 6b d1 11 imul rdx, rcx, 17
00216 49 03 d0 add rdx, r8
00219 4c 8d 45 f0 lea r8, QWORD PTR rankVal$[rbp-256]
0021d 4d 8d 04 90 lea r8, QWORD PTR [r8+rdx*4]
$LL3@HUF_readDT:
; 1034 : rankVal[consumed][w] = rankVal[0][w] >> consumed;
00221 41 8b 11 mov edx, DWORD PTR [r9]
00224 8b c8 mov ecx, eax
00226 ff c0 inc eax
00228 d3 ea shr edx, cl
0022a 4d 8d 40 44 lea r8, QWORD PTR [r8+68]
0022e 41 89 50 bc mov DWORD PTR [r8-68], edx
00232 41 3b c3 cmp eax, r11d
00235 76 ea jbe SHORT $LL3@HUF_readDT
$LN5@HUF_readDT:
; 1031 : }
; 1032 : for (consumed=minBits; consumed <= memLog-minBits; consumed++)
00237 ff c3 inc ebx
00239 41 3b da cmp ebx, r10d
0023c 76 c2 jbe SHORT $LL6@HUF_readDT
$LN4@HUF_readDT:
; 1035 : }
There is indeed some code generated by the compiler for the look. My asm is rusty but it does look like it is doing the right thing, it's not as optimized as I would expect.
Either the code is actually writing stuff in the wrong place, or it gets optimized away layer (during linking?)
from zstd.
I got it to work!
It's definitely easier when you can see the assembly code generated while you are poking at the thing. I tried hoisting both rankVal[0]
and rankVal[consumed]
outside the loops (they do not change at all), like this:
/* Build rankVal */
{
const U32 minBits = tableLog+1 - maxW;
U32 nextRankVal = 0;
U32 w, consumed;
const int rescale = (memLog-tableLog) - 1; /* tableLog <= memLog */
U32* rankVal0 = rankVal[0];
for (w=1; w<=maxW; w++)
{
U32 current = nextRankVal;
nextRankVal += rankStats[w] << (w+rescale);
rankVal0[w] = current;
}
for (consumed = minBits; consumed <= memLog - minBits; consumed++)
{
U32* rankValPtr = rankVal[consumed];
for (w = 1; w <= maxW; w++)
{
rankValPtr[w] = rankVal0[w] >> consumed;
}
}
}
The assembly code changed a lot, it looks there is some loop unrolling done (which I would expect with /O2) EDIT: oops I mistakenly pasted the code twice 💦
; 1033 : for (consumed = minBits; consumed <= memLog - minBits; consumed++)
001d9 45 8b cf mov r9d, r15d
001dc 45 2b cb sub r9d, r11d
001df 45 3b d9 cmp r11d, r9d
001e2 0f 87 d0 00 00
00 ja $LN4@HUF_readDT
001e8 41 8b cb mov ecx, r11d
001eb 48 8d 5d f0 lea rbx, QWORD PTR rankVal$[rbp-256]
001ef 48 6b c1 44 imul rax, rcx, 68 ; 00000044H
001f3 48 03 d8 add rbx, rax
001f6 48 6b f9 bc imul rdi, rcx, -68 ; ffffffffffffffbcH
001fa 66 0f 1f 44 00
00 npad 6
$LL6@HUF_readDT:
; 1034 : {
; 1035 : U32* rankValPtr = rankVal[consumed];
; 1036 : for (w = 1; w <= maxW; w++)
00200 ba 01 00 00 00 mov edx, 1
00205 66 41 0f 6e cb movd xmm1, r11d
0020a 44 3b d2 cmp r10d, edx
0020d 0f 82 91 00 00
00 jb $LN5@HUF_readDT
00213 41 83 fa 08 cmp r10d, 8
00217 72 62 jb SHORT $LN45@HUF_readDT
; 1037 : {
; 1038 : rankValPtr[w] = rankVal0[w] >> consumed;
00219 41 8b c2 mov eax, r10d
0021c 48 8d 4d f0 lea rcx, QWORD PTR rankVal$[rbp-256]
00220 48 8d 0c 81 lea rcx, QWORD PTR [rcx+rax*4]
00224 4c 8d 04 83 lea r8, QWORD PTR [rbx+rax*4]
00228 48 8d 43 04 lea rax, QWORD PTR [rbx+4]
0022c 48 3b c1 cmp rax, rcx
0022f 77 09 ja SHORT $LN46@HUF_readDT
00231 48 8d 45 f4 lea rax, QWORD PTR rankVal$[rbp-252]
00235 4c 3b c0 cmp r8, rax
00238 73 41 jae SHORT $LN45@HUF_readDT
$LN46@HUF_readDT:
0023a 41 8b c2 mov eax, r10d
0023d 45 8b c2 mov r8d, r10d
00240 83 e0 07 and eax, 7
00243 44 2b c0 sub r8d, eax
00246 66 66 0f 1f 84
00 00 00 00 00 npad 10
$LL3@HUF_readDT:
00250 8b c2 mov eax, edx
00252 f3 0f 6f 44 85
f0 movdqu xmm0, XMMWORD PTR rankVal$[rbp+rax*4-256]
00258 66 0f d2 c1 psrld xmm0, xmm1
0025c f3 0f 7f 04 83 movdqu XMMWORD PTR [rbx+rax*4], xmm0
00261 8d 42 04 lea eax, DWORD PTR [rdx+4]
00264 83 c2 08 add edx, 8
00267 f3 0f 6f 44 85
f0 movdqu xmm0, XMMWORD PTR rankVal$[rbp+rax*4-256]
0026d 66 0f d2 c1 psrld xmm0, xmm1
00271 f3 0f 7f 04 83 movdqu XMMWORD PTR [rbx+rax*4], xmm0
00276 41 3b d0 cmp edx, r8d
00279 76 d5 jbe SHORT $LL3@HUF_readDT
$LN45@HUF_readDT:
; 1034 : {
; 1035 : U32* rankValPtr = rankVal[consumed];
; 1036 : for (w = 1; w <= maxW; w++)
0027b 41 3b d2 cmp edx, r10d
0027e 77 24 ja SHORT $LN5@HUF_readDT
00280 45 8b c2 mov r8d, r10d
00283 8b c2 mov eax, edx
00285 44 2b c2 sub r8d, edx
00288 48 8d 04 83 lea rax, QWORD PTR [rbx+rax*4]
0028c 41 ff c0 inc r8d
0028f 90 npad 1
$LL44@HUF_readDT:
; 1037 : {
; 1038 : rankValPtr[w] = rankVal0[w] >> consumed;
00290 8b 14 07 mov edx, DWORD PTR [rdi+rax]
00293 41 8b cb mov ecx, r11d
00296 48 8d 40 04 lea rax, QWORD PTR [rax+4]
0029a d3 ea shr edx, cl
0029c 89 50 fc mov DWORD PTR [rax-4], edx
0029f 49 ff c8 dec r8
002a2 75 ec jne SHORT $LL44@HUF_readDT
$LN5@HUF_readDT:
; 1032 : }
; 1033 : for (consumed = minBits; consumed <= memLog - minBits; consumed++)
002a4 41 ff c3 inc r11d
002a7 48 83 c3 44 add rbx, 68 ; 00000044H
002ab 48 83 ef 44 sub rdi, 68 ; 00000044H
002af 45 3b d9 cmp r11d, r9d
002b2 0f 86 48 ff ff
ff jbe $LL6@HUF_readDT
$LN4@HUF_readDT:
from zstd.
Looking at the original ASM code for the inner loop (the one that fails):
$LL3@HUF_readDT:
; 1034 : rankVal[consumed][w] = rankVal[0][w] >> consumed;
00221 41 8b 11 mov edx, DWORD PTR [r9]
00224 8b c8 mov ecx, eax
00226 ff c0 inc eax
00228 d3 ea shr edx, cl
0022a 4d 8d 40 44 lea r8, QWORD PTR [r8+68]
0022e 41 89 50 bc mov DWORD PTR [r8-68], edx
00232 41 3b c3 cmp eax, r11d
00235 76 ea jbe SHORT $LL3@HUF_readDT
The loop index w
is stored in the register eax
, the address of rankVal[0][w]
is stored in r9
and r8
is used to compute the offset in rankVal[consumed][w]
. The only problem is that r9
is not updated anywhere in the loop! It is only computed once per outer loop iteration, so basically it is always storing rankVal[0][1] >> consumed
everywhere in the table...
Also, it seems that the right shift is done using eax
which is the loop index (w
) and NOT consumed
? So it is computing something like rankVal[0][1] >> w
?
So it's not a case of the compiler optimizing away the loops, it is more like the compiler generating some funny code which fills the table with mostly zeroes....
from zstd.
Thanks, that's a great investigation.
You nailed it. The work around works perfectly, and I could check it generates negligible speed difference for other compilers / OS.
So it's a fix. "dev" branch has been updated with it.
Rgds
from zstd.
Confirmed that it works with the dev branch.
from zstd.
Fix ported to release 0.2.2
from zstd.
Related Issues (20)
- Add library and cli flags for file format with embedded dictionary
- Question about ZSTD protocole HOT 2
- Building on MacOS 13 and targeting MacOS 11 and SDK 11.3 (or any other MacOS version) does not work HOT 2
- Integrating the library with an external thread pool HOT 2
- Is it safe to move compression and decompression contexts between threads? HOT 1
- ZDICT_trainFromBuffer_cover is not thread safe HOT 17
- zstd compression output differens with the same options between 1.5.5 and 1.5.6 HOT 5
- Warning message for `zstd -v --train` is missing line breaks
- How to accelerate the process of dictionary training in zstd? HOT 5
- tests/cli-tests/cltools/zstdless.sh fails with newer version of less HOT 3
- Please promote thread pools from experimental to stable HOT 1
- The CMake build script breaks check_ipo_supported
- Dynamic decompression HOT 3
- Change `dictionary_compression.c` example to use API for dictionary creation
- Enable weak symbol support for Risc-V? HOT 1
- Possibly missing check for truncated initial states in Huffman weight block HOT 4
- Poor compressor behavior on interleaved data HOT 2
- zstd 1.5.5+ has worse performance on Graviton2 nodes than v1.4.4 HOT 4
- [Not a bug] Dictionary building strategy HOT 7
- CLI: Hang bomb with with crafted circular symbolic link causes "zstd -d -r -f" to infinitely loop. "pigz -d-r -f" skips symbolic links with non compressed suffix
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zstd.