tl;dr: this is a codegen issue with VS2013 in Release x64, see di

Thanks for the report <a class="user-mention notranslate" data-hovercard-type="user" d

"Stack cookie instrumentation code detected a stack-based buffer overrun" in HUF_fillDTableX4Level2,about facebook/zstd

Comments (19)

Cyan4973 commented on July 27, 2024

Thanks for the report @KrzysFR, I'll look into it.

Traces seem to indicate that the problem was created during a call to the Huffman decoder, specifically the part which build the decoding table.

Thanks to your excellent traces, one strange behavior can be spotted right away :

HUF_fillDTableX4() :

rankValOrigin       0x0000006f7edd9880

HUF_fillDTableX4Level2() :

rankValOrigin       0x00007ff95ea037fd {Inside clr.dll!

Both pointer addresses should be equal. So, something must have overwritten it.

There is always a risk that this behavior is specific to .NET linking. Nonetheless I'll try to reproduce it within my C-only dev environment. As you provided the sample, it should prove straightforward.

from zstd.

KrzysFR commented on July 27, 2024

I added extern __declspec(dllexport) before the methods I need in the .h headers. From the .NET side I'm using LoadLibrary to get the dll in my process, and then a combination of GetProcAddress(..) and Marshal.GetDelegateForFunctionPointer(..) for each method I want to call. I've been using the same technique successfully with zstd 0.1.x as well as lz4 and various other native C libraries in the past.

Also, the value of rankValOrigin seems to point inside the CLR stack (which would be expected since the CLR is the one calling ZSTD_decompress). But since the content of the stack is garbage, the values of all the locals should probably not be trusted...

from zstd.

Cyan4973 commented on July 27, 2024

I'm afraid I don't find a good way to help here.

I tried to reproduce the issue, using your samples, and a C dev environment on Linux, with 2 compilers, gcc and clang.
None of them is able to create this error.

Compressing original_fail.bin works, and the result can be properly uncompressed, regenerating the exact original file. The compressed file is also different from the one in your package.

Attempting to decompress the downloaded sample original_fail.bin.zst does indeed fails, but the decompressor gracefully exits, with an input corrupted error message. Neither valgrind nor -fsanitize=address could catch any error, nor any write/read outside of any memory area, nor any pointer going wild.

So it seems the issue is, for the time being, only triggered within your dev environment. Which makes it more difficult to analyze.

from zstd.

Cyan4973 commented on July 27, 2024

Also noteworthy :

I note that compressed files are different, between the one generated by zstd command line utility and your program.

I also note that even original_pass.bin.zst cannot be decompressed by the zstd command line utility.

So it seems the compressed format your are generating with your program is different, and possibly not 0.2 compliant. Just an hypothesis.

from zstd.

KrzysFR commented on July 27, 2024

The only difference I can see, is that I build everything (zstd.exe and zstd_x64.dll) with Visual Studio 2013 targetting msvcr120, while you are using gcc and clang on linux.

Could the different compiler and platform have an impact on the generated compressed output? Maybe there is some optimization done by msvcc somewhere that changes things up? Note that in your repo you have a solution that targets VS2012 and msvcr110, so it could also be a recent change in msvcc combine with a recent change in 0.2 that produces bad output... Maybe also different settings between Debug and Release build...

If you have access to Visual Studio 2013 (or +) on Windows, I can try to put together all the code necessary to run my test in a very simple console application, if you want.

from zstd.

Cyan4973 commented on July 27, 2024

Yes, I do have access to Visual Studio 2013.

I also tested the VS2012 solution, but could not find any difference with gcc/clang behavior.
I believed VS2012 to be representative of VS2013, but maybe not ...

from zstd.

KrzysFR commented on July 27, 2024

Quick update: if I build the zstd lib in Debug x64, everything works fine and all vectors compress and decompress properly. Definetly looks like there is some optimization done by msvcc that breaks things in Release...

from zstd.

KrzysFR commented on July 27, 2024

Update: After going through most of the build settings, here's what I'm getting: all other settings equal, if I disable optimization (/Od) it works. If I use any other optimization settings (/O1, /O2, /Ox) then it fails.

With the /O2 and a new set of settings (copied from another lz4 vc project), the failure mode is now different than before: Instead of a crash, I'm now getting an infinite loop inside HUF_decompress4X4_usingDTable, in the for loop that calls the HUF_DECODE_SYMBOLX4_* macros.

When I dump the values of the op1 to op4 each iteration, they do not change at all, and so for keeps going (neither endSignal nor op4 change).

If I dump the content of the dt table right before the for, a whole lot of them have all HUF_DEltX4 fields to 0 (including length which explains while the pointers do not move) and only a few here and there have non-0 values.

Now one thing I don't understand: If I compute the hash of the compressed data in Debug (where decompression succeeds), they are IDENTICAL to the hash of the same vectors compressed in Release (where decompression fails), which means that it probably does not come from the compression step itself.

I'm not sure what I can do next, because right now I can only "debug" using good old fprintf (I can't debug or step into code from the .NET side into the native library: I only can take memory dumps (not ideal to see the thing working) or add fprintf everywhere...

Any idea?

from zstd.

Cyan4973 commented on July 27, 2024

It seems you have been going as far as you could given your test tools.

It's not easy to help remotely either. As an example, I recently had to change a test within zstd.c because it was removed by clang at highest optimization settings. clang probably believed the test was useless, which was incorrect : it was in fact designed to detect broken input.
This is the kind of thing that can happen with some broken optimizer. They tend to be fixed, but it takes some time. And analyzing them requires direct access to the broken config.

If you need a work around right now, I would suggest to hardcode within huff0.c the usage of HUF_decompress4X2 (instead of the dynamic selector implemented within HUF_decompress). It will evade the problematic section of code.

from zstd.

Cyan4973 commented on July 27, 2024

On my side, I've been comparing VS2012 and VS2013 output.
Strangely, compressing original_fail.bin with zstd compiled with VS2012 doesn't give the same compressed result as zstd compiled with VS2013 :

10/26/2015  11:32 PM           177,876 fail2012.compressed
10/26/2015  11:34 PM           177,858 fail2013.compressed

I also note that the 2012 version can decompress both files, and regenerate original_fail.bin appropriately. The 2013 release version fails on both files.

But zstd2013 in debug mode successfully decompress both files.

This is in line with your own experiments. Just weird, and damn difficult to debug ...

from zstd.

Cyan4973 commented on July 27, 2024

Step-tracing into the 2013 release version,
the problem seems located at line 1032 within huff0.c :

        for (consumed=minBits; consumed <= memLog-minBits; consumed++)
            for (w=1; w<=maxW; w++)
                rankVal[consumed][w] = rankVal[0][w] >> consumed;

This section is supposed to build rankVal, a 2D table.
The debug version works fine, but the release version seems to do strange things. At the end of this section, only the first line of the table seems properly filled, everything else remaining uninitialized to zero.

Obviously, it has nasty consequences on the rest of the decoding table build process ...

_[Edit]_ : this is confirmed with printf traces : the content of rankVal remains 0 in release mode. So that's where the problem seems to start.

from zstd.

KrzysFR commented on July 27, 2024

Ok so If I hardcode calls into HUF_decompress4X2 it works in all cases, and if I instead call HUF_decompress4X4 (or 4X6) then if fails in ALL cases. Probably the vectors that failed earlier were the one that use 4X4 due to your timing heuristic (which seems to deterministically select X2 or X4), not the content itself.

Second, I compared HUF_readDTableX2 and HUF_readDTableX4 to look for some error somewhere, and I found out that if I put an fprintf inside the following loop in HUF_readDTableX4 then I immediately get an ZSTD_error_corruption_detected instead of an infinite loop (without the fprintf).

        for (consumed=minBits; consumed <= memLog-minBits; consumed++)
            for (w = 1; w <= maxW; w++)
            {
                rankVal[consumed][w] = rankVal[0][w] >> consumed;
                //fprintf(stdout, "rankVal[%d][%d]=%d\r\n", consumed, w, rankVal[consumed][w]); // uncomment this line change the behaviour with /O2
            }

So:

Debug: everything always compress and decompress fine
Release /Od: same as debug works fine
Release /O2 without printf: infinite loop later in the code
Release /O2 WITH printf: immediate ZSTD_error_corruption_detected

... 😑

from zstd.

Cyan4973 commented on July 27, 2024

If I replace the faulty section with :

        for (consumed=minBits; consumed <= memLog-minBits; consumed++)
        for (w = 1; w <= maxW; w++)
        {
            rankVal[consumed][w] = rankVal[0][w] >> consumed;
            printf("[%u][%u] : %u \n", consumed, w, (U32)(rankVal[consumed][w]));
        }

which basically forces the inner loop to do something (printf) on top of filling the table, then the table is properly filled, and decoding process works properly.

So it seems that rankVal[consumed][w] = rankVal[0][w] >> consumed; is optimized away when it is alone. The step-trace within Visual behaves strangely, as if w>maxW after the first attempt, skipping all other loops (neither w nor maxW are available value within Visual debugger when tracing the release binary. So these values are likely optimized inside registers). Looks like a compiler bug to me.

from zstd.

KrzysFR commented on July 27, 2024

I tested using Visual Studio 2015 and the same behavior happens.

Right now, I've added a #define in huff0.h that allows me to choose either the auto select or force calling 4X2, 4X4 or 4X6 (thouh the last two fail currently). With only 4X2 I was able to build a version that works fine and seems to have correct performance.

Looking more into the codegen issue, I've enabled the /FAcs compiler option (as described in https://randomascii.wordpress.com/2013/10/14/how-to-report-a-vc-code-gen-bug/ ) that produces .cod files with the assembler produced by the C compiler.

Looking at huff0.cod, I see this (Release x64, the one that fails).

; 1032 :         for (consumed=minBits; consumed <= memLog-minBits; consumed++)

  001f0 45 8b dc     mov     r11d, r12d
  001f3 44 2b d8     sub     r11d, eax
  001f6 44 3b d3     cmp     r10d, ebx
  001f9 72 43        jb      SHORT $LN4@HUF_readDT
  001fb 0f 1f 44 00 00   npad    5
$LL6@HUF_readDT:

; 1033 :             for (w=1; w<=maxW; w++)

  00200 41 3b c3     cmp     eax, r11d
  00203 77 32        ja      SHORT $LN5@HUF_readDT
  00205 44 8b c3     mov     r8d, ebx
  00208 8b c8        mov     ecx, eax
  0020a 4c 8d 4d f0  lea     r9, QWORD PTR rankVal$[rbp-256]
  0020e 4f 8d 0c 81  lea     r9, QWORD PTR [r9+r8*4]
  00212 48 6b d1 11  imul    rdx, rcx, 17
  00216 49 03 d0     add     rdx, r8
  00219 4c 8d 45 f0  lea     r8, QWORD PTR rankVal$[rbp-256]
  0021d 4d 8d 04 90  lea     r8, QWORD PTR [r8+rdx*4]
$LL3@HUF_readDT:

; 1034 :                 rankVal[consumed][w] = rankVal[0][w] >> consumed;

  00221 41 8b 11     mov     edx, DWORD PTR [r9]
  00224 8b c8        mov     ecx, eax
  00226 ff c0        inc     eax
  00228 d3 ea        shr     edx, cl
  0022a 4d 8d 40 44  lea     r8, QWORD PTR [r8+68]
  0022e 41 89 50 bc  mov     DWORD PTR [r8-68], edx
  00232 41 3b c3     cmp     eax, r11d
  00235 76 ea        jbe     SHORT $LL3@HUF_readDT
$LN5@HUF_readDT:

; 1031 :         }
; 1032 :         for (consumed=minBits; consumed <= memLog-minBits; consumed++)

  00237 ff c3        inc     ebx
  00239 41 3b da     cmp     ebx, r10d
  0023c 76 c2        jbe     SHORT $LL6@HUF_readDT
$LN4@HUF_readDT:

; 1035 :     }

There is indeed some code generated by the compiler for the look. My asm is rusty but it does look like it is doing the right thing, it's not as optimized as I would expect.

Either the code is actually writing stuff in the wrong place, or it gets optimized away layer (during linking?)

from zstd.

KrzysFR commented on July 27, 2024

I got it to work!

It's definitely easier when you can see the assembly code generated while you are poking at the thing. I tried hoisting both rankVal[0] and rankVal[consumed] outside the loops (they do not change at all), like this:

    /* Build rankVal */
    {
        const U32 minBits = tableLog+1 - maxW;
        U32 nextRankVal = 0;
        U32 w, consumed;
        const int rescale = (memLog-tableLog) - 1;   /* tableLog <= memLog */
        U32* rankVal0 = rankVal[0];
        for (w=1; w<=maxW; w++)
        {
            U32 current = nextRankVal;
            nextRankVal += rankStats[w] << (w+rescale);
            rankVal0[w] = current;
        }
        for (consumed = minBits; consumed <= memLog - minBits; consumed++)
        {
            U32* rankValPtr = rankVal[consumed];
            for (w = 1; w <= maxW; w++)
            {
                rankValPtr[w] = rankVal0[w] >> consumed;
            }
        }
    }

The assembly code changed a lot, ~~it looks there is some loop unrolling done (which I would expect with /O2)~~ EDIT: oops I mistakenly pasted the code twice 💦

; 1033 :         for (consumed = minBits; consumed <= memLog - minBits; consumed++)

  001d9 45 8b cf     mov     r9d, r15d
  001dc 45 2b cb     sub     r9d, r11d
  001df 45 3b d9     cmp     r11d, r9d
  001e2 0f 87 d0 00 00
    00       ja  $LN4@HUF_readDT
  001e8 41 8b cb     mov     ecx, r11d
  001eb 48 8d 5d f0  lea     rbx, QWORD PTR rankVal$[rbp-256]
  001ef 48 6b c1 44  imul    rax, rcx, 68       ; 00000044H
  001f3 48 03 d8     add     rbx, rax
  001f6 48 6b f9 bc  imul    rdi, rcx, -68      ; ffffffffffffffbcH
  001fa 66 0f 1f 44 00
    00       npad    6
$LL6@HUF_readDT:

; 1034 :         {
; 1035 :             U32* rankValPtr = rankVal[consumed];
; 1036 :             for (w = 1; w <= maxW; w++)

  00200 ba 01 00 00 00   mov     edx, 1
  00205 66 41 0f 6e cb   movd    xmm1, r11d
  0020a 44 3b d2     cmp     r10d, edx
  0020d 0f 82 91 00 00
    00       jb  $LN5@HUF_readDT
  00213 41 83 fa 08  cmp     r10d, 8
  00217 72 62        jb  SHORT $LN45@HUF_readDT

; 1037 :             {
; 1038 :                 rankValPtr[w] = rankVal0[w] >> consumed;

  00219 41 8b c2     mov     eax, r10d
  0021c 48 8d 4d f0  lea     rcx, QWORD PTR rankVal$[rbp-256]
  00220 48 8d 0c 81  lea     rcx, QWORD PTR [rcx+rax*4]
  00224 4c 8d 04 83  lea     r8, QWORD PTR [rbx+rax*4]
  00228 48 8d 43 04  lea     rax, QWORD PTR [rbx+4]
  0022c 48 3b c1     cmp     rax, rcx
  0022f 77 09        ja  SHORT $LN46@HUF_readDT
  00231 48 8d 45 f4  lea     rax, QWORD PTR rankVal$[rbp-252]
  00235 4c 3b c0     cmp     r8, rax
  00238 73 41        jae     SHORT $LN45@HUF_readDT
$LN46@HUF_readDT:
  0023a 41 8b c2     mov     eax, r10d
  0023d 45 8b c2     mov     r8d, r10d
  00240 83 e0 07     and     eax, 7
  00243 44 2b c0     sub     r8d, eax
  00246 66 66 0f 1f 84
    00 00 00 00 00   npad    10
$LL3@HUF_readDT:
  00250 8b c2        mov     eax, edx
  00252 f3 0f 6f 44 85
    f0       movdqu  xmm0, XMMWORD PTR rankVal$[rbp+rax*4-256]
  00258 66 0f d2 c1  psrld   xmm0, xmm1
  0025c f3 0f 7f 04 83   movdqu  XMMWORD PTR [rbx+rax*4], xmm0
  00261 8d 42 04     lea     eax, DWORD PTR [rdx+4]
  00264 83 c2 08     add     edx, 8
  00267 f3 0f 6f 44 85
    f0       movdqu  xmm0, XMMWORD PTR rankVal$[rbp+rax*4-256]
  0026d 66 0f d2 c1  psrld   xmm0, xmm1
  00271 f3 0f 7f 04 83   movdqu  XMMWORD PTR [rbx+rax*4], xmm0
  00276 41 3b d0     cmp     edx, r8d
  00279 76 d5        jbe     SHORT $LL3@HUF_readDT
$LN45@HUF_readDT:

; 1034 :         {
; 1035 :             U32* rankValPtr = rankVal[consumed];
; 1036 :             for (w = 1; w <= maxW; w++)

  0027b 41 3b d2     cmp     edx, r10d
  0027e 77 24        ja  SHORT $LN5@HUF_readDT
  00280 45 8b c2     mov     r8d, r10d
  00283 8b c2        mov     eax, edx
  00285 44 2b c2     sub     r8d, edx
  00288 48 8d 04 83  lea     rax, QWORD PTR [rbx+rax*4]
  0028c 41 ff c0     inc     r8d
  0028f 90       npad    1
$LL44@HUF_readDT:

; 1037 :             {
; 1038 :                 rankValPtr[w] = rankVal0[w] >> consumed;

  00290 8b 14 07     mov     edx, DWORD PTR [rdi+rax]
  00293 41 8b cb     mov     ecx, r11d
  00296 48 8d 40 04  lea     rax, QWORD PTR [rax+4]
  0029a d3 ea        shr     edx, cl
  0029c 89 50 fc     mov     DWORD PTR [rax-4], edx
  0029f 49 ff c8     dec     r8
  002a2 75 ec        jne     SHORT $LL44@HUF_readDT
$LN5@HUF_readDT:

; 1032 :         }
; 1033 :         for (consumed = minBits; consumed <= memLog - minBits; consumed++)

  002a4 41 ff c3     inc     r11d
  002a7 48 83 c3 44  add     rbx, 68            ; 00000044H
  002ab 48 83 ef 44  sub     rdi, 68            ; 00000044H
  002af 45 3b d9     cmp     r11d, r9d
  002b2 0f 86 48 ff ff
    ff       jbe     $LL6@HUF_readDT
$LN4@HUF_readDT:

from zstd.

KrzysFR commented on July 27, 2024

Looking at the original ASM code for the inner loop (the one that fails):

$LL3@HUF_readDT:
; 1034 :                 rankVal[consumed][w] = rankVal[0][w] >> consumed;
  00221 41 8b 11     mov     edx, DWORD PTR [r9]
  00224 8b c8        mov     ecx, eax
  00226 ff c0        inc     eax
  00228 d3 ea        shr     edx, cl
  0022a 4d 8d 40 44  lea     r8, QWORD PTR [r8+68]
  0022e 41 89 50 bc  mov     DWORD PTR [r8-68], edx
  00232 41 3b c3     cmp     eax, r11d
  00235 76 ea        jbe     SHORT $LL3@HUF_readDT

The loop index w is stored in the register eax, the address of rankVal[0][w] is stored in r9 and r8 is used to compute the offset in rankVal[consumed][w]. The only problem is that r9 is not updated anywhere in the loop! It is only computed once per outer loop iteration, so basically it is always storing rankVal[0][1] >> consumed everywhere in the table...

Also, it seems that the right shift is done using eax which is the loop index (w) and NOT consumed ? So it is computing something like rankVal[0][1] >> w ?

So it's not a case of the compiler optimizing away the loops, it is more like the compiler generating some funny code which fills the table with mostly zeroes....

from zstd.

Cyan4973 commented on July 27, 2024

Thanks, that's a great investigation.

You nailed it. The work around works perfectly, and I could check it generates negligible speed difference for other compilers / OS.
So it's a fix. "dev" branch has been updated with it.
Rgds

from zstd.

KrzysFR commented on July 27, 2024

Confirmed that it works with the dev branch.

from zstd.

Cyan4973 commented on July 27, 2024

Fix ported to release 0.2.2

from zstd.

"Stack cookie instrumentation code detected a stack-based buffer overrun" in HUF_fillDTableX4Level2 about zstd HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent