It seems that you have lost 20% of decompression speed in v0.4: <markdown-accessib

There are 2 potential culprits : One or some of the last fixes

Does seem <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Does seem <a class="user-mention notranslate" data-hovercard-type="user"

Issue seems solved in 0.4.1. Some other leads proposed by <a class="user-mention n

lost 20% of decompression speed in v0.4 about zstd HOT 9 CLOSED

facebook commented on July 27, 2024

lost 20% of decompression speed in v0.4

from zstd.

Comments (9)

Cyan4973 commented on July 27, 2024

There are 2 potential culprits :

One or some of the last fixes made.
- Notably, I had to fix some 32-bits specific issues during the last days, and did not re-test speed.
- I could try to track them one by one to see if one or some of them introduce a regression.
Instruction loop Alignment.
There is a big issue with GCC : it doesn't pay attention to instruction alignment well enough, notably for hot loops. It happens to make a huge speed difference for Huff0 and FSE on latest Intel Core CPU. As a consequence, decompression speed can vary a lot, just by virtue of "where the code is", that is almost randomly after each code change, including code changed outside of zstd_decompress.c.

I opened an issue about it on GCC, and although they acknowledge the issue, they seem to not care to do anything about it.

This is a problem, as tracking speed changes on the decompression algorithm is a total nightmare : no way to know if the speed difference observed is related to the latest code change or to some random alignment property.

As a consequence, I'm considering to move from GCC to another compiler free of such instability. I still have to test and select one though.

from zstd.

inikep commented on July 27, 2024

The above results are obtained using 1 core of Intel Core i5-4300U, Windows 10 64-bit (MinGW-w64 compilation under gcc 4.8.3 with -O3 -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math flags).

Moving to another compiler doesn't solve a problem because other people will use GCC (and zstd is open-source). The decompression function is no so big so I think you can try to find what was changed between v0.4 and v0.3.6 which worked fine with GCC.

from zstd.

Cyan4973 commented on July 27, 2024

I will. Understand this is a never-ending effort though : any change anywhere in the code, even totally unrelated, can trigger such 20% decompression speed difference. Some of these changes are valid fixes, so they must be done, I can't simply undo them. But compensating their indirect consequence on decompression speed requires creative work-around, basically strange constructions just to "push" instructions in a way which is randomly good for hot loop alignment. A real nightmare. I need something more stable.

from zstd.

tomByrer commented on July 27, 2024

Does seem @inikep's results show better compression size; could be an influence on speed?

GCC : it doesn't pay attention to instruction alignment well enough, notably for hot loops.... I opened an issue about it on GCC, and although they acknowledge the issue, they seem to not care to do anything about it.

I can understand GCC's position somewhat. I used hand-edit ASM for DSP years ago. I reordered instructions & remove redundant opcodes. But I found the (then new) Core2Duo made much of my work unneeded since the CPU did much of my work for me.
I've heard that the Intel compilers make the fastest executables.
https://software.intel.com/en-us/c-compilers

from zstd.

Cyan4973 commented on July 27, 2024

Does seem @inikep's results show better compression size; could be an influence on speed?

It could explain for 1-3% difference, but not 20 %.

I've been wasting quite some more time on this today. And the worst scenario is predictably happening.
I can make a few changes of code, resulting in better decompression speed ... for a given set of compiler version, parameters and target. But then switch to another system, and it turns out to be worse.

In fact, I suspect that the changes themselves don't bring much to speed, but its impossible to measure, since the impact to instruction alignment dwarf any other potential effect. That means there is no stable solution. +/- speed is reached out of random luck. This is a total nightmare.

I can't imagine a worse situation to be in. A solution is required to stabilize that. But gcc declined to investigate.

As a happier example, when compiling just huff0, a solution has been found, using -falign-loops=32. It's better for hot loops, but worse for everything else.
Unfortunately, what worked for a simple library doesn't work for zstd, since it's too large, so the global effect is negative.

I can imagine 2 other possibilities, none of which I have yet expertise in :

Add a few assembler directives (only for x64 target) to manually achieve hot instruction loop alignment (and only those loops).
Use PGO assisted build, in the hope that it will understand which part of the code is hot or not, and apply clever instruction alignment. hopefully.

I can understand GCC's position somewhat.

Well, on my side, I don't understand how a known performance stability problem with a swing of up to 20% can be considered uninteresting. My understanding is that compiler developers chase micro-gains, even when it introduces serious complexity, such as automatic vectorization. 20% cannot be qualified as micro-gain, and alignment is not expected to be huge complexity (for a compiler like gcc). But this is nonetheless not worth investigating ? There is something I don't understand here.

from zstd.

Cyan4973 commented on July 27, 2024

Latest update in "dev" branch is trying to improve the situation.
It is still a bit random, but seems overall a bit more positive than negative.
If your program experienced a -20% performance effect with 0.4, this update is likely to produce an improvement.

On a related note, I can now confirm that PGO assisted build are rather more stable performance wise. It generally ends up at the top of the possible performance range, though the exact outcome vary from one attempt to another, due to different measures during benchmark phase.

It's also a partial solution : makefile can be modified to create a PGO build, and even hope to transfer this capability to dll build (which I haven't done yet).
But this will not help for people integrating directly from source. Maybe x64 assembler directives would help there, but I have no idea how to do it.

Anyway, if at least it makes performance stable, it's still good news : I should be able to make some new modifications with now less worry that it will lead to random results. Thus, it will be possible to tell if the modification was neutral, beneficial or detrimental, a wanted ability.

from zstd.

tomByrer commented on July 27, 2024

, I don't understand how a known performance stability problem with a swing of up to 20% can be considered uninteresting.

Yes, unstable compiling would prevent the 'scientific method' to speed testing.

from zstd.

Cyan4973 commented on July 27, 2024

Proposed change merged into master.
[edit] merged into release v0.4.1

from zstd.

Cyan4973 commented on July 27, 2024

Issue seems solved in 0.4.1.
Some other leads proposed by @nemequ.
Discussion at : http://encode.ru/threads/2119-Zstandard?p=45797&viewfull=1#post45797

from zstd.

lost 20% of decompression speed in v0.4 about zstd HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent