Comments (9)
There are 2 potential culprits :
- One or some of the last fixes made.
- Notably, I had to fix some 32-bits specific issues during the last days, and did not re-test speed.
- I could try to track them one by one to see if one or some of them introduce a regression.
- Instruction loop Alignment.
There is a big issue with GCC : it doesn't pay attention to instruction alignment well enough, notably for hot loops. It happens to make a huge speed difference for Huff0 and FSE on latest Intel Core CPU. As a consequence, decompression speed can vary a lot, just by virtue of "where the code is", that is almost randomly after each code change, including code changed outside ofzstd_decompress.c
.
I opened an issue about it on GCC, and although they acknowledge the issue, they seem to not care to do anything about it.
This is a problem, as tracking speed changes on the decompression algorithm is a total nightmare : no way to know if the speed difference observed is related to the latest code change or to some random alignment property.
As a consequence, I'm considering to move from GCC to another compiler free of such instability. I still have to test and select one though.
from zstd.
The above results are obtained using 1 core of Intel Core i5-4300U, Windows 10 64-bit (MinGW-w64 compilation under gcc 4.8.3 with -O3 -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math flags).
Moving to another compiler doesn't solve a problem because other people will use GCC (and zstd is open-source). The decompression function is no so big so I think you can try to find what was changed between v0.4 and v0.3.6 which worked fine with GCC.
from zstd.
I will. Understand this is a never-ending effort though : any change anywhere in the code, even totally unrelated, can trigger such 20% decompression speed difference. Some of these changes are valid fixes, so they must be done, I can't simply undo them. But compensating their indirect consequence on decompression speed requires creative work-around, basically strange constructions just to "push" instructions in a way which is randomly good for hot loop alignment. A real nightmare. I need something more stable.
from zstd.
Does seem @inikep's results show better compression size; could be an influence on speed?
GCC : it doesn't pay attention to instruction alignment well enough, notably for hot loops.... I opened an issue about it on GCC, and although they acknowledge the issue, they seem to not care to do anything about it.
I can understand GCC's position somewhat. I used hand-edit ASM for DSP years ago. I reordered instructions & remove redundant opcodes. But I found the (then new) Core2Duo made much of my work unneeded since the CPU did much of my work for me.
I've heard that the Intel compilers make the fastest executables.
https://software.intel.com/en-us/c-compilers
from zstd.
Does seem @inikep's results show better compression size; could be an influence on speed?
It could explain for 1-3% difference, but not 20 %.
I've been wasting quite some more time on this today. And the worst scenario is predictably happening.
I can make a few changes of code, resulting in better decompression speed ... for a given set of compiler version, parameters and target. But then switch to another system, and it turns out to be worse.
In fact, I suspect that the changes themselves don't bring much to speed, but its impossible to measure, since the impact to instruction alignment dwarf any other potential effect. That means there is no stable solution. +/- speed is reached out of random luck. This is a total nightmare.
I can't imagine a worse situation to be in. A solution is required to stabilize that. But gcc declined to investigate.
As a happier example, when compiling just huff0
, a solution has been found, using -falign-loops=32
. It's better for hot loops, but worse for everything else.
Unfortunately, what worked for a simple library doesn't work for zstd
, since it's too large, so the global effect is negative.
I can imagine 2 other possibilities, none of which I have yet expertise in :
- Add a few assembler directives (only for x64 target) to manually achieve hot instruction loop alignment (and only those loops).
- Use PGO assisted build, in the hope that it will understand which part of the code is hot or not, and apply clever instruction alignment. hopefully.
I can understand GCC's position somewhat.
Well, on my side, I don't understand how a known performance stability problem with a swing of up to 20% can be considered uninteresting. My understanding is that compiler developers chase micro-gains, even when it introduces serious complexity, such as automatic vectorization. 20% cannot be qualified as micro-gain, and alignment is not expected to be huge complexity (for a compiler like gcc). But this is nonetheless not worth investigating ? There is something I don't understand here.
from zstd.
Latest update in "dev" branch is trying to improve the situation.
It is still a bit random, but seems overall a bit more positive than negative.
If your program experienced a -20% performance effect with 0.4, this update is likely to produce an improvement.
On a related note, I can now confirm that PGO assisted build are rather more stable performance wise. It generally ends up at the top of the possible performance range, though the exact outcome vary from one attempt to another, due to different measures during benchmark phase.
It's also a partial solution : makefile
can be modified to create a PGO build, and even hope to transfer this capability to dll build (which I haven't done yet).
But this will not help for people integrating directly from source. Maybe x64 assembler directives would help there, but I have no idea how to do it.
Anyway, if at least it makes performance stable, it's still good news : I should be able to make some new modifications with now less worry that it will lead to random results. Thus, it will be possible to tell if the modification was neutral, beneficial or detrimental, a wanted ability.
from zstd.
, I don't understand how a known performance stability problem with a swing of up to 20% can be considered uninteresting.
Yes, unstable compiling would prevent the 'scientific method' to speed testing.
from zstd.
Proposed change merged into master.
[edit] merged into release v0.4.1
from zstd.
Issue seems solved in 0.4.1.
Some other leads proposed by @nemequ.
Discussion at : http://encode.ru/threads/2119-Zstandard?p=45797&viewfull=1#post45797
from zstd.
Related Issues (20)
- Environment variable for --memory HOT 2
- Improve misleading wording in the streaming decompression howto HOT 2
- erro
- Add library and cli flags for file format with embedded dictionary
- Question about ZSTD protocole HOT 2
- Building on MacOS 13 and targeting MacOS 11 and SDK 11.3 (or any other MacOS version) does not work HOT 2
- Integrating the library with an external thread pool HOT 2
- Is it safe to move compression and decompression contexts between threads? HOT 1
- ZDICT_trainFromBuffer_cover is not thread safe HOT 17
- zstd compression output differens with the same options between 1.5.5 and 1.5.6 HOT 5
- Warning message for `zstd -v --train` is missing line breaks
- How to accelerate the process of dictionary training in zstd? HOT 5
- tests/cli-tests/cltools/zstdless.sh fails with newer version of less HOT 3
- Please promote thread pools from experimental to stable HOT 1
- The CMake build script breaks check_ipo_supported
- Dynamic decompression HOT 3
- Change `dictionary_compression.c` example to use API for dictionary creation
- Enable weak symbol support for Risc-V? HOT 1
- Possibly missing check for truncated initial states in Huffman weight block HOT 4
- Poor compressor behavior on interleaved data HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zstd.