It would be great if zstd could use at least 2 threads

OK <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Indeed <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Is multithreaded compression planned?,about facebook/zstd

Comments (34)

ivalylo commented on July 27, 2024 1

Ok, first let me explain how the parralel compression works in my project. I have a big chunk of data. The data is divided into blocks depending on its size and number of CPU threads. Let's say 4MB. Then each thread takes a block of memory and starts compressing it. When it's done, it enters a global mutex and writes to the file the unique ID of the the compressor, and the unique ID of the block, then the compressed data itself. Then unlocks the mutex, and takes another block. It can optionally flush its dictionary or continue as normal, so the compression ratio will not suffer so much. Then on decompression, I create the same number of decompressors and feed them in the proper order with the proper blocks of memory. It's not perfect, but I don't want to allocate a lot of memory either.

So, here are some results including the file I/O, compression:

zstd (default compression):
1 thread - 80MB/s
2 threads - 115MB/s
4 threads - 117MB/s
12 threads - 113MB/s

zlib (default compression):
1 thread: 15MB/s
2 threads: 34MB/s
4 threads: 53 MB/s
12 threads: 80 MB/s

lzma (lowest compression)
1 thread: 7.5 MB/s
2 threads: 17MB/s
4 threads: 22.3 MB/s
12 threads: 38.5MB/s

from zstd.

Cyan4973 commented on July 27, 2024

Planned, yes, but no date set yet.

from zstd.

ivalylo commented on July 27, 2024

I tried zstd with MT compression in my framework which works already with zlib, lzma and lzham. The results are that zstd doesn't scale good enough with the number of threads. I don't know why, it may be just memory bound. Although even increasing the compression level doesn't seem to improve the situation. I think 2/4 threads would be fine though. I can make tests and put some numbers if you are interested.

from zstd.

Cyan4973 commented on July 27, 2024

I can make tests and put some numbers if you are interested.

Sure, it would be interesting.
Note that there are very different trade-off to find, depending on working on a lot of small files in parallel, or on a very large one.
Scaling should work fine for the first case, not for the second one (well, at least not yet).

zstd doesn't scale good enough with the number of threads

If by "scale", you mean "speed" then it's possible to test some hypothesis.
It's very important that each CPU works into its own cache, thus avoiding contention into main memory or shared cached. This can be controlled more directly, by using _advanced() versions, within zstd_static.h (hence still in experimental status). It allows direct manipulation of ZSTD_parameters.
2 things to work on :

Limit size of compression tables (compression only) : pay attention to hashLog. Limit it to 12 for L1 cache, 15 for L2 cache, or more if you have some larger dedicated L3 cache per core (avoid shared cache).
Limit size of backward searches (compression and decompression) : pay attention to windowLog. Typically set it to 18 for L2 cache.

Note that in both cases, these limitations will likely hurt compression ratio for large files. But they should make the algorithm scale with number of threads (assuming nb threads <= physical cores; hyperthreading will likely ruin scaling, as zstd is typically able to keep busy an entire core, leaving little room for a second thread on same core).

Sidenote : I would be surprised if lzma is free of such issues. It's expected to have equivalent MT drawbacks.

Now, if by "scale" you mean "compression ratio", it's a whole different story.

from zstd.

Cyan4973 commented on July 27, 2024

OK @ivalylo, it's clearer.

The only reason I can imagine to explain this "wall" of performance
is that you may have reached the speed limit of write operation.
Quoting :

When it's done, it enters a global mutex and writes to the file the unique ID of the the compressor, and the unique ID of the block, then the compressed data itself. Then unlocks the mutex, and takes another block.

This did not happened for other algorithms because they never reached the limit.

Try compressing into /dev/null or any other method to remove write speed limit from the equation (or cheat, don't write anything to disk, just for tests).

If that's the correct hypothesis, you can try increasing compression ratio to get some better compression, but the speed will likely remained limited at about the same maximum.

If I/O bound is not the explanation, then I have no other idea.
There is no good reason I know of for zstd to scale worse than lzma.
It would require code investigation.

from zstd.

ivalylo commented on July 27, 2024

Yes, you are probably right. I was actually compressing with the lowest compression level. I did new tests on SSD with level 6 compression and it looks as expected:

1 thread - 30MB/s
2 threads - 61MB/s
4 threads - 92 MB/s
12 threads - 130 MB/s

For compression level 5:
2 threads - 80 MB/s
4 threads - 140 MB/s
12 threads - 160 MB/s

For compression level 3:
2 threads - 135 MB/s
4 threads - 190 MB/s
12 threads - 190 MB/s

zlib with compression level 2:
4 threads - 145MB/s
12 threads - 180 Mb/s

I was probably a bit disappointed that I had similar results with zlib on 12-threads, but zstd can do the same on 4-threads, so it's still pretty good :). Also, zlib tends to compress my data a bit better (it's mainly floating point), but I haven't done much testing yet with various files.

from zstd.

Cyan4973 commented on July 27, 2024

Type of content plays an important role.
If you know you are going to compress a pure table of floats, you could benefit from some pre-filter, such as blosc for example.

Zstd will also feature higher compression levels soon, but mind the speed ...

from zstd.

ivalylo commented on July 27, 2024

Ok, great, will keep an eye on this.

from zstd.

FrancescAlted commented on July 27, 2024

I have just implemented support for zstd in c-blosc2, and I am seeing the same patterns than @ivalylo in that zstd cannot get large speed-ups on my multi-threaded code. Here it is an example:

There one can see that for small compression levels in Blosc2 (that means, for blocksizes <= 256 KB), which are the first dots 2, 3 and 4 in the plots (dot 1 is for non-compression), we can see the best speedups (as Yann predicted), whereas for larger blocksizes (reaching L3 which is a shared cache) there is almost no gain in performance.

The trend when using zlib in Blosc is a bit different:

But provided that zlib is quite a bit more CPU-bounded than zstd, this is expected.

Finally, and mainly for completeness, here it is what you can expect from a compressor that is mainly memory bounded (like LZ4):

That is, the use of more cores brings a quite nice acceleration throughout all the compression ratios. Note however, that LZ4 is designed for speed and not for high compression ratios.

At any rate, I am quite impressed on the speed of zstd so far, but even more for its high compression ratios. Perhaps, I still could improve the performance by creating contexts for zstd inside blosc contexts, or by using dictionaries, but that will come later.

My benchmarks were made on a Xeon E3-1240 v3 @ 3.40GHz with 8 physical cores and using synthetic data (https://github.com/Blosc/c-blosc2/blob/master/bench/bench.c).

from zstd.

Cyan4973 commented on July 27, 2024

Indeed @FrancescAlted, stronger compression modes use too much memory to linearly ramp up on multi-core. To ensure such ramp up, one would have to select either faster modes, or go into detailed parameter mode, to keep table sizes below per-core cache limit (typically 256 KB).

That being said, the compression ratios achieved in your tests look very impressive.
Your transform is doing an extremely good job to prepare data for later compression round.

from zstd.

FrancescAlted commented on July 27, 2024

With regards to the high compression ratios, yes, the benchmark is meant to make binary data compression to shine via the shuffle filter. For example, if we deactivate the shuffle filter we have:

$ bench/bench zstd noshuffle single 8
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib,zstd
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
  Zstd: 0.5.0
Using compressor: zstd
Using shuffle type: noshuffle
Running suite: single
--> 8, 2097152, 8, 19, zstd, noshuffle
********************** Run info ******************************
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 2097152 bytes     Type size: 8 bytes
Working set: 256.0 MB           Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):            255.7 us, 7820.7 MB/s
memcpy(read):             167.5 us, 11936.9 MB/s
Compression level: 0
comp(write):      176.3 us, 11343.3 MB/s          Final bytes: 2097168  Ratio: 1.00
decomp(read):     115.3 us, 17346.0 MB/s          OK
Compression level: 1
comp(write):     1231.9 us, 1623.5 MB/s   Final bytes: 1589840  Ratio: 1.32
decomp(read):    1130.6 us, 1769.0 MB/s   OK
Compression level: 2
comp(write):     2543.8 us, 786.2 MB/s    Final bytes: 1589846  Ratio: 1.32
decomp(read):    1111.3 us, 1799.8 MB/s   OK
Compression level: 3
comp(write):     2060.0 us, 970.9 MB/s    Final bytes: 1589846  Ratio: 1.32
decomp(read):    1115.4 us, 1793.0 MB/s   OK
Compression level: 4
comp(write):     5285.9 us, 378.4 MB/s    Final bytes: 1586484  Ratio: 1.32
decomp(read):     923.2 us, 2166.4 MB/s   OK
Compression level: 5
comp(write):     3758.1 us, 532.2 MB/s    Final bytes: 1586484  Ratio: 1.32
decomp(read):     988.5 us, 2023.3 MB/s   OK
Compression level: 6
comp(write):     5599.3 us, 357.2 MB/s    Final bytes: 1584810  Ratio: 1.32
decomp(read):    1412.1 us, 1416.3 MB/s   OK
Compression level: 7
comp(write):     7649.7 us, 261.4 MB/s    Final bytes: 1583901  Ratio: 1.32
decomp(read):    2446.9 us, 817.4 MB/s    OK
Compression level: 8
comp(write):     7848.1 us, 254.8 MB/s    Final bytes: 1583901  Ratio: 1.32
decomp(read):    2447.9 us, 817.0 MB/s    OK
Compression level: 9
comp(write):     8187.9 us, 244.3 MB/s    Final bytes: 1583901  Ratio: 1.32
decomp(read):    2447.5 us, 817.2 MB/s    OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:      22.6 s, 746.5 MB/s

so, quite low compression ratio. Interestingly, in this case zlib gets a bit better compression ratios:

$ bench/bench zlib noshuffle single 8   
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib,zstd
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
  Zstd: 0.5.0
Using compressor: zlib
Using shuffle type: noshuffle
Running suite: single
--> 8, 2097152, 8, 19, zlib, noshuffle
********************** Run info ******************************
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 2097152 bytes     Type size: 8 bytes
Working set: 256.0 MB           Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):            254.8 us, 7847.9 MB/s
memcpy(read):             172.8 us, 11570.9 MB/s
Compression level: 0
comp(write):      175.9 us, 11371.4 MB/s          Final bytes: 2097168  Ratio: 1.00
decomp(read):     116.2 us, 17214.9 MB/s          OK
Compression level: 1
comp(write):     9011.9 us, 221.9 MB/s    Final bytes: 1313887  Ratio: 1.60
decomp(read):    2182.9 us, 916.2 MB/s    OK
Compression level: 2
comp(write):     10401.2 us, 192.3 MB/s   Final bytes: 1392833  Ratio: 1.51
decomp(read):    2138.2 us, 935.3 MB/s    OK
Compression level: 3
comp(write):     10699.1 us, 186.9 MB/s   Final bytes: 1491236  Ratio: 1.41
decomp(read):    2170.2 us, 921.6 MB/s    OK
Compression level: 4
comp(write):     15154.4 us, 132.0 MB/s   Final bytes: 1318214  Ratio: 1.59
decomp(read):    1930.3 us, 1036.1 MB/s   OK
Compression level: 5
comp(write):     18609.0 us, 107.5 MB/s   Final bytes: 1318382  Ratio: 1.59
decomp(read):    2158.4 us, 926.6 MB/s    OK
Compression level: 6
comp(write):     37161.9 us, 53.8 MB/s    Final bytes: 1317011  Ratio: 1.59
decomp(read):    2648.8 us, 755.1 MB/s    OK
Compression level: 7
comp(write):     113481.1 us, 17.6 MB/s   Final bytes: 1314598  Ratio: 1.60
decomp(read):    6803.3 us, 294.0 MB/s    OK
Compression level: 8
comp(write):     113468.5 us, 17.6 MB/s   Final bytes: 1314598  Ratio: 1.60
decomp(read):    6787.7 us, 294.7 MB/s    OK
Compression level: 9
comp(write):     113491.2 us, 17.6 MB/s   Final bytes: 1314598  Ratio: 1.60
decomp(read):    6788.4 us, 294.6 MB/s    OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:     182.7 s, 92.5 MB/s

so this scenario would be a good example showing some optimization room for zstd.

But when zstd really shows a big muscle is in combination with the bitshuffle filter (like shuffle, but shuffling happens at bit level, and not at byte one):

$ bench/bench zstd bitshuffle single 8         
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib,zstd
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
  Zstd: 0.5.0
Using compressor: zstd
Using shuffle type: bitshuffle
Running suite: single
--> 8, 2097152, 8, 19, zstd, bitshuffle
********************** Run info ******************************
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 2097152 bytes     Type size: 8 bytes
Working set: 256.0 MB           Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):            264.8 us, 7553.1 MB/s
memcpy(read):             173.0 us, 11558.1 MB/s
Compression level: 0
comp(write):      175.2 us, 11416.5 MB/s          Final bytes: 2097168  Ratio: 1.00
decomp(read):     115.4 us, 17329.1 MB/s          OK
Compression level: 1
comp(write):      807.1 us, 2478.1 MB/s   Final bytes: 29560  Ratio: 70.95
decomp(read):     664.8 us, 3008.3 MB/s   OK
Compression level: 2
comp(write):     1462.4 us, 1367.6 MB/s   Final bytes: 20000  Ratio: 104.86
decomp(read):     552.5 us, 3620.0 MB/s   OK
Compression level: 3
comp(write):     1037.0 us, 1928.7 MB/s   Final bytes: 15880  Ratio: 132.06
decomp(read):     457.7 us, 4369.9 MB/s   OK
Compression level: 4
comp(write):     3625.4 us, 551.7 MB/s    Final bytes: 13240  Ratio: 158.40
decomp(read):     562.3 us, 3556.7 MB/s   OK
Compression level: 5
comp(write):     2632.2 us, 759.8 MB/s    Final bytes: 11856  Ratio: 176.89
decomp(read):     592.2 us, 3377.1 MB/s   OK
Compression level: 6
comp(write):     4901.6 us, 408.0 MB/s    Final bytes: 8310  Ratio: 252.36
decomp(read):     794.8 us, 2516.2 MB/s   OK
Compression level: 7
comp(write):     3246.5 us, 616.1 MB/s    Final bytes: 4725  Ratio: 443.84
decomp(read):     906.4 us, 2206.6 MB/s   OK
Compression level: 8
comp(write):     3622.4 us, 552.1 MB/s    Final bytes: 4723  Ratio: 444.03
decomp(read):     904.3 us, 2211.5 MB/s   OK
Compression level: 9
comp(write):     4955.2 us, 403.6 MB/s    Final bytes: 4598  Ratio: 456.10
decomp(read):     903.9 us, 2212.7 MB/s   OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:      12.8 s, 1317.1 MB/s

which is pretty amazing :)

from zstd.

FrancescAlted commented on July 27, 2024

After seeing that zstd 0.5.1 is out now, I have updated c-blosc2 accordingly and, out of curiosity, I re-run some benchmarks. The most important differences that I have seen are:

with zstd 0.5.0:

$ bench/bench zstd bitshuffle single 8
[snip]
Compression level: 7
comp(write):     3243.8 us, 616.6 MB/s    Final bytes: 4725  Ratio: 443.84
decomp(read):    1028.5 us, 1944.6 MB/s   OK
Compression level: 8
comp(write):     3614.4 us, 553.3 MB/s    Final bytes: 4723  Ratio: 444.03
decomp(read):    1025.5 us, 1950.2 MB/s   OK
Compression level: 9
comp(write):     4932.4 us, 405.5 MB/s    Final bytes: 4598  Ratio: 456.10
decomp(read):    1024.6 us, 1951.9 MB/s   OK

with zstd 0.5.1:

$ bench/bench zstd bitshuffle single 8
[snip]
Compression level: 7
comp(write):     7591.1 us, 263.5 MB/s    Final bytes: 4654  Ratio: 450.61
decomp(read):     992.6 us, 2014.9 MB/s   OK
Compression level: 8
comp(write):     6019.3 us, 332.3 MB/s    Final bytes: 5515  Ratio: 380.26
decomp(read):    1022.5 us, 1956.0 MB/s   OK
Compression level: 9
comp(write):     5126.9 us, 390.1 MB/s    Final bytes: 3230  Ratio: 649.27
decomp(read):     969.0 us, 2064.0 MB/s   OK

you can see that, while decompression speeds are mostly unchanged, the compression speeds are up to 2x slower for compression levels 7 and 8 (mapped to 13 and 15 in zstd). Also, for compression level 9 (mapped to 21 in zstd) the compression ratio increased by 1.5x (yeah, hard to believe) while keeping compression/decompression speeds in the same range.

When using the shuffle filter I have got some slightly different results.

With zstd 0.5.0:

Compression level: 7
comp(write):     3174.8 us, 630.0 MB/s    Final bytes: 7514  Ratio: 279.10
decomp(read):     365.8 us, 5467.2 MB/s   OK
Compression level: 8
comp(write):     3823.8 us, 523.0 MB/s    Final bytes: 7514  Ratio: 279.10
decomp(read):     366.7 us, 5453.9 MB/s   OK
Compression level: 9
comp(write):     6291.4 us, 317.9 MB/s    Final bytes: 6874  Ratio: 305.08
decomp(read):     357.2 us, 5598.5 MB/s   OK

with zstd 0.5.1:

Compression level: 7
comp(write):     8694.2 us, 230.0 MB/s    Final bytes: 6810  Ratio: 307.95
decomp(read):     438.4 us, 4561.7 MB/s   OK
Compression level: 8
comp(write):     11185.4 us, 178.8 MB/s   Final bytes: 6882  Ratio: 304.73
decomp(read):     357.5 us, 5594.7 MB/s   OK
Compression level: 9
comp(write):     12195.1 us, 164.0 MB/s   Final bytes: 6878  Ratio: 304.91
decomp(read):     361.1 us, 5539.1 MB/s   OK

Here, decompression speed and compression ratios are similar, but compression speed for the high levels are up 3x slower.

I thought it would be a good idea to report this back. This reminds me that it would be nice to setup a set of speed tests and keep historical records so that these kind of regressions can be detected. Once solution could be air speed velocity. I have plans to use this for my own packages, but you may be interested in something similar also.

from zstd.

Cyan4973 commented on July 27, 2024

Thanks @FrancescAlted, these are very good points.

With the introduction of the new optimal parsing schemes, there was a need to rebuild compression level parameter tables.

The most important and most tested table is the first (default) one, applicable to data sets > 256 KB.
But the parameters selected for this scenario (large file) are not good enough for smaller data sets. Consequently, there are a few more tables generated, to better fit to different trade-offs :

256KB ; default for file compression
128-256 KB
16-128 KB ; default for dictionary compression
<= 16 KB

The selection of the right table happens in the beginning, if the API is able to properly detect size of data to compress. It's not always the case though, for example in streaming application, when there is no hint of final size to compress. In such case, default tables are applicable.

The reference corpus used to generate parameters for tables < 256 KB changed between v0.5.0 and v0.5.1. As a consequence, the "optimal" settings for each compression parameter changed too. They could prove better or worse, depending on data to compress.

There is no "universally better" parameter set : it features better in some cases, worse for others. So the reference corpus used to calibrate compression levels parameters can make quite a difference.

In your case, my understanding is that you use blocks of 256 KB, so it probably relies table 2.
Here, level 13 went from :

{  0, 18, 17, 16,  9,  4, ZSTD_lazy    },  /* old level 13 */
{  0, 18, 19, 17,  7,  4,  4, ZSTD_btlazy2 },  /* new level 13.*/ 
{  0, 18, 17, 17,  4,  4,  4, ZSTD_lazy    },  /* new level  7, closest to old level 13 */

Level 15 went from :

{  0, 18, 17, 17,  9,  4, ZSTD_lazy2   },  /* old level 15 */
{  0, 18, 19, 19,  8,  4, 48, ZSTD_opt_bt  },  /* new level 15.*/  
{  0, 18, 17, 17,  7,  4,  4, ZSTD_lazy2   },  /* new level 11, closest to old level 15 */

In general, new compression levels in this table are supposed to be stronger and slower. But in many circumstances, "stronger" will be barely noticeable, while "slower" will. That's because on smaller blocks it's difficult to get more compression by just throwing CPU power into the fray. "Pretty close to optimal" is reached much faster.

Anyway, setting up a speed test to detect regression at each version (or even better, at each commit) would be greatly beneficial to the project. If airspeed velocity can help do it, I'm all for it !

sidenote : applications manipulating directly compression parameters (using ZSTD_compress_advanced()) are not impacted by pre-generated tables.

from zstd.

FrancescAlted commented on July 27, 2024

Thanks for the nice explanation on the internal parametrization of zstd. But I must say that Blosc uses different block sizes for different compression levels. Actually, I was seeing the compression slowdown for blocksizes > 256 KB (in particular, level 7 -> 512 KB, level 8 -> 1 MB and level 9 -> 4 MB, see https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc.c#L925 for details).

from zstd.

Cyan4973 commented on July 27, 2024

Ah, then it's more surprising.
Considering blocks > 256 KB should use table 1, levels have remained much more stable there. See details below :

    { 22, 21, 22,  5,  5, ZSTD_lazy2   },  /* old level 13 */ 
    {  0, 22, 21, 22,  5,  5,  4, ZSTD_lazy2   },  /* new level 13 */

    { 23, 23, 23,  5,  5, ZSTD_lazy2   },  /* old level 15 */  
    {  0, 23, 23, 23,  5,  5,  4, ZSTD_lazy2   },  /* new level 15 */

Almost identical. So any significant change in compression speed is unexpected.

I'm currently unable to reproduce this effect : in my tests, old (0.5.0 and new (0.5.1) levels 13-15 have about the same speed (within error margin).

from zstd.

Cyan4973 commented on July 27, 2024

The only reason I can think of for 0.5.0 and 0.5.1 to behave so differently is if they are provided blocks <= 256 KB.

I've been looking at https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc.c#L925 , as suggested in your post.

I spotted some very small differences that could result in block sizes being different from wanted :

compute_blocksize() checks for :
if (context->compcode == BLOSC_ZSTD) { blocksize *= 8; }

But initialization is within initialize_context_compression(), and there we see :

#if defined(HAVE_ZSTD)
    case BLOSC_ZSTD:
      compcode = BLOSC_ZSTD_FORMAT;
      context->dest[1] = BLOSC_ZSTD_VERSION_FORMAT;      /* zstd format version */
      break;
#endif /*  HAVE_ZSTD */

So now we have 2 different definitions, BLOSC_ZSTD_FORMAT at initialization, and BLOSC_ZSTD at later runtime check.

Are they the same ? Let's look at their definition, into blosc.h :

(...)
#define BLOSC_ZSTD           5
(...)
#define BLOSC_ZSTD_LIB       4
(...)
#define BLOSC_ZSTD_FORMAT     BLOSC_ZSTD_LIB
(...)

So they are different. I guess it could have an impact within compute_blocksize().

from zstd.

FrancescAlted commented on July 27, 2024

Well, my internal notation was a bit confusing, sorry about that. The compcode is the compressor chosen and compformat should be the format that the compressor writes its output. I introduced this duality because LZ4 and LZ4HC are different codecs, but both write the info in the same format (indeed LZ4 can decompress a LZ4HC stream). It is unfortunate that I used compcode when I should have used compformat. I tried to address this in: Blosc/c-blosc2@0309587.

Also, there is something that I don't understand in Blosc2 and is the fact that compression levels >= 7 for zstd always end with a 2 MB blocksize (verified via printf), and not with a more progressive way towards 2 MB (I have to debug this more). But anyway, is for block sizes of 2 MB that I am seeing the regression in compression speed between zstd 0.5.0 and 0.5.1 (I have double checked just now, so it is completely reproducible in my setup).

from zstd.

Cyan4973 commented on July 27, 2024

there is something that I don't understand in Blosc2 and is the fact that compression levels >= 7 for zstd always end with a 2 MB blocksize (verified via printf), and not with a more progressive way towards 2 MB (I have to debug this more).

I believe it comes from this code :

   else if (clevel <= 6) {
      blocksize *= 2;
    }
    else if (clevel < 9) {
      blocksize *= 8;
    }
    else {
      blocksize *= 16;
    }

If I do understand your code correctly, at the start of this suite of else if, blocksize == 32 KB * 8 = 256 KB.
So, following the code :
clevel == 6 => blockSize == 256KB * 2 = 512 KB
clevel == 7 => blockSize == 256KB * 8 = 2 MB
clevel == 8 => blockSize == 256KB * 8 = 2 MB
clevel == 9 => blockSize == 256KB * 16 = 4 MB

Maybe you wanted to introduce a 1 MB block size at clevel==7 ?

Sidenote : maybe a switch() { case would prove easier to read and maintain.

But anyway, is for block sizes of 2 MB that I am seeing the regression in compression speed between zstd 0.5.0 and 0.5.1 (I have double checked just now, so it is completely reproducible in my setup).

This one is indeed very strange.
I would need a way to reproduce it on my side.
How would you recommend to do it ?

from zstd.

FrancescAlted commented on July 27, 2024

Reproducing my setup is very easy. Just compile c-blosc2 via cmake and then run:

$ bench/bench zstd [bit]shuffle single

That should be pretty much it.

from zstd.

Cyan4973 commented on July 27, 2024

I've been able to reproduce results with v0.5.1,
but I'm struggling to do the same with v0.5.0, which would be nice to compare.

It seems it's not enough to just swap library files, due to integration of zdict which only exists in 0.5.1 and also depends on newer version of zstd.

from zstd.

FrancescAlted commented on July 27, 2024

Ah, you can go back to 0.5.0 by just going back in history before 0.5.1 was included. Use this command:

$ git checkout 0064efb

and recompile.

from zstd.

Cyan4973 commented on July 27, 2024

just going back in history before 0.5.1 was included

ok, but then, it is important that no other change has been done. As it could also impact result.

from zstd.

FrancescAlted commented on July 27, 2024

I think the only important change is this one: Blosc/c-blosc2@79a4079#diff-497387f34956a83d83645aa6212df413L515, and this is only for the highest compression ratio (9).

from zstd.

Cyan4973 commented on July 27, 2024

Thanks. I can reproduce the results, similar to yours.

Somehow, they do not make sense to me.
Especially, it's not only the speed which is different, but also the compression ratio.
This does not look okay. In my tests, both version feature identical compression ratios at -13 and -15 levels. Something fishy is happening.

from zstd.

FrancescAlted commented on July 27, 2024

From what you are saying this is weird, yes; but well, at least it is reproducible :P. Anyway, I just converted the big if sentence into a switch: Blosc/c-blosc2@a981dc7. Much better 👍

from zstd.

Cyan4973 commented on July 27, 2024

I've been doing some more tests, with a bit more logs.

It appears there is something happening which gives to ZSTD_compress() some block size different from compute_blocksize().

Here are a few traces :

cLevel : 11 ;   srcSize : 65536
cLevel : 13 ;   srcSize : 262144
cLevel : 15 ;   srcSize : 262144
cLevel : 21 ;   srcSize : 262144

from zstd.

FrancescAlted commented on July 27, 2024

Ah, then you are right and Blosc is passing smaller blocksizes to compressors than I realized. Tomorrow I'll have a look. Thanks for the investigation!

from zstd.

FrancescAlted commented on July 27, 2024

So, I had a look, and yeah, I forgot the fact that blosc splits the blocksizes in N sub-blocks before passing the data to the compressor. N is the size of typesize. As bench.c has a default typesize of 8, the data sub-blocks were arriving at 1/8 of the blocksizes. In addition, bench.c has a buffer size of 2 MB by default, so that means that the data sub-blocks that arrived to Zstd were in all cases <= 256 KB, and this is why bench.c was sensible in the transition between Zstd 0.5.0 and 0.5.1. Mystery solved.

Out of curiosity, I run another benchmark with 1) a bigger buffer size (2 MB -> 4MB) and 2) a smaller typesize (8 bytes -> 4 bytes). With that, the block size that should arrive to Zstd in Blosc's compression level 9 is 1 MB and 512 KB for 7 and 8. Here are the results for Zstd 0.5.0:

$ bench/bench zstd bitshuffle single 4 4194304 4
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib,zstd
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
  Zstd: 0.5.0
Using compressor: zstd
Using shuffle type: bitshuffle
Running suite: single
--> 4, 4194304, 4, 19, zstd, bitshuffle
********************** Run info ******************************
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes     Type size: 4 bytes
Working set: 256.0 MB           Number of threads: 4
********************** Running benchmarks *********************
memcpy(write):            561.7 us, 7121.9 MB/s
memcpy(read):             411.1 us, 9730.0 MB/s
Compression level: 0
comp(write):      377.2 us, 10604.3 MB/s          Final bytes: 4194320  Ratio: 1.00
decomp(read):     291.7 us, 13712.5 MB/s          OK
Compression level: 1
comp(write):     1425.8 us, 2805.5 MB/s   Final bytes: 42880  Ratio: 97.81
decomp(read):    1434.6 us, 2788.3 MB/s   OK
Compression level: 2
comp(write):     5560.4 us, 719.4 MB/s    Final bytes: 42688  Ratio: 98.25
decomp(read):    1439.0 us, 2779.7 MB/s   OK
Compression level: 3
comp(write):     7634.2 us, 524.0 MB/s    Final bytes: 40432  Ratio: 103.74
decomp(read):    1390.1 us, 2877.4 MB/s   OK
Compression level: 4
comp(write):     5298.1 us, 755.0 MB/s    Final bytes: 23760  Ratio: 176.53
decomp(read):    1072.8 us, 3728.4 MB/s   OK
Compression level: 5
comp(write):     3857.8 us, 1036.9 MB/s   Final bytes: 18840  Ratio: 222.63
decomp(read):     675.0 us, 5925.7 MB/s   OK
Compression level: 6
comp(write):     4763.4 us, 839.7 MB/s    Final bytes: 13804  Ratio: 303.85
decomp(read):     862.8 us, 4636.0 MB/s   OK
Compression level: 7
comp(write):     29094.7 us, 137.5 MB/s   Final bytes: 9126  Ratio: 459.60
decomp(read):    2277.3 us, 1756.4 MB/s   OK
Compression level: 8
comp(write):     50543.8 us, 79.1 MB/s    Final bytes: 9126  Ratio: 459.60
decomp(read):    2259.3 us, 1770.5 MB/s   OK
Compression level: 9
comp(write):     151322.3 us, 26.4 MB/s   Final bytes: 5344  Ratio: 784.86
decomp(read):    2061.3 us, 1940.5 MB/s   OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:      52.8 s, 320.3 MB/s

whereas for 0.5.1:

$ bench/bench zstd bitshuffle single 4 4194304 4
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,snappy,zlib,zstd
Supported compression libraries:
  BloscLZ: 1.0.5
  LZ4: 1.7.0
  Snappy: 1.1.1
  Zlib: 1.2.8
  Zstd: 0.5.1
Using compressor: zstd
Using shuffle type: bitshuffle
Running suite: single
--> 4, 4194304, 4, 19, zstd, bitshuffle
********************** Run info ******************************
Blosc version: 2.0.0a2 ($Date:: 2016-01-08 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes     Type size: 4 bytes
Working set: 256.0 MB           Number of threads: 4
********************** Running benchmarks *********************
memcpy(write):            561.4 us, 7125.5 MB/s
memcpy(read):             410.9 us, 9735.1 MB/s
Compression level: 0
comp(write):      353.1 us, 11328.1 MB/s          Final bytes: 4194320  Ratio: 1.00
decomp(read):     302.4 us, 13225.4 MB/s          OK
Compression level: 1
comp(write):     1568.3 us, 2550.5 MB/s   Final bytes: 42880  Ratio: 97.81
decomp(read):    1436.2 us, 2785.2 MB/s   OK
Compression level: 2
comp(write):     2640.2 us, 1515.1 MB/s   Final bytes: 34416  Ratio: 121.87
decomp(read):    1363.2 us, 2934.2 MB/s   OK
Compression level: 3
comp(write):     7748.6 us, 516.2 MB/s    Final bytes: 37392  Ratio: 112.17
decomp(read):    1118.8 us, 3575.2 MB/s   OK
Compression level: 4
comp(write):     5763.9 us, 694.0 MB/s    Final bytes: 23760  Ratio: 176.53
decomp(read):     699.5 us, 5718.5 MB/s   OK
Compression level: 5
comp(write):     4454.1 us, 898.1 MB/s    Final bytes: 18840  Ratio: 222.63
decomp(read):     680.0 us, 5882.5 MB/s   OK
Compression level: 6
comp(write):     5567.6 us, 718.4 MB/s    Final bytes: 13804  Ratio: 303.85
decomp(read):     863.2 us, 4634.1 MB/s   OK
Compression level: 7
comp(write):     27624.4 us, 144.8 MB/s   Final bytes: 9126  Ratio: 459.60
decomp(read):    2270.0 us, 1762.1 MB/s   OK
Compression level: 8
comp(write):     50007.0 us, 80.0 MB/s    Final bytes: 9126  Ratio: 459.60
decomp(read):    2346.7 us, 1704.5 MB/s   OK
Compression level: 9
comp(write):     167821.7 us, 23.8 MB/s   Final bytes: 3217  Ratio: 1303.79
decomp(read):    1741.1 us, 2297.4 MB/s   OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:      55.2 s, 306.1 MB/s

We see here a much uniform behaviour, but for 1 MB sub-blocks, it is apparent that 0.5.1 compress much better for this specific case. This is perhaps due to the fact that Blosc with 0.5.1 activates compression level 21 in Zstd, whereas it activates just 20 for Zstd 0.5.0, but it might be a better tuning in Zstd 0.5.1 as well.

Also, for high compression levels, Blosc tuning for Zstd makes multi-threading useless because the blocksize tends to be equal to the buffer size (see clevel == 9 above). Maybe I could make the threads work with data at sub-block level. Hmm, food for thought.

At any rate, it is nice to see that Zstd is not having big regressions in this scenario (buffers of 4 MB and typesizes of 4 bytes), and that the previous one that I reported (for sub-blocs of 256 KB mainly) were kind of 'special'.

from zstd.

Cyan4973 commented on July 27, 2024

0.5.0 level 20 and 0.5.1 level 21 are completely different, so it's no surprise they generate different results.

The newer high compression levels (>=18) are expected to always compress better.

from zstd.

FrancescAlted commented on July 27, 2024

I see. Thanks for your time!

from zstd.

FrancescAlted commented on July 27, 2024

Just a follow up on this. I recently implemented context calls for zstd inside blosc2 and I am getting quite nice speed-ups (up to 1.4x). I also avoid doing splitting blocks when using ztsd (actually any codec except blosclz and snappy ), and that proved to be a good thing.

Here are the trends for decompressing when using multiple threads in a 8-core processor:

where it can be seen that we can get an advantage for most of the compression levels (for nthreads > 4 there is no advantage; I suppose this should have to do with the architecture of my processor).

And here for compressing:

In this case we can see advantages only for small compression levels (i.e. when the block size <= 32 KB, which is the L1 size for my cores).

Next step should be supporting zstd dictionaries inside blosc2.

from zstd.

Cyan4973 commented on July 27, 2024

👍

from zstd.

Cyan4973 commented on July 27, 2024

There is now contrib/pzstd which covers this topic.

from zstd.

Cyan4973 commented on July 27, 2024

Multithreaded compression now implemented in v1.1.0 in contrib/pzstd

from zstd.

Is multithreaded compression planned? about zstd HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent