Speed vs. "Optimized" codahale version?,about minio/blake2b-simd

Comments (30)

harshavardhana commented on August 19, 2024

Per chance have you run speed comparisons vs. the other common implementation of blake2b?

No there are no intentions of doing that @goarchit - happy to see some results from community if any.

from blake2b-simd.

goarchit commented on August 19, 2024

One data point: Hashing a 5.8GB tarball on a Jetway (SSE2 only, quad core celeron - SSD based - hopefully the low-end of PC's these days, at least CPU wise):

Minio: /* First run - nothing in OS cache */
real 1m11.658s
user 1m8.976s
sys 0m2.680s

Codahale:
real 1m8.054s
user 1m6.611s
sys 0m1.954s

Minio: /* Second run - sanity check, just in case OS cacheing played a roll */
real 1m10.302s
user 1m8.865s
sys 0m1.976s

Code used:

import (
        "fmt"
        "github.com/minio/blake2b-simd"
//      "github.com/codahale/blake2"
        "io"
        "os"
)

func ComputeBlake2B(filePath string) ([]byte, error) {
        var result []byte
        file, err := os.Open(filePath)
        if err != nil {
                return result, err
        }
        defer file.Close()

//      hash := blake2.NewBlake2B()
        hash, err := blake2b.New(nil)
        if err != nil {
                fmt.Println("Blake2B Error: ", err)
        }
        if _, err := io.Copy(hash, file); err != nil {
                return result, err
        }

        return hash.Sum(result), nil
}

func main() {
        if b, err := ComputeBlake2B("/home/everything.tar"); err != nil {
                fmt.Printf("Err: %v", err)
        } else {
                fmt.Printf("/home/everything.tar Blake2b checksum is: %x", b)
        }
}

from blake2b-simd.

goarchit commented on August 19, 2024

Sorry about the formatting, not sure why pasting the code between ticks (e.g. "insert code") didn't work.

from blake2b-simd.

fwessels commented on August 19, 2024

Thanks for the benchmark, looks pretty comparable between the two.

Using more powerful CPU with AVX or AVX2 would double hashing speed.

from blake2b-simd.

goarchit commented on August 19, 2024

Ran a similar test (6.2GB tarball) on a sample of the other extreme: A dual socket Xeon with a combined 28 cores and 56 thread capability, L3 cache, AVX and AVX2. Again off a (likely faster) SSD.

Mimio:
real 0m16.623s
user 0m13.060s
sys 0m2.640s

Codahale:
real 0m29.193s
user 0m24.800s
sys 0m2.980s

Mimio: (sanity check):
real 0m16.443s
user 0m13.170s
sys 0m2.560s

That server was moderately busy, so I idled it and reran:

Mimio:
real 0m10.744s
user 0m9.480s
sys 0m1.270s

Codahale:
real 0m17.705s
user 0m16.670s
sys 0m1.040s

Bottom line: Mimio is approximately the same speed on low end CPUs, and takes ~40% less resources on a high end CPU system. Nice Job! Pretty clear which codebase our project will be using, shy of any surprises benchmarking some other algos.

update: Just for grins I timed the golang/x/crypto/Blake2B package:
real 2m19.175s
user 2m18.290s
sys 0m0.990s

from blake2b-simd.

harshavardhana commented on August 19, 2024

update: Just for grins I timed the golang/x/crypto/Blake2B package:
real 2m19.175s
user 2m18.290s
sys 0m0.990s

This seems like odd in-fact crypto/blake2b is faster than minio-blake2b implementation - are you using the latest commit from - https://github.com/golang/crypto/tree/master/blake2b ?

// cc @aead

from blake2b-simd.

harshavardhana commented on August 19, 2024

Did some tests locally golang/x/crypto/blake2b is giving quite close to minio/blake2b-simd or better.

from blake2b-simd.

goarchit commented on August 19, 2024

No... I pulled from the path referenced...

Built using github.coim/golang/crypto/blake2b (after a go get), idled the box to recreate the original environment, and got a time of:
real 2m3.855s
user 2m2.150s
sys 0m1.780s

Very similar to what I previously reported.

from blake2b-simd.

aead commented on August 19, 2024

@goarchit
Can you provide the benchmark results of go test -bench=Benchmark on your machine(s)?

On my i7-6500U (linux/amd64) the codahale implementation is about 400 MB/s slower than x/crypto/blake2b - which is not surprising, because codahale relies on the "official" C implementation (using "just" SSE4.1) vs. the AVX2 code of /x/crypto

Did some tests locally golang/x/crypto/blake2b is giving quite close to minio/blake2b-simd or better.

Maybe there are some other side effects, but on my machines the I see the same as you.

from blake2b-simd.

goarchit commented on August 19, 2024

Per your request:

BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s
BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s
BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s
BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s
PASS
ok github.com/golang/crypto/blake2b 7.705s

And just to be complete, the last cpu's output from a cat /proc/cpuinfo

processor : 55
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
stepping : 1
microcode : 0xb00001d
cpu MHz : 2401.000
cache size : 35840 KB
physical id : 1
siblings : 28
core id : 14
cpu cores : 14
apicid : 61
initial apicid : 61
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc
bugs :
bogomips : 4800.57
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

from blake2b-simd.

harshavardhana commented on August 19, 2024

BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s
BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s
BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s
BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s

what is the Go version you are using @goarchit ?

from blake2b-simd.

aead commented on August 19, 2024

BenchmarkWrite128-56 1000000 2030 ns/op 63.05 MB/s
BenchmarkWrite1K-56 100000 16031 ns/op 63.87 MB/s
BenchmarkSum128-56 1000000 2038 ns/op 62.79 MB/s
BenchmarkSum1K-56 100000 16517 ns/op 61.99 MB/s

The performance is such low - that's something unexpected. On this kind of machine I would expect values about >= 800 MB/s...
The CPU supports AVX2 so probably the AVX2 code is used (as long go version >= 1.7)

Unfortunately I don't have access to this kind of CPU - so I cannot reproduce or test this on my own.
Maybe (this is a guess) the LOAD_MSG_AVX2 macro causes some trouble (maybe the the Xeon has problems with the AVX2-SSE mixing code), but as mentioned, I cannot test this...

@goarchit
Can you replace line 10 in blake2bAVX2_amd64.s with useAVX2 = supportsAVX2() && false
and run the benchmark again?

from blake2b-simd.

goarchit commented on August 19, 2024

@harshavardhana go version go1.7.4 linux/amd64

@aead Presume you meant black2bAVX2_amd64.go... similar results:
BenchmarkWrite128-56 1000000 2028 ns/op 63.10 MB/s
BenchmarkWrite1K-56 100000 15764 ns/op 64.96 MB/s
BenchmarkSum128-56 1000000 2047 ns/op 62.53 MB/s
BenchmarkSum1K-56 100000 15836 ns/op 64.66 MB/s
PASS
ok github.com/golang/crypto/blake2b 7.616s

@aead E-mail me and I can work out access for you to the box. Its normally busy doing some CPU mining and other stuff, but I can manually idle it for you while you test. Would just need to coordinate and keep windows reasonable (an hour or two at a time, not days or weeks)

from blake2b-simd.

harshavardhana commented on August 19, 2024

Was able to reproduce this problem on our Xeon hardware

$ go test -bench=Benchmark
BenchmarkWrite128-12              500000              2243 ns/op          57.05 MB/s
BenchmarkWrite1K-12               100000             16225 ns/op          63.11 MB/s
BenchmarkSum128-12               1000000              2106 ns/op          60.77 MB/s
BenchmarkSum1K-12                 100000             16287 ns/op          62.87 MB/s
PASS
ok      golang.org/x/crypto/blake2b     6.885s

v/s blake2b-simd

BenchmarkComparisonMD5-12                   1000           1741000 ns/op         602.28 MB/s
BenchmarkComparisonSHA1-12                  1000           1393112 ns/op         752.69 MB/s
BenchmarkComparisonSHA256-12                 500           3006689 ns/op         348.75 MB/s
BenchmarkComparisonSHA512-12                 500           3440423 ns/op         304.78 MB/s
BenchmarkComparisonBlake2B-12               1000           1229740 ns/op         852.68 MB/s
BenchmarkSize64-12                       3000000               654 ns/op          97.80 MB/s
BenchmarkSize128-12                      3000000               511 ns/op         250.45 MB/s
BenchmarkSize1K-12                       1000000              2014 ns/op         508.25 MB/s
BenchmarkSize8K-12                        100000             10483 ns/op         781.43 MB/s
BenchmarkSize32K-12                        50000             35959 ns/op         911.25 MB/s
BenchmarkSize128K-12                       10000            143302 ns/op         914.65 MB/s

Looks like a bug @aead

from blake2b-simd.

aead commented on August 19, 2024

@harshavardhana
Switching to AVX code doesn't change anything - so my guess about the AVX2-SSE switching seems to be wrong. Maybe you can run the benchmark for the different SIMD-instructions (SSE4.1, AVX, AVX2)?
I'll take a deeper look into blake2b-simd assembly and search for significant differences in mem/cache accesses, etc.

Maybe creating an issue at golang is a good idea?!

@goarchit
Okay, now I'm really blankly...
Thanks for the offer, I'll will do some investigation on this and come back to you! 😃

from blake2b-simd.

harshavardhana commented on August 19, 2024

Plain blake2b without any optimizations using golang.org/x/crypto/blake2b

$ go test -bench .
BenchmarkWrite128-12             3000000               351 ns/op         364.08 MB/s
BenchmarkWrite1K-12               500000              2436 ns/op         420.20 MB/s
BenchmarkSum128-12               5000000               344 ns/op         371.54 MB/s
BenchmarkSum1K-12                 500000              2467 ns/op         414.91 MB/s
PASS
ok      golang.org/x/crypto/blake2b     6.115s

Witth SSE4

$ go test -bench .
SSE4
BenchmarkWrite128-12             5000000               236 ns/op         540.17 MB/s
BenchmarkWrite1K-12              1000000              1528 ns/op         669.91 MB/s
BenchmarkSum128-12               5000000               354 ns/op         360.71 MB/s
BenchmarkSum1K-12                1000000              1686 ns/op         607.12 MB/s
PASS
ok      golang.org/x/crypto/blake2b     6.842s

With AVX

$ go test -bench .
AVX
BenchmarkWrite128-12             3000000               333 ns/op         383.37 MB/s
BenchmarkWrite1K-12               500000              2436 ns/op         420.35 MB/s
BenchmarkSum128-12               5000000               344 ns/op         371.52 MB/s
BenchmarkSum1K-12                 500000              2592 ns/op         394.96 MB/s
PASS
ok      golang.org/x/crypto/blake2b     6.091s

With AVX2

$ go test -bench .
AVX2
BenchmarkWrite128-12              500000              2084 ns/op          61.40 MB/s
BenchmarkWrite1K-12               100000             16227 ns/op          63.10 MB/s
BenchmarkSum128-12               1000000              2107 ns/op          60.74 MB/s
BenchmarkSum1K-12                 100000             16312 ns/op          62.77 MB/s
PASS
ok      golang.org/x/crypto/blake2b     7.647s

from blake2b-simd.

aead commented on August 19, 2024

Missed I something? On time you get < 70 MB/s for AVX2 and the next time you get > 350 MB/s ?!

from blake2b-simd.

harshavardhana commented on August 19, 2024

Missed I something? On time you get < 70 MB/s for AVX2 and the next time you get > 350 MB/s ?!

No i edited, it was a copy paste problem :-)

from blake2b-simd.

aead commented on August 19, 2024

Oh okay, no problem :-)

from blake2b-simd.

harshavardhana commented on August 19, 2024

Oh okay, no problem :-)

Pinged you on gitter send me your SSH keys can give you access to the XEON machines we have. Thanks @goarchit for the offer.

from blake2b-simd.

aead commented on August 19, 2024

@harshavardhana
Okay, I get the same low performance for AVX2 but I see:

AVX
BenchmarkWrite128-12    	10000000	       216 ns/op	 589.97 MB/s
BenchmarkWrite1K-12     	 1000000	      1532 ns/op	 668.30 MB/s
BenchmarkSum128-12      	10000000	       245 ns/op	 520.60 MB/s
BenchmarkSum1K-12       	 1000000	      1562 ns/op	 655.39 MB/s
PASS
ok  	golang.org/x/crypto/blake2b	8.239s

If that's the "normal" AVX performance, the AVX2 code is the problem...

from blake2b-simd.

aead commented on August 19, 2024

Replacing the PINSRQ through VPINSRQ and adding VZEROUPPER doesn't change anything - I'll look at this tomorrow again...

from blake2b-simd.

fwessels commented on August 19, 2024

A performance drop like this can very well be related to unintended saving of the upper half of the AVX registers -- so my suggestion would indeed be to see if VZEROUPPER etc. can help (or even better would not be needed unnecessarily altogether).

from blake2b-simd.

aead commented on August 19, 2024

Okay, I know the problem: SSE-AVX savings in the LOAD_MSG_AVX2 macro.
The critical instructions are PINSRQ and MOVQ - I replaced them with VPINSRQ and VMOVQ and got:

AVX2
BenchmarkWrite128-12    	 5000000	       263 ns/op	 485.06 MB/s
BenchmarkWrite1K-12     	 1000000	      1617 ns/op	 633.24 MB/s
BenchmarkSum128-12      	 5000000	       292 ns/op	 438.23 MB/s
BenchmarkSum1K-12       	 1000000	      1707 ns/op	 599.73 MB/s

I'll will replace them and optimize the code to get the "old" performance again...
This affects the AVX code as well...

from blake2b-simd.

harshavardhana commented on August 19, 2024

Thanks @aead - this looks good.

from blake2b-simd.

fwessels commented on August 19, 2024

Nice, you have to be really careful with the correct "flavour" of the instruction...

from blake2b-simd.

aead commented on August 19, 2024

So, I modified the AVX and AVX2 code to use only AVX instructions and got this on the XEON machine:

BenchmarkWrite128-12    	 5000000	       228 ns/op	 561.39 MB/s
BenchmarkWrite1K-12     	 1000000	      1131 ns/op	 904.78 MB/s
BenchmarkSum128-12      	10000000	       222 ns/op	 575.70 MB/s
BenchmarkSum1K-12       	 1000000	      1198 ns/op	 854.54 MB/s

I submitted the code for review and will send a PR to minio as soon as the code passes the review process.

@fwessels
I'm also a little bit surprised, that the MOVOU (SSE) instruction in compressAvx_amd64.s doesn't lead to some similar performance problems as the MOVQ instruction. But maybe the XMM,m128 moves are not affected by register savings?!

from blake2b-simd.

fwessels commented on August 19, 2024

@aead good point, at least it doesn't do on my MacBook Pro (which has AVX but lacks AVX2).
On the server where we have AVX2 I never actually benchmarked the AVX version (it would have been slower anyway), so perhaps it would have shown a similar behaviour.

from blake2b-simd.

aead commented on August 19, 2024

@fwessels
I tested the AVX implementation (blake2b-simd) on the XEON and got the expected performance - so MOVOU (for example) doesn't seem to lead to SSE-AVX transitions... Nice to know :-)

from blake2b-simd.

fwessels commented on August 19, 2024

@aead Yes, that is good to know (otherwise I'd have probably noticed it, if you get it wrong there are typically significant penalties...)

from blake2b-simd.

Speed vs. "Optimized" codahale version? about blake2b-simd HOT 30 CLOSED

Comments (30)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent