Coder Social home page Coder Social logo

hashcompare's Introduction

Comparison of Hashing Techniques for Bit Rot detection

Introduction

At minio we use a hash algorithm in order to detect bit rot for data blocks that are stored on disk. Thus far we have been using the Blake2b algorithm but recent developments have led us to do new research for a faster algorithm for bit rot protection.

Initially we looked into SipHash as well as Poly1305. Obviously Poly1305 is not a "traditional" hash function but since it has nice performance properties and could potentially fit our use case (if used in a particular way), we decided to test it out as well.

As a last algorithm we have analysed HighwayHash which has been finalized as of January 2018. It can produce checksums in 64, 128, or 256 bit sizes and significantly outperforms other hashing techniques allowing for speeds over 10GB/sec on a single core (as compared to roughly 1 GB/sec on a single core for Blake2b). We have developed a Golang implementation with accelerations for both Intel and ARM.

Bit rot simulation

Due to the fact that we are interested specifically in bit rot detection, we tried to simulate bit rot by generating messages of varying lengths and by flipping (XOR-ring) a single bit across the full length of the message and computing the hash for each of these permutations. In addition we also run patterns of 2 up to 8 adjacent bits across each byte of each input message.

All hashes are sorted and then the smallest difference is searched for between two consecutive hashes. The number of leading zero bits that this smallest difference has then gives us a measure of how close (or how far apart depending on your point of view) we are to a collision with respect to the size (in number of bits) for the respective hash function.

Results

The following table shows an overview for the number of leading zero bits for the different permutations.

Permutations HighwayHash (64) HighwayHash (128) HighwayHash (256) Blake2b (512) Poly1305 (128) SipHash (128)
9216 27 28 28 26 28 29
18432 31 28 30 28 31 28
36864 33 35 31 33 31 31
73728 32 32 31 30 33 31
147456 33 34 36 36 33 34
294912 37 39 38 36 38 35
589824 41 38 37 37 40 39
1179648 37 39 40 41 40 41
2359296 47 43 42 44 41 45
4718592 44 44 45 48 45 45
9437184 48 45 47 46 47 46
18874368 47 48 48 46 48 51
37748736 53 53 49 50 48 51
75497472 52 53 52 51 53 54
150994944 56 53 56 54 54 53

Graphically represented this gives the following overview:

Hash Comparison Overview

Comparison to earlier results

In earlier research with respect to Blake2b we found that for the YFCC100m dataset (Yahoo Flickr Creative Commons 100M) we obtained a closest collision of 52 equal leading bits (at 100 million scale). Thus it is nice to know that for a different type of dataset the results are similar. Note that these were much larger objects and overall took several (combined) weeks of computing power.

Conclusion

While both SipHash and Poly1305 produce good results that could qualify them for bit rot detection, given the fact that they do not outperform HighwayHash means that there is no incentive for switching.

As can be concluded from the results presented above, HighwayHash produces qualitatively similar results as compared to Blake2b. Given the huge performance difference it thus makes sense to switch. That leaves the question which size of output to use.

Clearly the 64 bits version should not be used as it (at 100 million scale) gets "close" (within 10 bits) of the maximum value of 64 bits.

Although the finalization of HighwayHash-128 is slightly faster as compared to HighwayHash-256, that should not be the primary reason for making the decision.

Detailed Results

Below are the detailed measurements as executed on dual Skylake CPUs (Xeon Gold 6152 with 22 Core @ 2.1 GHz).

h i g h w a y h a s h 6 4
---Permutations |--- Zero bits | Duration
-----------9216 |---------- 27 | 20.381302ms
----------18432 |---------- 31 | 27.571612ms
----------36864 |---------- 33 | 56.052397ms
----------73728 |---------- 32 | 104.601131ms
---------147456 |---------- 33 | 193.915563ms
---------294912 |---------- 37 | 326.372434ms
---------589824 |---------- 41 | 548.609056ms
--------1179648 |---------- 37 | 1.03530732s
--------2359296 |---------- 47 | 3.544088757s
--------4718592 |---------- 44 | 5.984269238s
--------9437184 |---------- 48 | 13.655630501s
-------18874368 |---------- 47 | 37.100825307s
-------37748736 |---------- 53 | 2m1.893114127s
-------75497472 |---------- 52 | 19m7.856193177s
------150994944 |---------- 56 | 2h27m53.158081067s
h i g h w a y h a s h 1 2 8
---Permutations |--- Zero bits | Duration
-----------9216 |---------- 28 | 12.780649ms
----------18432 |---------- 28 | 30.06254ms
----------36864 |---------- 35 | 68.784714ms
----------73728 |---------- 32 | 105.522995ms
---------147456 |---------- 34 | 226.163161ms
---------294912 |---------- 39 | 420.283355ms
---------589824 |---------- 38 | 663.253149ms
--------1179648 |---------- 39 | 1.432212157s
--------2359296 |---------- 43 | 2.781799607s
--------4718592 |---------- 44 | 6.316052298s
--------9437184 |---------- 45 | 17.407566111s
-------18874368 |---------- 48 | 41.393221202s
-------37748736 |---------- 53 | 2m11.951301187s
-------75497472 |---------- 53 | 19m41.172739809s
------150994944 |---------- 53 | 2h28m44.846577785s
h i g h w a y h a s h 2 5 6
---Permutations |--- Zero bits | Duration
-----------9216 |---------- 28 | 42.702422ms
----------18432 |---------- 30 | 73.954159ms
----------36864 |---------- 31 | 121.332344ms
----------73728 |---------- 31 | 222.526971ms
---------147456 |---------- 36 | 425.226972ms
---------294912 |---------- 38 | 840.404466ms
---------589824 |---------- 37 | 1.544777895s
--------1179648 |---------- 40 | 2.721589884s
--------2359296 |---------- 42 | 4.783848222s
--------4718592 |---------- 45 | 9.457394958s
--------9437184 |---------- 47 | 20.228981718s
-------18874368 |---------- 48 | 51.531855485s
-------37748736 |---------- 49 | 2m26.319665407s
-------75497472 |---------- 52 | 19m53.693452686s
------150994944 |---------- 56 | 2h23m55.834082563s
b l a k e 2 b
---Permutations |--- Zero bits | Duration
-----------9216 |---------- 26 | 61.13945ms
----------18432 |---------- 28 | 111.499262ms
----------36864 |---------- 33 | 189.275791ms
----------73728 |---------- 30 | 331.913372ms
---------147456 |---------- 36 | 692.136065ms
---------294912 |---------- 36 | 1.271843096s
---------589824 |---------- 37 | 2.44415668s
--------1179648 |---------- 41 | 4.454261956s
--------2359296 |---------- 44 | 10.528152825s
--------4718592 |---------- 48 | 28.531043005s
--------9437184 |---------- 46 | 1m28.637419899s
-------18874368 |---------- 46 | 5m7.323075536s
-------37748736 |---------- 50 | 18m59.189679076s
-------75497472 |---------- 51 | 1h13m26.851782323s
------150994944 |---------- 54 | 4h48m46.614104626s
p o l y 1 3 0 5
---Permutations |--- Zero bits | Duration
-----------9216 |---------- 28 | 22.267656ms
----------18432 |---------- 31 | 40.13137ms
----------36864 |---------- 31 | 76.420881ms
----------73728 |---------- 33 | 145.6595ms
---------147456 |---------- 33 | 238.326942ms
---------294912 |---------- 38 | 433.122249ms
---------589824 |---------- 40 | 804.927135ms
--------1179648 |---------- 40 | 1.823510133s
--------2359296 |---------- 41 | 6.213754498s
--------4718592 |---------- 45 | 9.823461941s
--------9437184 |---------- 47 | 26.058247821s
-------18874368 |---------- 48 | 1m21.038484996s
-------37748736 |---------- 48 | 4m37.082124066s
-------75497472 |---------- 53 | 24m55.351343578s
------150994944 |---------- 54 | 2h30m9.271630052s
s i p h a s h
---Permutations |--- Zero bits | Duration
-----------9216 |---------- 29 | 18.267322ms
----------18432 |---------- 28 | 37.565362ms
----------36864 |---------- 31 | 67.701024ms
----------73728 |---------- 31 | 134.268518ms
---------147456 |---------- 34 | 284.326987ms
---------294912 |---------- 35 | 336.299086ms
---------589824 |---------- 39 | 725.117538ms
--------1179648 |---------- 41 | 1.836565811s
--------2359296 |---------- 45 | 4.199563088s
--------4718592 |---------- 45 | 12.092629909s
--------9437184 |---------- 46 | 37.705131584s
-------18874368 |---------- 51 | 2m5.958026812s
-------37748736 |---------- 51 | 7m38.738874946s
-------75497472 |---------- 54 | 30m20.270643374s
------150994944 |---------- 53 | 2h35m57.478741043s

hashcompare's People

Contributors

fwessels avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hashcompare's Issues

Checking perf on Ampere aarch64 server

Hi folks, I'm looking to test HighwayHash on our new Ampere Altra to ensure performance. I'm having some trouble running. I was wondering if you could share your invocations? I am on go version go1.11.13 linux/arm64

$ go get -u github.com/minio/highwayhash
$ go get -u github.com/aead/poly1305
$ go get -u github.com/aead/siphash
$ go get -u github.com/minio/blake2b-simd
$ git clone https://github.com/fwessels/HashCompare.git

I spliced out the AVX512 portions of benchmarks_test.go since we are on aarch64..

$ cd /home/mjm/hash/HashCompare
$ git test
# _/home/mjm/hash/HashCompare.test
2020/08/30 21:16:09 duplicate symbol github.com/minio/highwayhash.zipperMerge (types 1 and 8) in github.com/minio/highwayhash and /home/mjm/go/pkg/linux_arm64/github.com/minio/highwayhash.a(highwayhash_arm6)
FAIL    _/home/mjm/hash/HashCompare [build failed]

I've tried "go clean" as well. Help is appreciated! Thanks

mathemtical aspects

Requirements

Since we use a hash-based bitrot protection approach we need a (hash) function f mapping value of arbitrary length to a value of a fixed length. The function f must satisfy the following properties:

  • The mapping f(x) = y must be deterministic. That means that f(x) always evaluates to y and not to z at some point in time
  • The evaluation of f with k different randomly chosen inputs x1, ..., xk should differ with probability P = e-((k * (k-1)) / (2 * 2n)) where n is length of the output values of f.

The 1. requirement is kind of obvious - if f is not deterministic it is not possible to reliable verify the integrity of data. The 2. requirement is derived from the strong collision requirement. However in contrast to the general birthday problem we are not interested in "constructed" collisions. Therefore we define a requirement violation as a collisions of randomly chosen inputs. A constructed collision would imply and intelligent adversary exploiting a mathematical weakness of a given mapping f to find collisions with a probability P' > P. However bitrot protection is not designed to withstand an attacker because the adversary can simple modify the data and replace the stored hash value with the hash value of the modified data.

Random Oracle and PRF

A cryptographic secure hash function - like BLAKE2b-512 - always satisfies our requirements since it provides strong collision resistance which does not restrict the input values to be chosen randomly.
A random oracle and following a PRF (like SipHash / HighwayHash) also satisfies our requirements since a PRF can be constructed from our function f` using the following scheme:

  1. Generate a set of uniformly at random chosen keys. |ki| = |xi|
  2. yi = f(ki โŠ• xi)

This implies that SipHash, HighwayHash and BLAKE2b can be used for bitrot protection because of their mathematical properties. (Since we discarded Poly1305 anyway I don't discuss it here further)

However SipHash and HigwayHash are PRFs and require a secret key. The secret key can be securely omitted if the PRF provides strong collision resistance for a fixed (but unkown) secret key which is the case for SipHash and HighwayHash.

Choice of hash size

The following list shows the number of data modifications that can happened before the probability for an undetected bitrot exceeds 50%:

  • SipHash-64:
    For k = 232 + 229 + 228 the probability for a collision Pc = 1-e-((k * (k-1)) / (2 * 2n)) = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 5100 million computations - roughly speaking 5 billion objects with bitrot in our case.
  • SipHash-128 / HighwayHash-128:
    For k = 264 + 261 + 260 the probability for a collision Pc = 1-e-((k * (k-1)) / (2 * 2n)) = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 21905508 trillion computations - roughly speaking 21905508 trillion objects with bitrot in our case.
  • HighwayHash-256:
    For k = 2128 + 2125 + 2124 the probability for a collision Pc = 1-e-((k * (k-1)) / (2 * 2n)) = 0.506 ~= 50% That would mean that the probability for one undetected bitrot is about 50% after 4.04ร—1038 computations - This is far beyond practical computation and can be treated as impossible.

Notice that this numbers refer to all minio instances. So for example if we talk about 5 billion objects with bitrot than we mean 5 billion distributed over all minio instances all over the world.
Basically 50% chance of undetected bitrot on one minio server somewhere on the planet earth.
(Except aliens are using minio ๐Ÿ˜‰)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.