Coder Social home page Coder Social logo

hyperminhash's Introduction

hyperminhash

LogLog scaling version of MinHash by combining ideas from HyperLogLog and b-bit MinHash

Please cite: Yun William Yu & Griffin Weber. HyperMinHash: MinHash in LogLog space. (2017)
Preprint: https://arxiv.org/abs/1710.08436

Code consists of a class hyperminhash.HyperMinHash that allows:

  • __init__: specify the size and parameters of the sketch
  • update: add hashable items to sketch
  • count: estimate the count-distinct cardinality of the sketch
  • filled_buckets: return the number of buckets that have an item (mostly used for internal algorithms)
  • __add__: given two sketches, A and B, A+B merges the sketches such that every item in A or in B is now in the sum.
  • jaccard: given two sketches, A and B, returns the Jaccard index
  • serialize: returns a ByteString that can be deserialized into the original object
  • deserialize: initializes a HyperMinHash sketch based on a serialized ByteString
  • intersection: returns the intersection cardinality, Jaccard index, number of bucket matches, and union cardinality when combining two sketches.

Full details are in the Python Docstrings. Of note, the Python implementation is not fully space-optimal, as that would require bit packing. Instead, we use size uint8, uint16, uint32, uint64 types from numpy in the implementation. Note also that we depend on Python3.

To test the class, we also provide an experiments/ directory.

  • tests_full.py will regenerate data allowing recreation of Figure 6 in the paper, though it may take weeks on standard workstations. Note that you can edit the "test_reps" parameter that is passed to the hmh_test_range function within tests_full.py to a smaller number to either increase speed/decrease accuracy by running fewer repetitions, or decrease speed/increase accuracy by running additional repetitions.
  • error_plot_full.py assumes that the current directory has the output of tests_full.py, and will generate a nice matplotlib graph.

We also provide precomputed data of the type generated by tests_*.py. To use these, go to experiments_precomputed/ and run bash regen.sh.

Unit tests are in hyperminhash_tests.py, using the unittest framework.


Thanks to Seif Lotfy for providing a Golang implementation of HyperMinHash: https://github.com/axiomhq/hyperminhash

hyperminhash's People

Contributors

yunwilliamyu avatar

Stargazers

Jianshu_Zhao avatar  avatar Haodi Zhong avatar Halu avatar  avatar yanyusong avatar  avatar zhouxss avatar khigh avatar Philippe Ombredanne avatar Josh Mize avatar Grant Williams avatar  avatar  avatar Michael Bargury avatar John S. Dvorak avatar Mayank Asthana avatar Jie Zhu avatar Kevin Porter avatar Okada Haruki avatar Shi Pengcheng avatar Jaye Doepke avatar Luke Hatcher avatar Martin Steinegger avatar Kyle Napierkowski avatar Thomas Hill avatar Josh Curtis avatar Bob avatar Don Sprague avatar Yuan-Kuei Hsueh avatar  avatar Krishna Srinivasan avatar  avatar Kashif Rasul avatar jd avatar Daphne Ippolito avatar Roderick Bovee avatar Lee Bergstrand avatar Adrian Viehweger avatar peterdfields avatar Phelim Bradley avatar Dr. K. D. Murray avatar myth avatar Yue Han avatar Santiago Gepigon III avatar Luiz Irber avatar Austin Richardson avatar Nick Greenfield avatar Roye Rozov avatar Karel Břinda avatar

Watchers

Luiz Irber avatar James Cloos avatar Josh Curtis avatar  avatar paper2code - bot avatar

hyperminhash's Issues

Exception when taking intersection where both sets are empty.

I would expect the cardinality of this intersection to be approximately 0 but instead I get an exception in python:

# test.py

from hyperminhash import HyperMinHash

hmh = HyperMinHash(8, 6, 10, collision_correction="false")
hmh1 = HyperMinHash(8, 6, 10, collision_correction="false")

print("hmh intersect", hmh.intersection(hmh1))
python test.py
hyperminhash/hyperminhash.py:276: RuntimeWarning: invalid value encountered in long_scalars
  jaccard = intersect_size / union.filled_buckets()
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    print("hmh intersect", hmh.intersection(hmh1))
  File "hyperminhash/hyperminhash.py", line 291, in intersection
    intersect_size = int(np.round(jaccard * union.filled_buckets()))
ValueError: cannot convert float NaN to integer

Number of minhash_bits for 1% relative error for 0.1% similarity.

From the paper I need -- minhash_bits > log(6 / (0.001 * 0.01))

In [313]: log(6/(0.001 * 0.01))                                                                                                                                           
Out[313]: 13.304684934198283

So 16 bits for minhash should be more than adequate? But when I test, the error is significantly higher. What am I missing?

seed a n_keys 1048576 mod 999 ratio 0.001 index_bits 16 minhash_bits 16
...
hll - count (      1024       1025          1) intersect          1 target          1 error 2.086
hmh - count (      1024       1025          1) intersect          1 target          1 error 1.578
hll - count (      2048       2046          2) intersect          2 target          2 error 1.823
hmh - count (      2048       2046          2) intersect          2 target          2 error 0.812
hll - count (      4096       4094          4) intersect          4 target          4 error 1.277
hmh - count (      4096       4094          4) intersect          4 target          4 error 0.738
hll - count (      8192       8196          8) intersect          8 target          8 error 0.108
hmh - count (      8192       8196          8) intersect          9 target          8 error 3.890
hll - count (     16384      16385         16) intersect         16 target         16 error 4.372
hmh - count (     16384      16385         16) intersect         17 target         16 error 3.474
hll - count (     32768      32734         32) intersect         32 target         33 error 2.649
hmh - count (     32768      32734         32) intersect         33 target         33 error 0.804
hll - count (     65536      65643         65) intersect         67 target         66 error 2.136
hmh - count (     65536      65643         65) intersect         68 target         66 error 3.868
hll - count (    131072     131252        131) intersect        127 target        131 error 2.898
hmh - count (    131072     131252        131) intersect        150 target        131 error 14.813
hll - count (    262144     263361        263) intersect        204 target        262 error 22.319
hmh - count (    262144     263361        263) intersect        278 target        262 error 6.163
hll - count (    524288     524280        525) intersect        360 target        524 error 31.415
hmh - count (    524288     524280        525) intersect        504 target        524 error 3.837
hll - count (   1048576    1056910       1051) intersect        985 target       1049 error 6.062
hmh - count (   1048576    1056910       1051) intersect       1145 target       1049 error 9.197

Test code here: https://github.com/kportertx/hyperminhash/blob/master/test3.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.