Coder Social home page Coder Social logo

hyperminhash's Issues

Number of minhash_bits for 1% relative error for 0.1% similarity.

From the paper I need -- minhash_bits > log(6 / (0.001 * 0.01))

In [313]: log(6/(0.001 * 0.01))                                                                                                                                           
Out[313]: 13.304684934198283

So 16 bits for minhash should be more than adequate? But when I test, the error is significantly higher. What am I missing?

seed a n_keys 1048576 mod 999 ratio 0.001 index_bits 16 minhash_bits 16
...
hll - count (      1024       1025          1) intersect          1 target          1 error 2.086
hmh - count (      1024       1025          1) intersect          1 target          1 error 1.578
hll - count (      2048       2046          2) intersect          2 target          2 error 1.823
hmh - count (      2048       2046          2) intersect          2 target          2 error 0.812
hll - count (      4096       4094          4) intersect          4 target          4 error 1.277
hmh - count (      4096       4094          4) intersect          4 target          4 error 0.738
hll - count (      8192       8196          8) intersect          8 target          8 error 0.108
hmh - count (      8192       8196          8) intersect          9 target          8 error 3.890
hll - count (     16384      16385         16) intersect         16 target         16 error 4.372
hmh - count (     16384      16385         16) intersect         17 target         16 error 3.474
hll - count (     32768      32734         32) intersect         32 target         33 error 2.649
hmh - count (     32768      32734         32) intersect         33 target         33 error 0.804
hll - count (     65536      65643         65) intersect         67 target         66 error 2.136
hmh - count (     65536      65643         65) intersect         68 target         66 error 3.868
hll - count (    131072     131252        131) intersect        127 target        131 error 2.898
hmh - count (    131072     131252        131) intersect        150 target        131 error 14.813
hll - count (    262144     263361        263) intersect        204 target        262 error 22.319
hmh - count (    262144     263361        263) intersect        278 target        262 error 6.163
hll - count (    524288     524280        525) intersect        360 target        524 error 31.415
hmh - count (    524288     524280        525) intersect        504 target        524 error 3.837
hll - count (   1048576    1056910       1051) intersect        985 target       1049 error 6.062
hmh - count (   1048576    1056910       1051) intersect       1145 target       1049 error 9.197

Test code here: https://github.com/kportertx/hyperminhash/blob/master/test3.py

Exception when taking intersection where both sets are empty.

I would expect the cardinality of this intersection to be approximately 0 but instead I get an exception in python:

# test.py

from hyperminhash import HyperMinHash

hmh = HyperMinHash(8, 6, 10, collision_correction="false")
hmh1 = HyperMinHash(8, 6, 10, collision_correction="false")

print("hmh intersect", hmh.intersection(hmh1))
python test.py
hyperminhash/hyperminhash.py:276: RuntimeWarning: invalid value encountered in long_scalars
  jaccard = intersect_size / union.filled_buckets()
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    print("hmh intersect", hmh.intersection(hmh1))
  File "hyperminhash/hyperminhash.py", line 291, in intersection
    intersect_size = int(np.round(jaccard * union.filled_buckets()))
ValueError: cannot convert float NaN to integer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.