yunwilliamyu / hyperminhash Goto Github PK
View Code? Open in Web Editor NEWLogLog space version of MinHash by combining ideas from HyperLogLog and b-bit MinHash
License: Creative Commons Zero v1.0 Universal
LogLog space version of MinHash by combining ideas from HyperLogLog and b-bit MinHash
License: Creative Commons Zero v1.0 Universal
From the paper I need -- minhash_bits > log(6 / (0.001 * 0.01))
In [313]: log(6/(0.001 * 0.01))
Out[313]: 13.304684934198283
So 16 bits for minhash should be more than adequate? But when I test, the error is significantly higher. What am I missing?
seed a n_keys 1048576 mod 999 ratio 0.001 index_bits 16 minhash_bits 16
...
hll - count ( 1024 1025 1) intersect 1 target 1 error 2.086
hmh - count ( 1024 1025 1) intersect 1 target 1 error 1.578
hll - count ( 2048 2046 2) intersect 2 target 2 error 1.823
hmh - count ( 2048 2046 2) intersect 2 target 2 error 0.812
hll - count ( 4096 4094 4) intersect 4 target 4 error 1.277
hmh - count ( 4096 4094 4) intersect 4 target 4 error 0.738
hll - count ( 8192 8196 8) intersect 8 target 8 error 0.108
hmh - count ( 8192 8196 8) intersect 9 target 8 error 3.890
hll - count ( 16384 16385 16) intersect 16 target 16 error 4.372
hmh - count ( 16384 16385 16) intersect 17 target 16 error 3.474
hll - count ( 32768 32734 32) intersect 32 target 33 error 2.649
hmh - count ( 32768 32734 32) intersect 33 target 33 error 0.804
hll - count ( 65536 65643 65) intersect 67 target 66 error 2.136
hmh - count ( 65536 65643 65) intersect 68 target 66 error 3.868
hll - count ( 131072 131252 131) intersect 127 target 131 error 2.898
hmh - count ( 131072 131252 131) intersect 150 target 131 error 14.813
hll - count ( 262144 263361 263) intersect 204 target 262 error 22.319
hmh - count ( 262144 263361 263) intersect 278 target 262 error 6.163
hll - count ( 524288 524280 525) intersect 360 target 524 error 31.415
hmh - count ( 524288 524280 525) intersect 504 target 524 error 3.837
hll - count ( 1048576 1056910 1051) intersect 985 target 1049 error 6.062
hmh - count ( 1048576 1056910 1051) intersect 1145 target 1049 error 9.197
Test code here: https://github.com/kportertx/hyperminhash/blob/master/test3.py
Line 177 in 5d88ff4
I would expect the cardinality of this intersection to be approximately 0 but instead I get an exception in python:
# test.py
from hyperminhash import HyperMinHash
hmh = HyperMinHash(8, 6, 10, collision_correction="false")
hmh1 = HyperMinHash(8, 6, 10, collision_correction="false")
print("hmh intersect", hmh.intersection(hmh1))
python test.py
hyperminhash/hyperminhash.py:276: RuntimeWarning: invalid value encountered in long_scalars
jaccard = intersect_size / union.filled_buckets()
Traceback (most recent call last):
File "test.py", line 8, in <module>
print("hmh intersect", hmh.intersection(hmh1))
File "hyperminhash/hyperminhash.py", line 291, in intersection
intersect_size = int(np.round(jaccard * union.filled_buckets()))
ValueError: cannot convert float NaN to integer
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.