Comments (5)
Ah, what's going on is that while the number of minhash bits is correct, the number of minhash buckets is not. I think that's actually a bug in the final space bound in the paper. The standard minhash bound of error on the order 1/sqrt(k) is actually an absolute error. So if you're using 2^16 buckets, then the minhash error for the Jaccard index (not even using hyperminhash compression) is going to be on the order of 1/256 ~ 0.004 asymptotically. The analysis in the paper focuses on the number of additional collisions from using hyperminhash compression, but the number of buckets still determines an absolute error in Jaccard index estimation.
Thanks for catching that.
from hyperminhash.
If I understand correctly, 16 index_bits should provide 0.4% absolute error when estimating the Jaccard index, correct?
from hyperminhash.
BTW, I believe you meant to use union_filled_buckets on line 257.
from hyperminhash.
Mostly yes. The 16 index_bits for standard MinHash gives a 0.004 standard deviation on the Jaccard index so long as there are no MinHash collisions. (i.e. so with 95% probability, it'll be within 0.8% absolute error, as that's 2 standard deviations).
The analysis in my paper was mostly bounding the additional error that comes from doing a floating point compression on the hash values. So you also need sufficient minhash_bits in order to prevent the floating point compression from adding to the error.
Does that make sense?
Also, thanks for the catch! Yes, I did.
from hyperminhash.
Yes, thank you. I was able to explain my remaining mysteries with the absolute vs relative error confusion. Thanks again for providing a reference implementation, it has been been very helpful to compare against.
from hyperminhash.
Related Issues (3)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyperminhash.