Coder Social home page Coder Social logo

Comments (5)

yunwilliamyu avatar yunwilliamyu commented on September 28, 2024

Ah, what's going on is that while the number of minhash bits is correct, the number of minhash buckets is not. I think that's actually a bug in the final space bound in the paper. The standard minhash bound of error on the order 1/sqrt(k) is actually an absolute error. So if you're using 2^16 buckets, then the minhash error for the Jaccard index (not even using hyperminhash compression) is going to be on the order of 1/256 ~ 0.004 asymptotically. The analysis in the paper focuses on the number of additional collisions from using hyperminhash compression, but the number of buckets still determines an absolute error in Jaccard index estimation.

Thanks for catching that.

from hyperminhash.

kportertx avatar kportertx commented on September 28, 2024

If I understand correctly, 16 index_bits should provide 0.4% absolute error when estimating the Jaccard index, correct?

from hyperminhash.

kportertx avatar kportertx commented on September 28, 2024

BTW, I believe you meant to use union_filled_buckets on line 257.

from hyperminhash.

yunwilliamyu avatar yunwilliamyu commented on September 28, 2024

Mostly yes. The 16 index_bits for standard MinHash gives a 0.004 standard deviation on the Jaccard index so long as there are no MinHash collisions. (i.e. so with 95% probability, it'll be within 0.8% absolute error, as that's 2 standard deviations).

The analysis in my paper was mostly bounding the additional error that comes from doing a floating point compression on the hash values. So you also need sufficient minhash_bits in order to prevent the floating point compression from adding to the error.

Does that make sense?

Also, thanks for the catch! Yes, I did.

from hyperminhash.

kportertx avatar kportertx commented on September 28, 2024

Yes, thank you. I was able to explain my remaining mysteries with the absolute vs relative error confusion. Thanks again for providing a reference implementation, it has been been very helpful to compare against.

from hyperminhash.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.