Coder Social home page Coder Social logo

Comments (6)

kportertx avatar kportertx commented on August 19, 2024

Sorry for jerking this back and forth. Works as expected when only one set is empty, fails when both are empty.

test2.py

from hyperminhash import HyperMinHash

hmh = HyperMinHash(8, 6, 10, collision_correction="false")
hmh1 = HyperMinHash(8, 6, 10, collision_correction="false")

n_keys = 1000000

hmh.update(("{}".format(v) for v in range(n_keys)))

print("hmh intersect", hmh.intersection(hmh1))
print("hmh1 intersect", hmh1.intersection(hmh))
python test2.py
hmh intersect (0.0, 0.0, 0, 1015900.7572943214)
hmh1 intersect (0.0, 0.0, 0, 1015900.7572943214)

from hyperminhash.

kportertx avatar kportertx commented on August 19, 2024

Here is a potential way of solving this issue:
https://github.com/kportertx/hyperminhash/pull/1/files

The "_is_empty" flag could be useful in union and count as well.

from hyperminhash.

yunwilliamyu avatar yunwilliamyu commented on August 19, 2024

Ah, thanks for finding this edge case!

Instead of adding a new _is_empty flag like you suggest in the pull request, I'm personally leaning towards just checking if filled_buckets = 0, which is equivalent to _is_empty, and doesn't require any extra propagation code in things like add.

Thanks again for bringing it to my attention; I'll implement a fix this weekend.

from hyperminhash.

kportertx avatar kportertx commented on August 19, 2024

No problem, and thank you for providing a reference implementation for your paper.

from hyperminhash.

yunwilliamyu avatar yunwilliamyu commented on August 19, 2024

I just implemented a check in the Jaccard index code to return 0 when both sets are empty.

Of course, Jaccard index is undefined when both sets are empty, because it's 0/0. However, we simply do not want to return a NaN. This definition makes the intersection code operate as normal without any extra exception checking.

from hyperminhash.

kportertx avatar kportertx commented on August 19, 2024

After more thought on this, I think NaN may be the most appropriate response.

It may be true that two empty sets the same, but whether or or not it should return 0 or 1 is use case dependent, and in most cases this probably should be an error. NaN provides the most information and likely this scenario wasn't intended. I have opted to return float('NaN').

from hyperminhash.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.