Coder Social home page Coder Social logo

lsh-semantic-similarity's Introduction

Locality Sensitive Hashing for semantic similarity

forthebadge vs 3.x

LSH (Locality Sensitive Hashing) is primarily used to find, given a large set of documents, the near-duplicates among them. It can use hamming distance, jaccard coefficient, edit distance or other distance notion.

You can read the following tutorials if you want to understand more about it:

Although LSH is more to duplicated documents than to semantic similar ones, in this approach I make an effort to use LSH to calculate semantic similarity among texts. For that, the algorithm extracts, using TFIDF, the text's main tokens (or you can pre-calculate them and pass as param). Also, in this approach I use MinHash (which uses Jaccard similarity) as the Similarity function.

The overall aim is to reduce the number of comparisons needed to find similar items. LSH uses hash collisions to capture objects similarities. The hash collisions come in handy here as similar documents have a high probability of having the same hash value. The probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets.

See this tutorial to see how use this LSH!

Run as following to install dependencies:

  python3 -m pip install -r requirements.txt

lsh-semantic-similarity's People

Contributors

italo-batista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

lsh-semantic-similarity's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.