Coder Social home page Coder Social logo

simhash-js's Introduction

simhash-js

A Javascript implementation of Charikar's hash for identification of similar documents.

What is Simhash?

Consider two documents A and B that differ in just a single byte.

Hash functions such as SHA-2 or MD5 will hash contents of these two documents into two completely different and unrelated hash values. The Hamming distance between md5(A) and md5(B) would be large. In fact, that is one of the goals of cryptographic hash functions such as SHA-2 or MD5 - to minimize collisions in hash values they generate.

By contrast, Simhash will hash contents of A and B to similar hash values. The Hamming distance between simhash(A) and simhash(B) would be small.

Usage

var sjs = require('simhash-js');
var simhash = new sjs.SimHash();
var x = simhash.hash("This is a test of the Emergency Blogcast System");
var y = simhash.hash("This is a second test of the Emergency Blogcast System");

var s = sjs.Comparator.similarity(x, y); 

To Do

  • Implement an efficient priority queue
  • Accept a list of stop words to be removed from input prior to calculating hash

References

  • Charikar: Similarity Estimation Techniques from Rounding Algorithms, in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM Press, 2002
  • Manku, Jain, Sarma: Detecting Near-Duplicates for Web Crawling. in Proceedings of the 16th international conference on World Wide Web, ACM Press, 2007

Contributors

Sincere thanks to:

simhash-js's People

Contributors

dverstee avatar vkandy avatar xblanc33 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

simhash-js's Issues

Licens?

Hi, what's the license of this code? (Please say MIT License) would like to use this in one of my projects, thx

npm install

Hi,

Would be awesome to publish simhash-js in npm.

Thanks

the similarity is not expected

TEST 1:
let a = 'the cat sat on the mat'
let b = 'the cat sat on a mat'
let h1 = simhash.hash(a)
let h2 = simhash.hash(b)
console.log(h1, h2, sjs.Comparator.similarity(h1, h2))
// ------------ result
// 687990018 690349194 0.16666666666666666

TEST2:
let a = 'the cat sat on the mat'
let b = 'xxx xxxxxx xx xxx xxx'
let h1 = simhash.hash(a)
let h2 = simhash.hash(b)
console.log(h1, h2, sjs.Comparator.similarity(h1, h2))
// -------------- result
// 687990018 236331081 0.23529411764705882

privilege problem & something seems wrong

At first, there is an access privilege problem of method "similarity", since it was declared as var, so it can not be access from outside. I solved this problem, but

Secondly, the calling of "simhash.of(xxx)" will always return 0, so I can't get any meaningful result.
By the way, the Jenkins Hash returns 0 for many inputs, I don't know whether it is correct.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.