Coder Social home page Coder Social logo

vectorizedminhash's Introduction

vectorizedMinHash

A small toolkit for very efficiently comparing the similarity of large numbers of documents or other data structures that can be represented as sets. Core features:

  • Very fast construction of MinHash "fingerprints" of sets. The algorithm is inspired by the MinHash implementation in datasketch, but the core MinHash algorithm is vectorized in numpy and includes CUDA support via cupy.
  • Jaccard similarity index estimation
  • Cardinality estimation (with bias correction for much better accuracy)
  • Union operations between fingerprints

The module also includes some simple functions for quickly converting text to token or n-gram-based hashes.

Requirements

  • numpy
  • cupy and CUDA for GPU acceleration (optional)

Installation

I haven't made it pip installable yet, so you'll just have to grab the repository files for now.

How to use

Generating fingerprints

The module is built around a VectorizedMinHash hasher object. To construct fingerprint from a string using CUDA support:

from vectorizedMinHash import VectorizedMinHash,fastNGramHashes

hasher = VectorizedMinHash(n_perm=256,mirror=True)


testString = 'This is a test string'

# 1) Construct a fingerprint from n-gram features
# (fastNGramHashes converts bytes directly to n-gram ids by changing the stride of the dtype)
ngramHashes = fastNGramHashes(testString.encode('ascii'),n=4)
fingerprint = hasher.fingerprint(ngramHashes,cuda=True)

# 2) Construct a fingerprint from tokens (split on whitespace by default)
tokenHashes = tokenHashes(testString.encode('ascii'))
fingerprint = hasher.fingerprint(ngramHashes,cuda=True)

# 3) Construct a fingerprint from tokens (split on non-alphanumeric chars)
tokenHashes = tokenHashes(testString.encode('ascii'),tokenRE=rb'\w+')
fingerprint = hasher.fingerprint(tokenHashes,cuda=True)

You can also compute a fingerprint from a more generic array of integer ids (must be representable as a np.uint32 array)

# Construct a fingerprint from element ids
ids = range(10)
fingerprint = hasher.fingerprint(ids,cuda=True)

The constructor's most important parameter is n_perm, which sets the size of the fingerprints. Larger fingerprints are more accurate, but require more processing time and memory to store. mirror doubles the length of the fingerprint for a given n_perm by using the max operation as in addition to min. In my experiments this saves processing time and improves accuracy, so it is the default setting.

A fingerprint is a simple np.uint32 array. It doesn't remember what hasher made it, so be careful to only compare fingerprints made with exactly the same settings.

Merging fingerprints with union

Fingerprints can be merged with the union function. This operation is equivalent to taking the union (or concatenating all the hash values) of the input sets before fingerprinting. Behind the scenes, it just stacks the fingerprints into an 2-d array and takes the min.

from vectorizedMinHash import union

new_fingerprint = union([fingerprint0,fingerprint1,fingerprint2])

Fingerprints constructed via union are indistinguishable from other fingerprints and it is perfectly valid to use them in jaccard or cardinality estimates.

Jaccard similarity index

The Jaccard similarity index between two sets can be estimated by the fraction of equal values in the corresponding fingerprints. The module contains two helper functions: jaccard and jaccardMatrix:

from vectorizedMinHash import jaccard, jaccardMatrix

# Compute jaccard similarity index for two fingerprints:
jac = jaccard(fingerprint0,fingerprint1)

# Compute all pairwise similarities for three fingerprints
jac_matrix = jaccardMatrix([fingerprint0,fingerprint1,fingerprint2])

cardinality estimates

The cardinality of a fingerprint (that is, the number of distinct hashes used to generate it) can be estimated with the cardinality method:

n = hasher.cardinality(fingerprint)

A Note on bias correction

Accurately estimating the cardinality of a set from its fingerprint is a little tricky. The method used in Datasketch MinHash has a huge downward bias, and more accurate methods usually involve completely different hashing algorithms (which can't simultaneously compute Jaccard similarity). The implementation in this module uses a simple bias-corrected maximum likelihood estimator to significantly increase the accuracy. The only complication is that the bias correction coefficients must be empirically estimated. The module has the required code at the end of the __init__.py file. Pre-computed bias correction coefficients are stored in bias_coef.npy, and should load automatically from the project root directory.

vectorizedminhash's People

Contributors

bradhackinen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.