Coder Social home page Coder Social logo

shao-group / lsb-learn Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 2.0 119.76 MB

A learning algorithm for locality-sensitive bucketing functions

Python 94.42% Shell 0.74% C++ 4.84%
inception-network locality-sensitive-hashing locality-sensitive-bucketing

lsb-learn's Introduction

Introduction

This repo hosts source code to train locality-sensitive bucketing (LSB) functions. A bucketing function $f$ maps a length $n$ string to a set of hash-codes (instead of one hash-code). A bucketing function $f$ is said to be $(d_1, d_2)$-sensitive, if for any two length-n strings $s$ and $t$, $f$ satisfies: if the edit distance between $s$ and $t$ is less than or equal to $d_1$, then $f(s)$ and $f(t)$ share at least one hash-code, and if the edit distance between $s$ and $t$ is greater than or equal to $d_2$, then $f(s)$ and $f(t)$ will not share any hash-code. The LSB functions are proposed in LSB paper with source code at LSB repo.

Here we develop a machine-learning framework to automatically learn $(d_1, d_2)$-LSB functions from simulation data. Briefly speaking, we use Siamese neural network as the training framework in which the inner model is an inception neural network which represents the hash function $f$. The inception neural network consists of layers of convolution-maxpooling units that can capture various sizes of substrings as features.

Usage

  • Environment: python vision >= 3.6

  • Data simulation. Codes in /simulation can generate a set of random pairs of length-n strings $(s,t)$ with various edit distances as needed. Given $d_1, d_2$, training samples consist of tuples ${(s,t,y)}$, $y = -1$ if $edit(s,t) \le d_1$ and $y = 1$ if $edit(s,t) \ge d_2$.

  • Model training. Codes for $n = 20$ and $n=100$ are put in separate folders. siacnn_models_gpu.py is a function library (including losses, evaluations, model structures and generating hash code) awaiting import. The siaincp_runner.py is a trainer for Siamese Neural Network. Parameters are easily modified in the files following the annotations. To train a model, use command:

    python siaincp_runner.py

  • Testing and hashcode generating. tester.py is a quick example of testing data seq-n20-ED15-2.txt for the pretained models stored in trained models and generating the hash code with the command; hash codes will be stored in a file named hashcode_20k_40m_(d1,d2)s.hdf5.

    python tester.py

  • Pre-trained models. More pre-trained models are available at zenodo.

lsb-learn's People

Contributors

shaomingfu avatar xy5180 avatar

Forkers

xy5180 rachelshiq

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.