Coder Social home page Coder Social logo

ip-nsw's Introduction

Non-metric Similarity Graphs for Maximum Inner Product Search

This is the implementation for ip-nsw from the following paper:

S. Morozov, A. Babenko. Non-metric Similarity Graphs for Maximum Inner Product Search. Advances in Neural Information Processing Systems 32 (NIPS 2018).

and it is substantially based on https://github.com/yurymalkov/hnsw.

Requirements

  • g++ with c++11 support
  • OpenMP installed

Compilation

To download and compile the code type:

$ git clone https://github.com/stanis-morozov/ip-nsw.git
$ cd ip-nsw
$ cmake CMakeLists.txt
$ make

Dataset

We introduce to the community the new dataset Music100. The database of Music100 consists of 1,000,000 vectors of dimensionality 100 and there is a random sample of 10,000 queries of the same dimensionality.

Database Music100

Query Music100

Groundtruth: Top 1, Top 10, Top 100.

Both files are written in binary format. Vectors are stored consecutively, and numbers are represented in 4-bytes floating poing format (float in C/C++).

Example of reading database_music100.bin in C++:

size_t databaseSize = 1000 * 1000;
size_t dimension = 100;
std::vector <std::vector <float> > data(databaseSize, std::vector <float> (dimension));
std::ifstream input("database_music100.bin", std::ios::binary);
for (size_t i = 0; i < databaseSize; i++) {
    input.read((char*)data[i].data(), dimension * sizeof(float));
}
input.close();

Example of reading database_music100.bin in Python:

databaseSize = 10**6
dimension = 100
data = np.fromfile('database_music100.bin', dtype = np.float32).reshape(databaseSize, dimension)

Run experiments

Usage: main [OPTIONS]
This tool supports two modes: to construct the graph index from the database and to retrieve top K maximum inner product vectors using the constructed index. Each mode has its own set of parameters.

  --mode            "database" or "query". Use "database" for
                    constructing index and "query" for top K
                    maximum inner product retrieval

Database mode supports the following options:
  --database        Database filename. Database should be stored in binary format.
                    Vectors are written consecutive and numbers are
                    represented in 4-bytes floating poing format (float in C/C++)
  --databaseSize    Number of vectors in the database
  --dimension       Dimension of vectors
  --outputGraph     Filename for the output index graph
  --efConstruction  efConstruction parameter. Default: 1024
  --M               M parameter. Default: 32

Query mode supports the following options:
  --query           Query filename. Queries should be stored in binary format.
                    Vectors are written consecutive and numbers are
                    represented in 4-bytes floating poing format (float in C/C++)
  --querySize       Number of queries
  --dimension       Dimension of vectors
  --inputGraph      Filename for the input index graph
  --efSearch        efSearch parameter. Default: 128
  --topK            Top size for retrieval. Default: 1
  --output          Filename to print the result. Default: stdout

Our best performed index graph from the NIPS paper is (we assume that our Music100 dataset is located in the data folder):

./main --mode database --database data/database_music100.bin --databaseSize 1000000 --dimension 100 --outputGraph out_graph.hnsw --efConstruction 1024 --M 32
./main --mode query --query data/query_music100.bin --querySize 10000 --dimension 100 --inputGraph out_graph.hnsw --topK 10 --efSearch 128 --output result.txt

The above commands print the output to result.txt in the following format:

109133 121486 19537 659869 834299 626720 867593 793041 985613 780019 
981712 601975 355926 202115 294738 198925 679900 609952 19993 615818 
77873 764344 292930 982610 65402 80310 638008 446408 209176 974376 
615370 981712 768351 601975 19993 679900 521439 198925 867593 739598 
619843 567912 732469 560702 649392 34342 827097 18669 385074 690897 
473444 752040 342582 154589 881600 292375 171237 223102 370021 219806 
313256 36821 267804 507875 949307 270882 471300 70277 651590 111192 
19993 93160 361868 896032 202115 739598 58964 967143 205517 71338 
894914 943250 85319 431888 615156 428084 909129 687106 884369 942893 
480939 228514 630512 651287 197739 509842 745237 566816 881843 166116
...

where each line corresponds to the answer for the query and the numbers in the lines are zero-based IDs of vectors in the database. The Recall for the above parameters is 0.99458.

Related publication

@inproceedings{ip-nsw18,
    title={Non-metric Similarity Graphs for Maximum Inner Product Search}
    author={Stanislav Morozov and Artem Babenko},
    booktitle={Advances in Neural Information Processing Systems},
    year={2018}
}

ip-nsw's People

Contributors

stanis-morozov avatar jerry-liujie avatar

Stargazers

 avatar  avatar  avatar Hannankan avatar Daniel Baker avatar Robert Liu avatar

Watchers

Kelvin Ng avatar Xinyan avatar  avatar

Forkers

bianzheng123

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.