Coder Social home page Coder Social logo

nmslib / hnswlib Goto Github PK

View Code? Open in Web Editor NEW
4.2K 4.2K 619.0 566 KB

Header-only C++/python library for fast approximate nearest neighbors

Home Page: https://github.com/nmslib/hnswlib

License: Apache License 2.0

CMake 1.58% C++ 76.53% Python 21.78% Makefile 0.11%

hnswlib's Introduction

Pypi version Downloads Downloads Build Status Windows Build Status Join the chat at https://gitter.im/nmslib/Lobby

Non-Metric Space Library (NMSLIB)

Important Notes

  • NMSLIB is generic but fast, see the results of ANN benchmarks.
  • A standalone implementation of our fastest method HNSW also exists as a header-only library.
  • All the documentation (including using Python bindings and the query server, description of methods and spaces, building the library, etc) can be found on this page.
  • For generic questions/inquiries, please, use the Gitter chat: GitHub issues page is for bugs and feature requests.

Objectives

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.

The goal of the project is to create an effective and comprehensive toolkit for searching in generic and non-metric spaces. Even though the library contains a variety of metric-space access methods, our main focus is on generic and approximate search methods, in particular, on methods for non-metric spaces. NMSLIB is possibly the first library with a principled support for non-metric space searching.

NMSLIB is an extendible library, which means that is possible to add new search methods and distance functions. NMSLIB can be used directly in C++ and Python (via Python bindings). In addition, it is also possible to build a query server, which can be used from Java (or other languages supported by Apache Thrift (version 0.12). Java has a native client, i.e., it works on many platforms without requiring a C++ library to be installed.

Authors: Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak. With contributions from Ben Frederickson, Lawrence Cayton, Wei Dong, Avrelin Nikita, Dmitry Yashunin, Bob Poekert, @orgoro, @gregfriedland, Scott Gigante, Maxim Andreev, Daniel Lemire, Nathan Kurz, Alexander Ponomarenko.

Brief History

NMSLIB started as a personal project of Bilegsaikhan Naidan, who created the initial code base, the Python bindings, and participated in earlier evaluations. The most successful class of methods--neighborhood/proximity graphs--is represented by the Hierarchical Navigable Small World Graph (HNSW) due to Malkov and Yashunin (see the publications below). Other most useful methods, include a modification of the VP-tree due to Boytsov and Naidan (2013), a Neighborhood APProximation index (NAPP) proposed by Tellez et al. (2013) and improved by David Novak, as well as a vanilla uncompressed inverted file.

Credits and Citing

If you find this library useful, feel free to cite our SISAP paper [BibTex] as well as other papers listed in the end. One crucial contribution to cite is the fast Hierarchical Navigable World graph (HNSW) method [BibTex]. Please, also check out the stand-alone HNSW implementation by Yury Malkov, which is released as a header-only HNSWLib library.

License

The code is released under the Apache License Version 2.0 http://www.apache.org/licenses/. Older versions of the library include additional components, which have different licenses (but this does not apply to NMLISB 2.x):

Older versions of the library included the following components:

  • The LSHKIT, which is embedded in our library, is distributed under the GNU General Public License, see http://www.gnu.org/licenses/.
  • The k-NN graph construction algorithm NN-Descent due to Dong et al. 2011 (see the links below), which is also embedded in our library, seems to be covered by a free-to-use license, similar to Apache 2.
  • FALCONN library's licence is MIT.

Funding

Leonid Boytsov was supported by the Open Advancement of Question Answering Systems (OAQA) group and the following NSF grant #1618159: "Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond". Bileg was supported by the iAd Center.

Related Publications

Most important related papers are listed below in the chronological order:

hnswlib's People

Contributors

2ooom avatar alonre24 avatar alxvth avatar aurora327 avatar bli25 avatar ctero-graham avatar dbespalov avatar dorosy-yeong avatar drons avatar dyashuni avatar emollier avatar groodt avatar ironyoung avatar jlmelville avatar js1010 avatar kishorenc avatar louisabraham avatar ltla avatar marekhanus avatar moritz-h avatar orrorcol avatar piem avatar psobot avatar slice4e avatar stephematician avatar takaakifuruse avatar ttsugriy avatar vimal-mathew avatar yoshoku avatar yurymalkov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hnswlib's Issues

hnsw can not always recall a data point already in the index?

I test hnswlib with 10000000 data points in the index.
I find that if i search a data point already in the index, the result may not contain the search point.
Is this case right?

Below is demo code:

    import numpy as np
    import hnswlib
    import time

    print 'start test hnsw ...'
    tic = time.time()
    data = np.random.randn(10000000, 128).astype(np.float32)
    toc = time.time()
    print 'data generate time used:', toc -tic

    tic = time.time()
    nmsl = hnswlib.Index(space='l2', dim=128)
    nmsl.init_index(max_elements=10010000, ef_construction=200, M=32)
    nmsl.set_ef(600)
    nmsl.add_items(data)
    toc = time.time()
    print 'build index time used:', toc -tic

    tic = time.time()
    labels, dists = nmsl.knn_query(data[1, :], k=400)
    toc = time.time()
    print 'h search top 400 time used:', toc - tic
    print labels

    tic = time.time()
    labels, dists = nmsl.knn_query(data[1000000, :], k=400)
    toc = time.time()
    print 'h search top 400 time used:', toc - tic
    print labels

    tic = time.time()
    labels, dists = nmsl.knn_query(data[2000000, :], k=400)
    toc = time.time()
    print 'h search top 400 time used:', toc - tic
    print labels

    tic = time.time()
    labels, dists = nmsl.knn_query(data[5000000, :], k=400)
    toc = time.time()
    print 'h search top 400 time used:', toc - tic
    print labels

    tic = time.time()
    labels, dists = nmsl.knn_query(data[7900000, :], k=400)
    toc = time.time()
    print 'h search top 400 time used:', toc - tic
    print labels


    tic = time.time()
    labels, dists = nmsl.knn_query(data[9900000, :], k=400)
    toc = time.time()
    print 'h search top 400 time used:', toc - tic
    print labels

    tic = time.time()
    labels, dists = nmsl.knn_query(data[1, :], k=600)
    toc = time.time()
    print 'h search top 600 time used:', toc - tic
    print labels

    tic = time.time()
    labels, dists = nmsl.knn_query(data[1, :], k=800)
    toc = time.time()
    print 'h search top 800 time used:', toc -tic
    print labels

    tic = time.time()
    labels, dists = nmsl.knn_query(data[1, :], k=1000)
    toc = time.time()
    print 'h search top 1000 time used:', toc -tic
    print labels

    tic = time.time()
    labels, dists = nmsl.knn_query(data[1, :], k=1500)
    toc = time.time()
    print 'h search top 1500 time used:', toc -tic
    print labels

Is it possible to fetch feature in Hbase to reduce the memory usage?

@yurymalkov
Thank you for your great work. I find HNSW stores the original features in memory, which limits HNSW to very large-scale dataset. HNSW memory usage is (d * 4 + x * 2 * 4) bytes per vector, and (d * 4) bytes is the feature size. If we fetch the features in Hbase or other database when add an item to HNSW index, we will reduce the memory usage to (x * 2 * 4) bytes per vector (Assume the routine path is not long). Do you think the solution to reduce the memory usage is OK?

Looking forward to your reply.

Does hnsw support adding items one by one in real-time?

Hi yurymalkov,

Thank you for your great work, and the performance shows in the paper is interesting. The python bindings shows the index can be built with incremental construction. Does the hnsw support adding items one by one? i.e. adding items one by one in real-time to the index dynamically.

How to avoid save whole index after add_items

We are using this lib to save about 10000+ indexes which results in a ~ 1GB index file. As we are adding new items continuously, we'd like to know if there are some optimizations can be done to avoid saving the whole index which takes a long time.

Unused variable in hnswalg.h

g++ warns that end_p is unused:

hnswalg.h:520:20: warning: variable 'end_p' set but not used [-Wunused-but-set-variable]
     std::streampos end_p=input.tellg();

Wrong distance to vector at index 0 using cosine similarity

There seems to be something weird going on when calculating the distance of any vector to the vector at index 0 using cosine similarity. It is sometimes very high (around 1e21) or very low (around -1e21), which are both outside the range of cosine similarity.

Below is a toy example that shows this behavior:

In [165]: v = np.array([[1,0],[1,1],[0,1],[-1,1],[-1,0],[-1,-1],[0,-1],[1,-1]])

In [166]: index = Index(space="cosine", dim=2)

In [167]: index.init_index(max_elements=8, ef_construction=200, M=16)

In [168]: data_labels = np.arange(8)

In [169]: index.add_items(v, data_labels)

In [170]: index.knn_query(v[0],k=8)
Out[170]:
(array([[1, 7, 2, 6, 3, 5, 4, 0]], dtype=uint64),
 array([[  2.92893231e-01,   2.92893231e-01,   1.00000000e+00,
           1.00000000e+00,   1.70710683e+00,   1.70710683e+00,
           2.00000000e+00,   3.41255758e+21]], dtype=float32))

In [171]: index.knn_query(v[1],k=8)
Out[171]:
(array([[1, 2, 3, 7, 4, 6, 5, 0]], dtype=uint64),
 array([[  5.96046448e-08,   2.92893231e-01,   1.00000000e+00,
           1.00000000e+00,   1.70710683e+00,   1.70710683e+00,
           2.00000000e+00,   2.41304247e+21]], dtype=float32))

In [172]: index.knn_query(v[2],k=8)
Out[172]:
(array([[2, 1, 3, 0, 4, 5, 7, 6]], dtype=uint64),
 array([[ 0.        ,  0.29289323,  0.29289323,  1.        ,  1.        ,
          1.70710683,  1.70710683,  2.        ]], dtype=float32))

In [173]: index.knn_query(v[3],k=8)
Out[173]:
(array([[0, 3, 2, 4, 1, 5, 6, 7]], dtype=uint64),
 array([[ -2.41304247e+21,   5.96046448e-08,   2.92893231e-01,
           2.92893231e-01,   1.00000000e+00,   1.00000000e+00,
           1.70710683e+00,   2.00000000e+00]], dtype=float32))

In [174]: index.knn_query(v[4],k=8)
Out[174]:
(array([[0, 4, 3, 5, 2, 6, 1, 7]], dtype=uint64),
 array([[ -3.41255758e+21,   0.00000000e+00,   2.92893231e-01,
           2.92893231e-01,   1.00000000e+00,   1.00000000e+00,
           1.70710683e+00,   1.70710683e+00]], dtype=float32))

In [175]: index.knn_query(v[5],k=8)
Out[175]:
(array([[0, 5, 4, 6, 3, 7, 2, 1]], dtype=uint64),
 array([[ -2.41304247e+21,   5.96046448e-08,   2.92893231e-01,
           2.92893231e-01,   1.00000000e+00,   1.00000000e+00,
           1.70710683e+00,   2.00000000e+00]], dtype=float32))

In [176]: index.knn_query(v[6],k=8)
Out[176]:
(array([[6, 5, 7, 0, 4, 1, 3, 2]], dtype=uint64),
 array([[ 0.        ,  0.29289323,  0.29289323,  1.        ,  1.        ,
          1.70710683,  1.70710683,  2.        ]], dtype=float32))

In [177]: index.knn_query(v[7],k=8)
Out[177]:
(array([[7, 6, 1, 5, 2, 4, 3, 0]], dtype=uint64),
 array([[  5.96046448e-08,   2.92893231e-01,   1.00000000e+00,
           1.00000000e+00,   1.70710683e+00,   1.70710683e+00,
           2.00000000e+00,   2.41304247e+21]], dtype=float32))

All indices apart from 2 and 6 show this behavior (incidentally, those make a 90 degree angle with the vector at index 0).

I am currently at commit aca8f8b, if that's useful.

core dumped when import hnswlib

The error also occurs when running the example code of hnswlib. It seems it is correlated with the system eviroment, but i can not figure it out. Same code runs well on another machine.
The error occurs when running this code line:

p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

I reinstall unbuntu system and hnsw, and it runs into core dump when import hnswlib ...

Any hint or suggestions here?

How to change a point after building index?

I want to change a data point in the index. How can i do that?

import hnswlib
import numpy as np

# Generating sample data
a = np.array([1.0, 0.0, 0.0])
b = np.array([0.0, 1.0, 0.0])
c = np.array([0.0, 0.0, 1.0])
d = np.array([1.0, 1.0, 0.0])

# Declaring index
p = hnswlib.Index(space='l2', dim = 3)

# Initing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = 50, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(a)
p.add_items(b)
p.add_items(c)

labels, distances = p.knn_query(c, k =1)
print labels, distances

#change data points
p.add_items(d, 2)
p.add_items(c, 0)

labels, distances = p.knn_query(d, k =1)
print labels, distances

labels, distances = p.knn_query(c, k =1)
print labels, distances

The query result is not changing after i rewrite the data point label index.

`ef_` race condition

There seems to be a race condition around the ef_ search parameter. If I've got two threads running searches concurrently, it's possible that thread A could try to set it to one value, and then have thread B change it to something else before it is able to execute the search.

I'd like to resolve it by adding an ef parameter to searchKnn:

std::priority_queue<std::pair<dist_t, labeltype >> searchKnn(void *query_data, size_t k, size_t ef);

and then adding an overload that has the current signature, and just passes the current value of ef_ in. This would mostly preserve the current behavior for existing code, while adding the option to override that setting for individual searches.

There would be some minor behavior change as a result of reducing the size of the window for that race condition. I assume that would be acceptable?

How to get items by ids

Hi @yurymalkov & @searchivarius,

I have used the python version to build index of some items with ids and stored it into a file. Now I have a question that is there a method to get the items by their assigned ids after I load the index file?

Thanks,
Shuo

Threaded add_items issue

I was looking into the python example. In my experience, the threaded add_items gives me different result & accuracy every time I run the script.

I think using multiple threads while adding the items is wrong here.

p.set_num_threads(4) # by default using all available cores

Moreover when I used cosine spcae the accuracy was around 50% in some index generation.
Index(space='cosine', dim=dim)

When I used single thread the results were consistant all the time.
p.set_num_threads(1)

Can someone clarify the issue?

Question about the recall performance

Hi yurymalkov,

I'm sorry to disturb you again. I have a question about the recall performance evaluation in sift_1b.cpp. In sift_1b.cpp, the number of returned sample for each query is 1, then for the samples at top K (top@K) in the ground truth, if the returned sample is in the samples at top K, it counts 1. I think there is another method which is more appropriate than this. This method is as follows:

The number of retrieved samples for each query is K (top@K), and for the sample at top 1 (top@1) in the ground truth, we check whether it (the sample of ground truth at top 1) in the top@K of retrieved samples. If true, it counts 1.

I think this evaluation is better than the method adopted in sift_1b.cpp. Taking face retrieval as example, we want to get high recall rate. If the same face doesn't recall at top of 10, we can set the number of retrieved samples to big, such as top@50.

How do you think about the evaluation method in the sift_1b.cpp and the method I show it on above?

Looking forward to your reply.

Build times

@yurymalkov Thanks for your great work.

I came from this thread After getting help from Leonid and I was hoping I could ask about the build times.

I'd be really grateful if you could tell me:

  • How long it took for the 200M SIFT dataset to index?
  • What hardware did you use?
  • What parameters gave the best recall/index-time/search-time for you?

Thanks for this optimized version.

Problem when running hnswlib on ARM demoboard

Thank you for the great work.
For my project, I need to run hnswlib on a ARM demoboard, but this get problem. I think it is caused by #include <x86intrin.h> in space_ip.h, in this code, use SSE to get less computing time.
Is there any better way rather than change to use NEON Instruction set to solve this problem?
Looking forward for you reply. thank you so much.

Keeping multiple candidates while searching

Hi,
have you considered keeping multiple candidates from each layer while searching?
This code selects only one candidate from each layer before going to the next layer.

Does it make sense to keep more candidates, e.g. 2? Keeping 2 candidates instead of 1 might not cause a 100% slowdown, since the neighbors of the 2 candidates might have overlap and thus the number of distance calculations might go up by much less than 100%. E.g. with 60% overlap there will be only 40% more distance calculations.

Intel E5-2680 coredump

We're having an issue w/the library core dumping on certain Intel processors, specifically the Intel E5-2680 v2 @ 2.80 GHz.

Steps to reproduce:

# python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hnswlib
>>> p = hnswlib.Index(space='l2', dim=128)
Illegal instruction (core dumped)
#

If you run it on an AWS c3 instance it will fail. On other instance types with the newer Intel E5-2686 v4 @ 2.30 GHz processor it works fine, for example on a m4 or i3 instance type.

Any ideas? The 2680 doesn't have the AVX instruction set.

I have a Docker image you can use to check it as well:

docker run -it quay.io/kairosinc/cpubug-test:latest

Thanks!

Using mmap on index

Is it possible to use a memory mapped file to store the index (eg: mmap) on a hnsw index, this is to reduce the memory usage size of a large index.

what parameters impact the recall/accuracy?

There are some parameters about the Index, e.g. space, M, ef_construction, which would impact the accuracy of knn_query()?

I have tried different value for these parameters, but still can't figure it out.

I created a index of 50000+ images, and then add 3 images of fish to the index to incre:
step 1: add fish-1 to the index, then query with fish-1, did not recall it.
step 2: add fish-2 to the index, then query with fish-2, recalled fish-1, and fish-2.
step 3: add fish-3 to the index, then query with fish-3, recalled fish-1, fish-2, and fish-3.
step 4: query with fish-1, did not recall any fish image.
step 5: query with fish-2 or fish-3, recalled all 3 fishes.

Why can't recall the fish-1?
Thanks.

Accuracy depends on the order of the insertion

Hi, I find that changing the order of the insertion alters the performance(accuracy). Is that expected? Or can I modify it so as to keep the accuracy the same, independent on the insert order?

Thanks.

resizing incremental index

It seems that when creating an index we need to specify the number of elements it holds. I have a few questions related to this.

  1. If we add entries to more than the number of elements specified, what would happen?
  2. Is it possible to auto expand the allocated index size when it reaches a threshold eg: 75% of max

Support deleting elements?

Thanks for the great implementation, the performance is impressive. I was wondering what it would take to add the ability to delete an element from the graph without degrading too much the performance? Having a special flag on the label, or re-assigning the lost edges?

Compiler pedantry: unnecessary semi-colons

If running g++ in -pedantic mode, there are warnings about unnecessary semi-colons terminating the static function definitions and namespace declarations in space_ip.h and space_l2.h.

Memory leak found for python lib

I find there is a memory leak when loading the index file. Here are the code snippets.
//python3
import psutil
import hnswlib

proc = psutil.Process()
index_map = {}
for i in range(100):
print('before: {} MB'.format(proc.memory_info().rss / 1024 / 1024))
knn_index = HNSWIndex(128, 10000)
knn_index.load('resources/indexes/03fef96d.runscope_medium.1636638189768806400.idx')
index_map['ks1'] = knn_index
print('after: {} MB'.format(proc.memory_info().rss / 1024 / 1024))

//output
before: 14.484375 MB
after: 23.72265625 MB
before: 23.72265625 MB
after: 24.390625 MB
before: 24.390625 MB
after: 24.92578125 MB
before: 24.92578125 MB
after: 33.6875 MB
before: 33.6875 MB
after: 41.6796875 MB
before: 41.6796875 MB
after: 49.4140625 MB
before: 49.4140625 MB
after: 57.40625 MB
before: 57.40625 MB
after: 65.140625 MB
before: 65.140625 MB
...

negative cosine similatiry

I get negative distance for item with index 0.
It must be 0, but returned -1.1920928955078125e-7 for item itself
How to handle it?

seems it linked to #34

Adding elements to an index loaded in memory iteratively

Currently, from what I understand of the documentation, we would need to load an index into the memory to be able to update the number of max elements and add new elements to it.

My question is: what if the index is already loaded in the memory, some process is being run iteratively and we want to append new elements yielded from that process to this already loaded index and update the object. At the moment it seems that we need to follow this routine: build index, save it, load it back (increase max elements in load argument), add elements, save, and again...

This adds load time to this process. I was wondering if there is a way to do this without saving and loading the index and just append to an already in memory index over and over (and save when we want).

Thank you in advance for any help on this!

command 'gcc' failed with exit status 1

from hnsw/python_bindings/
=> python setup.py install

running install
running bdist_egg
running egg_info
writing nmslib.egg-info/PKG-INFO
writing dependency_links to nmslib.egg-info/dependency_links.txt
writing requirements to nmslib.egg-info/requires.txt
writing top-level names to nmslib.egg-info/top_level.txt
reading manifest file 'nmslib.egg-info/SOURCES.txt'
writing manifest file 'nmslib.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
gcc -pthread -B /home/helllo/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/helllo/anaconda3/include/python3.6m -c /tmp/tmp663kadko.cpp -o tmp/tmp663kadko.o -std=c++14
gcc: error: unrecognized command line option ‘-std=c++14’
gcc -pthread -B /home/helllo/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/helllo/anaconda3/include/python3.6m -c /tmp/tmptfty6n9i.cpp -o tmp/tmptfty6n9i.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -B /home/helllo/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/helllo/anaconda3/include/python3.6m -c /tmp/tmpkk_m2rzy.cpp -o tmp/tmpkk_m2rzy.o -fvisibility=hidden
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
building 'nmslib' extension
gcc -pthread -B /home/helllo/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I./nmslib/similarity_search/include -I/home/helllo/anaconda3/include/python3.6m -I/home/helllo/.local/include/python3.6m -I/home/helllo/anaconda3/lib/python3.6/site-packages/numpy/core/include -I/home/helllo/anaconda3/include/python3.6m -c nmslib.cc -o build/temp.linux-x86_64-3.6/nmslib.o -O3 -march=native -fopenmp -DVERSION_INFO="1.7.3.2" -std=c++11 -fvisibility=hidden
gcc: error: nmslib.cc: No such file or directory
error: command 'gcc' failed with exit status 1

Cosine similarity is abnormally slow

Compared to l2 and ip, cosine is slow abnormally and the index time grows exponentially after the point is more than 10000.
data size=14000
index query
4021134 5241054 l2
5447251 7004991 ip
23386483 65030850 cosine
time unit=1us

After some experiment, I found that the technology in denorm branch solved the problem, hope this can be merge into the master branch
norm=1.0/sqrtf(norm+1e-30f);

And I wonder why this algorithm is so sensitive to this situation, thanks very much.

I have a core-dump back trace, what's the possible reason?

`
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000fddfae in hnswlib::HierarchicalNSW::searchBaseLayerST(unsigned int, void*, unsigned long) () at /home/bigo/recommend/recsys_framework/thirdparty/include/hnsw/hnswalg.h:253

#1 0x0000000000fdc8e3 in hnswlib::HierarchicalNSW::searchKnn(void*, unsigned long) () at /home/bigo/recommend/recsys_framework/thirdparty/include/hnsw/hnswalg.h:704

#2 0x0000000000fdafc7 in brec::KnnIndex::knnQuery(std::vector<std::pair<float, unsigned long>, std::allocator<std::pair<float, unsigned long> > >&, std::vector<float, std::allocator > const&, unsigned long) () at /home/bigo/recommend/recsys_framework/brec/knn/knn_index.h:217

#3 0x0000000000fd90f1 in server::recsys::KnnModel::knnQuery(std::vector<std::pair<float, unsigned long>, std::allocator<std::pair<float, unsigned long> > >&, std::vector<float, std::allocator > const&, int) () at /home/bigo/recommend/recommend_recall_d/src/models/knn_model.cpp:51
`

Publish to PyPi

As far as I can tell hnswlib can be installed with pip by running

pip install --user git+https://github.com/nmslib/hnswlib#subdirectory=python_bindings

but it would be nice to add it as a regular dependency via PyPi. Do you have plans to publish it there?

Access to stored vector

Hi @yurymalkov & @searchivarius,

After reading a bit the code, it seems like it is storing the vector data for each element in internal memory structure - is there a way of accessing the data after it has been indexed?

Does this memory location changes when new items are added? Or one could fetch the vector data from the location it was stored in the first place?

Thanks!

How to change a data point after building index?

I want to change a data point in the index. How can i do that?

import hnswlib
import numpy as np

# Generating sample data
a = np.array([1.0, 0.0, 0.0])
b = np.array([0.0, 1.0, 0.0])
c = np.array([0.0, 0.0, 1.0])
d = np.array([1.0, 1.0, 0.0])

# Declaring index
p = hnswlib.Index(space='l2', dim = 3)

# Initing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = 50, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(a)
p.add_items(b)
p.add_items(c)

labels, distances = p.knn_query(c, k =1)
print labels, distances

#change data points
p.add_items(d, 2)
p.add_items(c, 0)

labels, distances = p.knn_query(d, k =1)
print labels, distances

labels, distances = p.knn_query(c, k =1)
print labels, distances

The query result is not changed after i rewrite the data point label index.

L2Space distance is sorted from big to small

@yurymalkov Hi, yurymalkov. I'm using the hnsw to retrieve rootsift, and the distance is L2Space. I find the distances is not sorted from small to big, but sorted from big to small. I'm not sure I use it right. The code is as follows:

cv::Ptr<cv::Feature2D> detector = cv::xfeatures2d::SIFT::create();
cv::Mat im = cv::imread("123.jpg", 1);
if (im.empty()) return 0;

std::vector<cv::KeyPoint> keypoints;
cv::Mat img_descs;
detector->detect(im, keypoints);
detector->compute(im, keypoints, img_descs);

if (keypoints.size() < 5) {
    std::cout << "too few feature points: " << std::endl;
    return 0;
}

rootSift(img_descs);

for(int i = 0; i < img_descs.rows; i++) {
    
    for(int j = 0; j < vecdim; j++) {
        massb[j] = img_descs.at<float>(i, j);
    }
    std::priority_queue<std::pair<float, labeltype >> candidates = appr_alg->searchKnn(massb, 10);
    
    for(int k = 0; k < totalNum; k++) {
        if(candidates.top().first < 0.25 &&  abs(keypoints.at(i).angle - geosInfo.at(candidates.top().second).Pt.angle) < 10){
            cout << geosInfo.at(candidates.top().second).fileBaseName << ", angle diff: " << abs(keypoints.at(i).angle - geosInfo.at(candidates.top().second).Pt.angle) << ", dist: " << candidates.top().first << endl;
        }
        candidates.pop();
    }
}

To sort distance from small to big, a stupid way is set K in appr_alg->searchKnn(massb, K) to database size. Would you tell me how to solve it gracefully?

Missing call to addPoint for python bindings?

@yurymalkov Awesome code, really enjoying it compared to the other similar libraries.

Just one remark I noticed here, it seems that the python bindings provided are not actually adding any vector to the index - there is no call to addPoint on the code:

if(normalize==false) {
    ParallelFor(start, rows, num_threads, [&](size_t row, size_t threadId) {
        size_t id = ids.size() ? ids.at(row) : (cur_l+row);
    });
} else{
    std::vector<float> norm_array(num_threads * dim);
    ParallelFor(start, rows, num_threads, [&](size_t row, size_t threadId) {
        // normalize vector:
        size_t start_idx = threadId * dim;
        normalize_vector((float *) items.data(row), (norm_array.data()+start_idx));

        size_t id = ids.size() ? ids.at(row) : (cur_l+row);
    });
};


Also, one question. Is the library supposed to be thread safe for adding new items? Or just for performing search?

Duplicate labels permitted?

HierarchicalNSW will allow you to add two different vectors with the same label, which seems slightly surprising to me.

I'm not sure what to suggest here, since I don't know what the intended behavior is; I just wanted to bring it up. For my purposes, I might prefer the ability to overwrite the old value. I realize that may not suit everyone's needs, though. I presume it would depend on issue #4, too.

Index file is 13G all the time.

I insert 10k point and save it to disk, and I found that it consumes 13G space. Although after gzip, it is 500M. Is this a bug of save/load index module ?

Build index in unormalized inner product with lower performance?

Hi @yurymalkov & @searchivarius,

I have a dataset want to use hnswlib in inner product space(not cosine), it's dimension is 100. I build a index(M = 64, efc = 5000, efs = 5000) and try to do ANN search, recall rate is about 49% and cost 10ms for once search.
Then I try to build index with normalized(M = 64, efc = 5000, efs = 5000), like as cosine space(but actually I want to use ip), and call recall rate with a bruteforce result from ip space, recall rate is about 91% and cost 4ms for once search.
I am confuesd about the unormalized index peforcemance,have you encountered this situation and have any suggestions?
BTW, I also tried faiss::IndexIVFFlat index, it's recall rate is 91%, cost 7ms for once search.

Thanks,
Lin

Question about old indexes warning

Hi,

I am getting the following warning that I wasn't seeing before.

Warning: loading of old indexes will be deprecated before 2019.
Please resave the index in the new format.

I am wondering:

  • Why am I getting this warning even when I am creating an index (not loading), is the creation syntax different now?
  • What exactly is the change (in creating and loading an index) in python that we should apply now? I can't seem to find any change in python documentation.

Thank you in advance!

Index thread safety

Hi,

First of all, thanks for sharing with us such a nice work of yours. It really makes my work that much easier with an efficient and fast search algorithm like this.

A question on thread safety of the hnswlib.Index() object though, as I couldn't find it mentioned explicitly anywhere in the wiki. Is the index object thread safe and can be shared between threads? I'm thinking about having multiple threads sharing this object and call its knn_query() independently.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.