matsui528 / nanopq Goto Github PK

View Code? Open in Web Editor NEW

312.0 312.0 43.0 102 KB

Pure python implementation of product quantization for nearest neighbor search

License: MIT License

Makefile 1.06% Python 98.94%

approximate-nearest-neighbor-search data-compression nearest-neighbor-search product-quantization

nanopq's People

Contributors

Stargazers

Watchers

Forkers

jacke121 thelinuxmaniac huyhoang17 chengli0327 rainrain1218 shannonyu puer99miss dawei6875797 jasstionzyf buaazhf hiroshiba wrongwhp yuhaozeng undercontroller happyxuwork macyli01 kenjcak khatri1465 zongking123 kang9779 gitqinxinyu lujunsincerely chrisbyd charygao pichenze zhengliu101 tonellotto mdahao calvinmccarter henyaoyuan de9uch1 ashleyabraham techthiyanes inderpreetsingh01 mpskex lsb scampion mireklzicar ocavue jade2290 davnn lyt-1129 bois1616

nanopq's Issues

How to compute codes similar to FAISS in NanoPQ

How to compute codes similar to FAISS using NanoPQ without using FAISS

No module named 'scipy.cluster'

When using scipy == 1.3.3 , the error occurs.

when i use lower scipy into 1.2.1 version, it runs well

Typo

I think this should be 8 bits not 256! otherwise the package is very helpful thanks!

nanopq/nanopq/pq.py

Line 22 in 5c9e138

into 256 bits = 1 byte = uint8)

How to compute distance between PQ codes?

Not sure if this should be a feature request.

Supposed I just want to approximate distance between two PQ codes (under the same encoder of course). What is the most efficient way to perform such operation?

OPQ prints about Reconstruction error even with verbose=False

The Optimized PQ class takes a verbose flag, and this is passed to the inner PQ class.
However, it would make sense if OPQ also obeyed this flag in its fit method, which outputs lots of things about "Reconstruction error".

Number of codewords for quantization

How can we determine the ideal value for the Ks parameter?

why with parametric init has poor performance than non-parametric one ?

hi,friend ,I have two question .

why with parametric init has poor performance than non-parametric one according to your unit test 'test_parametric_init'? it is inconsistent with the conclusion of the paper 《Optimized Product Quantization for Approximate Nearest Neighbor Search 》--Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun

`def test_parametric_init(self):
N, D, M, Ks = 100, 12, 4, 10
X = np.random.random((N, D)).astype(np.float32)
opq = nanopq.OPQ(M=M, Ks=Ks)
opq.fit(X, parametric_init=False, rotation_iter=1)
err_init = np.linalg.norm(opq.rotate(X) - opq.decode(opq.encode(X)))

opq = nanopq.OPQ(M=M, Ks=Ks)
opq.fit(X, parametric_init=True, rotation_iter=1)
err = np.linalg.norm(opq.rotate(X) - opq.decode(opq.encode(X)))

self.assertLess(err_init, err)`

the code compute normal not need rotate X, the decode will rotate code to original space at 255 line of opq.py

self.pq.decode(codes) @ self.R.T

how to get top k closest neighbor without traverse entire dataset?

Turn print statements into logging

Hi!

Thanks for writing this package, it looks great!

I'd be interested in turning the print statements (with verbose=True) into logging statements. The verbose flag could then be used to control whether this logging is output to stdout (i.e., by setting the log level). Is this something you are interested in? if so, I could submit a PR.

Fail to build read-the-docs

https://readthedocs.org/projects/nanopq/builds/

Centroid of Centroids using NanoPQ

I am looking in to do centroid of centroids using NanoPQ, is it possible?. I have a first level nanopq model M=4, K=16, D=24. The codewords that is produced is (4, 16, 6), can this output be sent as an input for the second level nanoPQ to calculate centroid of centroids? The reason for investigating centroid of centroids is due to processing large datasets and reduce processing time.

about reconstructed

thanks for your work.

`import nanopq
import numpy as np

N, Nt, D = 10000, 2000, 128
X = np.random.random((N, D)).astype(np.float32) # 10,000 128-dim vectors to be indexed
Xt = np.random.random((Nt, D)).astype(np.float32) # 2,000 128-dim vectors for training
query = np.random.random((D,)).astype(np.float32) # a 128-dim query vector

pq = nanopq.PQ(M=8, Ks=256)
pq.fit(Xt, seed=123)
X_code = pq.encode(X) # (10000, 8) with dtype=np.uint8
X_reconstructed = pq.decode(codes=X_code)

tmp = X[0]
tmp1 = X_reconstructed[0]
dis = np.sqrt(np.sum(np.square(tmp - tmp1)))`

the dis is about 2.0+ . dose it look like right?

Why parametric init still has poorer performance than non-parametric one?

Hi! I like your code a lot! But there is one question: Why is that when I change the parameters in the test_parametric_init to N, D, M, Ks = 100, 12, 4, 20, the test will fail?

Do you have any idea? I thought the rotation matrix would work with different Ms.

Add `shubham0204/pq.rs`, a Rust implementation of `pq.py`, as a community resource in `README.md`

I wanted to learn how product quantization works, and this repository provided excellent code to understand how it works. As I had been learning Rust for a few months now, I decided to re-write the pq.py script in Rust to understand each step thoroughly by self-implementation. Here's the repository containing the Rust code: shubham0204/pq.rs.

The following steps are have to be taken in order to complete the project:

Complete README.md and add a small usage sample of the Rust API
Prepare a crate and upload it to crates.io

Do let me know if the repository can be included as a community resource. Just like me, many other learners would like to learn implementation of product quantization in languages other than Python, and building a section where implementations in other languages would be of great help. Moreover, I'm also working on a detailed blog which will explain product-quantization from first-concepts and with a Rust implementation.