jina-ai / annlite Goto Github PK

⚡ A fast embedded library for approximate nearest neighbor search

License: Apache License 2.0

Python 96.90% Makefile 0.12% Dockerfile 0.12% Shell 2.86%

information-retrieval approximate-nearest-neighbor-search hnsw neural-search product-quantization cython image-search vector-quantization vector-search

annlite's People

Contributors

Stargazers

Watchers

annlite's Issues

support load and save operation in HNSW

We don't support loading and saving in HNSW now, every time we need to rebuild the whole graph from the lmdb which is very slow when the datasize is huge. The better way is to save the HNSW graph directly and load it when initializing the indexer.

We need two APIs inside HNSW:

hnsw_indexer.load_index() and hnsw_indexer.save_index()

Feat: Cell Search Strategy

Explore the effect of n_probe and propose a better search-cell strategy.

implement quantized hnsw algorithm

TODO:

Implementation of quantized hnsw using C++ based on hnswlib codes

PQLite: check performance difference on MacOS and Linux

From my testing:

MacOS: can use multi-threads
Linux: does not use multi-threads

Deployment on Google Cloud Platform

Hi!

I am currently taking a look at jina-ai. The plan is to get a simple text-based document search going and so far I've managed to make a simple demo locally which uses the PQLiteIndexer (based on AnnLite).

flow = Flow(port=5050)
flow = (
    flow
        .add(uses=TfIdfEncoder, uses_with=dict(tfidf_fp=tfidf_fp))
        .add(uses='jinahub://PQLiteIndexer/latest', install_requirements=True, uses_with=dict(dim=dim))
)

The next step would be for me to see how I can deploy a prototype to Google Cloud Platform (GCP) and, if possible, use Cloud Run in order to keep costs at minimum.

However, since AnnLite requires access to a local file-system I am not sure if that's possible. I intended to use Cloud Storage but it seems AnnLite would not support this.

What options do I have here?

PQLite: support cosine for VQ codec

Formatting headers files

Hi, do we consider using clang-format or something else to format all the cpp/h files?
If it's necessary, maybe I can open a pr for that.

PQLite: support bruteforce search with hnsw

If the number fo valid data is small, we can directly apply bruteforce search over the subset data to speed up.

PQLite: create jina hub executor

will continue this ticket after #18 is merged

Clean codebase

From today's meeting we've commented

fix pip in README
fix table format (percentage, decimal)
use TYPE_CHECKING to protect unnecessary input
use from_bytes and to_bytes for reading/writing binary of Document
an example of using jina 3 and pqlite to achieve sharding in K8s.

PQLite: auto-detect the dimensions from the first added document

PQLite: DEBUG logging enabled by default

Not sure if this is deliberate? In jinahub://PQLiteIndexer/v0.1.3 there's a lot of logging coming up

paper-reading-hm-ann

HM-ANN enables billion-scale similarity search on a single machine without compression technologies.

HM-ANN: Efficient Billion-Point Nearest Neighbor
Search on Heterogeneous Memory

Persist Documents in a Database

Currently, only the HNSWPostgresIndexer supports persistence of Documents. Could we add the PQLiteIndexer some database persistence, not necessarily Postgres, but any database provider?

Workspace size double during first use (query time init)

Hello,

I have large dataset. My workspace is 27GB after indexing (I have sent. embeddings, token embeddings and metadata).
But after first inference init workspace grows to 53 GB which is totally strange.

I have
annlite==0.3.5
jina==3.6.9

Any clues why this is happening?

Thanks

PQLite: update README

The pqlite has broken changes, and hence the following two use cases are supported!

(Basic) For small-scale data (e.g., < 10M docs),
1. directly use HNSW indexing without training (dtype=np.float32)
(Advanced) For large-scale data (e.g., > 10M docs): combine 1) Product Qunantization, 2) IVF, and 3) HNSW
1. train the VQ to conduct IVF index
2. train the PQ to compress embeddings
3. build the IVF-HNSW indexing using pq codes (dtype=np.uint8)

Add comments for py

Hi,
I'm reading the main body of annlite, and I found some core functions lack comments, which may cause some confusion(at least to me).
Maybe I can add some comments while I'm reading, and open a PR for that?

annlite failed to build on M1 Mac

Python version: 3.9
MacOS version: 12.2.1
CMD used: pip install https://github.com/jina-ai/annlite/archive/refs/heads/main.zip (or pip install "docarray[full]")
Error log:

Building wheels for collected packages: annlite
Building wheel for annlite (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: /opt/anaconda3/envs/jina/bin/python /opt/anaconda3/envs/jina/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmps65jnhoh
cwd: /private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-req-build-zbgplt5p
Complete output (48 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-11.1-arm64-cpython-39
creating build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/enums.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/profile.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/container.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/utils.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/helper.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/filter.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/math.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core
copying annlite/core/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/kv.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/table.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/vq.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/pq.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/pq_index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/flat_index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
copying annlite/core/index/hnsw/index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
copying annlite/core/index/hnsw/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
running build_ext
creating var
creating var/folders
creating var/folders/8m
creating var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn
creating var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include/python3.9 -c /var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmplkc2w81j.cpp -o var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmplkc2w81j.o -std=c++14
building 'annlite.hnsw_bind' extension
creating build/temp.macosx-11.1-arm64-cpython-39
creating build/temp.macosx-11.1-arm64-cpython-39/bindings
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-build-env-ibiuvri9/overlay/lib/python3.9/site-packages/pybind11/include -I/private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-build-env-ibiuvri9/overlay/lib/python3.9/site-packages/numpy/core/include -I./include/hnswlib -I/opt/anaconda3/envs/jina/include/python3.9 -c ./bindings/hnsw_bindings.cpp -o build/temp.macosx-11.1-arm64-cpython-39/./bindings/hnsw_bindings.o -O3 -march=native -stdlib=libc++ -mmacosx-version-min=10.7 -DVERSION_INFO="0.3.2" -std=c++14
clang: error: the clang compiler does not support '-march=native'
error: command '/usr/bin/clang' failed with exit code 1
+++
ERROR: Failed building wheel for annlite
Failed to build annlite
ERROR: Could not build wheels for annlite which use PEP 517 and cannot be installed directly

PQLite: benchmark storage engine

benchmark among: the goal is to make a decision which storage engine to use in qplite:

LMDB
DAM
Rocksdb

PQLite: setup the CI / CD workflow

PQLite: investigate query language

Now, the filtering conditions only support AND combination.

The goal is to convert the following condition into SQL WHERE Clause (support both of AND and OR)

The input DSL

{"and": [
    {"eq": ("foo", 3)},
    {"gt": ("bar", 4)},
   ]
}

The resulting SQL WHERE Clause:

WHERE foo = 3 AND bar > 4

support upload/download model to/from hubble

Since we move to jcloud deployment, it's necessary to support uploading/downloading PCA/PQ model to/from Hubble.

Thus, we need to implement these APIs:

self._projector_codec.upload(artifact='...')
self._projector_codec.download(artifact='...')

The artifact is determined by users and should be consistency throughout the whole pipeline. And also should be passed to jcloud.yaml.

Support for more Python types (list, set, etc) in the filtering columns

annlite/executor/executor.py

Line 59 in e4e706e

self._valid_input_columns = ['str', 'float', 'int']

implement pure filtering function

Implement pure filtering without involving vector search.

PQLite: less data than limit

After some iterations python examples/hnsw_benchmark.py included in the PR seems to fail can you reproduce the following?

Xtr: (124980, 128) vs Xte: (20, 128)
2021-11-23 11:42:03.020 | WARNING  | pqlite.index:train:131 - The pqlite has been trained or is not trainable. Please use ``force_retrain=True`` to retrain.
2021-11-23 11:42:46.358 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:43:30.497 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.95, 'recall': 0.95, 'train_time': 0.000240325927734375, 'index_time': 87.68162178993225, 'query_time': 0.1407299041748047, 'query_qps': 142.1162056300232, 'index_qps': 1425.384219049087, 'indexer_hyperparams': {'n_cells': 1, 'n_subvectors': 64}}
2021-11-23 11:43:30.908 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:43:31.179 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=8)
2021-11-23 11:43:33.298 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=8) with 20480 data...
2021-11-23 11:43:34.021 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:43:34.021 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/c905ae006031e55b1d8d51e87803d278
2021-11-23 11:44:19.429 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:23.197 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:24.466 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:28.833 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:30.179 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:33.951 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:35.024 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:38.036 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.99, 'recall': 0.99, 'train_time': 0.7391390800476074, 'index_time': 64.20390892028809, 'query_time': 0.2022690773010254, 'query_qps': 98.8781887319096, 'index_qps': 1946.6104494536003, 'indexer_hyperparams': {'n_cells': 8, 'n_subvectors': 64}}
2021-11-23 11:44:38.510 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:44:38.736 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:44:38.951 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:44:39.172 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:44:39.610 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:44:39.836 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:44:40.064 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:44:40.290 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:44:40.518 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=16)
2021-11-23 11:44:46.918 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=16) with 20480 data...
2021-11-23 11:44:47.653 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:44:47.653 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/75115be8393181300ec49112b88b2445
2021-11-23 11:45:30.760 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:45:46.906 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.9850000000000001, 'recall': 0.9850000000000001, 'train_time': 0.7362098693847656, 'index_time': 59.48536229133606, 'query_time': 0.28374195098876953, 'query_qps': 70.48658095958322, 'index_qps': 2101.021077889663, 'indexer_hyperparams': {'n_cells': 16, 'n_subvectors': 64}}
2021-11-23 11:45:47.490 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:45:47.970 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:45:48.488 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:45:48.952 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:45:49.400 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:45:49.650 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:45:50.114 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:45:50.552 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:45:50.987 | INFO     | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:45:51.448 | INFO     | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:45:51.888 | INFO     | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:45:52.329 | INFO     | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:45:52.773 | INFO     | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:45:53.211 | INFO     | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:45:53.662 | INFO     | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:45:54.112 | INFO     | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:45:54.555 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=32)
2021-11-23 11:46:10.758 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=32) with 20480 data...
2021-11-23 11:46:11.500 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:46:11.500 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/8f37c3b2ffd1c67e4c81e81f64db0eea
2021-11-23 11:47:01.267 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 1.0, 'recall': 1.0, 'train_time': 0.7422680854797363, 'index_time': 49.93944001197815, 'query_time': 0.30017614364624023, 'query_qps': 66.62754660333749, 'index_qps': 2502.6311862933007, 'indexer_hyperparams': {'n_cells': 32, 'n_subvectors': 64}}
2021-11-23 11:47:01.804 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:47:02.336 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:47:02.945 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:47:03.533 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:47:04.493 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:47:05.305 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:47:06.152 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:47:06.923 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:47:07.710 | INFO     | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:47:08.489 | INFO     | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:47:09.025 | INFO     | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:47:09.471 | INFO     | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:47:09.901 | INFO     | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:47:10.344 | INFO     | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:47:10.775 | INFO     | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:47:11.211 | INFO     | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:47:11.639 | INFO     | pqlite.index:clear:259 - Clear the index of cell-16
2021-11-23 11:47:12.076 | INFO     | pqlite.index:clear:259 - Clear the index of cell-17
2021-11-23 11:47:12.503 | INFO     | pqlite.index:clear:259 - Clear the index of cell-18
2021-11-23 11:47:12.932 | INFO     | pqlite.index:clear:259 - Clear the index of cell-19
2021-11-23 11:47:13.367 | INFO     | pqlite.index:clear:259 - Clear the index of cell-20
2021-11-23 11:47:13.815 | INFO     | pqlite.index:clear:259 - Clear the index of cell-21
2021-11-23 11:47:14.261 | INFO     | pqlite.index:clear:259 - Clear the index of cell-22
2021-11-23 11:47:14.700 | INFO     | pqlite.index:clear:259 - Clear the index of cell-23
2021-11-23 11:47:15.132 | INFO     | pqlite.index:clear:259 - Clear the index of cell-24
2021-11-23 11:47:15.587 | INFO     | pqlite.index:clear:259 - Clear the index of cell-25
2021-11-23 11:47:16.032 | INFO     | pqlite.index:clear:259 - Clear the index of cell-26
2021-11-23 11:47:16.472 | INFO     | pqlite.index:clear:259 - Clear the index of cell-27
2021-11-23 11:47:16.903 | INFO     | pqlite.index:clear:259 - Clear the index of cell-28
2021-11-23 11:47:17.350 | INFO     | pqlite.index:clear:259 - Clear the index of cell-29
2021-11-23 11:47:17.787 | INFO     | pqlite.index:clear:259 - Clear the index of cell-30
2021-11-23 11:47:18.227 | INFO     | pqlite.index:clear:259 - Clear the index of cell-31
2021-11-23 11:47:18.661 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=64)
2021-11-23 11:47:56.044 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=64) with 20480 data...
2021-11-23 11:47:57.003 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:47:57.004 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/e01ce8063d859fe594084b33a10515e8
2021-11-23 11:48:55.512 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
Traceback (most recent call last):
  File "examples/hnsw_benchmark.py", line 95, in <module>
    pq.search(docs, limit=top_k)
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/index.py", line 238, in search
    match_dists, match_docs = self.search_cells(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 144, in search_cells
    dists, doc_ids, cells = self.ivf_search(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 107, in ivf_search
    _dists, _doc_idx = self.vec_index(cell_id).search(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/core/index/hnsw/index.py", line 77, in search
    ids, dists = self._index.knn_query(query, k=limit)
RuntimeError: Cannot return the results in a contigious 2D array. Probably ef or M is too small

Originally posted by @davidbp in #18 (comment)

PQLite: validate different metrics

Support different metrics:

euclidean
inner_product
cosine

feature request: get doucment from annlite by document id.

Hi,

it would be nice to have option to retrieve document by its id from lmdb.
Now it is possible via

_index.doc_store(0).get(['document_id'])} which is not cumbersome given number of different cells.

thank you

PQLite: support CRUD for hnsw index

Support PCA in ANNlite

In order to reduce memory usage, we need to implement PCA inside ANNlite. There will be two PRs for this function:

implement PCA based on scikit-learn
integrate PCA with ANNlite

Quantization process is very memory hungry

The quantization process expects the whole data to be in memory. This does not scale for big datasets.

Feat: Additional filters

I was just chatting with @davidbp about filtering in PQLite. For my fashion search example I'm looking at adding filters (similar to Amazon) to pre-filter results. This work is being done in a separate branch of my repo.

At present I'm able to easily search in ranges (e.g. price, year), or above a certain threshold (e.g. rating):

filter = {
    "$and": {
        "year": {"$gte": 2011, "$lte": 2014},
        "price": {"$gte": 100, "$lte": 200},
        "rating": {"$gte": 3},
    },
}

But what would be really useful is a convenient way to search for AND and XOR.

Current implementation

Previously I tried something (which actually works) like:

filter = {
    "$and": {
        "year": {"$lte": 2014, "$gte": 2011},
        "price": {"$gte": 0, "$lte": 200},
    },
    "$or": {
        "baseColour": {"$eq": "Black"},
        "$or": {
            "baseColour": {"$eq": "White"},
            "$or": {
                "baseColour": {"$eq": "Blue"}
            }
        }
    }
}

But this is:

Inelegant
A real pain to build programmatically (i.e. by taking into account checked boxes on my frontend)

Desired implementation

Some new operators: $one_of and $all_of

filter = {
    "$and": {
        "year": {"$gte": 2011, "$lte": 2014},
        "price": {"$gte": 100, "$lte": 200},
        "rating": {"$gte": 3},
        "baseColour": {"$one_of": ['White', 'Blue', 'Black']},
        "season": {"$all_of": ['Summer', 'Spring', 'Fall']},
    },
}

Other thoughts

In Commsor (our community analysis tool) we use a lot of filters are useful in real world:

So I'd also like to propose the following operators:

$contains
$notcontains (e.g. we often want to filter out universities since we focus on enterprises, so we would say company_name $notcontains "university")

Notes

Rating and price aren't in the original dataset (as used by @bwanglzu in his notebook). I generated them programmatically to give us a richer dataset to play with

README is not working

None of the code in the readme is working due to API changes

PQLite: benchmark with filtering

Performance benchmark experiment
Something like this:

QPS with filtering out 10% data
QPS with filtering out 30% data
QPS with filtering out 50% data
QPS with filtering out 80% data

PQlite: restore index from local storage

rebuild the index (sqlite, and vector index) from the local lmdb data

refactor abstract class BaseIndex
refactor fit function to check whether the training valid
add stat and clear api
rebuild index from local disk (i.e., lmdb data)
- restore trained model from disk
- rebuild index from disk

PQLite: improve table query performance

The bottleneck of search is about table SQL query

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    71                                               @line_profile
    72                                               def ivf_search(
    73                                                   self,
    74                                                   x: np.ndarray,
    75                                                   cells: np.ndarray,
    76                                                   where_clause: str = '',
    77                                                   where_params: Tuple = (),
    78                                                   limit: int = 10,
    79                                               ):
    80        15         18.0      1.2      0.0          dists = []
    81
    82        15         11.0      0.7      0.0          doc_idx = []
    83        15          7.0      0.5      0.0          cell_ids = []
    84        15          6.0      0.4      0.0          count = 0
    85        30        141.0      4.7      0.0          for cell_id in cells:
    86        15         23.0      1.5      0.0              cell_table = self.cell_table(cell_id)
    87        15      54765.0   3651.0      4.5              cell_size = cell_table.count()
    88        15         20.0      1.3      0.0              if cell_size == 0:
    89                                                           continue
    90
    91        15          6.0      0.4      0.0              indices = None
    92        15         10.0      0.7      0.0              if where_clause or (cell_table.deleted_count() > 0):
    93        15         11.0      0.7      0.0                  indices = []
    94    500030     806655.0      1.6     66.3                  for doc in cell_table.query(
    95        15          9.0      0.6      0.0                      where_clause=where_clause, where_params=where_params
    96                                                           ):
    97    500000     274113.0      0.5     22.5                      indices.append(doc['_id'])
    98
    99        15         27.0      1.8      0.0                  if len(indices) == 0:
   100                                                               continue
   101
   102        15      13655.0    910.3      1.1                  indices = np.array(indices, dtype=np.int64)
   103
   104        30      63932.0   2131.1      5.3              _dists, _doc_idx = self.vec_index(cell_id).search(
   105        15         32.0      2.1      0.0                  x, limit=min(limit, cell_size), indices=indices
   106                                                       )
   107
   108        15         22.0      1.5      0.0              if count >= limit and _dists[0] > dists[-1][-1]:
   109                                                           continue
   110
   111        15         24.0      1.6      0.0              dists.append(_dists)
   112        15          9.0      0.6      0.0              doc_idx.append(_doc_idx)
   113        15         41.0      2.7      0.0              cell_ids.extend([cell_id] * len(_dists))
   114        15         13.0      0.9      0.0              count += len(_dists)
   115
   116        15        113.0      7.5      0.0          cell_ids = np.array(cell_ids, dtype=np.int64)
   117        15         13.0      0.9      0.0          if len(dists) != 0:
   118        15        459.0     30.6      0.0              dists = np.hstack(dists)
   119        15        125.0      8.3      0.0              doc_idx = np.hstack(doc_idx)
   120
   121        15        105.0      7.0      0.0              indices = dists.argsort(axis=0)[:limit]
   122        15         28.0      1.9      0.0              dists = dists[indices]
   123        15         14.0      0.9      0.0              cell_ids = cell_ids[indices]
   124        15          9.0      0.6      0.0              doc_idx = doc_idx[indices]
   125
   126        15          6.0      0.4      0.0          doc_ids = []
   127       165        163.0      1.0      0.0          for cell_id, offset in zip(cell_ids, doc_idx):
   128       150       1750.0     11.7      0.1              doc_id = self.cell_table(cell_id).get_docid_by_offset(offset)
   129       150         94.0      0.6      0.0              doc_ids.append(doc_id)
   130        15          8.0      0.5      0.0          return dists, doc_ids, cell_ids

Filtering using $in keyword crashes the executors and the database

I have a simple flow that looks like this:

f = (
    Flow(port_expose=8082, protocol='http', monitoring=True, port_monitoring=9090)
    .add(name='encoder', uses='jinahub+docker://CLIPEncoder')
    .add(name='processor',
         uses='jinahub+docker://PQLiteIndexer/latest',
         uses_with={
            'dim': 512,
            'columns': columns
         },
    )
)

Indexing works fine and I can verify it using /status endpoint where it shows the number of indexed documents. When I hit the /search endpoint, I can search and retrieve results correctly.

I also verified that filtering works by testing it with $eq. However, when I test it with $in, things go south. Not only does it not return any results, but it also seems to crash my entire database where I can't make calls to endpoints like /statusand /search. Does anyone have any idea as to what is happening? Here is how I am structuring my filter query:

# this query searches the files with a tag 'owners' of type array which includes the given string
search_results = c.post(on="/search",
                 parameters={
                     "query": QUERIES[0],
                     "traversal_paths": '@r,c',
                     "limit": 3,
                     "filter":{"owners": {"$in": ["EGGWLJSUHT6GLWU2KIB0"]}}
                 })

[Bug] Executer from hub fails to start

I've been trying to use the example from Alex's multimodal search demo and also tested the example code for the PQLite extention on the jina hub page. Testing both examples I get the following errors with jina 2.6.0, 2.6.2 and the latest version.

↪ python ./app.py -t index -n 10
⠏ Fetching PQLiteIndexer from Jina Hub ...DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses (raised from /home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/flatbuffers/compat.py:19)
  image_encoder@263243[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Traceback (most recent call last):
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/server/cli.py", line 444, in main
    run()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "./app.py", line 94, in <module>
    main()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "./app.py", line 88, in main
    index(csv_file=CSV_FILE, max_docs=num_docs)
  File "./app.py", line 43, in index
    with flow_index:
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/flow/base.py", line 1132, in __enter__
    return self.start()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/flow/base.py", line 1179, in start
    self.enter_context(v)
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 425, in enter_context
    result = _cm_type.__enter__(cm)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 208, in __enter__
    return self.start()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 692, in start
    self.enter_context(self.replica_set)
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 425, in enter_context
    result = _cm_type.__enter__(cm)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 476, in __enter__
    self._peas.append(BasePea(_args).start())
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 135, in __init__
    self.runtime_cls = self._get_runtime_cls()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 427, in _get_runtime_cls
    update_runtime_cls(self.args)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/helper.py", line 106, in update_runtime_cls
    _args.uses = HubIO(_hub_args).pull()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/hubio.py", line 672, in pull
    executor, from_cache = HubIO.fetch_meta(
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/helper.py", line 323, in wrapper
    result = func(*args, **kwargs)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/hubio.py", line 588, in fetch_meta
    image_name=resp['image'],
KeyError: 'image'```

PQLite: implement pq in hnsw via C++

Pros:

compress the embedding, save the memory consumption
speed up the distance calculating via ADC method

Cons:

resulted in degraded search quality

PQLite: search at chunk-level vectors

PQLite: bench SQL indexer

sqlite
apsw (Another Python SQLite wrapper) https://github.com/rogerbinns/apsw
duckdb https://github.com/duckdb/duckdb
monetdb https://github.com/MonetDBSolutions/MonetDBe-Python

references:

https://hal.inria.fr/hal-02556400/document

PQLite: refactor to work with Jina 3.x

In Jina 3.x, the in-memory sqlite cannot work in an executor as before.

PQLite: optimize product quantization index

Improve the following parts (probably cython):

lookup operation of asymetric distance computation
asymetric distance computation table
and benchmark improvement.
[o] add tests to the functions

The main issue with current version is that unless lot of data is filtered the code ends up beeing slower than cdist.

PQLite: integrate `hnswlib` to enable hnsw indexing

with this PR, we can directly add new features into the hnsw lib.

acquire train data from lmdb

Sometimes we need to train the PCA model when we already created an indexer. (for example, there is a memory issue after we have indexed thousands or even millions of data, and we need PCA to fix it.)

We need to fetch train data from lmdb, but this is tricky when we move to jcloud since we need to fetch data from the server instead of local machine.

One way to solve this is to add a new endpoint in client called /fetch:

data = client.post('/fetch', params={'batch_size': 1024})

for training we can use partial_train():

annlite.partial_train(data)

Indexing of long text documents are tricky

Hello,

my use case is the search in long text documents.
Documents are split to chunks (lets say sentences) and each chunk has its embedding. Root document has no embedding.
I am not able to index documents with annlite indexer because of missing embedding of root document, only chunks may be indexed.
If I store documents directly to lmdb via self._index.doc_store(0).insert(root_docs) then when loading query flow it throws error.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.

10 means (5 root docs, and 5 chunks together - dummy data)

Can you please help me
Thanks

PQLite: pre-filtering for hnsw indexing

Can't be installed in Mac M1 chip

I am trying to install annlite in my macbook with M1 chip using pip install annlite but I receive the following error:

clang: error: the clang compiler does not support '-march=native'
      error: command '/usr/bin/clang' failed with exit code 1

Is there any suggestion to fix it?