Coder Social home page Coder Social logo

jina-ai / annlite Goto Github PK

View Code? Open in Web Editor NEW
212.0 22.0 23.0 2.76 MB

⚡ A fast embedded library for approximate nearest neighbor search

License: Apache License 2.0

Python 96.90% Makefile 0.12% Dockerfile 0.12% Shell 2.86%
information-retrieval approximate-nearest-neighbor-search hnsw neural-search product-quantization cython image-search vector-quantization vector-search

annlite's People

Contributors

alaeddine-13 avatar bwanglzu avatar cristianmtr avatar davidbp avatar gusye1234 avatar hanxiao avatar hippopotamus0308 avatar jemmyshin avatar jina-bot avatar joanfm avatar numb3r3 avatar orangesodahub avatar ziniuyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

annlite's Issues

support load and save operation in HNSW

We don't support loading and saving in HNSW now, every time we need to rebuild the whole graph from the lmdb which is very slow when the datasize is huge. The better way is to save the HNSW graph directly and load it when initializing the indexer.

We need two APIs inside HNSW:

hnsw_indexer.load_index() and hnsw_indexer.save_index()

Deployment on Google Cloud Platform

Hi!

I am currently taking a look at jina-ai. The plan is to get a simple text-based document search going and so far I've managed to make a simple demo locally which uses the PQLiteIndexer (based on AnnLite).

flow = Flow(port=5050)
flow = (
    flow
        .add(uses=TfIdfEncoder, uses_with=dict(tfidf_fp=tfidf_fp))
        .add(uses='jinahub://PQLiteIndexer/latest', install_requirements=True, uses_with=dict(dim=dim))
)

The next step would be for me to see how I can deploy a prototype to Google Cloud Platform (GCP) and, if possible, use Cloud Run in order to keep costs at minimum.

However, since AnnLite requires access to a local file-system I am not sure if that's possible. I intended to use Cloud Storage but it seems AnnLite would not support this.

What options do I have here?

Formatting headers files

Hi, do we consider using clang-format or something else to format all the cpp/h files?
If it's necessary, maybe I can open a pr for that.

Clean codebase

From today's meeting we've commented

  • fix pip in README
  • fix table format (percentage, decimal)
  • use TYPE_CHECKING to protect unnecessary input
  • use from_bytes and to_bytes for reading/writing binary of Document
  • an example of using jina 3 and pqlite to achieve sharding in K8s.

Persist Documents in a Database

Currently, only the HNSWPostgresIndexer supports persistence of Documents. Could we add the PQLiteIndexer some database persistence, not necessarily Postgres, but any database provider?

Workspace size double during first use (query time init)

Hello,

I have large dataset. My workspace is 27GB after indexing (I have sent. embeddings, token embeddings and metadata).
But after first inference init workspace grows to 53 GB which is totally strange.

I have
annlite==0.3.5
jina==3.6.9

Any clues why this is happening?

Thanks

PQLite: update README

The pqlite has broken changes, and hence the following two use cases are supported!

  • (Basic) For small-scale data (e.g., < 10M docs),

    1. directly use HNSW indexing without training (dtype=np.float32)
  • (Advanced) For large-scale data (e.g., > 10M docs): combine 1) Product Qunantization, 2) IVF, and 3) HNSW

    1. train the VQ to conduct IVF index
    2. train the PQ to compress embeddings
    3. build the IVF-HNSW indexing using pq codes (dtype=np.uint8)

Add comments for py

Hi,
I'm reading the main body of annlite, and I found some core functions lack comments, which may cause some confusion(at least to me).
Maybe I can add some comments while I'm reading, and open a PR for that?

annlite failed to build on M1 Mac

Python version: 3.9
MacOS version: 12.2.1
CMD used: pip install https://github.com/jina-ai/annlite/archive/refs/heads/main.zip (or pip install "docarray[full]")
Error log:

Building wheels for collected packages: annlite
Building wheel for annlite (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: /opt/anaconda3/envs/jina/bin/python /opt/anaconda3/envs/jina/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmps65jnhoh
cwd: /private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-req-build-zbgplt5p
Complete output (48 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-11.1-arm64-cpython-39
creating build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/enums.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/profile.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/container.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/utils.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/helper.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/filter.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/math.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core
copying annlite/core/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/kv.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/table.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/vq.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/pq.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/pq_index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/flat_index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
copying annlite/core/index/hnsw/index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
copying annlite/core/index/hnsw/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
running build_ext
creating var
creating var/folders
creating var/folders/8m
creating var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn
creating var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include/python3.9 -c /var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmplkc2w81j.cpp -o var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmplkc2w81j.o -std=c++14
building 'annlite.hnsw_bind' extension
creating build/temp.macosx-11.1-arm64-cpython-39
creating build/temp.macosx-11.1-arm64-cpython-39/bindings
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-build-env-ibiuvri9/overlay/lib/python3.9/site-packages/pybind11/include -I/private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-build-env-ibiuvri9/overlay/lib/python3.9/site-packages/numpy/core/include -I./include/hnswlib -I/opt/anaconda3/envs/jina/include/python3.9 -c ./bindings/hnsw_bindings.cpp -o build/temp.macosx-11.1-arm64-cpython-39/./bindings/hnsw_bindings.o -O3 -march=native -stdlib=libc++ -mmacosx-version-min=10.7 -DVERSION_INFO="0.3.2" -std=c++14
clang: error: the clang compiler does not support '-march=native'
error: command '/usr/bin/clang' failed with exit code 1
+++
ERROR: Failed building wheel for annlite
Failed to build annlite
ERROR: Could not build wheels for annlite which use PEP 517 and cannot be installed directly

support upload/download model to/from hubble

Since we move to jcloud deployment, it's necessary to support uploading/downloading PCA/PQ model to/from Hubble.

Thus, we need to implement these APIs:

self._projector_codec.upload(artifact='...')
self._projector_codec.download(artifact='...')

The artifact is determined by users and should be consistency throughout the whole pipeline. And also should be passed to jcloud.yaml.

PQLite: less data than limit

After some iterations python examples/hnsw_benchmark.py included in the PR seems to fail can you reproduce the following?

Xtr: (124980, 128) vs Xte: (20, 128)
2021-11-23 11:42:03.020 | WARNING  | pqlite.index:train:131 - The pqlite has been trained or is not trainable. Please use ``force_retrain=True`` to retrain.
2021-11-23 11:42:46.358 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:43:30.497 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.95, 'recall': 0.95, 'train_time': 0.000240325927734375, 'index_time': 87.68162178993225, 'query_time': 0.1407299041748047, 'query_qps': 142.1162056300232, 'index_qps': 1425.384219049087, 'indexer_hyperparams': {'n_cells': 1, 'n_subvectors': 64}}
2021-11-23 11:43:30.908 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:43:31.179 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=8)
2021-11-23 11:43:33.298 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=8) with 20480 data...
2021-11-23 11:43:34.021 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:43:34.021 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/c905ae006031e55b1d8d51e87803d278
2021-11-23 11:44:19.429 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:23.197 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:24.466 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:28.833 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:30.179 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:33.951 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:35.024 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:38.036 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.99, 'recall': 0.99, 'train_time': 0.7391390800476074, 'index_time': 64.20390892028809, 'query_time': 0.2022690773010254, 'query_qps': 98.8781887319096, 'index_qps': 1946.6104494536003, 'indexer_hyperparams': {'n_cells': 8, 'n_subvectors': 64}}
2021-11-23 11:44:38.510 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:44:38.736 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:44:38.951 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:44:39.172 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:44:39.610 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:44:39.836 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:44:40.064 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:44:40.290 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:44:40.518 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=16)
2021-11-23 11:44:46.918 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=16) with 20480 data...
2021-11-23 11:44:47.653 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:44:47.653 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/75115be8393181300ec49112b88b2445
2021-11-23 11:45:30.760 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:45:46.906 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.9850000000000001, 'recall': 0.9850000000000001, 'train_time': 0.7362098693847656, 'index_time': 59.48536229133606, 'query_time': 0.28374195098876953, 'query_qps': 70.48658095958322, 'index_qps': 2101.021077889663, 'indexer_hyperparams': {'n_cells': 16, 'n_subvectors': 64}}
2021-11-23 11:45:47.490 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:45:47.970 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:45:48.488 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:45:48.952 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:45:49.400 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:45:49.650 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:45:50.114 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:45:50.552 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:45:50.987 | INFO     | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:45:51.448 | INFO     | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:45:51.888 | INFO     | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:45:52.329 | INFO     | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:45:52.773 | INFO     | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:45:53.211 | INFO     | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:45:53.662 | INFO     | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:45:54.112 | INFO     | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:45:54.555 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=32)
2021-11-23 11:46:10.758 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=32) with 20480 data...
2021-11-23 11:46:11.500 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:46:11.500 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/8f37c3b2ffd1c67e4c81e81f64db0eea
2021-11-23 11:47:01.267 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 1.0, 'recall': 1.0, 'train_time': 0.7422680854797363, 'index_time': 49.93944001197815, 'query_time': 0.30017614364624023, 'query_qps': 66.62754660333749, 'index_qps': 2502.6311862933007, 'indexer_hyperparams': {'n_cells': 32, 'n_subvectors': 64}}
2021-11-23 11:47:01.804 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:47:02.336 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:47:02.945 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:47:03.533 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:47:04.493 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:47:05.305 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:47:06.152 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:47:06.923 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:47:07.710 | INFO     | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:47:08.489 | INFO     | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:47:09.025 | INFO     | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:47:09.471 | INFO     | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:47:09.901 | INFO     | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:47:10.344 | INFO     | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:47:10.775 | INFO     | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:47:11.211 | INFO     | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:47:11.639 | INFO     | pqlite.index:clear:259 - Clear the index of cell-16
2021-11-23 11:47:12.076 | INFO     | pqlite.index:clear:259 - Clear the index of cell-17
2021-11-23 11:47:12.503 | INFO     | pqlite.index:clear:259 - Clear the index of cell-18
2021-11-23 11:47:12.932 | INFO     | pqlite.index:clear:259 - Clear the index of cell-19
2021-11-23 11:47:13.367 | INFO     | pqlite.index:clear:259 - Clear the index of cell-20
2021-11-23 11:47:13.815 | INFO     | pqlite.index:clear:259 - Clear the index of cell-21
2021-11-23 11:47:14.261 | INFO     | pqlite.index:clear:259 - Clear the index of cell-22
2021-11-23 11:47:14.700 | INFO     | pqlite.index:clear:259 - Clear the index of cell-23
2021-11-23 11:47:15.132 | INFO     | pqlite.index:clear:259 - Clear the index of cell-24
2021-11-23 11:47:15.587 | INFO     | pqlite.index:clear:259 - Clear the index of cell-25
2021-11-23 11:47:16.032 | INFO     | pqlite.index:clear:259 - Clear the index of cell-26
2021-11-23 11:47:16.472 | INFO     | pqlite.index:clear:259 - Clear the index of cell-27
2021-11-23 11:47:16.903 | INFO     | pqlite.index:clear:259 - Clear the index of cell-28
2021-11-23 11:47:17.350 | INFO     | pqlite.index:clear:259 - Clear the index of cell-29
2021-11-23 11:47:17.787 | INFO     | pqlite.index:clear:259 - Clear the index of cell-30
2021-11-23 11:47:18.227 | INFO     | pqlite.index:clear:259 - Clear the index of cell-31
2021-11-23 11:47:18.661 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=64)
2021-11-23 11:47:56.044 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=64) with 20480 data...
2021-11-23 11:47:57.003 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:47:57.004 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/e01ce8063d859fe594084b33a10515e8
2021-11-23 11:48:55.512 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
Traceback (most recent call last):
  File "examples/hnsw_benchmark.py", line 95, in <module>
    pq.search(docs, limit=top_k)
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/index.py", line 238, in search
    match_dists, match_docs = self.search_cells(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 144, in search_cells
    dists, doc_ids, cells = self.ivf_search(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 107, in ivf_search
    _dists, _doc_idx = self.vec_index(cell_id).search(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/core/index/hnsw/index.py", line 77, in search
    ids, dists = self._index.knn_query(query, k=limit)
RuntimeError: Cannot return the results in a contigious 2D array. Probably ef or M is too small

Originally posted by @davidbp in #18 (comment)

Support PCA in ANNlite

In order to reduce memory usage, we need to implement PCA inside ANNlite. There will be two PRs for this function:

  1. implement PCA based on scikit-learn
  2. integrate PCA with ANNlite

Feat: Additional filters

I was just chatting with @davidbp about filtering in PQLite. For my fashion search example I'm looking at adding filters (similar to Amazon) to pre-filter results. This work is being done in a separate branch of my repo.

At present I'm able to easily search in ranges (e.g. price, year), or above a certain threshold (e.g. rating):

filter = {
    "$and": {
        "year": {"$gte": 2011, "$lte": 2014},
        "price": {"$gte": 100, "$lte": 200},
        "rating": {"$gte": 3},
    },
}

But what would be really useful is a convenient way to search for AND and XOR.

Current implementation

Previously I tried something (which actually works) like:

filter = {
    "$and": {
        "year": {"$lte": 2014, "$gte": 2011},
        "price": {"$gte": 0, "$lte": 200},
    },
    "$or": {
        "baseColour": {"$eq": "Black"},
        "$or": {
            "baseColour": {"$eq": "White"},
            "$or": {
                "baseColour": {"$eq": "Blue"}
            }
        }
    }
}

But this is:

  1. Inelegant
  2. A real pain to build programmatically (i.e. by taking into account checked boxes on my frontend)

Desired implementation

Some new operators: $one_of and $all_of

filter = {
    "$and": {
        "year": {"$gte": 2011, "$lte": 2014},
        "price": {"$gte": 100, "$lte": 200},
        "rating": {"$gte": 3},
        "baseColour": {"$one_of": ['White', 'Blue', 'Black']},
        "season": {"$all_of": ['Summer', 'Spring', 'Fall']},
    },
}

Other thoughts

In Commsor (our community analysis tool) we use a lot of filters are useful in real world:
image

So I'd also like to propose the following operators:

  • $contains
  • $notcontains (e.g. we often want to filter out universities since we focus on enterprises, so we would say company_name $notcontains "university")

Notes

  • Rating and price aren't in the original dataset (as used by @bwanglzu in his notebook). I generated them programmatically to give us a richer dataset to play with

PQLite: benchmark with filtering

Performance benchmark experiment
Something like this:

  • QPS with filtering out 10% data
  • QPS with filtering out 30% data
  • QPS with filtering out 50% data
  • QPS with filtering out 80% data

PQlite: restore index from local storage

rebuild the index (sqlite, and vector index) from the local lmdb data

  • refactor abstract class BaseIndex
  • refactor fit function to check whether the training valid
  • add stat and clear api
  • rebuild index from local disk (i.e., lmdb data)
    • restore trained model from disk
    • rebuild index from disk

PQLite: improve table query performance

The bottleneck of search is about table SQL query

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    71                                               @line_profile
    72                                               def ivf_search(
    73                                                   self,
    74                                                   x: np.ndarray,
    75                                                   cells: np.ndarray,
    76                                                   where_clause: str = '',
    77                                                   where_params: Tuple = (),
    78                                                   limit: int = 10,
    79                                               ):
    80        15         18.0      1.2      0.0          dists = []
    81
    82        15         11.0      0.7      0.0          doc_idx = []
    83        15          7.0      0.5      0.0          cell_ids = []
    84        15          6.0      0.4      0.0          count = 0
    85        30        141.0      4.7      0.0          for cell_id in cells:
    86        15         23.0      1.5      0.0              cell_table = self.cell_table(cell_id)
    87        15      54765.0   3651.0      4.5              cell_size = cell_table.count()
    88        15         20.0      1.3      0.0              if cell_size == 0:
    89                                                           continue
    90
    91        15          6.0      0.4      0.0              indices = None
    92        15         10.0      0.7      0.0              if where_clause or (cell_table.deleted_count() > 0):
    93        15         11.0      0.7      0.0                  indices = []
    94    500030     806655.0      1.6     66.3                  for doc in cell_table.query(
    95        15          9.0      0.6      0.0                      where_clause=where_clause, where_params=where_params
    96                                                           ):
    97    500000     274113.0      0.5     22.5                      indices.append(doc['_id'])
    98
    99        15         27.0      1.8      0.0                  if len(indices) == 0:
   100                                                               continue
   101
   102        15      13655.0    910.3      1.1                  indices = np.array(indices, dtype=np.int64)
   103
   104        30      63932.0   2131.1      5.3              _dists, _doc_idx = self.vec_index(cell_id).search(
   105        15         32.0      2.1      0.0                  x, limit=min(limit, cell_size), indices=indices
   106                                                       )
   107
   108        15         22.0      1.5      0.0              if count >= limit and _dists[0] > dists[-1][-1]:
   109                                                           continue
   110
   111        15         24.0      1.6      0.0              dists.append(_dists)
   112        15          9.0      0.6      0.0              doc_idx.append(_doc_idx)
   113        15         41.0      2.7      0.0              cell_ids.extend([cell_id] * len(_dists))
   114        15         13.0      0.9      0.0              count += len(_dists)
   115
   116        15        113.0      7.5      0.0          cell_ids = np.array(cell_ids, dtype=np.int64)
   117        15         13.0      0.9      0.0          if len(dists) != 0:
   118        15        459.0     30.6      0.0              dists = np.hstack(dists)
   119        15        125.0      8.3      0.0              doc_idx = np.hstack(doc_idx)
   120
   121        15        105.0      7.0      0.0              indices = dists.argsort(axis=0)[:limit]
   122        15         28.0      1.9      0.0              dists = dists[indices]
   123        15         14.0      0.9      0.0              cell_ids = cell_ids[indices]
   124        15          9.0      0.6      0.0              doc_idx = doc_idx[indices]
   125
   126        15          6.0      0.4      0.0          doc_ids = []
   127       165        163.0      1.0      0.0          for cell_id, offset in zip(cell_ids, doc_idx):
   128       150       1750.0     11.7      0.1              doc_id = self.cell_table(cell_id).get_docid_by_offset(offset)
   129       150         94.0      0.6      0.0              doc_ids.append(doc_id)
   130        15          8.0      0.5      0.0          return dists, doc_ids, cell_ids

Filtering using $in keyword crashes the executors and the database

I have a simple flow that looks like this:

f = (
    Flow(port_expose=8082, protocol='http', monitoring=True, port_monitoring=9090)
    .add(name='encoder', uses='jinahub+docker://CLIPEncoder')
    .add(name='processor',
         uses='jinahub+docker://PQLiteIndexer/latest',
         uses_with={
            'dim': 512,
            'columns': columns
         },
    )
)

Indexing works fine and I can verify it using /status endpoint where it shows the number of indexed documents. When I hit the /search endpoint, I can search and retrieve results correctly.

I also verified that filtering works by testing it with $eq. However, when I test it with $in, things go south. Not only does it not return any results, but it also seems to crash my entire database where I can't make calls to endpoints like /statusand /search. Does anyone have any idea as to what is happening? Here is how I am structuring my filter query:

# this query searches the files with a tag 'owners' of type array which includes the given string
search_results = c.post(on="/search",
                 parameters={
                     "query": QUERIES[0],
                     "traversal_paths": '@r,c',
                     "limit": 3,
                     "filter":{"owners": {"$in": ["EGGWLJSUHT6GLWU2KIB0"]}}
                 })

[Bug] Executer from hub fails to start

I've been trying to use the example from Alex's multimodal search demo and also tested the example code for the PQLite extention on the jina hub page. Testing both examples I get the following errors with jina 2.6.0, 2.6.2 and the latest version.

python ./app.py -t index -n 10Fetching PQLiteIndexer from Jina Hub ...DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses (raised from /home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/flatbuffers/compat.py:19)
  image_encoder@263243[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Traceback (most recent call last):
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/server/cli.py", line 444, in main
    run()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "./app.py", line 94, in <module>
    main()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "./app.py", line 88, in main
    index(csv_file=CSV_FILE, max_docs=num_docs)
  File "./app.py", line 43, in index
    with flow_index:
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/flow/base.py", line 1132, in __enter__
    return self.start()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/flow/base.py", line 1179, in start
    self.enter_context(v)
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 425, in enter_context
    result = _cm_type.__enter__(cm)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 208, in __enter__
    return self.start()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 692, in start
    self.enter_context(self.replica_set)
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 425, in enter_context
    result = _cm_type.__enter__(cm)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 476, in __enter__
    self._peas.append(BasePea(_args).start())
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 135, in __init__
    self.runtime_cls = self._get_runtime_cls()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 427, in _get_runtime_cls
    update_runtime_cls(self.args)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/helper.py", line 106, in update_runtime_cls
    _args.uses = HubIO(_hub_args).pull()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/hubio.py", line 672, in pull
    executor, from_cache = HubIO.fetch_meta(
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/helper.py", line 323, in wrapper
    result = func(*args, **kwargs)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/hubio.py", line 588, in fetch_meta
    image_name=resp['image'],
KeyError: 'image'```

PQLite: implement pq in hnsw via C++

Pros:

  • compress the embedding, save the memory consumption
  • speed up the distance calculating via ADC method

Cons:

  • resulted in degraded search quality

PQLite: optimize product quantization index

Improve the following parts (probably cython):

  • lookup operation of asymetric distance computation
  • asymetric distance computation table
    and benchmark improvement.
  • [o] add tests to the functions

The main issue with current version is that unless lot of data is filtered the code ends up beeing slower than cdist.

acquire train data from lmdb

Sometimes we need to train the PCA model when we already created an indexer. (for example, there is a memory issue after we have indexed thousands or even millions of data, and we need PCA to fix it.)

We need to fetch train data from lmdb, but this is tricky when we move to jcloud since we need to fetch data from the server instead of local machine.

One way to solve this is to add a new endpoint in client called /fetch:

data = client.post('/fetch', params={'batch_size': 1024})

for training we can use partial_train():

annlite.partial_train(data)

Indexing of long text documents are tricky

Hello,

my use case is the search in long text documents.
Documents are split to chunks (lets say sentences) and each chunk has its embedding. Root document has no embedding.
I am not able to index documents with annlite indexer because of missing embedding of root document, only chunks may be indexed.
If I store documents directly to lmdb via self._index.doc_store(0).insert(root_docs) then when loading query flow it throws error.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.

10 means (5 root docs, and 5 chunks together - dummy data)

Can you please help me
Thanks

Can't be installed in Mac M1 chip

I am trying to install annlite in my macbook with M1 chip using pip install annlite but I receive the following error:

clang: error: the clang compiler does not support '-march=native'
      error: command '/usr/bin/clang' failed with exit code 1

Is there any suggestion to fix it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.