Coder Social home page Coder Social logo

amenra / retriv Goto Github PK

View Code? Open in Web Editor NEW
155.0 8.0 18.0 381 KB

A Python Search Engine for Humans 🥸

License: MIT License

Python 98.85% Makefile 1.15%
bm25 information-retrieval numba search search-engine search-engine-optimization dense-retrieval semantic-search hybrid-retrieval sparse-retrieval

retriv's Introduction

PyPI version License: MIT

🔥 News

  • [August 23, 2023] retriv 0.2.2 is out!
    This release adds experimental support for multi-field documents and filters. Please, refer to Advanced Retriever documentation.

  • [February 18, 2023] retriv 0.2.0 is out!
    This release adds support for Dense and Hybrid Retrieval. Dense Retrieval leverages the semantic similarity of the queries' and documents' vector representations, which can be computed directly by retriv or imported from other sources. Hybrid Retrieval mix traditional retrieval, informally called Sparse Retrieval, and Dense Retrieval results to further improve retrieval effectiveness. As the library was almost completely redone, indices built with previous versions are no longer supported.

⚡️ Introduction

retriv is a user-friendly and efficient search engine implemented in Python supporting Sparse (traditional search with BM25, TF-IDF), Dense (semantic search) and Hybrid retrieval (a mix of Sparse and Dense Retrieval). It allows you to build a search engine in a single line of code.

retriv is built upon Numba for high-speed vector operations and automatic parallelization, PyTorch and Transformers for easy access and usage of Transformer-based Language Models, and Faiss for approximate nearest neighbor search. In addition, it provides automatic tuning functionalities to allow you to tune its internal components with minimal intervention.

✨ Main Features

Retrievers

Unified Search Interface

All the supported retrievers share the same search interface:

  • search: standard search functionality, what you expect by a search engine.
  • msearch: computes the results for multiple queries at once. It leverages automatic parallelization whenever possible.
  • bsearch: similar to msearch but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in JSONl format. bsearch is handy for computing results for hundreds of thousands or even millions of queries without hogging your RAM. Pre-computed results can be leveraged for negative sampling during the training of Neural Models for Information Retrieval.

AutoTune

retriv automatically tunes Faiss configuration for approximate nearest neighbors search by leveraging AutoFaiss to guarantee 10ms response time based on your available hardware. Moreover, it offers an automatic tuning functionality for BM25's parameters, which require minimal user intervention. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one. Finally, it can automatically balance the importance of lexical and semantic relevance scores computed by the Hybrid Retriever to maximize retrieval effectiveness.

📚 Documentation

🔌 Requirements

python>=3.8

💾 Installation

pip install retriv

💡 Minimal Working Example

# Note: SearchEngine is an alias for the SparseRetriever
from retriv import SearchEngine

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

se = SearchEngine("new-index").index(collection)

se.search("witches masses")

Output:

[
  {
    "id": "doc_2",
    "text": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "text": "Generals gathered in their masses",
    "score": 0.6931472
  }
]

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

🤘 Want to contribute?

Would you like to contribute? Please, drop me an e-mail.

📄 License

retriv is an open-sourced software licensed under the MIT license.

retriv's People

Contributors

alex2awesome avatar amenra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

retriv's Issues

Minimal example for Hybrid Search fails

First, I really like this project !

Respective sparse and dense examples work with minimal setup.

Issue is with the hybrid mode.

Here is the code:

from retriv import HybridRetriever

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

hr = HybridRetriever(
    # Shared params ------------------------------------------------------------
    index_name="hybrid-index",
    # Sparse retriever params --------------------------------------------------
    sr_model="bm25",
    min_df=1,
    tokenizer="whitespace",
    stemmer="english",
    stopwords="english",
    do_lowercasing=True,
    do_ampersand_normalization=True,
    do_special_chars_normalization=True,
    do_acronyms_normalization=True,
    do_punctuation_removal=True,
    # Dense retriever params ---------------------------------------------------
    dr_model="sentence-transformers/multi-qa-MiniLM-L6-dot-v1",
    normalize=True,
    max_length=128,
    use_ann=True,
)

he = hr.index(collection)
he.search(
  query="witches",    # What to search for        
  return_docs=True,          # Default value, return the text of the documents
  cutoff=5,                # 100 is Default value, number of results to return
)

Error:

Building TDF matrix: 100%|██████████| 4/4 [00:01<00:00,  3.41it/s]
Building inverted index: 100%|██████████| 13/13 [00:00<00:00, 6786.90it/s]
Embedding documents: 100%|██████████| 4/4 [00:00<00:00, 206.63it/s]
Building ANN Searcher
100%|██████████| 1/1 [00:00<00:00, 20661.60it/s]
100%|██████████| 1/1 [00:00<00:00, 99.58it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /tmp/ipykernel_45461/1793453458.py:32 in <module>                                                │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_45461/1793453458.py'                        │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/hybrid_retrieve │
│ r.py:255 in search                                                                               │
│                                                                                                  │
│   252 │   │   """
│   253 │   │                                                                                      │
│   254 │   │   sparse_results = self.sparse_retriever.search(query, False, 1_000)                 │
│ ❱ 255 │   │   dense_results = self.dense_retriever.search(query, False, 1_000)                   │
│   256 │   │   hybrid_results = self.merger.fuse([sparse_results, dense_results])                 │
│   257 │   │   return (                                                                           │
│   258 │   │   │   self.prepare_results(                                                          │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/dense_retriever │
│ /dense_retriever.py:251 in search                                                                │
│                                                                                                  │
│   248 │   │   │   │   self.load_embeddings()                                                     │
│   249 │   │   │   doc_ids, scores = compute_scores(encoded_query, self.embeddings, cutoff)       │
│   250 │   │                                                                                      │
│ ❱ 251 │   │   doc_ids = self.map_internal_ids_to_original_ids(doc_ids)                           │
│   252 │   │                                                                                      │
│   253 │   │   return (                                                                           │
│   254 │   │   │   self.prepare_results(doc_ids, scores)                                          │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. │
│ py:87 in map_internal_ids_to_original_ids                                                        │
│                                                                                                  │
│    84 │   │   return results                                                                     │
│    85 │                                                                                          │
│    86 │   def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:            │
│ ❱  87 │   │   return [self.id_mapping[doc_id] for doc_id in doc_ids]                             │
│    88 │                                                                                          │
│    89 │   def save(self):                                                                        │
│    90 │   │   raise NotImplementedError()                                                        │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. │
│ py:87 in <listcomp>                                                                              │
│                                                                                                  │
│    84 │   │   return results                                                                     │
│    85 │                                                                                          │
│    86 │   def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:            │
│ ❱  87 │   │   return [self.id_mapping[doc_id] for doc_id in doc_ids]                             │
│    88 │                                                                                          │
│    89 │   def save(self):                                                                        │
│    90 │   │   raise NotImplementedError()                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: -1

Error while generating package metadata

× Encountered error while generating package metadata.
╰─> See below output.

(venv) celso@capri:~$ pip install retriv
Collecting retriv
  Using cached retriv-0.1.4-py3-none-any.whl (20 kB)
Requirement already satisfied: numpy in ./projects/venvs/venv/lib/python3.10/site-packages (from retriv) (1.22.4)
Collecting optuna
  Using cached optuna-3.0.5-py3-none-any.whl (348 kB)
Collecting indxr
  Using cached indxr-0.1.1-py3-none-any.whl (8.7 kB)
Collecting cyhunspell
  Using cached CyHunspell-1.3.4.tar.gz (2.7 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      Downloading https://github.com/hunspell/hunspell/archive/v1.6.2.tar.gz to /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external/v1.6.2.tar.gz
      Extracting /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external/v1.6.2.tar.gz to /tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/external
      Traceback (most recent call last):
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 226, in pkgconfig
          raise RuntimeError(response)
      RuntimeError: /bin/sh: 1: pkg-config: not found
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/setup.py", line 46, in <module>
          hunspell_config = pkgconfig('hunspell', language='c++')
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 259, in pkgconfig
          lib_path = build_hunspell_package(os.path.join(BASE_DIR, 'external', 'hunspell-1.6.2'))
        File "/tmp/pip-install-5by9uuyz/cyhunspell_e8f37a450c4a4ce08ec1d2c733c1f0ba/find_library.py", line 189, in build_hunspell_package
          check_call(['autoreconf', '-vfi'])
        File "/usr/lib/python3.10/subprocess.py", line 364, in check_call
          retcode = call(*popenargs, **kwargs)
        File "/usr/lib/python3.10/subprocess.py", line 345, in call
          with Popen(*popenargs, **kwargs) as p:
        File "/usr/lib/python3.10/subprocess.py", line 969, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/usr/lib/python3.10/subprocess.py", line 1845, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: 'autoreconf'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(venv) celso@capri:~$ 

[BUG] Segmentation fault (core dumped)

First of all, thank you for this excellent library.

Describe the bug

Building TDF matrix: 100%|███████████████████████████████████████████████| 13905/13905 [00:34<00:00, 408.07it/s]
Building inverted index: 100%|███████████████████████████████████████| 148864/148864 [00:10<00:00, 14750.18it/s]
Batch search:   0%|                                                                   | 0/13905 [00:00<?, ?it/s]
Segmentation fault      (core dumped)

I am getting Segmentation fault (core dumped) when using bsearch in Sparse Retriever.

Current environment
  • CUDA:
    - GPU:
    - NVIDIA GeForce RTX 3090
    - available: True
    - version: 12.1

  • Packages:
    - absl-py: 2.0.0
    - accelerate: 0.24.1
    - aiohttp: 3.8.6
    - aiosignal: 1.3.1
    - alembic: 1.12.1
    - antlr4-python3-runtime: 4.9.3
    - appdirs: 1.4.4
    - async-timeout: 4.0.3
    - attrs: 23.1.0
    - autofaiss: 2.15.8
    - beautifulsoup4: 4.12.2
    - bleach: 6.1.0
    - cachetools: 5.3.2
    - cbor: 1.0.0
    - cbor2: 5.5.1
    - certifi: 2023.7.22
    - charset-normalizer: 3.3.2
    - click: 8.1.7
    - colorlog: 6.7.0
    - contourpy: 1.2.0
    - cramjam: 2.7.0
    - cycler: 0.12.1
    - dill: 0.3.7
    - docker-pycreds: 0.4.0
    - embedding-reader: 1.5.1
    - faiss-cpu: 1.7.4
    - fastparquet: 2023.10.1
    - filelock: 3.13.1
    - fire: 0.4.0
    - fonttools: 4.44.0
    - frozenlist: 1.4.0
    - fsspec: 2023.10.0
    - gitdb: 4.0.11
    - gitpython: 3.1.40
    - google-auth: 2.23.4
    - google-auth-oauthlib: 1.1.0
    - greenlet: 3.0.1
    - grpcio: 1.59.2
    - huggingface-hub: 0.17.3
    - hydra-core: 1.3.2
    - idna: 3.4
    - ijson: 3.2.3
    - indxr: 0.1.5
    - inscriptis: 2.3.2
    - ir-datasets: 0.5.5
    - jinja2: 3.1.2
    - joblib: 1.3.2
    - kaggle: 1.5.16
    - keybert: 0.8.3
    - kiwisolver: 1.4.5
    - krovetzstemmer: 0.8
    - lightning-utilities: 0.9.0
    - llvmlite: 0.41.1
    - lxml: 4.9.3
    - lz4: 4.3.2
    - mako: 1.3.0
    - markdown: 3.5.1
    - markdown-it-py: 3.0.0
    - markupsafe: 2.1.3
    - matplotlib: 3.8.1
    - mdurl: 0.1.2
    - mpmath: 1.3.0
    - multidict: 6.0.4
    - multipipe: 0.1.0
    - multiprocess: 0.70.15
    - networkx: 3.2.1
    - nltk: 3.8.1
    - nmslib: 2.1.1
    - numba: 0.58.1
    - numpy: 1.26.1
    - nvidia-cublas-cu12: 12.1.3.1
    - nvidia-cuda-cupti-cu12: 12.1.105
    - nvidia-cuda-nvrtc-cu12: 12.1.105
    - nvidia-cuda-runtime-cu12: 12.1.105
    - nvidia-cudnn-cu12: 8.9.2.26
    - nvidia-cufft-cu12: 11.0.2.54
    - nvidia-curand-cu12: 10.3.2.106
    - nvidia-cusolver-cu12: 11.4.5.107
    - nvidia-cusparse-cu12: 12.1.0.106
    - nvidia-nccl-cu12: 2.18.1
    - nvidia-nvjitlink-cu12: 12.3.52
    - nvidia-nvtx-cu12: 12.1.105
    - oauthlib: 3.2.2
    - omegaconf: 2.3.0
    - oneliner-utils: 0.1.2
    - optuna: 3.4.0
    - orjson: 3.9.10
    - packaging: 23.2
    - pandas: 1.5.3
    - pillow: 10.1.0
    - pip: 23.3.1
    - protobuf: 4.23.4
    - psutil: 5.9.6
    - pyarrow: 12.0.1
    - pyasn1: 0.5.0
    - pyasn1-modules: 0.3.0
    - pyautocorpus: 0.1.12
    - pybind11: 2.6.1
    - pygments: 2.16.1
    - pyparsing: 3.1.1
    - pystemmer: 2.0.1
    - python-dateutil: 2.8.2
    - python-slugify: 8.0.1
    - pytorch-lightning: 2.1.1
    - pytorch-metric-learning: 2.3.0
    - pytz: 2023.3.post1
    - pyyaml: 6.0.1
    - ranx: 0.3.18
    - regex: 2023.10.3
    - requests: 2.31.0
    - requests-oauthlib: 1.3.1
    - retriv: 0.2.3
    - rich: 13.6.0
    - rsa: 4.9
    - safetensors: 0.4.0
    - scikit-learn: 1.3.2
    - scipy: 1.11.3
    - seaborn: 0.13.0
    - sentence-transformers: 2.2.2
    - sentencepiece: 0.1.99
    - sentry-sdk: 1.39.1
    - setproctitle: 1.3.3
    - setuptools: 68.2.2
    - six: 1.16.0
    - smmap: 5.0.1
    - soupsieve: 2.5
    - sqlalchemy: 2.0.23
    - sympy: 1.12
    - tabulate: 0.9.0
    - tensorboard: 2.15.1
    - tensorboard-data-server: 0.7.2
    - termcolor: 2.3.0
    - text-unidecode: 1.3
    - threadpoolctl: 3.2.0
    - tokenizers: 0.14.1
    - torch: 2.1.0
    - torchaudio: 2.1.0
    - torchmetrics: 1.2.0
    - torchvision: 0.16.0
    - tqdm: 4.66.1
    - transformers: 4.35.0
    - trec-car-tools: 2.6
    - triton: 2.1.0
    - typing-extensions: 4.8.0
    - unidecode: 1.3.7
    - unlzw3: 0.2.2
    - urllib3: 2.0.7
    - wandb: 0.16.1
    - warc3-wet: 0.2.3
    - warc3-wet-clueweb09: 0.2.5
    - webencodings: 0.5.1
    - werkzeug: 3.0.1
    - wheel: 0.41.2
    - yarl: 1.9.2
    - zlib-state: 0.1.6

  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.10.13
    - release: 5.15.0-88-generic
    - version: #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023

Cache directory not in home directory?

Dear @AmenRa,
Thanks for releasing the clean retrieval library! I was wondering if it's possible to set custom cache directories? By default, it seems like the index is being stored in ~/.retriv?

Thank you!

[Feature Request] Ability to search and index documents with other metadata

Hi,
Nice choice with War Pigs in the example. :)

Been looking for a pure-python based search engine, ever since Whoosh stopped being actively developed.

Realize this library is just getting started out, but was wondering if it is possible to add the ability to search and filter by metadata as well.

For example

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses", "album": "War pigs"},
  {"id": "doc_2", "text": "Finished with my woman", "album": "Paranoid"}
]

I might want to search for all lines where the album has the word "pigs" for eg.

Also, is the search OR by default, as in find ANY of the words in the query. Can we search with AND and other BOOLEAN operators, as well as proximity and phrase search? Lucene has these features.

Any plan to combine knn search with the text search?

Multiprocess error triggers while trying example code

Hi AmenRa,

First of all I'd like to thank you for your efforts.
I'm trying to use retriv, but when I use the sample code you provided in the readme, I get the following error:

Building TDF matrix:   0%|                                                                                                                                                               | 0/4 [00:00<?, ?it/s]    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1268, in _count_vocab
    for doc in raw_documents:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\tqdm\std.py", line 1182, in __iter__
    for obj in iterable:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multipipe\multipipe.py", line 28, in to_generator
    with Pool(n_threads) as pool:
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\context.py", line 119, in Pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\context.py", line 336, in _Popen
    return Popen(process_obj)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\mcelli\Documents\Training\Python\cosinematrix\venv\lib\site-packages\multiprocess\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
    return Pool(processes, initializer, initargs, maxtasksperchild,

By just running the example code:

# Note: SearchEngine is an alias for the SparseRetriever
from retriv import SearchEngine

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

se = SearchEngine("new-index").index(collection)

se.search("witches masses")

Could you please help me fix this issue?

[Feature Request] Allow GPU for query embedding

Hi,

Really great and useful library. Thanks for making it available for everyone.

I am mostly applying this for quick evaluation of search models and realized that DenseRetriever is only applying GPU for the documents when building the index, but not for the queries when running search, which makes it a bit slow for larger sets of queries.

Would you consider adding use_gpu keyword argument to search, msearch and bsearch methods of DenseRetriever and HybridRetriever? Looks like it could be handled similarly as in the index method.

Just in case someone else is having the same issue, this problem can be avoided by directly setting the encoder device before running search as follows:

use_gpu = True

dr = dr.index(collection, use_gpu=use_gpu)

if use_gpu:
    dr.encoder.change_device('cuda')

r = dr.bsearch(queries=queries)
dr.encoder.change_device('cpu')

Thanks!

using another ANN

@AmenRa :

Thanks for good project.
Suggest to use Qdrant library for in memory search as alternative to FAISS.
Can help to implement it.

Thansk !

autotune Function Usage Example

I am looking for an example of how to structure the queries and qrels parameters of the autotune function because I searched in the repo and didn't find any example for that. Precisely, what should be the keys and values of queries dict? and similarly for qrels dict?

Thanks in advance for your help.

ANN_Searcher not dealing with -1 returned by faiss_index.search()

Traceback (most recent call last):
  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 251, in search
    doc_ids = self.map_internal_ids_to_original_ids(doc_ids)
  File "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80, in map_internal_ids_to_original_ids
    return [self.id_mapping[doc_id] for doc_id in doc_ids]
  File "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80, in <listcomp>
    return [self.id_mapping[doc_id] for doc_id in doc_ids]
KeyError: -1

Update "/lib/python3.8/site-packages/retriv/base_retriever.py", line 80 to

    return [self.id_mapping[doc_id] for doc_id in doc_ids if doc_id != -1]

would fix the problem.

Getting Out of Memory Error

Hi,

I have a dataset which has around 2million rows and each text is no more than 20 tokens. I tried building using SparseRetreiver

from retriv import SparseRetriever

sr = SparseRetriever(
  index_name="bm25",
  model="bm25",
  min_df=1,
  tokenizer="whitespace",
  stemmer="english",
  stopwords="english",
  do_lowercasing=True,
  do_ampersand_normalization=True,
  do_special_chars_normalization=True,
  do_acronyms_normalization=True,
  do_punctuation_removal=False,
)
collections = [{"id": id, "text": text} for id, text in zip(ids, descs)]
sr.index(collections)

My disc space is around 14GB and RAM is around 96GB with 24 processors. is there any option to chunk the data and index it one chunk at a time?

Qrels and Run query ids do not match

Trial 0 failed with parameters: {'b': 0.37, 'k1': 9.600000000000001} because of the following error: 
AssertionError('Qrels and Run query ids do not match').

I guess this issue happens because in qrels, there are not all possible scores for all possible results in the run.
Wouldn't it be interesting to filter the run dictionary for only the evaluated cases that occur in qrels ?

HybridRetriever does not respect cutoff when calling sub-retrievers and the merger

In HybridRetriever.search:

        sparse_results = self.sparse_retriever.search(query, False, 1_000)
        dense_results = self.dense_retriever.search(query, False, 1_000)
        hybrid_results = self.merger.fuse([sparse_results, dense_results])

cutoff is not passed down.

potential fix:

        sparse_results = self.sparse_retriever.search(query, False, cutoff)
        dense_results = self.dense_retriever.search(query, False, cutoff)
        hybrid_results = self.merger.fuse([sparse_results, dense_results], cutoff)

[Feature Request] Add documents to index after initializing?

Hi,

I understand that there are reasons why we only want to do indexing once, since there are corpus-level statistics that need to be calculated.

But is there any way to index a huge batch of documents, then index a few more, assuming they are from the same distribution?

Alex

HybridRetriever raise KeyError: -1 if the len of doc less than 1_000

The cutoff of msearch for HybridRetriever is hardcode to 1_000, which makes map_internal_ids_to_original_ids raise KeyError when doc len less than 1_000

sparse_results = self.sparse_retriever.search(query, False, 1_000)
dense_results = self.dense_retriever.search(query, False, 1_000)

Thus, map_internal_ids_to_original_ids should be:

def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:
    return [self.id_mapping[doc_id] for doc_id in doc_ids if doc_id != -1]

[Feature Request] Use WAND Top-K Retrieval

@inproceedings{petri2013exploring,
  title={Exploring the magic of WAND},
  author={Petri, Matthias and Culpepper, J Shane and Moffat, Alistair},
  booktitle={Proceedings of the 18th Australasian Document Computing Symposium},
  pages={58--65},
  year={2013}
}

I believe if you're using inverted index and token - docs list, using the WAND Top-K Retrieval Algorithm can speedup retrieval for small K in large documents. I'm not sure whether it's relevant to this project. I've once implemented this https://raw.githubusercontent.com/hockyy/ir-pa-2/main/bsbi.py

Input file format

Hi, I'm pretty new to this. Can you give an example what a input file in jsonl format looks like?

Image search

Hi,
Thank you for a nice Elastic/Pinecone replacement 🙂
A small question (or perhaps a feature request): is it possible to use different neural networks for indexing and retrieval?
I mean, with CLIP model one first calculates vectors of images, and then uses second part of the same model to encode text queries.

fsspec==2023.12.2 does not allow '**' in path

  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 175, in index
    self.index_aux(
  File "/lib/python3.8/site-packages/retriv/dense_retriever/dense_retriever.py", line 137, in index_aux
    self.ann_searcher.build()
  File "/lib/python3.8/site-packages/retriv/dense_retriever/ann_searcher.py", line 27, in build
    index, index_infos = build_index(
  File "/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 205, in build_index
    embedding_reader = EmbeddingReader(
  File "/lib/python3.8/site-packages/embedding_reader/embedding_reader.py", line 20, in __init__
    self.reader = NumpyReader(embeddings_folder)
  File "/lib/python3.8/site-packages/embedding_reader/numpy_reader.py", line 67, in __init__
    self.fs, embeddings_file_paths = get_file_list(embeddings_folder, "npy")
  File "/lib/python3.8/site-packages/embedding_reader/get_file_list.py", line 15, in get_file_list
    return _get_file_list(path, file_format)
  File "/lib/python3.8/site-packages/embedding_reader/get_file_list.py", line 46, in _get_file_list
    file_paths = fs.glob(glob_pattern)
  File "/lib/python3.8/site-packages/fsspec/spec.py", line 606, in glob
    pattern = glob_translate(path + ("/" if ends_with_sep else ""))
  File "/lib/python3.8/site-packages/fsspec/utils.py", line 734, in glob_translate
    raise ValueError(
ValueError: Invalid pattern: '**' can only be an entire path component

this error does not occur with fsspec==2023.5.0

[BUG] Corrupted log when using SearchEngine

Hi again,

I'm stuck with a strange behavior, that from my tests seems to be related to the use of the SearchEngine.
I'm using a SingletonLogger that logs everything to stdout and persists that log onto a file.
When the program runs, the index takes a bit of time to be calculated and if I check the logfile, I can correctly see everything printed until this point. After the SearchEngine finishes calculating the index, the first row of the logfile becomes a series of nul values.
Below a sample of code and of the log file.
Can anyone give me pointers to solve this?

_logger = Logger()

[code doing stuff, collecting collection mainly]

_logger.info("Building index...")
SearchEngine("new-index").index(collection, show_progress=False)
_logger.info("Index built.")

logfile.log

Doc strings

Hi, I can't help but notice that this codebase is quite missing in doc strings.

This really hinders my experience as someone trying to use it. In particular, auto completion does not work.
I know that you've made your documentation on markdown files, and why not, but it does not explain every function that I may want to use.

Is there an ETA on getting those ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.