Coder Social home page Coder Social logo

guest400123064 / bbm25-haystack Goto Github PK

View Code? Open in Web Editor NEW
16.0 1.0 0.0 598 KB

Simple Haystack in-memory document store alternative that performs incremental indexing and supports SentencePiece tokenizer.

Home Page: https://guest400123064.github.io/bbm25-haystack/

License: Apache License 2.0

Python 100.00%
bm25-plus haystack-ai information-retrieval llm rag

bbm25-haystack's Introduction

test codecov code style - Black types - Mypy Python 3.9

Better BM25 In-Memory Document Store

An in-memory document store is a great starting point for prototyping and debugging before migrating to production-grade stores like Elasticsearch. However, the original implementation of BM25 retrieval recreates an inverse index for the entire document store on every new search. Furthermore, the tokenization method is primitive, only permitting splitters based on regular expressions, making localization and domain adaptation challenging. Therefore, this implementation is a slight upgrade to the default BM25 in-memory document store by implementing incremental index update and incorporation of SentencePiece statistical sub-word tokenization.

Installation

$ pip install bbm25-haystack

Alternatively, you can clone the repository and build from source to be able to reflect changes to the source code:

$ git clone https://github.com/Guest400123064/bbm25-haystack.git
$ cd bbm25-haystack
$ pip install -e .

Usage

Quick Start

Below is an example of how you can build a minimal search engine with the bbm25_haystack components on their own. They are also compatible with Haystack pipelines.

from haystack import Document
from bbm25_haystack import BetterBM25DocumentStore, BetterBM25Retriever


document_store = BetterBM25DocumentStore()
document_store.write_documents([
   Document(content="There are over 7,000 languages spoken around the world today."),
   Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
   Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bio-luminescent waves.")
])

retriever = BetterBM25Retriever(document_store)
retriever.run(query="How many languages are spoken around the world today?")

API References

You can find the full API references here. In a hurry? Below are some most important document store parameters you might want explore:

  • k, b, delta - the three BM25+ hyperparameters.
  • sp_file - a path to a trained SentencePiece tokenizer .model file. The default tokenizer is directly copied from LLaMA-2-7B-32K tokenizer with a vocab size of 32,000.
  • n_grams - default to 1, which means text (both query and document) are tokenized into uni-grams. If set to 2, the tokenizer also augment the list of uni-grams with bi-grams, and so on. If specified as tuple, e.g., (2, 3), the tokenizer only produce bi-grams and tri-grams, without any uni-gram.
  • haystack_filter_logic - see below.

The retriever parameters are largely the same as InMemoryBM25Retriever.

Filtering Logic

The current document store uses document_matches_filter shipped with Haystack to perform filtering by default, which is the same as InMemoryDocumentStore.

However, there is also an alternative filtering logic shipped with this implementation (unstable at this point). To use this alternative logic, initialize the document store with haystack_filter_logic=False. Please find comments and implementation details in filters.py. TL;DR:

  • Comparison with None, i.e., missing values, involved will always return False, no matter missing the document attribute value or missing the filter value.
  • Comparison with pandas.DataFrame is always prohibited to reduce surprises.
  • No implicit datetime conversion from string values.
  • in and not in allows any Iterable as filter value, without the list constraint.
  • Allowing custom comparison functions for more flexibility. Note that the custom comparison function inputs are NEVER checked, i.e., no missing value check, no DataFrame check, etc. User should ensure the input values are valid and return value is always a boolean. The inputs are always supplied in the order of document value and then filter value.

In this case, the negation logic needs to be considered again because False can now issue from both input nullity check and the actual comparisons. For instance, in and not in both yield non-matching upon missing values. But I think having input processing and comparisons separated makes the filtering behavior more transparent.

Search Quality Evaluation

This repo has a simple script to help evaluate the search quality over BEIR benchmark. You need to clone the repository (you can also manually download the script and place it under a folder named scripts) and you have to install additional dependencies to run the script.

$ pip install beir

To run the script, you may want to specify the dataset name and BM25 hyperparameters. For example:

$ python scripts/benchmark_beir.py --datasets scifact arguana --bm25-k1 1.2 --n-grams 2 --output eval.csv

It automatically downloads the benchmarking dataset to benchmarks/beir, where benchmarks is at the same level as scripts. You may also check the help page for more information.

$ python scripts/benchmark_beir.py --help

New benchmarking scripts are expected to be added in the future.

License

bbm25-haystack is distributed under the terms of the Apache-2.0 license.

bbm25-haystack's People

Contributors

guest400123064 avatar

Stargazers

Theodore M avatar leaf5 avatar Ruichen Zhang avatar Tuana Çelik avatar M. Affaneh avatar IamMT avatar Xiatao Sun avatar Pete Tanski avatar Xinyue Wang avatar  avatar  avatar  avatar Edward avatar Vopaaz avatar HouZH avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.