bclavie / ragatouille Goto Github PK

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.

License: Apache License 2.0

Python 100.00%

ragatouille's People

Contributors

Stargazers

Watchers

Forkers

n1lanjan allthingsllm codeaudit jobergum hwchase17 tm17-abcgen techthiyanes saquib-mehmood mivanovitch tomchapin ai-jie01 murilocurti animesh tonywhite11 robin10125 ai-mou eltociear rkp64 dragonusangelife limetaryd itselgispreamha messagesta-r finmalage-westphold gisteroxcaptail cclaworns mensurgia-heardoom wardipity28headlinte logiclent50webexotic chirpik75 saviorrazu0glutilli ladywib pattyrobo-s countryroyal4 mbrukman yhopwator readyna-o sletsuchem ailmendra17stictime 40versinda diddystblueba jansystemic knightcn1983 xiechengmude samibouge lexsf filippo82 turkcadent hhy5277 adharm lakshayk12 sandy4321 sunholo-data qqr1 sutyum shauryr alxpez lisabuilds kekewind heiung2001 josephrp haskely johnpeng47 sarutobi12 dalehille mz0in neuroscigeek77 righteousgambit polya20 liuqi6777 wuwuz flexorregev deven367 supermario-ai joeaelkhoury almonok shivamsinha15 amyhei12 adambear 0-hero wooodhead g-i-o-r-g-i-o primouomo89 anminhhung muou55555 gmartin-dev sennoy11012 bjsi potrock gautamr-samagra akshay1921 corrius stephenbyrne99 tuanbc jtatman platfrmrcarl bryansparks codeakrome cryptomercury davidadamczyk abcbum

ragatouille's Issues

Google Colab support

Hi there,

do you have any ideas/clues for the issue with Google Colab? If yes, I could look into this?

Indexing failing: subcommand issues

I'm testing the Studio Ghibli sample code in the README and running on an Ubuntu 22.04.3 dist. of an EC2 machine, Python 3.10.

The problem seems to be related to CUDA. Here is my code:

from ragatouille import RAGPretrainedModel
from ragatouille.utils import get_wikipedia_page
from ragatouille.data import CorpusProcessor
import os

os.environ['CUDA_HOME'] = '/usr/local/cuda-12.3'

RAG = RAGPretrainedModel.from_pretrained('colbert-ir/colbertv2.0')

my_documents = [get_wikipedia_page("Hayao_Miyazaki"), get_wikipedia_page("Studio_Ghibli")]
processor = CorpusProcessor()
processed_docs = processor.process_corpus(my_documents)

index_path = RAG.index(index_name="ghibli_test", collection=processed_docs)

and here is the output:


[Jan 16, 14:37:46] #> Creating directory .ragatouille/colbert/indexes/ghibli_test 


#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
[Jan 16, 14:37:49] [0] 		 #> Encoding 96 passages..
[Jan 16, 14:37:51] [0] 		 avg_doclen_est = 189.6770782470703 	 len(local_sample) = 96
[Jan 16, 14:37:51] [0] 		 Creating 2,048 partitions.
[Jan 16, 14:37:51] [0] 		 *Estimated* 18,208 embeddings.
[Jan 16, 14:37:51] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/ghibli_test/plan.json ..
WARNING clustering 17299 points to 2048 centroids: please provide at least 79872 training points
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 68, in run
    self.train(shared_lists) # Trains centroids from selected passages
  File "/home/ubuntu/.local/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 229, in train
    bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 307, in _compute_avg_residual
    compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/colbert/indexing/codecs/residual.py", line 24, in __init__
    ResidualCodec.try_load_torch_extensions(self.use_gpu)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/colbert/indexing/codecs/residual.py", line 103, in try_load_torch_extensions
    decompress_residuals_cpp = load(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'decompress_residuals_cpp': [1/2] /usr/local/cuda-12.3/bin/nvcc  -DTORCH_EXTENSION_NAME=decompress_residuals_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda-12.3/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /home/ubuntu/.local/lib/python3.10/site-packages/colbert/indexing/codecs/decompress_residuals.cu -o decompress_residuals.cuda.o 
FAILED: decompress_residuals.cuda.o 
/usr/local/cuda-12.3/bin/nvcc  -DTORCH_EXTENSION_NAME=decompress_residuals_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda-12.3/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /home/ubuntu/.local/lib/python3.10/site-packages/colbert/indexing/codecs/decompress_residuals.cu -o decompress_residuals.cuda.o 
/bin/sh: 1: /usr/local/cuda-12.3/bin/nvcc: not found
ninja: build stopped: subcommand failed.

Clustering 17299 points in 128D to 2048 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (0.05 s, search 0.04 s): objective=2913.76 imbalance=1.486 nsplit=0       
[Jan 16, 14:37:51] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

When trying to run this without setting CUDA_HOME, indexing fails as well. Looking for help on this, thanks!

DistributedDataParallel issue on .train()

I understand this may not be a RAGatouille issue - but I can't seem to get a simple training example to work. Relentlessly running into the following within trainer.train(...):

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.

Any ideas? Apple M3 Max chip, here is the script im running using python 3.10

from ragatouille import RAGTrainer
from ragatouille.utils import get_wikipedia_page

if __name__ == "__main__":

    pairs = [
        ("Who won the premier league in 1976?", "Liverpool won the premier league in 1976."),
        ("Who was the manager for the premier league winners in 1976?", "Bob Paisley was the manager for the premier league winners in 1976."),
        ("Who was the premier league runner up in 1988-89?", "Liverpool was the premier league runner up in 1988-89."),
        ("Who has the most premier league titles?", "Manchester United has the most premier league titles."),
    ]

    my_full_corpus = [get_wikipedia_page("List_of_English_football_champions")]

    trainer = RAGTrainer(model_name="MyFineTunedColBERT", pretrained_model_name="colbert-ir/colbertv2.0") # In this example, we run fine-tuning

    trainer.prepare_training_data(raw_data=pairs, data_out_path="./data/", all_documents=my_full_corpus)

    trainer.train(batch_size=32) # Train with the default hyperparams

README Indexing fails on two GPUs

I'm not sure if the problem is related to Colab, I also have an error using Jupyter locally on my Ubuntu server.
The basic readme.md example doesn't work and the cell never finish executing.

Here's the code and stacktrace if that helps:

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
my_documents = [
    "This is a great excerpt from my wealth of documents",
    "Once upon a time, there was a great document"
]

index_path = RAG.index(index_name="my_index", collection=my_documents)

output the following:

[Jan 06, 10:41:35] #> Creating directory .ragatouille/colbert/indexes/my_index 


#> Starting...
#> Starting...
nranks = 2 	 num_gpus = 2 	 device=1
[Jan 06, 10:41:38] [1] 		 #> Encoding 0 passages..
nranks = 2 	 num_gpus = 2 	 device=0
[Jan 06, 10:41:38] [0] 		 #> Encoding 2 passages..

 File "/home/np/miniconda3/envs/np-ml/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 101, in setup
    avg_doclen_est = self._sample_embeddings(sampled_pids)
  File "/home/np/miniconda3/envs/np-ml/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 140, in _sample_embeddings
    self.num_sample_embs = torch.tensor([local_sample_embs.size(0)]).cuda()
AttributeError: 'NoneType' object has no attribute 'size'

Originally posted by @timothepearce in #14 (comment)

Support for optional documents metadata like "source file", "date created" etc. while indexing, to show them when searching

Got a short question, is it possible to embed sources as metadata into the documents when indexing (as option), so that after retrieval with RAG.search, one can see from which source the relevant documents are (as option)?

Qdrant Support?

Qdrant supports sparse vectors.

Is it possible to get COLBERTv2 supported with Qdrant as a storage layer?

Support exporting index to HuggingFace Hub

Indexing is time consuming, and oftentimes people would like to be able to easily share pre-built index for various common datasets, for general domain application (wikipedia, code documentation...) and evaluation purposes.

A simple way to support this would be to add a util function that'd export the full index folder to the huggingface model, effectively exporting both the ColBERT config + the compressed vectors, allowing you to to do something like RAGPretrainedModel.from_prebuilt_index("EXAMPLE_USER/Wikipedia") and immediately begin querying the index.

Error when Getting Started

Not sure what I am doing wrong.

!pip install git+https://github.com/bclavie/RAGatouille.git

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 from ragatouille import RAGPretrainedModel
      3 RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/ragatouille/__init__.py:2
      1 __version__ = "0.0.4a4"
----> 2 from .RAGPretrainedModel import RAGPretrainedModel
      3 from .RAGTrainer import RAGTrainer
      5 __all__ = ["RAGPretrainedModel", "RAGTrainer"]

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/ragatouille/RAGPretrainedModel.py:7
      4 from langchain.retrievers.document_compressors.base import BaseDocumentCompressor
      5 from langchain_core.retrievers import BaseRetriever
----> 7 from ragatouille.data.corpus_processor import CorpusProcessor
      8 from ragatouille.data.preprocessors import llama_index_sentence_splitter
      9 from ragatouille.integrations import (
     10     RAGatouilleLangChainCompressor,
     11     RAGatouilleLangChainRetriever,
     12 )

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/ragatouille/data/__init__.py:1
----> 1 from .corpus_processor import CorpusProcessor
      2 from .preprocessors import llama_index_sentence_splitter
      3 from .training_data_processor import TrainingDataProcessor

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/ragatouille/data/corpus_processor.py:3
      1 from typing import Callable, Optional, Union
----> 3 from ragatouille.data.preprocessors import llama_index_sentence_splitter
      6 class CorpusProcessor:
      7     def __init__(
      8         self,
      9         document_splitter_fn: Optional[Callable] = llama_index_sentence_splitter,
     10         preprocessing_fn: Optional[Union[Callable, list[Callable]]] = None,
     11     ):

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/ragatouille/data/preprocessors.py:1
----> 1 from llama_index import Document
      2 from llama_index.text_splitter import SentenceSplitter
      5 def llama_index_sentence_splitter(documents: list[str], chunk_size=256):

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/__init__.py:21
     17 from llama_index.embeddings import OpenAIEmbedding
     19 # indices
     20 # loading
---> 21 from llama_index.indices import (
     22     ComposableGraph,
     23     DocumentSummaryIndex,
     24     GPTDocumentSummaryIndex,
     25     GPTKeywordTableIndex,
     26     GPTKnowledgeGraphIndex,
     27     GPTListIndex,
     28     GPTRAKEKeywordTableIndex,
     29     GPTSimpleKeywordTableIndex,
     30     GPTTreeIndex,
     31     GPTVectorStoreIndex,
     32     KeywordTableIndex,
     33     KnowledgeGraphIndex,
     34     ListIndex,
     35     RAKEKeywordTableIndex,
     36     SimpleKeywordTableIndex,
     37     SummaryIndex,
     38     TreeIndex,
     39     VectorStoreIndex,
     40     load_graph_from_storage,
     41     load_index_from_storage,
     42     load_indices_from_storage,
     43 )
     45 # structured
     46 from llama_index.indices.common.struct_store.base import SQLDocumentContextBuilder

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/indices/__init__.py:44
     39 from llama_index.indices.struct_store.sql import (
     40     GPTSQLStructStoreIndex,
     41     SQLStructStoreIndex,
     42 )
     43 from llama_index.indices.tree.base import GPTTreeIndex, TreeIndex
---> 44 from llama_index.indices.vector_store import GPTVectorStoreIndex, VectorStoreIndex
     46 __all__ = [
     47     "load_graph_from_storage",
     48     "load_index_from_storage",
   (...)
     78     "GPTEmptyIndex",
     79 ]

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/indices/vector_store/__init__.py:4
      1 """Vector-store based data structures."""
      3 from llama_index.indices.vector_store.base import GPTVectorStoreIndex, VectorStoreIndex
----> 4 from llama_index.indices.vector_store.retrievers import (
      5     VectorIndexAutoRetriever,
      6     VectorIndexRetriever,
      7 )
      9 __all__ = [
     10     "VectorStoreIndex",
     11     "VectorIndexRetriever",
   (...)
     14     "GPTVectorStoreIndex",
     15 ]

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers/__init__.py:4
      1 from llama_index.indices.vector_store.retrievers.retriever import (  # noqa: I001
      2     VectorIndexRetriever,
      3 )
----> 4 from llama_index.indices.vector_store.retrievers.auto_retriever import (
      5     VectorIndexAutoRetriever,
      6 )
      8 __all__ = [
      9     "VectorIndexRetriever",
     10     "VectorIndexAutoRetriever",
     11 ]

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers/auto_retriever/__init__.py:1
----> 1 from llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever import (
      2     VectorIndexAutoRetriever,
      3 )
      5 __all__ = [
      6     "VectorIndexAutoRetriever",
      7 ]

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers/auto_retriever/auto_retriever.py:11
      9 from llama_index.indices.vector_store.base import VectorStoreIndex
     10 from llama_index.indices.vector_store.retrievers import VectorIndexRetriever
---> 11 from llama_index.indices.vector_store.retrievers.auto_retriever.output_parser import (
     12     VectorStoreQueryOutputParser,
     13 )
     14 from llama_index.indices.vector_store.retrievers.auto_retriever.prompts import (
     15     DEFAULT_VECTOR_STORE_QUERY_PROMPT_TMPL,
     16 )
     17 from llama_index.output_parsers.base import OutputParserException, StructuredOutput

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers/auto_retriever/output_parser.py:3
      1 from typing import Any
----> 3 from llama_index.output_parsers.base import StructuredOutput
      4 from llama_index.output_parsers.utils import parse_json_markdown
      5 from llama_index.types import BaseOutputParser

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/output_parsers/__init__.py:3
      1 """Output parsers."""
----> 3 from llama_index.output_parsers.guardrails import GuardrailsOutputParser
      4 from llama_index.output_parsers.langchain import LangchainOutputParser
      5 from llama_index.output_parsers.pydantic import PydanticOutputParser

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/llama_index/output_parsers/guardrails.py:9
      6 from deprecated import deprecated
      8 try:
----> 9     from guardrails import Guard
     10 except ImportError:
     11     Guard = None

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/guardrails/__init__.py:3
      1 # Set up __init__.py so that users can do from guardrails import Response, Schema, etc.
----> 3 from guardrails.guard import Guard
      4 from guardrails.llm_providers import PromptCallable
      5 from guardrails.prompt import Instructions, Prompt

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/guardrails/guard.py:7
      3 from typing import Callable, Dict, Optional, Tuple, Union
      5 from eliot import start_action, to_file
----> 7 from guardrails.llm_providers import PromptCallable, get_llm_ask
      8 from guardrails.prompt import Instructions, Prompt
      9 from guardrails.rail import Rail

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/guardrails/llm_providers.py:16
     12 except ImportError:
     13     MANIFEST = False
     15 OPENAI_RETRYABLE_ERRORS = [
---> 16     openai.error.APIConnectionError,
     17     openai.error.APIError,
     18     openai.error.TryAgain,
     19     openai.error.Timeout,
     20     openai.error.RateLimitError,
     21     openai.error.ServiceUnavailableError,
     22 ]
     23 RETRYABLE_ERRORS = OPENAI_RETRYABLE_ERRORS
     26 class PromptCallableException(Exception):

AttributeError: module 'openai' has no attribute 'error'

I also tried with the version on pypi, had the same error.

OS: MacOS 14.1.1
Python: 3.10.9
Environment: VirtualEnv

add_to_index() one/few documents throws error

Hello!

Great work on the tool so far! Really loving it! I have a question, I apologize if it has already been answered. I am trying to add a single document to an index using the add_to_index() function. However, I get the following error:

RuntimeError: Error in void faiss::Clustering::train_encoded(faiss::idx_t, const uint8_t*, const faiss::Index*, faiss::Index&, const float*) at /project/faiss/faiss/Clustering.cpp:275: Error: 'nx >= k' failed: Number of training points (11) should be at least as large as number of clusters (32)

I looked more into the source code and found that:
(colbert.py: add_to_index())

 current_len = len(searcher.collection)
        new_doc_len = len(new_documents)
        new_documents_with_ids = [
            {"content": doc, "document_id": new_pid_docid_map[pid]}
            for pid, doc in enumerate(new_documents)
            if new_pid_docid_map[pid] not in self.pid_docid_map
        ]

        if new_docid_metadata_map is not None:
            self.docid_metadata_map = self.docid_metadata_map or {}
            self.docid_metadata_map.update(new_docid_metadata_map)

        if current_len + new_doc_len < 5000 or new_doc_len > current_len * 0.05:
            self.index(
                [doc["content"] for doc in new_documents_with_ids],
                {
                    pid: doc["document_id"]
                    for pid, doc in enumerate(new_documents_with_ids)
                },
                docid_metadata_map=self.docid_metadata_map,
                index_name=self.index_name,
                max_document_length=self.config.doc_maxlen,
                overwrite="force_silent_overwrite",
            )

If current_length of the collection + the length of the new collection, it is re-indexing the documents, which might be more efficient than IndexUpdater, but in the re-indexing process, I suspect that only the new documents are being indexed, in this case 1 and so it is throwing the following error. I could be wrong, can you confirm this?

Thank you!

Documentation for max supported document length

Testing with larger documents gives this error:

RuntimeError: The expanded size of the tensor (812) must match the existing size (512) at non-singleton dimension 1.  Target sizes: [64, 812].  Tensor sizes: [1, 512]

What is the largest document length supported? Looks like splitting for chunk_size=800 works but chunk_size=1000 fails.

Init with existing index in non-default location not working

Hello,

I am getting an error trying to init RAGatouille from an existing index at /mnt/index (within a docker container). There is an error message:

The relevant part of my code is

        RAG = RAGPretrainedModel.from_index(f"/mnt/index/.ragatouille/colbert/indexes/{INDEX_NAME}/")
        retriever = RAG.as_langchain_retriever(index_name=INDEX_NAME)

which runs ok, the error below occurs when my langchain chain is invoked.

Here is the log:

Loading searcher for index grt_ragatouille_colbertv20 for the first time... This may take a few seconds
2024-01-13 22:21:19 - [Errno 2] No such file or directory: '.ragatouille/colbert/indexes/grt_ragatouille_colbertv20/plan.json'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/colbert/infra/config/base_config.py", line 94, in load_from_index
    loaded_config, _ = cls.from_path(metadata_path)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/colbert/infra/config/base_config.py", line 44, in from_path
    with open(name) as f:
         ^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '.ragatouille/colbert/indexes/grt_ragatouille_colbertv20/metadata.json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/chainlit/utils.py", line 39, in wrapper
    return await user_function(**params_values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/app.py", line 143, in main
    res = await chain.acall(message.content, callbacks=[cb])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 413, in acall
    return await self.ainvoke(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 209, in ainvoke
    raise e
  File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 203, in ainvoke
    await self._acall(inputs, run_manager=run_manager)
  File "/usr/local/lib/python3.11/site-packages/langchain/chains/conversational_retrieval/base.py", line 207, in _acall
    docs = await self._aget_docs(new_question, inputs, run_manager=_run_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain/chains/conversational_retrieval/base.py", line 330, in _aget_docs
    docs = await self.retriever.aget_relevant_documents(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_core/retrievers.py", line 281, in aget_relevant_documents
    raise e
  File "/usr/local/lib/python3.11/site-packages/langchain_core/retrievers.py", line 274, in aget_relevant_documents
    result = await self._aget_relevant_documents(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_core/retrievers.py", line 166, in _aget_relevant_documents
    return await run_in_executor(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/langchain_core/runnables/config.py", line 490, in run_in_executor
    return await asyncio.get_running_loop().run_in_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/futures.py", line 287, in __await__
    yield self  # This tells Task to wait for completion.
    ^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 339, in __wakeup
    future.result()
  File "/usr/local/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 206, in _get_relevant_documents
    docs = self.model.search(query, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 184, in search
    return self.model.search(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 240, in search
    self._load_searcher(index_name=index_name, force_fast=force_fast)
  File "/usr/local/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 204, in _load_searcher
    self.searcher = Searcher(
                    ^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/colbert/searcher.py", line 33, in __init__
    self.index_config = ColBERTConfig.load_from_index(self.index)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/colbert/infra/config/base_config.py", line 97, in load_from_index
    loaded_config, _ = cls.from_path(metadata_path)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/colbert/infra/config/base_config.py", line 44, in from_path
    with open(name) as f:
         ^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '.ragatouille/colbert/indexes/grt_ragatouille_colbertv20/plan.json'```

Looks like `_load_searcher` in `colbert.py` does not pass an `index_root` and also no `config` when creating a `Searcher` but this was just a very quick assessment in the code.

Multi Language Indexing & Retrieving Support?

Great work! I noticed that indexing and retrieving Chinese or Japanese documents shows low accuracy, is there any tricks to improve the performance without fine-tuning?

Runtime Error: An attempt has been made to start a new process before the current rocess has finshed its boostrapping phase.

I attempted the following code:

import requests

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

RAG.index(
    collection= ["This is a test", "Really it's just a test"],
    index_name="Test",
    split_documents=False,
)

When I run this I get the following error:

[Jan 04, 18:44:26] #> Note: Output directory .ragatouille/colbert/indexes/Test already exists

[Jan 04, 18:44:28] #> Note: Output directory .ragatouille/colbert/indexes/Test already exists

Traceback (most recent call last):
File "", line 1, in
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/jponline77/ragatouille/RAG_2.py", line 6, in
RAG.index(
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages/ragatouille/RAGPretrainedModel.py", line 117, in index
return self.model.index(
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages/ragatouille/models/colbert.py", line 166, in index
self.indexer.index(
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages/colbert/indexer.py", line 78, in index
self.__launch(collection)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages/colbert/indexer.py", line 83, in __launch
manager = mp.Manager()
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/context.py", line 57, in Manager
m.start()
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/managers.py", line 562, in start
self._process.start()
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "/home/jponline77/ragatouille/RAG_2.py", line 6, in
RAG.index(
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages/ragatouille/RAGPretrainedModel.py", line 117, in index
return self.model.index(
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages/ragatouille/models/colbert.py", line 166, in index
self.indexer.index(
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages/colbert/indexer.py", line 78, in index
self.__launch(collection)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages/colbert/indexer.py", line 83, in __launch
manager = mp.Manager()
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/context.py", line 57, in Manager
m.start()
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/managers.py", line 566, in start
self._address = reader.recv()
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/jponline77/miniconda3/envs/rag/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

The following is my torch version:

Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /home/jponline77/miniconda3/envs/rag/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: RAGatouille, sentence-transformers, torchaudio, torchvision

The following is my cuda version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

Improve Testing

Testing is currently very sparse. It's essentially just ensuring model loading works properly (not tested in all cases yet) and reproducing the notebooks as end2end tests to make sure a new version doesn't break indexing/searching or alter results.

This is an ongoing issue, with some objectives being:

Improve unit test coverage
Test every component of the data processing pipeline to ensure the training pipeline can be grown without breaking anything
Test model loading in a variety of circumstances
Just about anything else you can think of: it should be tested

Any contributions of even a single test would be very welcome!

UnboundLocalError: cannot access local variable 'batch_idx' where it is not associated with a value

When training, i sometimes get the error in the title. Here the full error:

#> Starting...
nranks = 1       num_gpus = 1    device=0
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "load_index_with_mmap": false,
    "index_path": null,
    "nbits": 2,
    "kmeans_niters": 4,
    "resume": false,
    "similarity": "cosine",
    "bsize": 32,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 500000,
    "save_every": 0,
    "warmup": 0,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": true,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": "HBOColbert",
    "query_maxlen": 32,
    "attend_to_mask_tokens": false,
    "interaction": "colbert",
    "dim": 128,
    "doc_maxlen": 256,
    "mask_punctuation": true,
    "checkpoint": "bert-base-german-cased",
    "triples": "german\/train_data_0\/triples.train.colbert.jsonl",
    "collection": "german\/train_data_0\/corpus.train.colbert.tsv",
    "queries": "german\/train_data_0\/queries.train.colbert.tsv",
    "index_name": null,
    "overwrite": false,
    "root": ".ragatouille\/",
    "experiment": "colbert",
    "index_root": null,
    "name": "2024-01\/07\/23.03.43",
    "rank": 0,
    "nranks": 1,
    "amp": true,
    "gpus": 1
}
Using config.bsize = 32 (per process) and config.accumsteps = 1
[Jan 07, 23:04:30] #> Loading the queries from german/train_data_0/queries.train.colbert.tsv ...
[Jan 07, 23:04:30] #> Got 80 queries. All QIDs are unique.

[Jan 07, 23:04:30] #> Loading collection...
0M 
Some weights of HF_ColBERT were not initialized from the model checkpoint at bert-base-german-cased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
#> LR will use 0 warmup steps and linear decay over 500000 steps.
[Jan 07, 23:04:32] #> Done with all triples!
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/training.py", line 146, in train
    ckpt_path = manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None, consumed_all_triples=True)
                                                               ^^^^^^^^^
UnboundLocalError: cannot access local variable 'batch_idx' where it is not associated with a value

This happens sometimes when i am using the trainer.train function:

trainer.train(batch_size=32,
            nbits=2, # How many bits will the trained model use when compressing indexes
            maxsteps=500000, # Maximum steps hard stop
            use_ib_negatives=True, # Use in-batch negative to calculate loss
            dim=128, # How many dimensions per embedding. 128 is the default and works well.
            learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
            doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
            use_relu=False, # Disable ReLU -- doesn't improve performance
            warmup_steps="auto", # Defaults to 10%
        )

This is part of the function where the issue lies (/colbert/training/training.py):

for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader):
        if (warmup_bert is not None) and warmup_bert <= batch_idx:
            set_bert_grad(colbert, True)
            warmup_bert = None

        this_batch_loss = 0.0

        for batch in BatchSteps:
            with amp.context():
                try:
                    queries, passages, target_scores = batch
                    encoding = [queries, passages]
                except:
                    encoding, target_scores = batch
                    encoding = [encoding.to(DEVICE)]

                scores = colbert(*encoding)

                if config.use_ib_negatives:
                    scores, ib_loss = scores

                scores = scores.view(-1, config.nway)

                if len(target_scores) and not config.ignore_scores:
                    target_scores = torch.tensor(target_scores).view(-1, config.nway).to(DEVICE)
                    target_scores = target_scores * config.distillation_alpha
                    target_scores = torch.nn.functional.log_softmax(target_scores, dim=-1)

                    log_scores = torch.nn.functional.log_softmax(scores, dim=-1)
                    loss = torch.nn.KLDivLoss(reduction='batchmean', log_target=True)(log_scores, target_scores)
                else:
                    loss = nn.CrossEntropyLoss()(scores, labels[:scores.size(0)])

                if config.use_ib_negatives:
                    if config.rank < 1:
                        print('\t\t\t\t', loss.item(), ib_loss.item())

                    loss += ib_loss

                loss = loss / config.accumsteps

            if config.rank < 1:
                print_progress(scores)

            amp.backward(loss)

            this_batch_loss += loss.item()

        train_loss = this_batch_loss if train_loss is None else train_loss
        train_loss = train_loss_mu * train_loss + (1 - train_loss_mu) * this_batch_loss

        amp.step(colbert, optimizer, scheduler)

        if config.rank < 1:
            print_message(batch_idx, train_loss)
            manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None)

    if config.rank < 1:
        print_message("#> Done with all triples!")
        ckpt_path = manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None, consumed_all_triples=True)

        return ckpt_path  # TODO: This should validate and return the best checkpoint, not just the last one.

Thanks in advance for any ideas or so to fix this issue :)

Ragatouille

Indexing can not be completed (on Windows)

I'm testing
01-basic_indexing_and_search.ipynb
on a Windows 10 PC, in Cursor IDE, using Python 3.11.6

Cell:
RAG.index(collection=[full_document], index_name="Miyazaki", max_document_length=180, split_documents=True)
can not be completed after almost an hour!

[Jan 05, 10:46:21] #> Creating directory .ragatouille/colbert\indexes/Miyazaki 
#> Starting...

is shown, I restarted kernel after an hour.

The previous cell, which prints the length of full_document, worked properly.

Error when running the indexing example

I am trying to run the Miyazaki sample code and it fails in index_path = RAG.index(index_name="my_index", collection=my_documents) with this error message:

Traceback (most recent call last):
File "", line 1, in
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
prepare(preparation_data)
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 291, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "/mnt/c/Ubuntu/RAGatouille/RAGatouille/RAGatouille.py", line 13, in
index_path = RAG.index(index_name="my_index", collection=my_documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 125, in index
return self.model.index(
^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 204, in index
self.indexer.index(
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/site-packages/colbert/indexer.py", line 78, in index
self.__launch(collection)
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/site-packages/colbert/indexer.py", line 83, in __launch
manager = mp.Manager()
^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/context.py", line 57, in Manager
m.start()
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/managers.py", line 563, in start
self._process.start()
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
_check_not_importing_main()
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html

Traceback (most recent call last):
File "/mnt/c/Ubuntu/RAGatouille/RAGatouille/RAGatouille.py", line 13, in
index_path = RAG.index(index_name="my_index", collection=my_documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 125, in index
return self.model.index(
^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 204, in index
self.indexer.index(
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/site-packages/colbert/indexer.py", line 78, in index
self.__launch(collection)
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/site-packages/colbert/indexer.py", line 83, in __launch
manager = mp.Manager()
^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/context.py", line 57, in Manager
m.start()
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/managers.py", line 567, in start
self._address = reader.recv()
^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/home/wsl/anaconda3/envs/RAGatouille/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
raise EOFError
EOFError

trainer.prepare_training_data() does not turn 3rd column of triplets into indices

I am currently trying to train a model on my own knowledge corpus. However instead of the triplets ( from triples.train.colbert.jsonl that gets created when running the function) having three numbers, the third column is only text in my case, e.g.:

[61,434,"text"]

I am not sure what i did wrong and i cross checked the different data types and formats with the 2nd and 3rd examples notebook.

trainer.prepare_training_data(
        raw_data = pairs,
        data_out_path="./data/",
        all_documents = chunked_documents,
        num_new_negatives = 32,
        mine_hard_negatives= True,
        )

pairs[0] in my use case would be

('Welche Bauarten und Bauprodukte sind von der Regelung betroffen?', 'die sich für einen Verwendungszweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirken, c) Verfahren für die Feststellung der Leistung eines Bauproduktes im Hinblick auf Merk- male, die sich für einen Verwendungszweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirken, d) zulässige oder unzulässige besondere Verwendungszwecke, e) die Festlegung von Klassen und Stufen in Bezug auf bestimmte Verwendungszwecke, f) die für einen bestimmten Verwendungszweck anzugebende oder erforderliche und anzugebende Leistung in Bezug auf ein Merkmal, das sich für einen Verwendungs- zweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirkt, soweit vorgesehen in Klassen und Stufen, 4. die Bauarten und die Bauprodukte,')

and pairs has a length of 64, while "chunked_documents" comes from here at the beginning:

corpus_processor.process_corpus(full_documents, chunk_size=256)

Hope someone has a clue on why the last column of the triplet did not turn into numbers when executing the prepare_training_data function. Thanks in advance.

Update 1:

Debugged a bit and came to this function in training_data_processor.py. Apparently when you have only one positive passage it puts a negative passage as the third column in the triplet instead of a number. Would like to know why.

def _make_individual_triplets(self, query, positives, negatives):
        """Create the training data in ColBERT(v1) format from raw lists of triplets"""
        triplets = []
        q = self.query_map[query]
        print("q")
        print(q)

        random.seed(42)
        if len(positives) > 1:
            all_pos_texts = [p for p in positives]
            max_triplets_per_query = 20
            negs_per_positive = max(1, max_triplets_per_query // len(all_pos_texts))
            initial_triplets_count = 0
            
            for pos in all_pos_texts:
                p = self.passage_map[pos]
                chosen_negs = random.sample(
                    negatives, min(len(negatives), negs_per_positive)
                )
                for neg in chosen_negs:
                    print("neg")
                    print(neg)
                    n = self.passage_map[neg]
                    print("n")
                    print(n)
                    initial_triplets_count += 1
                    triplets.append([q, p, n])

            extra_triplets_needed = max_triplets_per_query - initial_triplets_count
            while extra_triplets_needed > 0:
                p = self.passage_map[random.choice(all_pos_texts)]
                n = self.passage_map[random.choice(negatives)]
                triplets.append([q, p, n])
                extra_triplets_needed -= 1
        else:
            p = self.passage_map[positives[0]]
            for n in negatives:
                triplets.append([q, p, n])

        return triplets

Update 2:

Actually the same problem is in the 3rd example notebook if i understand it right, if you were to after the last cell:

trainer.prepare_training_data(
        raw_data = pairs,
        all_documents = documents,
        num_new_negatives = 10,
        mine_hard_negatives= True,
        )

train it with:

from pathlib import Path
trainer.data_dir=Path("./data/")
trainer.train(batch_size=8,
              nbits=4, # How many bits will the trained model use when compressing indexes
              maxsteps=500000, # Maximum steps hard stop
              use_ib_negatives=True, # Use in-batch negative to calculate loss
              dim=128, # How many dimensions per embedding. 128 is the default and works well.
              learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
              doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
              use_relu=False, # Disable ReLU -- doesn't improve performance
              warmup_steps="auto", # Defaults to 10%
             )

you would get the same error, namely that the third column of the triplets is a str and not an indice:

#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "load_index_with_mmap": false,
    "index_path": null,
    "nbits": 4,
    "kmeans_niters": 20,
    "resume": false,
    "similarity": "cosine",
    "bsize": 8,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 500000,
    "save_every": 8,
    "warmup": 8,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
...
[Jan 07, 01:09:02] #> Got 64 queries. All QIDs are unique.

[Jan 07, 01:09:02] #> Loading collection...
0M 
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?e83155fb-e8f7-4acb-b764-aa4bfae7ca90) or open in a [text editor](command:workbench.action.openLargeOutput?e83155fb-e8f7-4acb-b764-aa4bfae7ca90). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
[/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429): FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
#> LR will use 8 warmup steps and linear decay over 500000 steps.
[89, 'Artists from Pixar and Aardman Studios signed a tribute stating, "You\'re our inspiration, Miyazaki-san!" He has also been cited as inspiration for video game designers including Shigeru Miyamoto on The Legend of Zelda and Hironobu Sakaguchi on Final Fantasy, as well as the television series Avatar: The Last Airbender, and the video game Ori and the Blind Forest (2015).Studio Ghibli has searched for some time for Miyazaki and Suzuki\'s successor to lead the studio; Kondō, the director of Whisper of the Heart, was initially considered, but died from a sudden heart attack in 1998. Some candidates were considered by 2023—including Miyazaki\'s son Goro, who declined—but the studio was not able to find a successor.']
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/training.py", line 87, in train
    for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader):
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/lazy_batcher.py", line 57, in __next__
    passages = [self.collection[pid] for pid in pids]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/lazy_batcher.py", line 57, in <listcomp>
    passages = [self.collection[pid] for pid in pids]
                ~~~~~~~~~~~~~~~^^^^^
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/data/collection.py", line 25, in __getitem__
    return self.data[item]
           ~~~~~~~~~^^^^^^
TypeError: list indices must be integers or slices, not str

Tomorrow i will look at example notebook 2 again as a guide, as the training worked there (third triplet column are indices)

Using Colbert purely as a re-ranker

Is there a way to use this library only as a re-ranker? Something like:

results = RAG.rerank(question = '...', docs = [...])

IndexError: list index out of range after running a search on an already-created index.

I ran through the first example notebook:
https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb

After the last indexing i resetted the Kernel and the notebook. Then i tried the following lines:

from ragatouille import RAGPretrainedModel

path_to_index = ".ragatouille/colbert/indexes/Miyazaki/"
RAG = RAGPretrainedModel.from_index(path_to_index)
k=3
all_results = RAG.search(query=["What is the highest grossing movie from Studio Ghibli?", "What is the name of the most recent Studio Ghibli movie?"], k=k)
all_results

And it gave me the following:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[3], [line 6](vscode-notebook-cell:?execution_count=3&line=6)
      [4](vscode-notebook-cell:?execution_count=3&line=4) RAG = RAGPretrainedModel.from_index(path_to_index)
      [5](vscode-notebook-cell:?execution_count=3&line=5) k=3
----> [6](vscode-notebook-cell:?execution_count=3&line=6) all_results = RAG.search(query=["What is the highest grossing movie from Studio Ghibli?", "What is the name of the most recent Studio Ghibli movie?"], k=k)
      [7](vscode-notebook-cell:?execution_count=3&line=7) all_results

File [~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:179](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:179), in RAGPretrainedModel.search(self, query, index_name, k, force_fast, zero_index_ranks, **kwargs)
    [153](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:153) def search(
    [154](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:154)     self,
    [155](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:155)     query: Union[str, list[str]],
   (...)
    [160](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:160)     **kwargs,
    [161](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:161) ):
    [162](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:162)     """Query an index.
    [163](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:163) 
    [164](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:164)     Parameters:
   (...)
    [177](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:177)     ```
    [178](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:178)     """
--> [179](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:179)     return self.model.search(
    [180](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:180)         query=query,
    [181](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:181)         index_name=index_name,
    [182](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py:182)         k=k,
...
     [23](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/data/collection.py:23) def __getitem__(self, item):
     [24](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/data/collection.py:24)     # TODO: Load from disk the first time this is called. Unless self.data is already not None.
---> [25](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/~/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/data/collection.py:25)     return self.data[item]

IndexError: list index out of range

Thanks in advance :)

EDIT: If it is still a ToDo, then nevermind that.

Failure in faiss for short document?

if __name__ == "__main__":
    from ragatouille import RAGPretrainedModel
    from time import time

    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
    RAG.index(
        collection=["This is a test."],
        document_ids=["test_document"],
        index_name=f"test_index_{time()}",
        split_documents=False,
    )

    results = RAG.search(query="What animation studio did Miyazaki found?", k=10)
    print(results)

[Jan 28, 19:18:40] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
.pyvenv/ragatouille/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(


[Jan 28, 19:18:41] #> Creating directory .ragatouille/colbert/indexes/test_index_1706465921.111713 


[Jan 28, 19:18:43] [0]           #> Encoding 1 passages..
  0%|                                                                                        | 0/1 [00:00<?, ?it/s].pyvenv/ragatouille/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.16s/it]
[Jan 28, 19:18:44] [0]           avg_doclen_est = 7.0    len(local_sample) = 1
[Jan 28, 19:18:44] [0]           Creating 32 partitions.
[Jan 28, 19:18:44] [0]           *Estimated* 7 embeddings.
[Jan 28, 19:18:44] [0]           #> Saving the indexing plan to .ragatouille/colbert/indexes/test_index_1706465921.111713/plan.json ..

Traceback (most recent call last):
  File "Documents/Exploring/Playgrounds/explore-colbert-ragatouille/main.py", line 51, in <module>
    RAG.index(
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 183, in index
    return self.model.index(
           ^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 349, in index
    self.indexer.index(
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexer.py", line 78, in index
    self.__launch(collection)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexer.py", line 89, in __launch
    launcher.launch(self.config, collection, shared_lists, shared_queues, self.verbose)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/infra/launcher.py", line 34, in launch
    return_val = run_process_without_mp(self.callee, new_config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/infra/launcher.py", line 103, in run_process_without_mp
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 68, in run
    self.train(shared_lists) # Trains centroids from selected passages
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 232, in train
    centroids = self._train_kmeans(sample, shared_lists)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 304, in _train_kmeans
    centroids = compute_faiss_kmeans(*args_)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 507, in compute_faiss_kmeans
    kmeans.train(sample)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/extra_wrappers.py", line 457, in train
    clus.train(x, self.index, weights)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/class_wrappers.py", line 85, in replacement_train
    self.train_c(n, swig_ptr(x), index)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/swigfaiss.py", line 2165, in train
    return _swigfaiss.Clustering_train(self, n, x, index, x_weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in void faiss::Clustering::train_encoded(faiss::idx_t, const uint8_t *, const faiss::Index *, faiss::Index &, const float *)
at /Users/runner/work/faiss-wheels/faiss-wheels/faiss/faiss/Clustering.cpp:281:
Error: 'nx >= k' failed: Number of training points (7) should be at least as large as number of clusters (32)

This is on macOS 14.3 (M1) with Python 3.11, latest ragatouille.

The Miyazaki example works.

Batch search in-memory without index results in TypeError if you pass document_metadatas

My code:

def rag(args: SearchArgs) -> List[List[SearchResult]]:
    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0", verbose=0)
    metadata = [doc["metadata"] for doc in args["docs"]]
    RAG.encode(
        [doc["content"] for doc in args["docs"]],
        document_metadatas=metadata,
    )
    k = args.get("k", 5)
    search_results = RAG.search_encoded_docs(query=args["query"], k=k)
    # ....
    return search_results
    
res = rag(
    {
        "query": ["Hello World", "This is a test"],
        "docs": [
            {"content": "doc1", "metadata": {"id": 0}},
            {"content": "doc2", "metadata": {"id": 1}},
        ],
    }
)

Error:

Documents encoded!
Traceback (most recent call last):
  File "/Users/james/Projects/TS/open-recommender/packages/cli/src/rag/rag.py", line 996, in <module>
    res = rag(
          ^^^^
  File "/Users/james/Projects/TS/open-recommender/packages/cli/src/rag/rag.py", line 37, in rag
    search_results = RAG.search_encoded_docs(query=args["query"], k=k)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/Projects/TS/open-recommender/env/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 377, in search_encoded_docs
    return self.model.search_encoded_docs(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/Projects/TS/open-recommender/env/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 722, in search_encoded_docs
    result["result_index"]
    ~~~~~~^^^^^^^^^^^^^^^^
TypeError: list indices must be integers or slices, not str

One thing I found confusing, and which seems to be causing this error too, is that search_encoded_docs returns either List[SearchResult] or a nested List[List[SearchResult]] depending on the number of queries. Maybe it would be better to just always return List[List[SearchResult]] otherwise we have to implement conditional logic based on the number of queries we pass, which we might not know ahead of time?

The error does not happen if I only pass one query, or if I don't pass document_metadatas.

Workaround:

def rag(args: SearchArgs) -> List[List[SearchResult]]:
    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0", verbose=0)
    metadatas = [doc["metadata"] for doc in args["docs"]]
    RAG.encode(
        [doc["content"] for doc in args["docs"]],
        # document_metadatas=metadatas,
    )
    k = args.get("k", 5)
    search_results = RAG.search_encoded_docs(query=args["query"], k=k)
    if type(search_results[0]) is not list:
        for result in search_results:
            result["metadata"] = metadatas[result["result_index"]]
        return [search_results]
    else:
        for batch in search_results:
            for result in batch:
                result["metadata"] = metadatas[result["result_index"]]
        return search_results

Removal of 20 second delay in overwriting index

Hello,

When I run

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
processor = CorpusProcessor()
my_documents = processor.process_corpus(segments)
index_path = RAG.index(index_name="my_index", collection=my_documents, split_documents=True, max_document_length = 256*4)

where segments is a list of strings, if my_index already exists, I get the message:

Jan 18, 17:36:43] #> Note: Output directory .ragatouille/colbert/indexes/my_index already exists


[Jan 18, 17:36:43] #> Will delete 10 files already at .ragatouille/colbert/indexes/my_index in 20 seconds...

I was wondering if I could eliminate this 20 second delay?

Use metadata filter in search?

Does the ColBERT implementation here (or elsewhere) allow for indexing and filtering documents by metadata? For example, something like,

results = RAG.search(query, filter={"doc_type": "doc_type_im_interested_in"})

[MacOS/w11] segmentation fault; testing "Miyazaki" example locally

Testing the Miyazaki example locally:

# RAGtest.py

from ragatouille import RAGPretrainedModel
from ragatouille.utils import get_wikipedia_page


def run():
    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

    documents = [get_wikipedia_page("Hayao_Miyazaki")]
    document_ids = ["miyazaki"]
    document_metadatas = [{"entity": "person", "source": "wikipedia"}]

    RAG.index(
        index_name="miyazaki",
        collection=documents,
        document_ids=document_ids,
        document_metadatas=document_metadatas,
        max_document_length=180, 
        split_documents=True
    )

    results = RAG.search(query="What is Miyazaki's first work?", k=3)
    results
    print(results)


if __name__ == '__main__':
    run()

output

[Jan 29, 17:15:38] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
/Users/username/anaconda3/envs/alts/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(


[Jan 29, 17:15:40] #> Creating directory .ragatouille/colbert/indexes/miyazaki 


[Jan 29, 17:15:43] [0]           #> Encoding 81 passages..
  0%|                                                     | 0/2 [00:00<?, ?it/s]/Users/username/anaconda3/envs/alts/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
 50%|██████████████████████▌                      | 1/2 [00:21<00:21, 21.27s/it]/Users/username/anaconda3/envs/alts/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
100%|█████████████████████████████████████████████| 2/2 [00:26<00:00, 13.38s/it]
[Jan 29, 17:16:10] [0]           avg_doclen_est = 129.82716369628906     len(local_sample) = 81
[Jan 29, 17:16:10] [0]           Creating 1,024 partitions.
[Jan 29, 17:16:10] [0]           *Estimated* 10,516 embeddings.
[Jan 29, 17:16:10] [0]           #> Saving the indexing plan to .ragatouille/colbert/indexes/miyazaki/plan.json ..
WARNING clustering 9991 points to 1024 centroids: please provide at least 39936 training points
Clustering 9991 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
[2]    31903 segmentation fault  python RAGtest.py
/Users/username/anaconda3/envs/alts/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

running on a macbook pro (intel)
using the latest version (0.0.5a2)

RAGTrainer not working anymore with base bert models with latest version (Duplicate)

In Version 0.0.3a1, using the following to initialize the RAGTrainer worked:

from ragatouille import RAGTrainer
trainer = RAGTrainer(model_name="HBOColbert", pretrained_model_name="deepset/gbert-large", language_code="de")

I just found out that with the version 0.0.4a2, the same code does not work and gave out the following. Also tried to download the bert model locally and refer to the local folder, still the same kind of error:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status
    response.raise_for_status()
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/None/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
                    ^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1368, in hf_hub_download
    raise head_call_error
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 323, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65a26540-4ffa46f312262c9956872e4f;02ebfce3-0189-4426-9f49-9cc393b0cfde)

Repository Not Found for url: https://huggingface.co/None/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/main.py", line 180, in <module>
    prepare_training_data_out_of_synthetic_dataset(colbert_training_query_docs_path,colbert_model_path,latest_model_name,local_pretrained_model,language,colbert_training_triplets_path)
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/synthethic_retrieval_dataset_generation.py", line 270, in prepare_training_data_out_of_synthetic_dataset
    trainer = RAGTrainer(model_name=latest_model_name, pretrained_model_name=pretrained_model_name, language_code=language_code)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/RAGTrainer.py", line 45, in __init__
    self.model = ColBERT(
                 ^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 62, in __init__
    self.inference_ckpt = Checkpoint(self.checkpoint, colbert_config=self.config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/checkpoint.py", line 24, in __init__
    self.query_tokenizer = QueryTokenizer(self.colbert_config, verbose=self.verbose)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/tokenization/query_tokenization.py", line 12, in __init__
    HF_ColBERT = class_factory(config.checkpoint)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/hf_colbert.py", line 59, in class_factory
    loadedConfig  = AutoConfig.from_pretrained(name_or_path)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/utils/hub.py", line 410, in cached_file
    raise EnvironmentError(
OSError: None is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
Exception ignored in: <function ColBERT.__del__ at 0x7f124c2a96c0>
Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 463, in __del__
AttributeError: 'ColBERT' object has no attribute 'run_context'
(venv) tm16@ThewindsHPZBookFury:~/Work/00_RandomCoding/e2e_retrieval_pipeline$ /home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/bin/python /home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/main.py
/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/langchain/llms/__init__.py:548: LangChainDeprecationWarning: Importing LLMs from langchain is deprecated. Importing from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead:

`from langchain_community.llms import OpenAI`.

To install langchain-community run `pip install -U langchain-community`.
  warnings.warn(
/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:115: LangChainDeprecationWarning: The class `OpenAI` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use langchain_openai.OpenAI instead.
  warn_deprecated(
bert base
bert base
Some weights of HF_ColBERT were not initialized from the model checkpoint at deepset/gbert-large and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status
    response.raise_for_status()
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/None/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
                    ^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1368, in hf_hub_download
    raise head_call_error
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 323, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65a26710-578e59d841d16b9b394a8895;a8aa5928-f89b-4ff8-9528-8c535da29b00)

Repository Not Found for url: https://huggingface.co/None/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/main.py", line 178, in <module>
    trainer = RAGTrainer(model_name="HBOColbert", pretrained_model_name="deepset/gbert-large", language_code="de")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/RAGTrainer.py", line 45, in __init__
    self.model = ColBERT(
                 ^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 62, in __init__
    self.inference_ckpt = Checkpoint(self.checkpoint, colbert_config=self.config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/checkpoint.py", line 24, in __init__
    self.query_tokenizer = QueryTokenizer(self.colbert_config, verbose=self.verbose)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/tokenization/query_tokenization.py", line 12, in __init__
    HF_ColBERT = class_factory(config.checkpoint)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/hf_colbert.py", line 59, in class_factory
    loadedConfig  = AutoConfig.from_pretrained(name_or_path)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/utils/hub.py", line 410, in cached_file
    raise EnvironmentError(
OSError: None is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
Exception ignored in: <function ColBERT.__del__ at 0x7f523c515800>
Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 463, in __del__
AttributeError: 'ColBERT' object has no attribute 'run_context'

Stuck at 'Starting...' on windows

Trying to run the sample code, but stuck at the following line every time.

[Jan 08, 15:49:04] #> Creating directory .ragatouille/colbert\indexes/my_index1 

#> Starting...

sample code

from ragatouille import RAGPretrainedModel
from ragatouille.utils import get_wikipedia_page

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
my_documents = [get_wikipedia_page("Hayao_Miyazaki"), get_wikipedia_page("Studio_Ghibli")]
index_path = RAG.index(index_name="my_index", collection=my_documents)

I am using python 3.11.5 with conda.

Support CPU

Hi, i want to run RAGatouille with cpu, It's possible? I don't see any parameter to define the device.

Thanks in advance

Version of python being used ?

Some dependant library is failing. Please specify the version you are using .

ImportError: cannot import name 'Iterator' from 'typing_extensions' (/opt/conda/lib/python3.10/site-packages/typing_extensions.py)

Indexing on Mac: using 'mps' device?

Hi,

just installed RAGatouille on my mac and tried to create an index. It worked.

However, output shows several warnings:

UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.

So does it use the Apple M2 chip, or the GPU? How to use the torch.device('mps') ?

TIA

colbertv1 or colbertv2

Hello,

First thanks for your work this is amazing! :)

I am wondering when you train a colbert in another language like you did in Japanese are the process from ragatouille training a colbert v1 equivalent or v2?

RAGTrainer not working anymore with base bert models with latest version

In Version 0.0.3a1, using the following to initialize the RAGTrainer worked:

from ragatouille import RAGTrainer
trainer = RAGTrainer(model_name="HBOColbert", pretrained_model_name="deepset/gbert-large", language_code="de")

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status
    response.raise_for_status()
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/None/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
                    ^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1368, in hf_hub_download
    raise head_call_error
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 323, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65a26540-4ffa46f312262c9956872e4f;02ebfce3-0189-4426-9f49-9cc393b0cfde)

Repository Not Found for url: https://huggingface.co/None/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/main.py", line 180, in <module>
    prepare_training_data_out_of_synthetic_dataset(colbert_training_query_docs_path,colbert_model_path,latest_model_name,local_pretrained_model,language,colbert_training_triplets_path)
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/synthethic_retrieval_dataset_generation.py", line 270, in prepare_training_data_out_of_synthetic_dataset
    trainer = RAGTrainer(model_name=latest_model_name, pretrained_model_name=pretrained_model_name, language_code=language_code)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/RAGTrainer.py", line 45, in __init__
    self.model = ColBERT(
                 ^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 62, in __init__
    self.inference_ckpt = Checkpoint(self.checkpoint, colbert_config=self.config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/checkpoint.py", line 24, in __init__
    self.query_tokenizer = QueryTokenizer(self.colbert_config, verbose=self.verbose)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/tokenization/query_tokenization.py", line 12, in __init__
    HF_ColBERT = class_factory(config.checkpoint)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/hf_colbert.py", line 59, in class_factory
    loadedConfig  = AutoConfig.from_pretrained(name_or_path)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/utils/hub.py", line 410, in cached_file
    raise EnvironmentError(
OSError: None is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
Exception ignored in: <function ColBERT.__del__ at 0x7f124c2a96c0>
Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 463, in __del__
AttributeError: 'ColBERT' object has no attribute 'run_context'
(venv) tm16@ThewindsHPZBookFury:~/Work/00_RandomCoding/e2e_retrieval_pipeline$ /home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/bin/python /home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/main.py
/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/langchain/llms/__init__.py:548: LangChainDeprecationWarning: Importing LLMs from langchain is deprecated. Importing from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead:

`from langchain_community.llms import OpenAI`.

To install langchain-community run `pip install -U langchain-community`.
  warnings.warn(
/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:115: LangChainDeprecationWarning: The class `OpenAI` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use langchain_openai.OpenAI instead.
  warn_deprecated(
bert base
bert base
Some weights of HF_ColBERT were not initialized from the model checkpoint at deepset/gbert-large and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status
    response.raise_for_status()
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/None/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
                    ^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1368, in hf_hub_download
    raise head_call_error
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
    hf_raise_for_status(response)
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 323, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65a26710-578e59d841d16b9b394a8895;a8aa5928-f89b-4ff8-9528-8c535da29b00)

Repository Not Found for url: https://huggingface.co/None/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/main.py", line 178, in <module>
    trainer = RAGTrainer(model_name="HBOColbert", pretrained_model_name="deepset/gbert-large", language_code="de")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/RAGTrainer.py", line 45, in __init__
    self.model = ColBERT(
                 ^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 62, in __init__
    self.inference_ckpt = Checkpoint(self.checkpoint, colbert_config=self.config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/checkpoint.py", line 24, in __init__
    self.query_tokenizer = QueryTokenizer(self.colbert_config, verbose=self.verbose)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/tokenization/query_tokenization.py", line 12, in __init__
    HF_ColBERT = class_factory(config.checkpoint)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/colbert/modeling/hf_colbert.py", line 59, in class_factory
    loadedConfig  = AutoConfig.from_pretrained(name_or_path)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1082, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 644, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 699, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/transformers/utils/hub.py", line 410, in cached_file
    raise EnvironmentError(
OSError: None is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
Exception ignored in: <function ColBERT.__del__ at 0x7f523c515800>
Traceback (most recent call last):
  File "/home/tm16/Work/00_RandomCoding/e2e_retrieval_pipeline/venv/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 463, in __del__
AttributeError: 'ColBERT' object has no attribute 'run_context'

Problem with path in add_to_index

Using RAGatouille==0.0.6a2 in colab, trying .add_to_index

for i, batch in enumerate(batches, start=0):
    RAG.add_to_index(
        new_collection=batch,
        index_name="dharma_colb",
        split_documents=True,
    )

error message

WARNING: add_to_index support is currently experimental! add_to_index support will be more thorough in future versions

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

[/usr/local/lib/python3.10/dist-packages/colbert/infra/config/base_config.py](https://localhost:8080/#) in load_from_index(cls, index_path)
     93             metadata_path = os.path.join(index_path, "metadata.json")
---> 94             loaded_config, _ = cls.from_path(metadata_path)
     95         except:

6 frames

FileNotFoundError: [Errno 2] No such file or directory: '.ragatouille/colbert/indexes/colbert/indexes/dharma_colb/metadata.json'


During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)

[/usr/local/lib/python3.10/dist-packages/colbert/infra/config/base_config.py](https://localhost:8080/#) in from_path(cls, name)
     42     @classmethod
     43     def from_path(cls, name):
---> 44         with open(name) as f:
     45             args = ujson.load(f)
     46 

FileNotFoundError: [Errno 2] No such file or directory: '.ragatouille/colbert/indexes/colbert/indexes/dharma_colb/plan.json'

colbert/indexes is doubled in path

Number of Triplets

Hello Benjamin,

Again thank you for this amazing work! :)
There is something I do not understand: I have 400K pairs of (query, positive) from MSMarco but when I create the training dataset with the hard mining equal to 10 I get 40M triplets I do not understand how? Do you have an explanation?

Thank you!

Import failing on ubuntu

Hi team! very excited to try this out. I am getting a bit of a bizarre issue, appreciate any insights into the below:

I am simply trying running the examples/01-basic_indexing_and_search.ipynb notebook. This works perfectly fine on my M1 Macbook. However when I try to run this on a remote ubuntu machine with an NVIDIA GPU installed, it hangs on the first cell.

I initially thought perhaps its a networking issue and loading the pre-trained model is simply taking a long time. However the cell still hangs just on the import line. (I have left it running and after an hour it is still stuck).

The notebook environment works fine for other imported packages, and is responsive (hello world)
Setting logging level to DEBUG all I get are these (possibly unrelated) logs

TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Popen(['git', 'version'], cwd=<REMOVED>, stdin=None, shell=False, universal_newlines=False)

This issue persists after creating a fresh poetry environment with only the following installed:

python = "^3.10"
RAGatouille = "^0.0.4b2"
ipykernel = "^6.29.0"

Any help would be greatly appreciated!

RAGatouille seems to break when ingesting the colbertv2 pdf?

used llamaindex to parse colbertv2 arxiv pdf into 20 docs, tried putting the text into RAGatouillle, and getting the below error.

seems to work on first 5 docs and first 10 docs for some reason.

getting this error:

File ~/Programming/llama-hub/.venv/lib/python3.10/site-packages/ragatouille/models/colbert.py:253, in ColBERT.search(self, query, index_name, k, force_fast, zero_index_ranks)
    249     result_for_query = []
    250     for id_, rank, score in zip(*result):
    251         result_for_query.append(
    252             {
--> 253                 "content": self.searcher.collection[id_],
    254                 "score": score,
    255                 "rank": rank - 1 if zero_index_ranks else rank,
    256             }
    257         )
    258     to_return.append(result_for_query)
    260 if len(to_return) == 1:

File ~/Programming/llama-hub/.venv/lib/python3.10/site-packages/colbert/data/collection.py:25, in Collection.__getitem__(self, item)
     23 def __getitem__(self, item):
     24     # TODO: Load from disk the first time this is called. Unless self.data is already not None.
---> 25     return self.data[item]

IndexError: list index out of range

Here's the notebook https://drive.google.com/file/d/1qEpaFSe7Vjhw4hOWHYc2wg49TLo9jCPI/view?usp=sharing

Utilizing clustering

Hi, sorry if this is an ignorant question but would it be possible to use the calculated centroids for nlp tasks such as summarization? With dense embeddings it's possible to cluster the dataset and use documents close to the centroids from each cluster as a representation of the cluster for summarization. Would it be possible to do something similar with the centroids calculated by ragatouille?

llama_index_sentence_splitter issues

ragatouille 0.0.4b2 , ubuntu 22.04

when I using the sample code to run , documents is just a list of string .

Traceback (most recent call last):
File "/workspace/three_methods_ranking2.py", line 160, in
my_documents = processor.process_corpus(documents)
File "/usr/local/lib/python3.10/dist-packages/ragatouille/data/corpus_processor.py", line 22, in process_corpus
documents = self.document_splitter_fn(documents, **splitter_kwargs)
File "/usr/local/lib/python3.10/dist-packages/ragatouille/data/preprocessors.py", line 9, in llama_index_sentence_splitter
docs = [[Document(text=doc)] for doc in documents]
TypeError: 'NoneType' object is not iterable

Fully support index-free encoding and querying

Most of the necessary functions are currently present, but not fully implemented.

While it can run slower and is memory intensive, there's nothing stopping us from querying smaller collections on-device, by encoding the documents and performing the computation without building an index.

The goal here would for RAGPretrainedModel/the ColBERT model class to support an additional .index_free_encode() and .index_free_search() (or better naming) functions. The former would encode docs and store their representation in-memory, while the latter would query it.

Functionally very similar to rerank(), except encoding & searching are performed at different stages and the encodings are stored, rather than on-the-fly for rerank.

Fail to create an index on Ubuntu / Linux environment.

Env: Ubuntu 22.04, Jammy Jellyfish. Just a normal index call with a subset of documents. After this error, I can see some json files created.

Python env dependencies (requirement):
https://pastebin.com/9yHL0d8b

            overwrite_index = False
            rag.index(
                index_name=index_id,
                max_document_length=500,
                overwrite_index=overwrite_index,
                collection=all_doc_text,
                document_ids=ids,
                document_metadatas=metadatas,
            )

File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 187, in index
return self.model.index(
^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 349, in index
self.indexer.index(
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/colbert/indexer.py", line 78, in index
self.__launch(collection)
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/colbert/indexer.py", line 83, in __launch
manager = mp.Manager()
^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/context.py", line 57, in Manager
m.start()
File "/home/german/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/managers.py", line 567, in start
self._address = reader.recv()
^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/connection.py", line 249, in recv
buf = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/connection.py", line 382, in _recv
raise EOFError

Then trying to execute as a retriever:

File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/langchain_core/retrievers.py", line 281, in aget_relevant_documents
raise e
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/langchain_core/retrievers.py", line 274, in aget_relevant_documents
result = await self._aget_relevant_documents(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/langchain_core/retrievers.py", line 166, in _aget_relevant_documents
return await run_in_executor(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/langchain_core/runnables/config.py", line 490, in run_in_executor
return await asyncio.get_running_loop().run_in_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/ragatouille/integrations/_langchain.py", line 20, in _get_relevant_documents
docs = self.model.search(query, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 296, in search
return self.model.search(
^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 446, in search
self._load_searcher(index_name=index_name, force_fast=force_fast)
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 409, in _load_searcher
self.searcher = Searcher(
^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/colbert/searcher.py", line 33, in init
self.index_config = ColBERTConfig.load_from_index(self.index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/colbert/infra/config/base_config.py", line 97, in load_from_index
loaded_config, _ = cls.from_path(metadata_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/german/.pyenv/versions/3.11.3/envs/agile_clean/lib/python3.11/site-packages/colbert/infra/config/base_config.py", line 44, in from_path
with open(name) as f:
^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '.ragatouille/colbert/indexes/test_index_id/plan.json'

Ability to store metadata with the content

First of all, thanks a bunch for RAGatouille -- it's been a very nice library to make use of in some of the experiments I've been doing lately!

One feature I was missing was the ability to store some metadata (e.g. the source of the content or some other attribute) in the index -- similarly how one would do so within chromadb for instance (https://docs.trychroma.com/getting-started#4-add-some-text-documents-to-the-collection).

Do you think something like that would be possible? If so, feel free to leave some pointers -- I'd be happy to see if I can contribute a quick PR with it.

Thanks!

Integrate DSPy as a third main API Class

Ongoing project.

The goal is for RAGatouille to support more than just ColBERT, and build our way to UDAPDR support.

Integrating DSPy is the next big milestone. No current definite plans on the best path to integration, but the initial goal is to be able to reproduce the HotPotQA example in under 10 lines of code.

RAGatouille should also be a drop-in replacement for the current ColBERTv2 DSPy retriever (which uses a separate ColBERT server at the moment).

Fine-tuning, ValueError: DistributedDataParallel device_ids and output_device arguments only work...

Hi,

trying for the 1st time to fine-tune colbert-ir/colbertv2.0 on my mac, an error is shown in the log:

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}

My code is just the following, mostly copy/paste from the readme (I removed the all_documents arg)

trainer = RAGTrainer(model_name = "MyFineTunedColBERT",
        pretrained_model_name = "colbert-ir/colbertv2.0") # In this example, we run fine-tuning

# This step handles all the data processing, check the examples for more details!
trainer.prepare_training_data(raw_data=pairs,
                                data_out_path="../data/",
                                # all_documents=my_full_corpus
                                )

(pairs is a List of 3581 tuples such as:

('1234yf', "Le 1234yf est un bla bla...")

Here is the trace:

...
Using config.bsize = 32 (per process) and config.accumsteps = 1
[Jan 28, 14:42:04] #> Loading the queries from ../data/queries.train.colbert.tsv ...
[Jan 28, 14:42:04] #> Got 3581 queries. All QIDs are unique.

[Jan 28, 14:42:04] #> Loading collection...
0M 
[Jan 28, 14:42:05] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Process Process-5:
Traceback (most recent call last):
  File "/Users/fps/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/fps/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/fps/.pyenv/versions/fps_env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 128, in setup_new_process
    return_val = callee(config, *args)
  File "/Users/fps/.pyenv/versions/fps_env/lib/python3.10/site-packages/colbert/training/training.py", line 55, in train
    colbert = torch.nn.parallel.DistributedDataParallel(colbert, device_ids=[config.rank],
  File "/Users/fps/.pyenv/versions/fps_env/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 603, in __init__
    self._log_and_throw(
  File "/Users/fps/.pyenv/versions/fps_env/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 769, in _log_and_throw
    raise err_type(err_msg)
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.

This didn't stop the execution, which seems to be continuing (without any new output)

Slow index creation [Windows + WSL]

Hey all, I am struggling to index a corpus of ~1.1 million passages (this is after preprocessing). I left the process running all night and it made a 1% progress, it can't be right. I am using 11/32 gb RAM, 0/8 usage on VRAM.

Is it supposed to use a gpu in the indexing process? It detects the gpu it doesn't use it.

Querying on Subset of Document_IDs

Would love to be able to pass in an array of document_IDs as argument to query function, representing the subset of documents to query. Not familiar enough with the inner workings of the technology to propose a resolution myself. Would gladly take some guidance from someone more senior so I can produce a pull request myself.

Slow (?) indexing on Apple m1

Hi - really excited to try RAGatouille. On Apple mac with M1Max - it's taken over 12 hours to index. Is this expected?

PyTorch emittd some warnings about CUDA not being available but it's running otherwise without error seemingly.

Below is the output in Jupyter in VS Code - it's running

[Jan 14, 21:51:57] #> Creating directory .ragatouille/colbert/indexes/Miyazaki 


#> Starting...
[Jan 14, 21:51:59] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
/Users/../raga/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
  0%|          | 0/2 [00:00<?, ?it/s]/Users/../mambaforge/envs/raga/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
[Jan 14, 21:52:00] [0] 		 #> Encoding 81 passages..
 50%|█████     | 1/2 [00:03<00:03,  3.22s/it]/Users..//mambaforge/envs/raga/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
100%|██████████| 2/2 [00:03<00:00,  1.94s/it]
WARNING clustering 10001 points to 1024 centroids: please provide at least 39936 training points
[Jan 14, 21:52:04] [0] 		 avg_doclen_est = 129.9629669189453 	 len(local_sample) = 81
[Jan 14, 21:52:04] [0] 		 Creating 1,024 partitions.
[Jan 14, 21:52:04] [0] 		 *Estimated* 10,527 embeddings.
[Jan 14, 21:52:04] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..
Clustering 10001 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s

Add utility for exporting a checkpoint to onnx format for serving in Vespa

First, thank you for this great work! I'm glad to see more interest in multi-vector representations for retrieval. I'm opening this feature request to discuss the best option for adding a utility to export a checkpoint to onnx format for serving in Vespa.

The following routine is what we have used to export the v2 checkpoint, but it is not directly tied to the HF_COLBERT class in the colbert repo, which does handle the magic with the base class.

The below works because we know that the saved torch pickle file has the bert and linear properties base_colbert

Do you have any thoughts on how to add the a utility to this repo, that can export the model to an onnx file?

from transformers import AutoModel, BertPreTrainedModel
import transformers
import torch 
from pathlib import Path
import torch.nn as nn

class VespaColBERT(BertPreTrainedModel):

    def __init__(self,config, dim):
        super().__init__(config)
        self.bert = AutoModel.from_config(config)
        self.linear = nn.Linear(config.hidden_size, dim, bias=False)
        self.init_weights()

    def forward(self, input_ids, attention_mask):
        Q = self.bert(input_ids,attention_mask=attention_mask)[0]
        Q = self.linear(Q)
        return torch.nn.functional.normalize(Q, p=2, dim=2)

vespa_colbert = VespaColBERT.from_pretrained("colbert-ir/colbertv2.0", dim=128)
input_names = ["input_ids", "attention_mask"]
output_names = ["contextual"]
input_ids = torch.ones(1,32, dtype=torch.int64)
attention_mask = torch.ones(1,32,dtype=torch.int64)
args = (input_ids, attention_mask)
torch.onnx.export(vespa_colbert,
                args=args,
                f="colbertv2.onnx",
                input_names = input_names,
                output_names = output_names,
                dynamic_axes = {
                    "input_ids": {0: "batch", 1: "batch"},
                    "attention_mask": {0: "batch", 1: "batch"},
                    "contextual": {0: "batch", 1: "batch"},
                },
                opset_version=17)

Example code from README seems to not work due to path differences in index name

I've run the example code as written from the README.md (on 0.0.4b1) and it seems to fail. I'm running on a Mac M1, in PyCharm with poetry on Python 3.9 (default parameters).

For indexing, I ran:

from ragatouille import RAGPretrainedModel
from ragatouille.data import CorpusProcessor
from ragatouille.utils import get_wikipedia_page

if name == 'main':
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
my_documents = [get_wikipedia_page("Hayao_Miyazaki"), get_wikipedia_page("Studio_Ghibli")]
processor = CorpusProcessor()
my_documents = processor.process_corpus(my_documents)
index_path = RAG.index(index_name="my_index", collection=my_documents)

For searching I ran:

from ragatouille import RAGPretrainedModel

query = "What manga did Hayao Miyazaki write?"
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
results = RAG.search(query, index_name="my_index")

I get the following error:

Traceback (most recent call last):
File "/Users/gal/Documents/knowledgedb/rag_test.py", line 12, in
results = RAG.search(query, index_name="my_index")
File "/Users/gal/Library/Caches/pypoetry/virtualenvs/knowledgedb-OLhc9Epa-py3.9/lib/python3.9/site-packages/ragatouille/RAGPretrainedModel.py", line 187, in search
return self.model.search(
File "/Users/gal/Library/Caches/pypoetry/virtualenvs/knowledgedb-OLhc9Epa-py3.9/lib/python3.9/site-packages/ragatouille/models/colbert.py", line 279, in search
self._load_searcher(index_name=index_name, force_fast=force_fast)
File "/Users/gal/Library/Caches/pypoetry/virtualenvs/knowledgedb-OLhc9Epa-py3.9/lib/python3.9/site-packages/ragatouille/models/colbert.py", line 242, in _load_searcher
self.searcher = Searcher(
File "/Users/gal/Library/Caches/pypoetry/virtualenvs/knowledgedb-OLhc9Epa-py3.9/lib/python3.9/site-packages/colbert/searcher.py", line 33, in init
self.index_config = ColBERTConfig.load_from_index(self.index)
File "/Users/gal/Library/Caches/pypoetry/virtualenvs/knowledgedb-OLhc9Epa-py3.9/lib/python3.9/site-packages/colbert/infra/config/base_config.py", line 97, in load_from_index
loaded_config, _ = cls.from_path(metadata_path)
File "/Users/gal/Library/Caches/pypoetry/virtualenvs/knowledgedb-OLhc9Epa-py3.9/lib/python3.9/site-packages/colbert/infra/config/base_config.py", line 44, in from_path
with open(name) as f:
FileNotFoundError: [Errno 2] No such file or directory: '.ragatouille/my_index/plan.json'

Fixing the paths (changing the index name to 'colbert/indexes/my_index' which seems to be the path its expecting) seems to fix it.

More examples & documentation

Self-explanatory, currently very barebones. Any contribution, be it documentation, more examples, or deeper tutorials, is very welcome.

bclavie / ragatouille Goto Github PK

ragatouille's People

Contributors

Stargazers

Watchers

Forkers

ragatouille's Issues

Then trying to execute as a retriever:

Recommend Projects

Recommend Topics

Recommend Org