Hi, I'm getting an issue similar to <a class="issue-link js-issue-link" data-error-tex

Hey, this will come along with <a class="issue-link js-issue-link" data-error-text="Fa

Inconsistent search results length for high top-k values about ragatouille HOT 3 OPEN

ABCbum commented on July 3, 2024

Inconsistent search results length for high top-k values

from ragatouille.

Comments (3)

bclavie commented on July 3, 2024 1

Hey! This isn't a full solution to your problem (which is basically due to how the optimised retrieval engine works, and the defaults/dynamic hyper parameters not being very strong for small collections), but I think for just ~800 documents for benchmarking purposes you could alleviate this issue is by using in-memory encoding rather than indexing.
(until I build a proper nice HNSW-style index, I'm also planning on letting users create an "index" by persisting their in-memory encoding, which will work really well for relatively low number of documents!)

E.g. in your situation, replace

RAG = RAGPretrainedModel.from_pretrained("/path/to/finetuned_model")
index_path = RAG.index(index_name="my_index", collection=docs, document_ids=doc_ids)

# Retrieving
RAG = RAGPretrainedModel.from_index('.ragatouille/colbert/indexes/finetuned_index')
results = RAG.search(query, k=500)

with

RAG = RAGPretrainedModel.from_pretrained("/path/to/finetuned_model")
RAG.encode(docs)
results = RAG.search_encoded_docs(query, k=500)

This will actively search through every single document rather than PLAID-style approximation, which for small datasets + high k values will always guarantee that you get the number of results you want, and the computational overhead is minimal at your data scale (on my machine, it takes ~45ms to query the index, and ~55 to query in-memory encoded docs)

from ragatouille.

bclavie commented on July 3, 2024 1

Hey, this will come along with #137 (as well as making full-vectors indexing the default index for small collections)!

from ragatouille.

ABCbum commented on July 3, 2024

Thanks, that works well! Small detail but I think it'd be nice to add document_ids to RAG.encode similar to how it's done with RAG.index so that both can have the same result format.

from ragatouille.

Recommend Projects

Inconsistent search results length for high top-k values about ragatouille HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent