Hi, Thanks for releasing this library. I am planni

ColBERT on Wikipedia corpus about colbert HOT 1 CLOSED

stanford-futuredata commented on August 26, 2024

ColBERT on Wikipedia corpus

from colbert.

Comments (1)

okhat commented on August 26, 2024

Hi Shashank! Sorry for the late response.

I strongly recommend using a passage-level Wikipedia corpus. It's common in the Open-QA literature (e.g., our ColBERT-QA paper) to divide Wikipedia into 100-word or (say) 200-token passages, keeping the title of the page at the start of each passage.

For the second one, encoding the corpus (or the queries) with colbert.index can give you files with all the embeddings. Or you can use the ModelInference class from colbert/modeling/inference.py, and in particular queryFromText and docFromText. See existing uses in the code for how to do this; it's pretty simple!

Let me know if you face any issues!

from colbert.

Related Issues (20)

How to get rid of the "Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000" error while training of the ColBERTv1.9 modell? HOT 1
Request for AMD gpu support
How to quickly check if installation is working fine?
ColBert is not failing when Error is encounter during both train and indexing
How to insert new document into the pre-built index? HOT 1
Is there a check point of ColBERT that wasn't trained on MSMARCO?
How to check the centroids and the data in the clusters?
Extract only embeddings
Execution fails in colbert.index_objs() with assert classname.endswith('Vector')
Results on BEIR HOT 1
unable to open file </root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/model.safetensors> in read-only mode: No such file or directory (2)
Add_to_index only work first time
Tokenization Assumption for Query Marker Replacement is Inconsistent
GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message HOT 1
Training script from doc is not working
ImportError: cannot import name 'packaging' from 'pkg_resources' HOT 1
Indexing stuck at encoding passages HOT 2
ImportError: .../torch_extensions/py38_cu117/decompress_residuals_cpp/decompress_residuals_cpp.so: cannot open shared object file: No such file or directory
How to load the checkpoint of "colbert-ir/colbertv2.0"
FAISS RuntimError

ColBERT on Wikipedia corpus about colbert HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent