terrierteam / pyterrier_colbert Goto Github PK

Python 22.64% Jupyter Notebook 77.36%

pyterrier_colbert's Introduction

pyterrier-colbert & ColBERT-PRF

Advanced PyTerrier bindings for ColBERT, including for dense indexing and retrieval. This also includes the implementations of ColBERT PRF, approximate ANN scoring and query embedding pruning.

Usage

Given an existing ColBERT checkpoint, an end-to-end ColBERT dense retrieval index can be created as follows:

from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer("/path/to/checkpoint.dnn", "/path/to/index", "index_name")
indexer.index(dataset.get_corpus_iter())

An end-to-end ColBERT dense retrieval pipeline can be formulated as follows:

from pyterrier_colbert.ranking import ColBERTFactory
pytcolbert = ColBERTFactory("/path/to/checkpoint.dnn", "/path/to/index", "index_name")
dense_e2e = pytcolbert.end_to_end()

A ColBERT re-ranker of BM25 can be formulated as follows (you will need to have an index with text saved - the Terrier data repostiory conviniently provides such an index):

bm25 = pt.terrier.Retriever.from_dataset('msmarco_passage', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'text'])
sparse_colbert = bm25 >> pytcolbert.text_scorer()

Thereafter it is possible to conduct a side-by-side comparison of effectiveness:

pt.Experiment(
    [bm25, sparse_colbert, dense_e2e],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map", "ndcg_cut_10"],
    names=["BM25", "BM25 >> ColBERT", "Dense ColBERT"]
)

ColBERT PRF

You can use ColBERTFactory to obtain ColBERT PRF pipelines, as follows:

colbert_prf_rank = pytcolbert.prf(rerank=False)
colbert_prf_rerank = pytcolbert.prf(rerank=True)

ColBERT PRF requires the ColBERT index to have aligned token ids. During indexing, use the ids=True kwarg for ColBERTIndexer, as follows:

indexer = ColBERTIndexer("/path/to/checkpoint.dnn", "/path/to/index", "index_name", ids=True)

If you use ColBERT PRF in your research, you must cite our ICTIR 2021 paper (citation included below).

All of our results files are available from the paper's Virtual Appendix.

Approximate ANN Scoring and Query Embedding Pruning

This repository contains code to apply the techniques of query embedding pruning [Tonellotto21] and approximate ANN ranking [Macdonald21a].

Query Emebdding pruning can be applied using the following pipeline:

qep_pipe5 = (factory.query_encoder() 
            >> pyterrier_colbert.pruning.query_embedding_pruning(factory, 5) 
            >> factory.set_retrieve(query_encoded=True)
            >> factory.index_scorer(query_encoded=False)
)

where 5 is the number of query embeddings based on collection frquency to retain.

Approximate ANN scoring can be applied using the following pipeline:

ann_pipe = (factory.ann_retrieve_score() % 200) >> factory.index_scorer(query_encoded=True)

where 200 is the number of top-scored ANN candidates to forward for exact scoring.

Demos

vaswani.ipynb - [Github] [Colab] - demonstrates end-to-end dense retrieval and indexing on the Vaswani corpus (~11k documents)
colbertprf-msmarco-passages.ipynb - [Github] - demonstrates ColBERT PRF on the TREC Deep Learning track (MSMARCO) passage ranking tasks.
cikm2021-demos.ipynb - [Github] - demonstrates ANN scoring and Query Embedding Pruning on the TREC Deep Learning track (MSMARCO) passage ranking tasks.
colbert_text_and_explain.ipynb - [Github] [Colab] - demonstrates using a ColBERT model for scoring text, and for explaining an interaction. If you use one of these interaction diagrams, please cite [Macdonald21].

Resource Requirements

You will need a GPU to use this. Preferable more than one. You will also need lots of RAM - ColBERT requires you load the entire index into memory.

Name	Corpus size	Indexing Time	Index Size
Vaswani	11k abstracts	2 minutes (1 GPU)	163 MB
MSMARCO Passages	8M passages	~24 hours (1 GPU)	192 GB

Installation

This package can be installed using Pip, and then used with PyTerrier. See also the examples notebooks.

pip install -q git+https://github.com/terrierteam/pyterrier_colbert.git
conda install -c pytorch faiss-gpu=1.6.5 # or faiss-cpu
#on Colab: pip install faiss-gpu==1.6.5

NB: ColBERT requires FAISS, namely the faiss-gpu package, to be installed. pip install faiss-gpu does NOT usually work. FAISS recommends using Anaconda to install faiss-gpu. On Colab, you need to resort to pip install. We recommend faiss-gpu version 1.6.3, not 1.7.0.

References

[Khattab20]: Omar Khattab, Matei Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of SIGIR 2020. https://arxiv.org/abs/2004.12832
[Macdonald20]: Craig Macdonald, Nicola Tonellotto. Declarative Experimentation in Information Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271
[Macdonald21]: On Single and Multiple Representations in Dense Passage Retrieval. Craig Macdonald, Nicola Tonellotto and Iadh Ounis. In Proceedings of IIR 2021. https://arxiv.org/abs/2108.06279
[Macdonald21a]: On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval. Craig Macdonald and Nicola Tonellotto. In Proceedings of CIKM 2021. https://arxiv.org/abs/2108.11480
[Tonellotto21]: Query Embedding Pruning for Dense Retrieval Nicola Tonellotto and Craig Macdonald. In Proceedings of CIKM 2021. https://arxiv.org/abs/2108.10341
[Wang21]: Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis. Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval. In Proceedings of ICTIR 2021. https://arxiv.org/abs/2106.11251

Credits

Craig Macdonald, University of Glasgow
Nicola Tonellotto, University of Pisa
Sanjana Karumuri, University of Glasgow
Xiao Wang, University of Glasgow
Muhammad Hammad Khan, University of Glasgow
Sean MacAvaney, University of Glasgow
Sasha Petrov, University of Glasgow

pyterrier_colbert's People

Contributors

Stargazers

Watchers

pyterrier_colbert's Issues

calculate number of docs without needing to load embeddings index

Performance on Robust04

Hi @cmacdonald,

After the pull request, the code run perfectly.
However, I have some performance issue.

Requirements:
python-terrier==0.9.1
faiss-gpu==1.6.5
pyterrier-colbert==0.0.1

To create the index I am executing the following code:

checkpoint = "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"
indexer = ColBERTIndexer(checkpoint, "./index_robust04", "my_index/", chunksize=3, skip_empty_docs=True)
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
indexer.index(dataset.get_corpus_iter())

With the following output

#> Sample has shape (4352839, 128)
[feb 10, 15:32:26] Preparing resources for 1 GPUs.
[feb 10, 15:32:26] #> Training with the vectors...
[feb 10, 15:32:26] #> Training now (using 1 GPUs)...
0.06014108657836914
11.042617559432983
0.0002636909484863281
[feb 10, 15:32:37] Done training!

[feb 10, 15:32:37] #> Indexing the vectors...
[feb 10, 15:32:37] #> Loading ('./index_robust04/my_index/0.pt', './index_robust04/my_index/1.pt', './index_robust04/my_index/2.pt') (from queue)...
[feb 10, 15:32:43] #> Processing a sub_collection with shape (36038509, 128)
[feb 10, 15:32:43] Add data with shape (36038509, 128) (offset = 0)..
  IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432
33488896/36038509 (25.997 s)   Flush indexes to CPU
35979264/36038509 (28.914 s)   Flush indexes to CPU
add(.) time: 29.045 s           --               index.ntotal = 36038509
[feb 10, 15:33:12] #> Loading ('./index_robust04/my_index/3.pt', './index_robust04/my_index/4.pt', './index_robust04/my_index/5.pt') (from queue)...
[feb 10, 15:33:13] #> Processing a sub_collection with shape (33680999, 128)
[feb 10, 15:33:13] Add data with shape (33680999, 128) (offset = 36038509)..
33488896/33680999 (25.242 s)   Flush indexes to CPU
33619968/33680999 (26.493 s)   Flush indexes to CPU
add(.) time: 26.553 s           --               index.ntotal = 69719508
[feb 10, 15:33:39] #> Loading ('./index_robust04/my_index/6.pt', './index_robust04/my_index/7.pt', None) (from queue)...
[feb 10, 15:33:40] #> Processing a sub_collection with shape (17337319, 128)
[feb 10, 15:33:40] Add data with shape (17337319, 128) (offset = 69719508)..
17301504/17337319 (12.993 s)   Flush indexes to CPU
add(.) time: 13.636 s           --               index.ntotal = 87056827
[feb 10, 15:33:54] Done indexing!
[feb 10, 15:33:54] Writing index to ./index_robust04/my_index/ivfpq.100.faiss ...
[feb 10, 15:33:55]

Done! All complete (for slice #1 of 1)!
#> Faiss encoding complete
#> Indexing complete, Time elapsed 1143.59 seconds

Then I renamed ivfpq.100.faiss to ivfpq.faiss, otherwise the codebase crashes.

And I tried to execute some exps with the following code:

from pyterrier_colbert.ranking import ColBERTFactory
checkpoint = "http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"
pytcolbert = ColBERTFactory(checkpoint,"./robust", "my_index/")
dense_e2e = pytcolbert.end_to_end()
res = pt.Experiment(
  [BM25, dense_e2e],
  queries,
  qrels,
  eval_metrics=["ndcg_cut_10", "recall"],
  names=["BM25", "Dense ColBERT"],
)

However the Dense ColBERT results are the following.

            name  ndcg_cut_10       R@5      R@10     R@15      R@20      R@30     R@100     R@200     R@500    R@1000
0           BM25     0.434104  0.086331  0.140303  0.18080  0.206941  0.249246  0.405284  0.492415  0.610984  0.689337
1  Dense ColBERT     0.062902  0.008783  0.014293  0.01824  0.020652  0.025027  0.041959  0.053847  0.073959  0.088344

Can you help me with this problem?

Thanks in advance,
Andrea

support indexing without a GPU

support transformers v4

Empty passages cause index docnos to be misaligned

Spotted by @tuberj

Empty passages are skipped at line 253, but line 323 doesn't appreciate that some have been skipped.

Instead we need to raise an exception of some form at line 253.

Colbert PRF as a textual reranker

Maik Frobe requested Colbert prf as a textual reranker.

I think the code should look like this:

colbert = ColBERTModelOnlyFactory(checkpoint)
bm25 = pt.BatchRetrieve(sparse_index, wmodel='BM25', metadata=['docno', 'text'])
cprf_reranker = (
    bm25 
    >> colbert.text_encoder() 
    >> ColbertPRF(colbert, k=64, fb_embs=10, beta=1, fb_docs=10, return_docs=True) 
    >> colbert.scorer()
)

but: The only thing the index is used for is the token-level IDF, so we'd need to work around that...
https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L1020-L1024

Cc/ @seanmacavaney

Integrate more batch retrieval transformers

No faiss index found

When rebuilding the vaswani.ipynb-Experiment with my own index, the retrieval was not possible because ivfpq.faiss was expected, but ivfpq.100.faiss was build. After renaming this, everything worked as expected.

The Indexing:

checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"
from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer(checkpoint, "/content", "colbert_index", chunksize=3)
files = pt.io.find_files("/content/Data")
gen = pt.index.treccollection2textgen(files)
indexer.index(gen)

The Retrieval:

from pyterrier_colbert.ranking import ColBERTFactory
import pyterrier_colbert.indexing
pyterrier_colbert_factory = pyterrier_colbert.ranking.ColBERTFactory(checkpoint, "/content", "colbert_index")
colbert_e2e = pyterrier_colbert_factory.end_to_end()
(colbert_e2e % 10).search("chemical reactions")

Is there an Error in my code, or is this a bug?
Kind Regards,
Wilhelm

Indexing dataset read using `pt.io.find_files`

Is there a recommended way to build an index on a corpus that isn't provided via a PyTerrier dataset? The example code provided shows that this is an appropriate way to build an index:

from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer("/path/to/checkpoint.dnn", "/path/to/index", "index_name")
indexer.index(dataset.get_corpus_iter())

This code relies on the dataset.get_corpus_iter() function. The dataset I'm working with is a TREC collection which (when only using PyTerrier), I have indexed using the following approach:

news_corpus_files = pt.io.find_files("/content/news_corpus/text")

indexer = pt.TRECCollectionIndexer(
    "/content/news_index",
    meta={"docno": 26, "text": 4096},
    meta_tags={"text": "ELSE"},
    verbose=True,
)

indexer.index(news_corpus_files)

Is there a similar way to use the ColBERTIndexer using a dataset that was read using pt.io.find_files()?

Thanks in advance for any help with this issue!

Kernel crash

I have tried to run the vaswani.ipynb notebook in two different machines, but in both cases the kernel crashed in the same point. This is the log that I can see before the crash:

vaswani documents:   0%|          | 0/11429 [00:00<?, ?it/s]

[Nov 10, 15:37:24] [0] 		 #> Local args.bsize = 128
[Nov 10, 15:37:24] [0] 		 #> args.index_root = ../content
[Nov 10, 15:37:24] [0] 		 #> self.possible_subset_sizes = [69905]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[Nov 10, 15:37:30] #> Loading model checkpoint.
[Nov 10, 15:37:30] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip

/home/s7949670/.local/lib/python3.8/site-packages/torch/hub.py:452: UserWarning: Falling back to the old format < 1.6. This support will be deprecated in favor of default zipfile format introduced in 1.6. Please redo torch.save() to save it in the new zipfile format.
  warnings.warn('Falling back to the old format < 1.6. This support will be '

[Nov 10, 15:37:46] #> checkpoint['epoch'] = 0
[Nov 10, 15:37:46] #> checkpoint['batch'] = 44500




[Nov 10, 15:37:53] #> Note: Output directory ../content already exists




[Nov 10, 15:37:53] #> Creating directory ../content/colbertindex 


vaswani documents: 100%|██████████| 11429/11429 [00:28<00:00, 399.40it/s]

[Nov 10, 15:38:04] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 61.2k (overall),  62.0k (this encoding),  12955.9M (this saving)
[Nov 10, 15:38:04] [0] 		 [NOTE] Done with local share.
[Nov 10, 15:38:04] [0] 		 #> Joining saver thread.
[Nov 10, 15:38:04] [0] 		 #> Saved batch #0 to ../content/colbertindex/0.pt 		 Saving Throughput = 3.3M passages per minute.

#> num_embeddings = 581496
[Nov 10, 15:38:04] #> Starting..
[Nov 10, 15:38:04] #> Processing slice #1 of 1 (range 0..1).
[Nov 10, 15:38:04] #> Will write to ../content/colbertindex/ivfpq.100.faiss.
[Nov 10, 15:38:04] #> Loading ../content/colbertindex/0.sample ...
#> Sample has shape (29074, 128)
[Nov 10, 15:38:04] Preparing resources for 1 GPUs.
[Nov 10, 15:38:04] #> Training with the vectors...
[Nov 10, 15:38:04] #> Training now (using 1 GPUs)...
0.4895319938659668
23.038629055023193
0.00026726722717285156
[Nov 10, 15:38:28] Done training!

[Nov 10, 15:38:28] #> Indexing the vectors...
[Nov 10, 15:38:28] #> Loading ('../content/colbertindex/0.pt', None, None) (from queue)...
[Nov 10, 15:38:28] #> Processing a sub_collection with shape (581496, 128)
[Nov 10, 15:38:28] Add data with shape (581496, 128) (offset = 0)..
  IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432

Does anyone has an idea of what the problem might be?

索引在写入ivfpq.100.faiss

Hello. Now I get the most of the need when the index files such as doclens.10.json,docnos.pkl.gz files But in the last step write in ivfpq.100.faiss file failed So I want to use the obtained file to write in the ivfpq.100.faiss file. My code is as follows:

indexer = ColBERTIndexer(checkpoint, "/home/yujy/code/Colbert_PRF/index", "robust04_index",skip_empty_docs=True,chunksize=6,ids=True)
# dataset = pt.get_dataset("trec-deep-learning-passages")
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
indexer.index02(dataset.get_corpus_iter())

    def index02(self, iterator):
        docnos = []
        docid = 0
        def convert_gen(iterator):
            import pyterrier as pt
            nonlocal docnos
            nonlocal docid
            if self.num_docs is not None:
                iterator = pt.tqdm(iterator, total=self.num_docs, desc="encoding", unit="d")
            for l in iterator:
                l["docid"] = docid
                docnos.append(l['docno'])
                docid += 1
                yield l
        self.args.generator = convert_gen(iterator)
        index_faiss(self.args)
        print("#> Faiss encoding complete")

But it didn't work out and got stuck：

[ 21:30:28] #> Indexing the vectors...
[ 21:30:28] #> Loading ('/home/yujy/code/Colbert_PRF/index/robust04_index/0.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/1.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/2.pt') (from queue)...

Unable to extract query/object scores.

Hello,

I recently used your code for colbert retrieval

res = (0.8* colbert_e2e_base+0.2*colbert_e2e).transform(topic)
pt.Experiment(
[res1],
topic.head(50),
#dataset.get_topics().head(50),
dataset.get_qrels(),
eval_metrics = ["map", "recip_rank", "mrt", "P_10", "P_20", "ndcg_cut_20"],
names = ["ColBERT_ours"]
)

I meet this problem

TypeError: Unable to extract query/object scores.
I tried to use the Pytrec_eval native method, but I got the following problem, what is the problem probably?

Thank you very much for your time!

Applying query_encoder to document ranking is slow.

torch cannot load pt files

[4月 25, 10:52:31] #> Indexing the vectors...
[4月 25, 10:52:31] #> Loading ('/home/yujy/code/Colbert_PRF/index/robust04_index/0.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/1.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/2.pt') (from queue)..

Error when loading pt file:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/faiss.py", line 92, in _loader_thread
    sub_collection = [load_index_part(filename) for filename in filenames if filename is not None]
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/faiss.py", line 92, in <listcomp>
    sub_collection = [load_index_part(filename) for filename in filenames if filename is not None]
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/index_manager.py", line 17, in load_index_part
    part = torch.load(filename)
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/serialization.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

perform batching for rerank with embeddings

The resulting query_token often contains noise

When using code to get PRF:

prf_rank = pytcolbert.prf(rerank=False,fb_docs=10,fb_embs=10)
df=prf_rank.search(query).head(1)
query_toks=df.iloc[0]['query_toks']

The resulting query_token often contains noise, such as:
['coating', 'smuggling', 'radioactive', 'kazakhstan', 'legislation', '##b', '12', 'minister', '"', '"']
query_token has unrelated symbols or numbers instead of words.

Is this due to the use of the robust2004 dataset? Is the noise caused by selecting the document during the PRF process and extracting the contents of the document set other than ?

In the indexing phase, I used the code from the pull request. That is, when using the robust2004 document set, the contents of line["text"] are extracted. Is this also required in the PRF phase?

Number of partitions is fixed to 100

Hi, thanks for this great repo!

In indexing.py, the number of partitions is set to 100 here.

Since this condition will always be false, the index will always consist of 100 partitions.

Is this the intended behavior? Would that affect the retrieval effectiveness?

ColbertModelOnlyFactory should have a text_encoder() transformer

cc/ @seanmacavaney

dataset = pt.get_dataset("trec-deep-learning-passages") Data set not found Error 404

When I was recreating the colbert prf demo, the indexer.index(dataset.get_corpus_iter()) error 404 not found in the index generation step of MSMARCO passage ranking corpus
Change dataset = pt.get_dataset("trec-deep-learning-passages") to dataset = pt.get_dataset("msmarco_passages") still cannot find the collection in the Passages error 404

Bug downloading the dataset

import pyterrier as pt
import faiss
from pyterrier_colbert.indexing import ColBERTIndexer

if not pt.started(): pt.init()

dataset = pt.get_dataset('vaswani')
print("Files in vaswani corpus: %s " % dataset.get_corpus())

This code raises the following error: 'NoneType' object is not callable.

I noticed that the problem is caused by the importing of ColBERTIndexer. Indeed, the snippet works just moving this line in the bottom.

import pyterrier as pt
import faiss

if not pt.started(): pt.init()

dataset = pt.get_dataset('vaswani')
print("Files in vaswani corpus: %s " % dataset.get_corpus())

from pyterrier_colbert.indexing import ColBERTIndexer

Set/get FAISS partitions in a better manner

Can I include metadata in rerank method?

Hello,

I'm trying to use colbert.text_scorer() to do the rerank in the pipeline, but it seems that there is no option for me to include the metadata, and the output of colbert.text_scorer() only returns me the docno.

Therefore, even if my pipeline below have already included my text when doing bm25, I still need to go back to my data to match the text for each docno.

import pyterrier_colbert.ranking

colbert = pyterrier_colbert.ranking.ColBERTModelOnlyFactory(checkpoint)
pipe = (pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "text"])
            >> pt.text.sliding(text_attr='text', length=128, stride=64, prepend_attr=None)
            >> colbert.text_scorer()
            >> pt.text.max_passage())

how to score all documents in the index

Hi,
While using the dense retrieval pipeline via factory.end_to_end() I notice that not all documents are ranked and returned using the search(query) method of the pipeline. I am struggling a bit to understand where do I explicitly provide the number of results to return.

For e.g., my index has 5000 documents, but the search returns around 3000 documents even when I call search like this

ranked_df = (end_to_end_factory % 5000).search('my query')

Can someone please guide me in this regard?

How to run colbert indexing in multi-GPUs

I tried to run colbert indexing on trec-deep-learning-passages, and my environment is available with 4 GPUs. But when I call the APIs for indexing such as below it only utilizing 1 GPU.

from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer(checkpoint, "/path/to/index", "index_name", ids=True)
indexer.index(dataset.get_corpus_iter())

How can I leverage all 4 GPUs with pyterrier API ?

Request for Dense Indexed Data

Hello,

This is Delaram, MSc student at the University of Windsor, Canada. I am currently researching query reformulation, explicitly focusing on robust04, dbpedia, clueweb09b, antique, and gov2 datasets using sparse retrieval methods. However, as I progress in my work, I need to use dense retrieval methods, which require its own version of indexed data. Unfortunately, I do not have the indexes for dense retrieval. I do not have the document corpus to build the indexes either. I only have the sparse indexed data.

Given your expertise and the remarkable work you've undertaken in dense retrieval methods and pyterrier_colbert, I was wondering if you or your team have generated dense indexes for the above datasets. If so, I would be grateful if you could share them with me. Access to dense indexed data would be incredibly beneficial for my ongoing research efforts.

Please let me know if you would be open to this collaboration or if you have any specific requirements or conditions regarding the sharing of dense indexed data. Your assistance and generosity would be greatly appreciated.

Retrieval Issue

Hello,

I already got a pretty big Index and now I try to do some test-retrieval.
This is the code I tried first:

import faiss
assert faiss.get_num_gpus() > 0

import pyterrier as pt
pt.init()

import torch
print('torch version' , torch.__version__)
x = torch.rand(5, 3)
print(x)

checkpoint="/home/s2003857/javaIndex/checkpoint/colbert-10000.dnn"

from pyterrier_colbert.indexing import ColBERTIndexer
from pyterrier_colbert.ranking import ColBERTFactory

pyterrier_colbert_factory = pyterrier_colbert.ranking.ColBERTFactory(checkpoint, "/home/s2003857/javaIndex/indextest", "colbert_java_index")
#pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()
(colbert_e2e % 10).search("chemical reactions")

print("retrival is da")

After getting the error

Traceback (most recent call last):
  File "smallretrival.py", line 17, in <module>
    pyterrier_colbert_factory = pyterrier_colbert.ranking.ColBERTFactory(checkpoint, "/home/s2003857/javaIndex/indextest", "colbert_java_index")
NameError: name 'pyterrier_colbert' is not defined

I changed the code to:

import faiss
assert faiss.get_num_gpus() > 0

import pyterrier as pt
pt.init()

import torch
print('torch version' , torch.__version__)
x = torch.rand(5, 3)
print(x)

checkpoint="/home/s2003857/javaIndex/checkpoint/colbert-10000.dnn"

from pyterrier_colbert.indexing import ColBERTIndexer
from pyterrier_colbert.ranking import ColBERTFactory

pyterrier_colbert_factory = ColBERTFactory(checkpoint, "/home/s2003857/javaIndex/indextest", "colbert_java_index")
#pyterrier_colbert_factory = indexer.ranking_factory()

colbert_e2e = pyterrier_colbert_factory.end_to_end()
(colbert_e2e % 10).search("chemical reactions")

print("retrival is da")

The full Error I am getting is this:

PyTerrier 0.8.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
torch version 1.10.1+cu113
tensor([[0.3226, 0.8167, 0.1429],
        [0.7141, 0.0719, 0.4174],
        [0.6066, 0.5820, 0.4509],
        [0.7547, 0.5944, 0.4332],
        [0.8414, 0.6289, 0.9862]])
1.10.1+cu113
Some weights of ColBERT were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[May 23, 21:38:52] #> Loading model checkpoint.
[May 23, 21:38:52] #> Loading checkpoint /home/s2003857/javaIndex/checkpoint/colbert-10000.dnn
[May 23, 21:38:53] #> checkpoint['epoch'] = 0
[May 23, 21:38:53] #> checkpoint['batch'] = 10000
Traceback (most recent call last):
  File "smallretrival.py", line 20, in <module>
    colbert_e2e = pyterrier_colbert_factory.end_to_end()
  File "/home/s2003857/javaIndex/venv/lib/python3.8/site-packages/pyterrier_colbert/ranking.py", line 773, in end_to_end
    return self.set_retrieve() >> self.index_scorer(query_encoded=True)
  File "/home/s2003857/javaIndex/venv/lib/python3.8/site-packages/pyterrier_colbert/ranking.py", line 607, in set_retrieve
    faiss_index = self._faiss_index()
  File "/home/s2003857/javaIndex/venv/lib/python3.8/site-packages/pyterrier_colbert/ranking.py", line 586, in _faiss_index
    self.faiss_index = FaissIndex(self.index_path, faiss_index_path, self.args.nprobe, self.args.part_range, mmap=self.faisstype == 'mmap')
TypeError: __init__() got an unexpected keyword argument 'mmap'

Is this some kind of installation issue?
Thanks in advance and kind regards
Wilhelm.

CodeBERT as base model

Hello, we are trying to use ColBERT for Code Retrieval. Therefore we would like to use a different base model than BERT, namely CodeBERT. By applying the changes contained in this commit hueck/ColBERT@1d268f5 we obtained this ColBERT checkpoint.

Is there a simple way to integrate a checkpoint based on a different architecture? I think this would be a useful feature to possibly improve the performance of the model.

I tried to customize pyterrier myself, but after fixing minor problems I ran into the following error, which I assume is not related to the custom checkpoint.

TypeError                                 Traceback (most recent call last)
<ipython-input-8-ae1901375f17> in <module>()
      7 gen = pt.index.treccollection2textgen(files)
      8 
----> 9 indexer.index(gen)


/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in index(self, iterator)
    326         create_directory(self.args.index_root)
    327         create_directory(self.args.index_path)
--> 328         ceg.encode()
    329         self.colbert = ceg.colbert
    330         self.checkpoint = ceg.checkpoint

/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in encode(self)
    401             t1 = time.time()
    402             batch = self._preprocess_batch(offset, lines)
--> 403             embs, doclens, ids = self._encode_batch(batch_idx, batch)
    404             if DEBUG:
    405                 assert sum(doclens) == len(ids), (batch_idx, len(doclens), len(ids))

/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in _encode_batch(self, batch_idx, batch)
    352     def _encode_batch(self, batch_idx, batch):
    353         with torch.no_grad():
--> 354             embs, ids = self.inference.docFromText(batch, bsize=self.args.bsize, keep_dims=False, with_ids=True)
    355             assert type(embs) is list
    356             assert len(embs) == len(batch)

TypeError: docFromText() got an unexpected keyword argument 'with_ids'

Is this related to stanford-futuredata/ColBERT#30? It seems that pyterrier assumes that this pull request has been merged.
@cmacdonald, could you maybe explain the reason for this pull request? As we don't want to mask the punctuation. Is there a way to just bypass it?

To reproduce the error you can use this colab. Note that it uses forked ColBERT and pyterrier_colbert versions.

Thank you for your help!

Use verbose flag for TQDM during indexing, check for len on iterator

Data Format when indexing

@cmacdonald
When trying to index my custom dataset with a trained ColBERT checkpoint, what should the dataset format be like? My corpus is a pandas dataframe organized as follows

and I am using the following code to declare the indexer and as well as to index

indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "colbert_content", "colbertindex", chunksize=3) 

indexer.index(corpus)

But if I simply run this code (with corpus being the pandas dataframe that looks like the screenshot above) I get an error.

How should I organize my corpus data (the data that I want to search) in order to properly index them?

Does this support ColBERTv2?

Wondering if this would support ColBERT v2? E.g. by installing the main branch of ColBERT and providing a ColBERTv2 checkpoint to pyterrier_colbert?

pyterrier_colbert installation

I was trying to implement pyterrier_colbert's indexes using Visual Studio Code. When I tried to install pyterrier_colbert from GitHub, using "pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git", I got the error messages"ERROR: Failed building wheel for tokenizers, & "ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects". What can this be fixed?

Error installing pyterrier_colbert

After pip install --upgrade git+https://github.com/terrierteam/pyterrier_colbert.git in a fresh venv, everything worked as expected until

Building wheels for collected packages: pyterrier-colbert, ColBERT, python-terrier, chest, ir-measures, sklearn, wget, alembic, databricks-cli, warc3-wet-clueweb09, cwl-eval, cbor

This was the following error:

Building wheels for collected packages: pyterrier-colbert, ColBERT, python-terrier, chest, ir-measures, sklearn, wget, alembic, databricks-cli, warc3-wet-clueweb09, cwl-eval, cbor
  Building wheel for pyterrier-colbert (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ix1udm9s/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ix1udm9s/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-9z69uhdx
       cwd: /tmp/pip-req-build-ix1udm9s/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for pyterrier-colbert
  Running setup.py clean for pyterrier-colbert
  Building wheel for ColBERT (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/ColBERT/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/ColBERT/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-5ypvgl1z
       cwd: /tmp/pip-install-svmltq1x/ColBERT/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for ColBERT
  Running setup.py clean for ColBERT
  Building wheel for python-terrier (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/python-terrier/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/python-terrier/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-it8vr478
       cwd: /tmp/pip-install-svmltq1x/python-terrier/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for python-terrier
  Running setup.py clean for python-terrier
  Building wheel for chest (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/chest/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/chest/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-l05_rc7s
       cwd: /tmp/pip-install-svmltq1x/chest/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for chest
  Running setup.py clean for chest
  Building wheel for ir-measures (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/ir-measures/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/ir-measures/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-a7mqpiqx
       cwd: /tmp/pip-install-svmltq1x/ir-measures/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for ir-measures
  Running setup.py clean for ir-measures
  Building wheel for sklearn (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/sklearn/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/sklearn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-5ul6w29m
       cwd: /tmp/pip-install-svmltq1x/sklearn/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for sklearn
  Running setup.py clean for sklearn
  Building wheel for wget (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/wget/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/wget/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-nwdabhto
       cwd: /tmp/pip-install-svmltq1x/wget/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for wget
  Running setup.py clean for wget
  Building wheel for alembic (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/alembic/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/alembic/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-6mkla7_u
       cwd: /tmp/pip-install-svmltq1x/alembic/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for alembic
  Running setup.py clean for alembic
  Building wheel for databricks-cli (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/databricks-cli/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/databricks-cli/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-g_rb9wlh
       cwd: /tmp/pip-install-svmltq1x/databricks-cli/
  Complete output (8 lines):
  /tmp/pip-install-svmltq1x/databricks-cli/setup.py:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for databricks-cli
  Running setup.py clean for databricks-cli
  Building wheel for warc3-wet-clueweb09 (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/warc3-wet-clueweb09/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/warc3-wet-clueweb09/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-6lyf4scz
       cwd: /tmp/pip-install-svmltq1x/warc3-wet-clueweb09/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for warc3-wet-clueweb09
  Running setup.py clean for warc3-wet-clueweb09
  Building wheel for cwl-eval (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/cwl-eval/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/cwl-eval/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-ez9kc24k
       cwd: /tmp/pip-install-svmltq1x/cwl-eval/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for cwl-eval
  Running setup.py clean for cwl-eval
  Building wheel for cbor (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/s2003857/colbert_pyterrier/project_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-svmltq1x/cbor/setup.py'"'"'; __file__='"'"'/tmp/pip-install-svmltq1x/cbor/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-vjku2nkg
       cwd: /tmp/pip-install-svmltq1x/cbor/
  Complete output (8 lines):
  /usr/lib/python3.8/distutils/extension.py:131: UserWarning: Unknown Extension options: 'headers'
    warnings.warn(msg)
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'
  ----------------------------------------
  ERROR: Failed building wheel for cbor
  Running setup.py clean for cbor
Failed to build pyterrier-colbert ColBERT python-terrier chest ir-measures sklearn wget alembic databricks-cli warc3-wet-clueweb09 cwl-eval cbor
Installing collected packages: numpy, gunicorn, werkzeug, MarkupSafe, jinja2, click, itsdangerous, Flask, prometheus-client, prometheus-flask-exporter, pytz, entrypoints, zipp, importlib-metadata, sqlparse, six, python-dateutil, pandas, greenlet, sqlalchemy, websocket-client, charset-normalizer, certifi, idna, urllib3, requests, docker, Mako, python-editor, alembic, pyparsing, packaging, querystring-parser, pyyaml, tabulate, databricks-cli, smmap, gitdb, typing-extensions, gitpython, protobuf, cloudpickle, mlflow, wheel, markdown, pyasn1, pyasn1-modules, cachetools, rsa, google-auth, absl-py, oauthlib, requests-oauthlib, google-auth-oauthlib, tensorboard-plugin-wit, grpcio, tensorboard-data-server, tensorboard, torch, tqdm, sentencepiece, regex, joblib, sacremoses, tokenizers, filelock, transformers, ujson, ColBERT, heapdict, chest, deprecation, dill, ijson, zlib-state, cbor, trec-car-tools, warc3-wet-clueweb09, warc3-wet, lxml, lz4, pyautocorpus, soupsieve, beautifulsoup4, ir-datasets, cwl-eval, pyndeval, pytrec-eval-terrier, ir-measures, multiset, matchpy, more-itertools, typish, nptyping, cython, pyjnius, scipy, threadpoolctl, scikit-learn, sklearn, patsy, statsmodels, wget, python-terrier, pyterrier-colbert
    Running setup.py install for alembic ... done
    Running setup.py install for databricks-cli ... done
    Running setup.py install for ColBERT ... done
    Running setup.py install for chest ... done
    Running setup.py install for cbor ... done
    Running setup.py install for warc3-wet-clueweb09 ... done
    Running setup.py install for cwl-eval ... done
    Running setup.py install for ir-measures ... done
    Running setup.py install for sklearn ... done
    Running setup.py install for wget ... done
    Running setup.py install for python-terrier ... done
    Running setup.py install for pyterrier-colbert ... done
Successfully installed ColBERT-0.2.0 Flask-2.0.2 Mako-1.1.6 MarkupSafe-2.0.1 absl-py-1.0.0 alembic-1.4.1 beautifulsoup4-4.10.0 cachetools-4.2.4 cbor-1.0.0 certifi-2021.10.8 charset-normalizer-2.0.9 chest-0.2.3 click-8.0.3 cloudpickle-2.0.0 cwl-eval-1.0.10 cython-0.29.25 databricks-cli-0.16.2 deprecation-2.1.0 dill-0.3.4 docker-5.0.3 entrypoints-0.3 filelock-3.4.0 gitdb-4.0.9 gitpython-3.1.24 google-auth-2.3.3 google-auth-oauthlib-0.4.6 greenlet-1.1.2 grpcio-1.42.0 gunicorn-20.1.0 heapdict-1.0.1 idna-3.3 ijson-3.1.4 importlib-metadata-4.8.2 ir-datasets-0.5.0 ir-measures-0.2.3 itsdangerous-2.0.1 jinja2-3.0.3 joblib-1.1.0 lxml-4.6.4 lz4-3.1.10 markdown-3.3.6 matchpy-0.5.5 mlflow-1.22.0 more-itertools-8.12.0 multiset-2.1.1 nptyping-1.4.4 numpy-1.21.4 oauthlib-3.1.1 packaging-21.3 pandas-1.3.4 patsy-0.5.2 prometheus-client-0.12.0 prometheus-flask-exporter-0.18.6 protobuf-3.19.1 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyautocorpus-0.1.8 pyjnius-1.3.0 pyndeval-0.0.2 pyparsing-3.0.6 pyterrier-colbert-0.0.1 python-dateutil-2.8.2 python-editor-1.0.4 python-terrier-0.7.1 pytrec-eval-terrier-0.5.1 pytz-2021.3 pyyaml-6.0 querystring-parser-1.2.4 regex-2021.11.10 requests-2.26.0 requests-oauthlib-1.3.0 rsa-4.8 sacremoses-0.0.46 scikit-learn-1.0.1 scipy-1.7.3 sentencepiece-0.1.96 six-1.16.0 sklearn-0.0 smmap-5.0.0 soupsieve-2.3.1 sqlalchemy-1.4.28 sqlparse-0.4.2 statsmodels-0.13.1 tabulate-0.8.9 tensorboard-2.7.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.0 threadpoolctl-3.0.0 tokenizers-0.8.1rc1 torch-1.10.0 tqdm-4.62.3 transformers-3.0.2 trec-car-tools-2.5.4 typing-extensions-4.0.1 typish-1.9.3 ujson-4.3.0 urllib3-1.26.7 warc3-wet-0.2.3 warc3-wet-clueweb09-0.2.5 websocket-client-1.2.3 werkzeug-2.0.2 wget-3.2 wheel-0.37.0 zipp-3.6.0 zlib-state-0.1.5

I do not know if this is intended or not. What happend here?

Kind regards!

Unable to view the full query_toks

Hello, I got the results after colbert prf, eg: query_toks=[##´, vinegar, baking, reactions, substances,... According to the contents of the document there should be 10 query_toks but I cannot see all query_toks until the ellipses are not displayed. If I want to see the full query_toks and query_embs, what parameters should I change

Number of epochs to train the Colbert model in the fit( )

Hi everyone
I had a question about training the Colbert model

How can I change the epoch and batch size for the fit( ) function used to train the model?

What is the number of default epochs in the fit( ) function?

Which part can I see the settings for the number of epochs and batch sizes for the fit( ) function?

Thank you

different scorer performs differently

different scorer: factory.scorer(), factory.text_scorer() and factory.index_scorer() generate different embeddings for the ReRanking scenario, causing the performance difference in terms of nDCG@10 and MAP@1k.
Pipelines tested:

pipe1 =(factory.query_encoder()  
        >> bm25_terrier_stemmed_text
        >>pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
        >>factory.text_encoder()
        >>factory.scorer())
pipe2 = (bm25_terrier_stemmed_text
         >>pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
         >> factory.text_scorer())

pipe3 = (bm25_terrier_stemmed_text>>factory.index_scorer())
from pyterrier.measures import *
df = pt.Experiment(
    [pipe1, pipe2,pipe3],
    topics2019,
    qrels2019,
    batch_size=10, 
    verbose=True,
    filter_by_qrels=True,
    eval_metrics=[nDCG@10,RR(rel=2)@10,  AP(rel=2)@1000, R(rel=2)@1000],
    names=["pipe1","pipe2","pipe3"]
)

polish text in notebooks

get embeddings for a given tokenid

Error while indexing (flushing to cpu)

Hey,
I was using the notebook vaswani.ipynb and everything worked fine.
I tried to redo the indexing on a given remote machine, and I hope I installed everything needed:

absl-py                   1.0.0     
alembic                   1.4.1     
beautifulsoup4            4.10.0    
cachetools                4.2.4     
cbor                      1.0.0     
certifi                   2021.10.8 
charset-normalizer        2.0.9     
chest                     0.2.3     
click                     8.0.3     
cloudpickle               2.0.0     
ColBERT                   0.2.0     
cwl-eval                  1.0.10    
Cython                    0.29.25   
databricks-cli            0.16.2    
deprecation               2.1.0     
dill                      0.3.4     
docker                    5.0.3     
entrypoints               0.3       
faiss-gpu                 1.6.5     
filelock                  3.4.0     
Flask                     2.0.2     
gitdb                     4.0.9     
GitPython                 3.1.24    
google-auth               2.3.3     
google-auth-oauthlib      0.4.6     
greenlet                  1.1.2     
grpcio                    1.42.0    
gunicorn                  20.1.0    
HeapDict                  1.0.1     
idna                      3.3       
ijson                     3.1.4     
importlib-metadata        4.8.2     
ir-datasets               0.5.0     
ir-measures               0.2.3     
itsdangerous              2.0.1     
Jinja2                    3.0.3     
joblib                    1.1.0     
lxml                      4.6.5     
lz4                       3.1.10    
Mako                      1.1.6     
Markdown                  3.3.6     
MarkupSafe                2.0.1     
matchpy                   0.5.5     
mlflow                    1.22.0    
more-itertools            8.12.0    
multiset                  2.1.1     
nptyping                  1.4.4     
numpy                     1.21.4    
oauthlib                  3.1.1     
packaging                 21.3      
pandas                    1.3.5     
patsy                     0.5.2     
pip                       20.0.2    
pkg-resources             0.0.0     
prometheus-client         0.12.0    
prometheus-flask-exporter 0.18.6    
protobuf                  3.19.1    
pyasn1                    0.4.8     
pyasn1-modules            0.2.8     
pyautocorpus              0.1.8     
pyjnius                   1.3.0     
pyndeval                  0.0.2     
pyparsing                 3.0.6     
pyterrier-colbert         0.0.1     
python-dateutil           2.8.2     
python-editor             1.0.4     
python-terrier            0.7.1     
pytrec-eval-terrier       0.5.1     
pytz                      2021.3    
PyYAML                    6.0       
querystring-parser        1.2.4     
regex                     2021.11.10
requests                  2.26.0    
requests-oauthlib         1.3.0     
rsa                       4.8       
sacremoses                0.0.46    
scikit-learn              1.0.1     
scipy                     1.7.3     
sentencepiece             0.1.96    
setuptools                44.0.0    
six                       1.16.0    
sklearn                   0.0       
smmap                     5.0.0     
soupsieve                 2.3.1     
SQLAlchemy                1.4.28    
sqlparse                  0.4.2     
statsmodels               0.13.1    
tabulate                  0.8.9     
tensorboard               2.7.0     
tensorboard-data-server   0.6.1     
tensorboard-plugin-wit    1.8.0     
threadpoolctl             3.0.0     
tokenizers                0.8.1rc1  
torch                     1.10.0    
tqdm                      4.62.3    
transformers              3.0.2     
trec-car-tools            2.5.4     
typing-extensions         4.0.1     
typish                    1.9.3     
ujson                     4.3.0     
urllib3                   1.26.7    
warc3-wet                 0.2.3     
warc3-wet-clueweb09       0.2.5     
websocket-client          1.2.3     
Werkzeug                  2.0.2     
wget                      3.2       
wheel                     0.37.0    
zipp                      3.6.0     
zlib-state                0.1.5

The code I run was:

import faiss
assert faiss.get_num_gpus() > 0

import pyterrier as pt
pt.init()

#rm -rf /content/ARQIndex/

checkpoint="http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip"

from pyterrier_colbert.indexing import ColBERTIndexer

indexer = ColBERTIndexer(checkpoint, "/home/colbert_pyterrier/indextest", "colbert_smallindex", chunksize=3)
files = pt.io.find_files("/home/colbert_pyterrier/data/small")
gen = pt.index.treccollection2textgen(files)
indexer.index(gen)

I already tested the used data on mentioned colab above.

This is the error I got:

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
[Dec 28, 17:29:59] [0] 		 #> Local args.bsize = 128
[Dec 28, 17:29:59] [0] 		 #> args.index_root = /home/colbert_pyterrier/indextest
[Dec 28, 17:29:59] [0] 		 #> self.possible_subset_sizes = [69905]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[Dec 28, 17:30:05] #> Loading model checkpoint.
[Dec 28, 17:30:05] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip
/home/colbert_pyterrier/project_env/lib/python3.8/site-packages/torch/hub.py:513: UserWarning: Falling back to the old format < 1.6. This support will be deprecated in favor of default zipfile format introduced in 1.6. Please redo torch.save() to save it in the new zipfile format.
  warnings.warn('Falling back to the old format < 1.6. This support will be '
[Dec 28, 17:30:13] #> checkpoint['epoch'] = 0
[Dec 28, 17:30:13] #> checkpoint['batch'] = 44500




[Dec 28, 17:30:14] #> Note: Output directory /home/colbert_pyterrier/indextest already exists




[Dec 28, 17:30:14] #> Creating directory /home/colbert_pyterrier/indextest/colbert_smallindex 


[Dec 28, 17:30:55] [0] 		 #> Completed batch #0 (starting at passage #0) 		Passages/min: 14.8k (overall),  16.6k (this encoding),  6486.0M (this saving)
[Dec 28, 17:30:55] [0] 		 [NOTE] Done with local share.
[Dec 28, 17:30:55] [0] 		 #> Joining saver thread.
[Dec 28, 17:30:56] [0] 		 #> Saved batch #0 to /home/colbert_pyterrier/indextest/colbert_smallindex/0.pt 		 Saving Throughput = 864.8k passages per minute.

#> num_embeddings = 1572487
[Dec 28, 17:30:56] #> Starting..
[Dec 28, 17:30:56] #> Processing slice #1 of 1 (range 0..1).
[Dec 28, 17:30:56] #> Will write to /home/colbert_pyterrier/indextest/colbert_smallindex/ivfpq.100.faiss.
[Dec 28, 17:30:56] #> Loading /home/colbert_pyterrier/indextest/colbert_smallindex/0.sample ...
#> Sample has shape (78624, 128)
[Dec 28, 17:30:56] Preparing resources for 2 GPUs.
[Dec 28, 17:30:56] #> Training with the vectors...
[Dec 28, 17:30:56] #> Training now (using 2 GPUs)...
0.6025171279907227
24.62867784500122
0.01384115219116211
[Dec 28, 17:31:21] Done training!

[Dec 28, 17:31:21] #> Indexing the vectors...
[Dec 28, 17:31:21] #> Loading ('/home/colbert_pyterrier/indextest/colbert_smallindex/0.pt', None, None) (from queue)...
[Dec 28, 17:31:21] #> Processing a sub_collection with shape (1572487, 128)
[Dec 28, 17:31:21] Add data with shape (1572487, 128) (offset = 0)..
IndexShards shard 0 select modulo 2 = 0
  IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432
IndexShards shard 1 select modulo 2 = 1
  IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432
1507328/1572487 (0.614 s)   Flush indexes to CPU
Traceback (most recent call last):
  File "smallindex.py", line 16, in <module>
    indexer.index(gen)
  File "/home/colbert_pyterrier/project_env/lib/python3.8/site-packages/pyterrier_colbert/indexing.py", line 343, in index
    index_faiss(self.args)
  File "/home/colbert_pyterrier/project_env/lib/python3.8/site-packages/colbert/indexing/faiss.py", line 108, in index_faiss
    index.add(sub_collection)
  File "/home/colbert_pyterrier/project_env/lib/python3.8/site-packages/colbert/indexing/faiss_index.py", line 48, in add
    self.gpu.add(self.index, data, self.offset)
  File "/home/colbert_pyterrier/project_env/lib/python3.8/site-packages/colbert/indexing/faiss_index_gpu.py", line 118, in add
    self._flush_to_cpu(index, nb, offset)
  File "/home/colbert_pyterrier/project_env/lib/python3.8/site-packages/colbert/indexing/faiss_index_gpu.py", line 135, in _flush_to_cpu
    self.gpu_index.sync_with_shard_indexes()
AttributeError: 'IndexShards' object has no attribute 'sync_with_shard_indexes'

What went wrong?
Thank you for your help and sorry for the long print.
Kind regards

optionally disable faiss creation in ColBERTIndexer

method in ColbertFactory to force index to be loaded

TypeError: in method 'GpuIndex_add_with_ids', argument 4 of type 'faiss::Index::idx_t const *'

Hi everybody
I have a problem
This is my code:

from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer(addr_checkpoint, addr + "Indexing/", "index_doc", chunksize=3)
doc_26k_collection = pd.read_csv(addr+'Sajad_ds_small/doc_26k.tsv',sep='\t')
doc_collection = doc_26k_collection.rename({'doc_id':'docno'}, axis='columns')
doc_collection['docno'] = doc_collection['docno'].astype(str)
indexer.index(doc_26k_collection_new.to_dict("records"))

Outputs:

TypeError Traceback (most recent call last)
C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7936/1940468918.py in
13
14 # ###############
---> 15 indexer2.index(doc_26k_collection_new.to_dict("records"))

C:\ProgramData\Anaconda3\lib\site-packages\pyterrier_colbert\indexing.py in index(self, iterator)
351 warn("Default computation chooses", self.args.partitions,
352 "partitions (for {} embeddings)".format(num_embeddings))
--> 353 index_faiss(self.args)
354 print("#> Faiss encoding complete")
355 endtime = timer()

C:\ProgramData\Anaconda3\lib\site-packages\colbert\indexing\faiss.py in index_faiss(args)
106
107 print_message("#> Processing a sub_collection with shape", sub_collection.shape)
--> 108 index.add(sub_collection)
109
110 print_message("Done indexing!")

C:\ProgramData\Anaconda3\lib\site-packages\colbert\indexing\faiss_index.py in add(self, data)
46
47 if self.gpu.ngpu > 0:
---> 48 self.gpu.add(self.index, data, self.offset)
49 else:
50 self.index.add(data)
C:\ProgramData\Anaconda3\lib\site-packages\colbert\indexing\faiss_index_gpu.py in add(self, index, data, offset)
107 xs = data[i0:i1]
108
--> 109 self.gpu_index.add_with_ids(xs, np.arange(offset+i0, offset+i1))
110
111 if self.max_add > 0 and self.gpu_index.ntotal > self.max_add:

C:\ProgramData\Anaconda3\lib\site-packages\faiss_init_.py in replacement_add_with_ids(self, x, ids)
233
234 assert ids.shape == (n, ), 'not same nb of vectors as ids'
--> 235 self.add_with_ids_c(n, swig_ptr(x), swig_ptr(ids))
236
237 def replacement_assign(self, x, k, labels=None):

C:\ProgramData\Anaconda3\lib\site-packages\faiss\swigfaiss_avx2.py in add_with_ids(self, n, x, ids)

-> 8867 return _swigfaiss_avx2.GpuIndex_add_with_ids(self, n, x, ids)
8868
8869 def assign(self, n, x, labels, k=1):

TypeError: in method 'GpuIndex_add_with_ids', argument 4 of type 'faiss::Index::idx_t const *'

OS: Windows 10

Faiss version: faiss-gpu

Installed from: anaconda

Faiss compilation options:

Running on:
GPU
(1 GPU is used)

Interface:
Python
(python 3.8)

My code runs on google colab but doesn`t work on our system. How can I solve this problem? I hope you can help me.
Thank you

Resume building the index when the process crashes

Hi,

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

I have been going through the pain of building a ColBERT dense index for MSMARCO passages v2 where the process took a long time and crashed due to some technical issues halfway through.

I wonder if there is a built-in support to resume building the index; if not I would appreciate any tips about doing that while keeping the integrity of the built index. Thank you.

terrierteam / pyterrier_colbert Goto Github PK

pyterrier_colbert's Introduction

pyterrier-colbert & ColBERT-PRF

Usage

ColBERT PRF

Approximate ANN Scoring and Query Embedding Pruning

Demos

Resource Requirements

Installation

References

Credits

pyterrier_colbert's People

Contributors

Stargazers

Watchers

Forkers

pyterrier_colbert's Issues

Recommend Projects

Recommend Topics

Recommend Org