stanford-futuredata / colbert Goto Github PK

View Code? Open in Web Editor NEW

2.5K 2.5K 338.0 1.93 MB

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

License: MIT License

Python 92.98% C++ 5.73% Cuda 1.29%

colbert's People

Contributors

Stargazers

Watchers

Forkers

yufanpapa qianrenjian d3v3l0 thuyshao humdingers lurunchik terrierteam krishnadurai mmirshekari ningshiqi qratosone kyoungrok0517 ml-lab ansarturia amallia cmacdonald phil1108 iglimanaj petulla crescentluna yastil casperhansen bbmmjjdd xhyandwyy dscohen muramon emaiuileboo ericdoug-qi fword jobergum ashwinparanjape seanmacavaney victorzuanazzi anreu qinyinmm bangush gaonnr vjeronymo2 aloneirew jamesdeantonis harryzhuangtw jongwon-jay-lee akakaala wonderingalways suhas1999 kentonmurray e-budur littlewine mieseung socioprophet xrr233 codeaudit xiangli1999 xiaoya-li hejudgin jihyukkim-nlp mkolodne hueck hammadkhann techthiyanes cramraj8 sashank06 thongnt99 zcf131016 moontree kimyeondu lorr1 minkow hannawong oii123 jeremi-nh tginart kenzosakiyama arthurcamara hubayirp taoshen58 yuki617 chevincherry wyfunique yaasha chanlim kding1 mattchurgin pavitrayuvaraj kiminh mholmeslinder txdas atamborrino raphaelsty gabinguo sarshaw nashid nicexw lhfazry gluver smiyawaki0820 index08 thakur-nandan jay-gt mitsuhiko-nozawa

colbert's Issues

Questions about the code semantic

Hi. I've read your paper and now I'm trying to understand your code. Most of the part is intuitive, thanks for the readable and runnable code :). Unfortunately I'm having trouble understanding the reason to use labels filled with only zeros. Please understand I'm quite new to pytorch so the question might be silly.

labels = torch.zeros(args.bsize, dtype=torch.long, device=DEVICE)
...
loss = criterion(out, labels[:out.size(0)])

training resumption reverts to `bert-base-cased` rather than the provided checkpoint

Hello again,

When I run training from a checkpoint (providing the --resume flag and a --checkpoint flag, I think the training (that occurs in the lines around here) doesn't actually use the checkpoint, but instead starts from the un-finetuned bert-base that would be used if starting from scratch.

Is this intended? What should I do if I would like to resume from the checkpoint? Some evidence that I am right (I think I am, but I could be wrong) is provided below. If I am right, I'd be happy to provide a pull request with a fix. Thanks!

Code evidence: There are two models defined, checkpoint (the already partially finetuned model) and colbert (taken from a fresh bert-base). It looks like the checkpoint is largely ignored while colbert is what gets all the training action.

Anecdotal evidence: I tried to resume training after 80000 batches of size 32 each, at which point my model was pretty far along in its training. However, when I hit resume, I noticed that the loss rate started at a poor value and improved dramatically, as is typically the case at the fresh start of the finetuning process.

Long document/passage splitting in ColBERT

I would like to ask how/whether ColBERT handles long documents and whether there is any splitting etc. going on.

I think I've asked this question in the past @okhat and the response was negative (there's no handling whatsoever, its just matching on the first X tokens of the passage, depending on doc_maxlen - which is often set to 180), but I can't fully remember and the following logs made me reconsider:

[Jul 21, 11:54:40] #> Processing batch #0..
[Jul 21, 11:54:40] #> Fetching parts 2--3 from queue..
[Jul 21, 11:54:49] #> Using strides [108, 180]..

In this case, what are the strides? I recognize that 180 as the doc_maxlen. Is this strictly practical and related to batching, or does it have to do something with document splitting?

thanks!

long document ranking

Hi, when it comes to long document ranking, how to use colbert to solve the problem?
I find that your team have submitted "ColBERT MaxP end-to-end" model to MS MARCO Document Ranking Leaderboard, would you mind release the code and update in this repository?

Running/debugging ColBERT on CPU

Hello,
I am trying to make some changes/adaptations in the code of ColBERT and I want to run locally on a MacBook without GPU.

However, when I am trying to create the index (python -m colbert.index ... ) I get
AssertionError: Torch not compiled with CUDA enabled

On the other hand, when I generate the index on GPU , transfer it to my local machine and run python -m colbert.index_faiss ... I get:

[Jun 16, 14:29:18] #> Will write to /Users/amkrasakis/data/faiss_indexes/MSMARCO.L2.32x200k.180len.small/ivfpq.32768.faiss.
[Jun 16, 14:29:18] #> Loading /Users/amkrasakis/data/faiss_indexes/MSMARCO.L2.32x200k.180len.small/0.pt ...
[Jun 16, 14:29:24] #> Loading /Users/amkrasakis/data/faiss_indexes/MSMARCO.L2.32x200k.180len.small/1.pt ...
#> Sample has shape (2946381, 128)
[Jun 16, 14:29:25] #> Training with the vectors...
[Jun 16, 14:29:25] #> Training now (using 0 GPUs)...
Segmentation fault: 1

Is there some solution to that?
For instance, since colbert.index_faiss does not complain about no GPU (at least so far) could I generate the faiss index in a linux node with GPU and afterwards transfer this to another machine with FAISS cpu?

Thanks for your help!

Batch Reranking is very slow

Hello,

While batch retrieval is very fast (one pass in .3 seconds), batch reranking is very slow (one pass in roughly 10 minutes). I think this is because the entire index is loaded in steps. I think this is unnecessary because we only care about a subset of the index. For example, a retrieval step for me returned 2000 pids, which multiplied by 512 doc max len, means we care about 2000 * 512 = 1,024,000 vectors. Isn't there a way to load only these vectors using a mapping? This might cut the time down dramatically.

Curious what your thoughts are on this.

Thanks!
Jamie

EDIT: I think the faiss.DirectMap might be helpful here. We know the indexes of the vectors of interest, so we can simply call index.reconstruct(idx) for each idx, then that will pretty quickly get us the full matrix for each doc

subclassing CollectionEncoder

We would like to change the indexing of data in formats other than the MSMARCO passage file - for instance, indexing an iterator or a dataframe.
Most of the changes would be in CollectionEncoder:

_batch_passages() and _preprocess_batch() could be overridden
encode() is nearly fine, except with open(self.collection) as fi: which assumes a file. Could this be extracted to another class method?

Expected result from the reranking without indexing

Hi Omar,
Is there some number we should be looking for, roughly speaking, when doing the reranking (without indexing) on the msmarco dev set? (EDIT: Would be equivalent to Table 1 in the paper?)
The reason is that I'd like to be able to use this performance to gauge the performance of the full indexed retrieval. Would this work, or does this metric not really correlate with the full indexed performance?

I trained a model with a slightly larger batch size and without gradient accumulation for 200000 steps, and using the msmarco evaluation script from anserini, the MRR@10 is around 0.3514. Does this look reasonable?

be stuck with baseline reproduction

I have reproduce the result of the final Colbert successfully. However, I am stuck with baseline reproduction which is [A], [B], [C] and [D] in the picture below.

I wonder if you can give me some help like the source code of ablation study?

about ColBERT(BertPreTrainedModel)

Hello, I am reading your code to replicate the experiment. There are some questions about the model in "model.py".

in the query() function, "queries" are word lists. So, it can not be input into the self.tokenizer.encode() function. The standard input for tokenizer.encode() should be text.
in the doc() function,

docs = [["[unused1]"] + self._tokenize(d)[:self.doc_maxlen-3] for d in docs]

the result of "self._tokenize()" is a word list, not a word-piece list, so it is improper to be cut by the doc_maxlen which limits the number of word-piece tokens.
3. although in the paper it is said that "Unlike queries, we do not append [mask] tokens to documents.", in the code the encoding function is "_encode()" for both queries and docs with the same [mask] padding.

is v0.2 available now?

@okhat Hi! Thanks for your paper and repo! It seems that everything of v2 is ready, I just wanna make sure whether it is totally available in terms of performence of the model trained by version 2, i.e. command

python -m colbert.train --triples triples.train.small.tsv

thx! : )

Errors when trying to interface directly with the underlying API for re-ranking

I keep getting the pictured error:

Upon investigation, I see that stride is referenced here but isn't defined prior in the method. Can you please explain if this is intended or a bug?

Thanks

Support for Images

If there are imaes/figures in the document , using text as query , is it possible to retrieve the images + text from the document ?

Performance Issues with RoBERTa Models

I am currently training a multilingual Model with your approach and with the bert-base-multilingual-uncased it works great.
Now I tried switching to xlm-roberta-base (which in general is better pre-trained than the mBERT) but performance is far off.
Both are trained on the same system with the same batch size.

Here a plot of the loss over training steps:

Evaluation performance is very different as well:
mBERT @ 32k Steps: MRR@10 0.22
XLM-RoBERTa @ 32k Steps: MRR@10 0.07

As RoBERTa uses a BPE vocab i had to add unused tokens by hand and initalize the embedding for them randomly (transformers does that with mean=0 and std=0.02):

        self.tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
        self.tokenizer.add_tokens(["[unused0]"])
        self.tokenizer.add_tokens(["[unused1]"])
        
        self.skiplist = {w: True for w in string.punctuation}

        self.bert = XLMRobertaModel(config)
        self.bert.resize_token_embeddings(len(self.tokenizer))

https://github.com/Phil1108/ColBERT/blob/b786e2e7ef1c13bac97a5f8a35aa9e5ff9f3425f/src/model.py#L24

Is the bad performance caused by the problem that RoBERTa doesn't include a next-sentence Prediction in its pre-training? Or is the ColBERT approach not transferable to BPE vocabs?
Or is my way of adding unused tokens to the vocab causing undetected problems?

Proposal for an inference class

Hi all,

Here is the code for the class I'm currently using to wrap colbert stuff. (sorry if there are some errors; I tried to delete the extra internal code that's only relevant to my team). Maybe something like this could be merged into the actual repo?

Jamie

from dataclasses import dataclass, field
import os

import torch
from transformers.modeling_utils import no_init_weights
from transformers import BertConfig

from colbert.modeling.inference import ModelInference
from colbert.ranking.rankers import Ranker
from colbert.modeling.colbert import ColBERT

@dataclass
class RankerArgs:
    index_path: str = field(metadata={"help": "path to doclens files"})
    faiss_index_path: str = field(metadata={"help": "path to faiss indices (often the same place as `index_path`"})
    nprobe: int = field(metadata={"help": "the number of clusters to visit during faiss search"})
    part_range: range = field(init=False)

    def __post_init__(self):
        self.part_range = None

class ColbertRetriever:
    def __init__(
        self, 
        colbert_model_path: str, 
        amp: bool=False,
        index_path,
        faiss_index_path,
        nprobe: int = 10,
        faiss_depth: int = 1024,
    ):

        inference = ModelInference(
            ColbertModel.from_saved_model(colbert_model_path), amp=amp
        )
        ranker_args = RankerArgs(index_path, faiss_index_path, nprobe)
        self.ranker = Ranker(ranker_args, inference, faiss_depth=faiss_depth)

    def retrieve_and_rerank(self, query: str, k: int):
        Q = self.ranker.encode([query])  # encode the query
        pids, scores = self.ranker.rank(Q)  # rank
        
        assert k <= len(pids)

        pids = pids[:k]
        scores = scores[:k]
        return pids, scores

class ColbertModel(ColBERT):
    @classmethod
    def from_saved_model(cls, model_path: str) -> "ColbertModel":
        """
        load colbert from a saved model

        Parameters
        ----------
        model_path : str
            the full path to a file containing a json with state_dict and other things

        Returns
        -------
        a colbert model
        """

        model_dict = torch.load(model_path)

        config = BertConfig()

        args = model_dict["arguments"]

        with no_init_weights():
            model = cls(
                config,
                query_maxlen=args["query_maxlen"],
                doc_maxlen=args["doc_maxlen"],
                mask_punctuation=args["mask_punctuation"],
                dim=args["dim"],
                similarity_metric=args["similarity"],
            )

        cls._load_state_dict_into_model(
            model, model_dict["model_state_dict"], model_path
        )

        model.eval()
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model.to(device)
        return model

Request for defining the command line parameters in more details

Hi @okhat,

Thanks for sharing and maintaining the codes. I had a small clarification question regarding the command line parameters used while indexing - https://github.com/stanford-futuredata/ColBERT#indexing

We have to enter "--root" twice while indexing. Do we have to give the same directory as the experiment directory used while training ("/root/to/experiments/") for both the -root params?
It would be very helpful if a small description can be added for all the params used in training, validation, indexing and retrieval step.

I am also getting this error -
ColBERT/ColBERT-master/colbert/index.py", line 28, in main
assert not os.path.exists(args.index_path), args.index_path

which I believe can be solved once correct command line parameters are given. Do you have any suggestions on this?

Thanks.

ColBERT on Wikipedia corpus

Hi,

Thanks for releasing this library.

I am planning to use ColBERT for a ranking task on Wikipedia corpus (as part of FAIR Ranking track: https://fair-trec.github.io/). Briefly, given a keyword consisting of terms related to Wiki articles, the task is to generate a rank list of Wiki docs. I have a couple of questions about using the model on the task:

Wikipedia docs are typically very long-docs. To use them in the model, if I truncate the doc (say take top-500 words), will it affect the perf of the model?
I want to use the query & document embeddings from ColBERT as feats. in another model. Is there a way to get the query and doc. embedding after training?

Thanks.

can't load full index into memory

What's the easiest way to use ColBERT without loading the full index into memory? We are building an index off of the wiki_dpr dataset (and eventually more), so we have about 21 million passages and counting. The full index is about 630Gb on disk and we have 230Gb of memory to work with (hopefully not needing nearly the full 230). I understand that faiss allows for this type of search (only metadata gets loaded into memory and the actual vectors stay on disk), so curious whether you support this in the current repo.

When running the retrieval script in the README, I run into memory issues once I start building IndexPart here. Should I be doing something with the index_part param? Any insight would be greatly appreciated.

Thanks!

(UPDATE: fyi, the retrieve script runs properly on a tiny dev subset of the dataset)

Make possible to pip install

Hello, thanks for your repository and SIGIR paper. We would like to develop wrappers on top of ColBERT. Would it be possible to make the repo compatible with pip. This would need:

make a setup.py
rename src directory as colbert

TREC-CAR pre-trained model in PyTorch

I am trying to look at TREC-CAR data and its pre-trained model on Wikipedia training set. However, the previous work that provides the pre-trained model only in Tensorflow. Did you have to convert the Tensorflow checkpoint files into PyTorch to train your model on top ?

Effect Comparsion with Poly-Encoder for text match task

hi, I just wonder that ColBERT and Poly-Encoder which is better for text match task (whether sentence pair is in same class)? Have you compared them? Thx!

on [PAD] as an actual token

A real MSMARCO document:

427711	Peripheral Vascular Disease (PVD; Peripheral Artery Disease [PAD]). Peripheral vascular disease (PVD) is a problem with poor blood flow. It affects blood vessels outside of the heart and brain and gets worse over time.Parts of the body, like the brain, heart, arms, or legs, may not get enough blood.The legs and feet are most commonly affected.eripheral vascular disease (PVD) is a problem with poor blood flow. It affects blood vessels outside of the heart and brain and gets worse over time. Parts of the body, like the brain, heart, arms, or legs, may not get enough blood. The legs and feet are most commonly affected.

Note that [PAD] here is an actual token. BERT Tokeniser assigns it the actual PAD token id, i.e. 0. Later, colbert.mask() removes it (as it is token id 0). Note that normal padding tokens are removed by the attention mask passed to the self.bert(). There are 11 such documents with [PAD] as a token MSMARCO passage corpus.

So is colbert.mask() correct, or should the BertTokenizer have tokenised it normally?

Feedback on ColBERT

Hello, this is not an issue but feedback. As you know we at the vespa.ai team have been working on the ColBERT model as it is cost effective on CPU and with almost the same accuracy as full cross-attention models and by using 32 dims per contextualized token representation the memory footprint is not a huge concern as you get a lot of memory and v-cpu for free if you can avoid having a GPU, example pricing from AWS EC2 (on-demand/Linux):

m5a.12xlarge 48 v-cpu, 192GB => $2.064 per hour
p3.2xlarge 8 v-cpu, 1 Nvidia Tesla V100 GPUs (16GB), 61GB RAM => $3.06 per hour

Vespa currently support float32, but will soon add bfloat16 for our tensor representation which will reduce memory footprint by 50% from 32 bits per value to 16. (Some data our work on ColBERT documented in vespa-engine/vespa#15854)

Now to the feedback on the modelling:

It would be nice if the tokenization parts could be moved out of the torch ColBERT model. This would allow direct export to ONNX format for the model for fast serving (e.g with ONNX-RT).
Move to AutoModel instead of BertModel so user can chose by an argument which pre-trained checkpoint should be used. For example its rumoured that MiniLM (https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) which uses the same vocabulary as BERT and the same tokenization has good accuracy on ranking while being 3x faster approx than bert-base-uncased.
We released a snapshot based on a bert-medium model https://huggingface.co/vespa-engine/colbert-medium but to do so we had to alter the training routine to use the bert-medium so it gets messy to replicate.

Thank you.

Question on how are we storing in FAISS?

Hello Team,

Thanks a lot for sharing the code. :)

I am trying to understand how we are saving the embeddings of a document and let me know if my understanding is wrong.

Say, I have document _1/ passage with 10 words & after tokenization it gives me 15 tokens so, my final embeddings would be [1,15,128] considering I have set compressed dim = 128.
While storing this final embeddings we are doing two level of quantization on every context word embeddings & then saving it to FAISS but are we saving it in batch wise like all 15 embeddings will be stored against some batchidx?

I am still learning how FAISS works so, I am really sorry if I have asked an amateur question.

trying distributed multigpu

Hi, thank you for a great code release,

I've been trying to train with 2 gpus in the new v2.0 code, but having trouble with pytorch distributed parallel.

I used the command:
CUDA_VISIBLE_DEVICES=0,1 python3.7 -m torch.distributed.launch \ --nproc_per_node=${WORLD_SIZE} colbert/train.py \ --triples $TRIPLES_PATH \ --local_rank 2 \ --accum 1
but I'm not sure if it is correct - At the moment the training doesn't seem to run (model loads on 1 gpu but training loop hangs)

I'm wondering, is the distributed parallel meant to work for training with multigpu in the new code? If it works, will we be able to speed up the training with multigpu?
Thank you!

Retrieve/rerank pipeline with Colbert

Hi,

Is there an easy way to make a pipeline that retrieves, then reranks and returns the run filepaths (without having to manually run both commands and pass the topk arguments) ?

Im not sure how easy it is to make a python pipeline (I saw you wrote/extended your own argument parser)
I also started trying doing something with bash, but it means I have to parse strings etc, which is prone to errors etc. etc.

Thanks!

Trained DistilBERT-based Checkpoint

Hi,

Thanks for this great model 🎉!

I just published a knowledge-distilled ColBERT checkpoint: https://huggingface.co/sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco It's based on a 6-layer DistilBERT and trained with our Margin-MSE (https://arxiv.org/abs/2010.02666) distillation, it gets up to .375 MRR@10 on MSMARCO-DEV and .744 NDCG@10 on TREC-DL'19 when re-ranking top-1K BM25 results.

The model definition & training code we used (https://github.com/sebastian-hofstaetter/neural-ranking-kd/blob/main/minimal_colbert_usage_example.ipynb) is slightly different then in this repo, but maybe if you are interested we can add our definition as well as another option to easily use the checkpoint?

Best,
Sebastian

Release checkpoints and datasets

Just wondering if you can release "official" model checkpoints, ideally to the HuggingFace model repo (ideally both, also the model for TREC CAR trained only on the Wikipedia training set). It would also help experimentation (improve time to reproduce results) if you could also release datasets representing the document embeddings for MS MARCO Passages, TREC CAR (used in the paper) and even other common datasets (Google Natural Questions/Wiki passages used in DPR, for example).

Examples of similar releases include:

Indexing and retrieval arguments for ColBERT quantization

Hi,

Thanks for releasing the code!
I am trying to replicate the MSMARCO DEV passage ranking results using the quantization branch but unable to do so.

Here's the set of commands I use to index:

CUDA_VISIBLE_DEVICES="0,1,2,3" OMP_NUM_THREADS=6 \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.index --amp --doc_maxlen 180 --mask-punctuation --bsize 256 --compress \
--checkpoint /path_to_checkpoint/colbert.dnn \
--collection /path_to collection/collection.tsv \
--index_root /path_to_index/indexes --index_name MSMARCO.psg.32x200k \
--root /path_to_root_dir/ --experiment MSMARCO-psg

python -m colbert.index_faiss \
--index_root /path_to_index/indexes --index_name MSMARCO.psg.32x200k \
--partitions 32768 --sample 0.3 \
--root /path_to_root_dir/ --experiment MSMARCO-psg

For end-to-end retrieval, I use the same command as in the README of the master branch

python -m colbert.retrieve \
--amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--queries /exp/scale21/data/multi-msmarco/source/queries.dev.small.tsv
--nprobe 32 --partitions 32768 --faiss_depth 1024 \
--index_root /path_to_index/indexes/ --index_name MSMARCO.psg.32x200k \
--checkpoint /path_to_checkpoint/colbert.dnn \
--root /path_to_root_dir/ --experiment MSMARCO-psg

@okhat Can you let me know if I'm missing some parameters specific to indexing or retrieval that I need to additionally incorporate?

Thanks in advance,
Suraj

Checkpoints + usage of different base BERT model

Hello, I would love to use ColBERT in my project as a retriever component in a retriever-reader architecture, hence I would prefer to use the huggingface API to create the ColBERT model in the code and load the checkpoints :) I have three small questions:

About the checkpoints of the pretrained on MsMARCO ColBERT. Is it possible to download the pretrained model's weights so I could use them in my implementation? Or the only solution is to use the training script and slightly modify it so I can save the weights for later use.
The used BERT model was the bert-base-uncased. My project's domain is the medical area, hence I was wondering whether using the training script with a small change - replacing the bert-base-uncased with the biobert would result in a properly pretrained model.
Is there any plan on creating the implementation of the the ColBERT's workflow available via the huggingface library?

Thank you very much for your work 💪🏻

train colbert with L2 similarity, but rerank with cosine similarity?

As your paper has noted, you used L2 similarity during end to end retrieval, but in your code index_ranker->rank(), in the second stage of end-to-end retrieval, you rerank the passages with cosine similarity, here shoudn't you also use L2 similarity?
Thanks!

FAISS indexing parameters

Dear authors,
thank you for your nice work and providing the code repository.
I would like to use your model to index a collection using faiss. However I see a few parameters (in the faiss indexing example command) that I do not understand.

python -m colbert.index_faiss \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--partitions 32768 --sample 0.3 \
--root /root/to/experiments/ --experiment MSMARCO-psg

One is sample and the other partitions. My guess is that the second one splits the generated index file to different partitions, hence not so important (correct me if Im wrong), but what about sample?

Also, is the --root /root/to/experiments/ expected to be the colbert code directory (this repo)?

Thank you

Not understanding class IndexRanker()

Hi @okhat,

I'm having quite a hard time understanding what is being done in class IndexRanker() and I could really use your help.
It looks like doclens and strides play an important role in determining the output_scores here but I don't understand why we are taking doclens into consideration when calculating the scores of the retrieved pids.

Among some of my questions are:

Why are we calculating the strides?
Why do we need to compare the doclens of the retrieved pids with the strides?
assignments = (doclens.unsqueeze(1) > torch.tensor(self.strides).unsqueeze(0) + 1e-6).sum(-1)
Why did we create new views of the embeddings tensor and use some cumulative doclens values as indexes to select values from those new views of the embeddings tensor?

def _create_views(self, tensor):
views = []

for stride in self.strides:
outdim = tensor.size(0) - stride + 1
view = torch.as_strided(tensor, (outdim, stride, self.dim), (self.dim, self.dim, 1))
views.append(view)

return views

D = torch.index_select(input=views[group_idx], dim=0, index=group_offsets_uniq, out=D_buffers[group_idx][:D_size])

Thanks a lot!

Faiss indexing assertion

[Nov 29, 12:42:01] #> Processing slice #1 of 1 (range 0..24).
<snip>
[Nov 29, 13:07:41] #> Indexing the vectors...
[Nov 29, 13:07:41] #> Loading ('/local/terrier/Indices/msMarco/2020_craig/colbert_passage/index_name/0.pt', '/local/terrier/Indices/msMarco/2020_craig/colbert_passage/index_name/1.pt', '/local/terrier/Indices/msMarco/2020_craig/colbert_passage/index_name/2.pt') (from queue)...
[Nov 29, 13:08:09] #> Processing a sub_collection with shape (86536593, 128)
[Nov 29, 13:08:09] Add data with shape (86536593, 128) (offset = 0)..
  IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432
33488896/86536593 (219.002 s)   Flush indexes to CPU
Traceback (most recent call last):
  File "/users/tr.craigm/anaconda3/envs/colbert/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/users/tr.craigm/anaconda3/envs/colbert/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/index_faiss.py", line 43, in <module>
    main()
  File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/index_faiss.py", line 39, in main
    index_faiss(args)
  File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/indexing/faiss.py", line 108, in index_faiss
    index.add(sub_collection)
  File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/indexing/faiss_index.py", line 48, in add
    self.gpu.add(self.index, data, self.offset)
  File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/indexing/faiss_index_gpu.py", line 112, in add
    self._flush_to_cpu(index, nb, offset)
  File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/indexing/faiss_index_gpu.py", line 137, in _flush_to_cpu
    assert index.ntotal == offset+nb, (index.ntotal, offset+nb, offset, nb)
AssertionError: (33619968, 86536593, 0, 86536593)

Can you explain the purpose of the assertion? index.ntotal will not be the number processed if we are only doing one part of the index?
Do I need to set slices to match the number of .pt files written by index.py (24)?

There is lots of terminology in the indexing code that isnt detailed in the paper.

It is suitable for text matching tasks?

In the text matching task, colbert and sentence-bert sbert are both representation-based models. I would like to ask how the comparison of colbert and sentence-bert compares to effect, will colbert be better?

Training stopping after few hundred steps

Hello Omar,
Thank you very much for sharing your awesome work.
I am currently trying to test ColBERT on the BioAsq8 dataset (3200 queries with approximately 10 relevant documents per query in the training set).

I have two questions about ColBERT.

The main problem that I face is that when I train ColBERT on a triples.p file, the training sometimes stop after couple of hundred or thousand steps : ie, each step takes less than a second and then it stops printing and saving ckpt, but the process is still running and doesn't exit. Is there somewhere in the code where it stops the training when the avg loss doesn't evolve anymore? I don't see it in the code. And my GPU has still available memory.
Also, I was wondering if there is a reason why you don't consider epochs in your training code. I guess that for MSMARCO dataset, the triples.p file is long enough. But in general, there is no problems to iterate over the train set right? Especially, I want to sample the the negatives not randomly but with BM25 negatives.

Again thank you for this great repository.
Alexandre

Question about scoring function

Thanks for sharing the code. I have a question about score calculation.

In doc function, the representation of the document is multiplied with the mask.

However, could the max(2) in the score function accidentally choose the padding token score (which score is 0) if all the non-padding token scores are negative?

I asked because I check the implementation here and found they assign a large negative value score[~exp_mask] = - 10000 before search for the max score.

Index does not fit to RAM

Hi,
I am trying to run inference (retrieve.py)in a quite big collection and I get the following error :

RuntimeError: [enforce fail at CPUAllocator.cpp:64]. DefaultCPUAllocator: can't allocate memory:you tried to allocate 664833940224 bytes. Error code 12 (Cannot allocate memory)

To my understanding, that means that colbert requires 600+ gb of memory, which the machine does not have. Is there any way to bypass this issue?

Can I read the file from an SSD and how much slower would the process be?

Some colleagues also suggested using huggingface.co datasets for that but I'm unsure whether this is feasible or where to start!

Request: Adaptation for user-facing search application

Thank you for sharing this incredible repo, great work!

Towards the end of your Readme file, you mention about providing pointers for the conversion of the code, especially concerning the retrieval loop, for when considering the model in a production environment. Are you able to provide any further details, such as what changes should be considered in this script for user-facing applications?

args.milliseconds is always empty in ranking.py

Hi,

args.milliseconds is defined here but never updated. It triggers a failure here, and thus the Avg Latency is missing and not computed.

Can't build faiss index

Thanks for the great repo! I'm trying to build a faiss index for retrieval, but can't get the script to run. I was originally using python3.8 and torch 1.8 in a docker container, but also downgraded to torch 1.6 to see if that would work.

I'm running

CUDA_VISIBLE_DEVICES="0,1" OMP_NUM_THREADS=1 \
python3.8 -m torch.distributed.launch --nproc_per_node=1 -m \
        index --root $PWD/experiments/ --amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
        --checkpoint out-of-the-box-model.pt \
        --collection passages.tsv \
        --index_root /faiss --index_name INDEX \
        --root $PWD/experiments/ --experiment out_of_the_box \
        --local_rank 0

But I get the error "Default process group is not initialized". That also happens even if I manually put dist.init_process_group('nccl') in the script.

Do you know why this is happening?

Thanks!

Question about query augmentation

In the paper, I notice that query augmentation is done by padding the query with [MASK] token, and doc side is not appended. While in the code, i find query and doc use the same _encode function. If I don't miss something, the attention mask disables the functionality of padding whatever it is.

    def _encode(self, x, max_length):
        input_ids = self.tokenizer.encode(x, add_special_tokens=True, max_length=max_length)

        padding_length = max_length - len(input_ids)
        attention_mask = [1] * len(input_ids) + [0] * padding_length
        input_ids = input_ids + [103] * padding_length

        return input_ids, attention_mask

Conda environment setup

Hi,

When I try to create a conda environment using the conda_env.yml file from v0.2, I get the following error:

Found conflicts! Looking for incompatible packages.

I then tried manually installing each package, however the same error occurred. Is there a workaround to this issue, or could it be an isolated problem?

Hands On with Colab

Hi, great article! The code looks very clean and understandable. Is it possible to make a Colab with tiny sample data.. With the pre-trained model.

About validation

In validation,top1000.dev dataset's format is qid,pid,query,passage?But in the load_topK_pids function,the label is passage?
assert qrels is None or topK_positives is None, "Cannot have both qrels and an annotated top-K file!" Can't qrels and topk exit at the same time?

Speeding up `FaissIndex` loading

This for-loop takes like 75 seconds for me, which is annoying to run every time I load the index.

I think we can shorten it by doing the following in the index.sh part of the pipeline:

total_num_embeddings = sum(all_doclens)
emb2pid = torch.zeros(total_num_embeddings, dtype=torch.int)
offset_doclens = 0

for pid, dlength in enumerate(all_doclens):
    emb2pid[offset_doclens: offset_doclens + dlength] = pid_offset + pid
    offset_doclens += dlength

torch.save(emb2pid, os.path.join(index_path, "emb2pid.pt")

and then in place of the current code in faiss_index.py:

emb2pids = torch.load(os.path.join(index_path, "emb2pid.pt")
if part_range is not None:
    emb2pids = emb2pids[emb2pids >= part_range.start]
    emb2pids = emb2pids[emb2pids < part_range.stop]

self.emb2pids = emb2pids

I think this would cut the time from 75 seconds to like 3 seconds, which would be a significant speedup

what is purpose of Sort by maxlens in tensorize_triples

hi, what is purpose of Sort by maxlens in tensorize_triples . I don't think it is effective

indices = maxlens.sort().indices

Q_ids, Q_mask = Q_ids[indices], Q_mask[indices]
D_ids, D_mask = D_ids[:, indices], D_mask[:, indices]

(positive_ids, negative_ids), (positive_mask, negative_mask) = D_ids, D_m

on memory usage and FaissIndex.emb2pid

On the MSMARCO passage ranking task, this tensor is 687M entries - i.e. 2.4GB?

A multiprocessing Pool is then created for 16 processors - thus copying that tensor 16 times, consuming about 41GB of RAM.

The workaround is to make the tensor shared, but calling share_memory_() - see https://pytorch.org/docs/stable/tensors.html#torch.Tensor.share_memory_:

        print_message("len(self.emb2pid) =", len(self.emb2pid))
        # prevent this being copied to each fork
        self.emb2pid.share_memory_()
        self.parallel_pool = Pool(16)

Despite this, I still have memory problems on our environment - the forking consumes just too much memory, and the job is killed. Probably the faiss index is copied 16 times as well, and all_doclens also (though the latter is not needed at that point).

Regarding training process

Hello Omar,
Thanks for open sourcing the code for amazing work.

This is not really a issue more like a doubt.

It was mentioned in the paper for MSMACRO that it was trained for 200k iterations with batch size of 32 to approximately reproduce the results., so effectively trained for 6.4 million triplets. So this means it was not trained on the full triples.small.tsv (39 mill points). Is my understanding on this correct?
I am trying to training on MSMACRO triples. During training individual current batch loss is decreasing only for the initial few steps and oscillating for later iterations. Did u face the same issue while training ? Should it be viewed as model is not getting trained or since it seeing the new examples every batch it is expected this way?

MSMARCO passage collection run

Hello,

Thank you for your amazing work.
I was wondering if it is possible to share your run on the MS MARCO passage collection dev set?

stanford-futuredata / colbert Goto Github PK

colbert's People

Contributors

Stargazers

Watchers

Forkers

colbert's Issues

Recommend Projects

Recommend Topics

Recommend Org