stanford-futuredata / colbert Goto Github PK
View Code? Open in Web Editor NEWColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
License: MIT License
ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
License: MIT License
Hi. I've read your paper and now I'm trying to understand your code. Most of the part is intuitive, thanks for the readable and runnable code :). Unfortunately I'm having trouble understanding the reason to use labels
filled with only zeros. Please understand I'm quite new to pytorch so the question might be silly.
labels = torch.zeros(args.bsize, dtype=torch.long, device=DEVICE)
...
loss = criterion(out, labels[:out.size(0)])
Hello again,
When I run training from a checkpoint (providing the --resume
flag and a --checkpoint
flag, I think the training (that occurs in the lines around here) doesn't actually use the checkpoint, but instead starts from the un-finetuned bert-base
that would be used if starting from scratch.
Is this intended? What should I do if I would like to resume from the checkpoint? Some evidence that I am right (I think I am, but I could be wrong) is provided below. If I am right, I'd be happy to provide a pull request with a fix. Thanks!
Code evidence: There are two models defined, checkpoint
(the already partially finetuned model) and colbert
(taken from a fresh bert-base
). It looks like the checkpoint
is largely ignored while colbert
is what gets all the training action.
Anecdotal evidence: I tried to resume training after 80000 batches of size 32 each, at which point my model was pretty far along in its training. However, when I hit resume, I noticed that the loss rate started at a poor value and improved dramatically, as is typically the case at the fresh start of the finetuning process.
I would like to ask how/whether ColBERT handles long documents and whether there is any splitting etc. going on.
I think I've asked this question in the past @okhat and the response was negative (there's no handling whatsoever, its just matching on the first X tokens of the passage, depending on doc_maxlen
- which is often set to 180
), but I can't fully remember and the following logs made me reconsider:
[Jul 21, 11:54:40] #> Processing batch #0..
[Jul 21, 11:54:40] #> Fetching parts 2--3 from queue..
[Jul 21, 11:54:49] #> Using strides [108, 180]..
In this case, what are the strides? I recognize that 180
as the doc_maxlen
. Is this strictly practical and related to batching, or does it have to do something with document splitting?
thanks!
Hi, when it comes to long document ranking, how to use colbert to solve the problem?
I find that your team have submitted "ColBERT MaxP end-to-end" model to MS MARCO Document Ranking Leaderboard, would you mind release the code and update in this repository?
Hello,
I am trying to make some changes/adaptations in the code of ColBERT and I want to run locally on a MacBook without GPU.
However, when I am trying to create the index (python -m colbert.index ...
) I get
AssertionError: Torch not compiled with CUDA enabled
On the other hand, when I generate the index on GPU , transfer it to my local machine and run python -m colbert.index_faiss ...
I get:
[Jun 16, 14:29:18] #> Will write to /Users/amkrasakis/data/faiss_indexes/MSMARCO.L2.32x200k.180len.small/ivfpq.32768.faiss.
[Jun 16, 14:29:18] #> Loading /Users/amkrasakis/data/faiss_indexes/MSMARCO.L2.32x200k.180len.small/0.pt ...
[Jun 16, 14:29:24] #> Loading /Users/amkrasakis/data/faiss_indexes/MSMARCO.L2.32x200k.180len.small/1.pt ...
#> Sample has shape (2946381, 128)
[Jun 16, 14:29:25] #> Training with the vectors...
[Jun 16, 14:29:25] #> Training now (using 0 GPUs)...
Segmentation fault: 1
Is there some solution to that?
For instance, since colbert.index_faiss does not complain about no GPU (at least so far) could I generate the faiss index in a linux node with GPU and afterwards transfer this to another machine with FAISS cpu?
Thanks for your help!
Hello,
While batch retrieval is very fast (one pass in .3 seconds), batch reranking is very slow (one pass in roughly 10 minutes). I think this is because the entire index is loaded in steps. I think this is unnecessary because we only care about a subset of the index. For example, a retrieval step for me returned 2000 pids, which multiplied by 512 doc max len, means we care about 2000 * 512 = 1,024,000 vectors. Isn't there a way to load only these vectors using a mapping? This might cut the time down dramatically.
Curious what your thoughts are on this.
Thanks!
Jamie
EDIT: I think the faiss.DirectMap
might be helpful here. We know the indexes of the vectors of interest, so we can simply call index.reconstruct(idx)
for each idx, then that will pretty quickly get us the full matrix for each doc
We would like to change the indexing of data in formats other than the MSMARCO passage file - for instance, indexing an iterator or a dataframe.
Most of the changes would be in CollectionEncoder:
with open(self.collection) as fi:
which assumes a file. Could this be extracted to another class method?Hi Omar,
Is there some number we should be looking for, roughly speaking, when doing the reranking (without indexing) on the msmarco dev set? (EDIT: Would be equivalent to Table 1 in the paper?)
The reason is that I'd like to be able to use this performance to gauge the performance of the full indexed retrieval. Would this work, or does this metric not really correlate with the full indexed performance?
I trained a model with a slightly larger batch size and without gradient accumulation for 200000 steps, and using the msmarco evaluation script from anserini, the MRR@10 is around 0.3514. Does this look reasonable?
Hello, I am reading your code to replicate the experiment. There are some questions about the model in "model.py".
docs = [["[unused1]"] + self._tokenize(d)[:self.doc_maxlen-3] for d in docs]
the result of "self._tokenize()" is a word list, not a word-piece list, so it is improper to be cut by the doc_maxlen which limits the number of word-piece tokens.
3. although in the paper it is said that "Unlike queries, we do not append [mask] tokens to documents.", in the code the encoding function is "_encode()" for both queries and docs with the same [mask] padding.
@okhat Hi! Thanks for your paper and repo! It seems that everything of v2 is ready, I just wanna make sure whether it is totally available in terms of performence of the model trained by version 2, i.e. command
python -m colbert.train --triples triples.train.small.tsv
thx! : )
I keep getting the pictured error:
Upon investigation, I see that stride
is referenced here but isn't defined prior in the method. Can you please explain if this is intended or a bug?
Thanks
Hi
If there are imaes/figures in the document , using text as query , is it possible to retrieve the images + text from the document ?
I am currently training a multilingual Model with your approach and with the bert-base-multilingual-uncased it works great.
Now I tried switching to xlm-roberta-base (which in general is better pre-trained than the mBERT) but performance is far off.
Both are trained on the same system with the same batch size.
Here a plot of the loss over training steps:
Evaluation performance is very different as well:
mBERT @ 32k Steps: MRR@10 0.22
XLM-RoBERTa @ 32k Steps: MRR@10 0.07
As RoBERTa uses a BPE vocab i had to add unused tokens by hand and initalize the embedding for them randomly (transformers does that with mean=0 and std=0.02):
self.tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
self.tokenizer.add_tokens(["[unused0]"])
self.tokenizer.add_tokens(["[unused1]"])
self.skiplist = {w: True for w in string.punctuation}
self.bert = XLMRobertaModel(config)
self.bert.resize_token_embeddings(len(self.tokenizer))
https://github.com/Phil1108/ColBERT/blob/b786e2e7ef1c13bac97a5f8a35aa9e5ff9f3425f/src/model.py#L24
Is the bad performance caused by the problem that RoBERTa doesn't include a next-sentence Prediction in its pre-training? Or is the ColBERT approach not transferable to BPE vocabs?
Or is my way of adding unused tokens to the vocab causing undetected problems?
Hi all,
Here is the code for the class I'm currently using to wrap colbert stuff. (sorry if there are some errors; I tried to delete the extra internal code that's only relevant to my team). Maybe something like this could be merged into the actual repo?
Jamie
from dataclasses import dataclass, field
import os
import torch
from transformers.modeling_utils import no_init_weights
from transformers import BertConfig
from colbert.modeling.inference import ModelInference
from colbert.ranking.rankers import Ranker
from colbert.modeling.colbert import ColBERT
@dataclass
class RankerArgs:
index_path: str = field(metadata={"help": "path to doclens files"})
faiss_index_path: str = field(metadata={"help": "path to faiss indices (often the same place as `index_path`"})
nprobe: int = field(metadata={"help": "the number of clusters to visit during faiss search"})
part_range: range = field(init=False)
def __post_init__(self):
self.part_range = None
class ColbertRetriever:
def __init__(
self,
colbert_model_path: str,
amp: bool=False,
index_path,
faiss_index_path,
nprobe: int = 10,
faiss_depth: int = 1024,
):
inference = ModelInference(
ColbertModel.from_saved_model(colbert_model_path), amp=amp
)
ranker_args = RankerArgs(index_path, faiss_index_path, nprobe)
self.ranker = Ranker(ranker_args, inference, faiss_depth=faiss_depth)
def retrieve_and_rerank(self, query: str, k: int):
Q = self.ranker.encode([query]) # encode the query
pids, scores = self.ranker.rank(Q) # rank
assert k <= len(pids)
pids = pids[:k]
scores = scores[:k]
return pids, scores
class ColbertModel(ColBERT):
@classmethod
def from_saved_model(cls, model_path: str) -> "ColbertModel":
"""
load colbert from a saved model
Parameters
----------
model_path : str
the full path to a file containing a json with state_dict and other things
Returns
-------
a colbert model
"""
model_dict = torch.load(model_path)
config = BertConfig()
args = model_dict["arguments"]
with no_init_weights():
model = cls(
config,
query_maxlen=args["query_maxlen"],
doc_maxlen=args["doc_maxlen"],
mask_punctuation=args["mask_punctuation"],
dim=args["dim"],
similarity_metric=args["similarity"],
)
cls._load_state_dict_into_model(
model, model_dict["model_state_dict"], model_path
)
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
return model
Hi @okhat,
Thanks for sharing and maintaining the codes. I had a small clarification question regarding the command line parameters used while indexing - https://github.com/stanford-futuredata/ColBERT#indexing
We have to enter "--root" twice while indexing. Do we have to give the same directory as the experiment directory used while training ("/root/to/experiments/") for both the -root params?
It would be very helpful if a small description can be added for all the params used in training, validation, indexing and retrieval step.
I am also getting this error -
ColBERT/ColBERT-master/colbert/index.py", line 28, in main
assert not os.path.exists(args.index_path), args.index_path
which I believe can be solved once correct command line parameters are given. Do you have any suggestions on this?
Thanks.
Hi,
Thanks for releasing this library.
I am planning to use ColBERT for a ranking task on Wikipedia corpus (as part of FAIR Ranking track: https://fair-trec.github.io/). Briefly, given a keyword consisting of terms related to Wiki articles, the task is to generate a rank list of Wiki docs. I have a couple of questions about using the model on the task:
Wikipedia docs are typically very long-docs. To use them in the model, if I truncate the doc (say take top-500 words), will it affect the perf of the model?
I want to use the query & document embeddings from ColBERT as feats. in another model. Is there a way to get the query and doc. embedding after training?
Thanks.
What's the easiest way to use ColBERT
without loading the full index into memory? We are building an index off of the wiki_dpr
dataset (and eventually more), so we have about 21 million passages and counting. The full index is about 630Gb on disk and we have 230Gb of memory to work with (hopefully not needing nearly the full 230). I understand that faiss allows for this type of search (only metadata gets loaded into memory and the actual vectors stay on disk), so curious whether you support this in the current repo.
When running the retrieval script in the README, I run into memory issues once I start building IndexPart
here. Should I be doing something with the index_part
param? Any insight would be greatly appreciated.
Thanks!
(UPDATE: fyi, the retrieve script runs properly on a tiny dev subset of the dataset)
Hello, thanks for your repository and SIGIR paper. We would like to develop wrappers on top of ColBERT. Would it be possible to make the repo compatible with pip. This would need:
I am trying to look at TREC-CAR data and its pre-trained model on Wikipedia training set. However, the previous work that provides the pre-trained model only in Tensorflow. Did you have to convert the Tensorflow checkpoint files into PyTorch to train your model on top ?
hi, I just wonder that ColBERT and Poly-Encoder which is better for text match task (whether sentence pair is in same class)? Have you compared them? Thx!
A real MSMARCO document:
427711 Peripheral Vascular Disease (PVD; Peripheral Artery Disease [PAD]). Peripheral vascular disease (PVD) is a problem with poor blood flow. It affects blood vessels outside of the heart and brain and gets worse over time.Parts of the body, like the brain, heart, arms, or legs, may not get enough blood.The legs and feet are most commonly affected.eripheral vascular disease (PVD) is a problem with poor blood flow. It affects blood vessels outside of the heart and brain and gets worse over time. Parts of the body, like the brain, heart, arms, or legs, may not get enough blood. The legs and feet are most commonly affected.
Note that [PAD]
here is an actual token. BERT Tokeniser assigns it the actual PAD token id, i.e. 0. Later, colbert.mask() removes it (as it is token id 0). Note that normal padding tokens are removed by the attention mask passed to the self.bert(). There are 11 such documents with [PAD]
as a token MSMARCO passage corpus.
So is colbert.mask() correct, or should the BertTokenizer have tokenised it normally?
Hello, this is not an issue but feedback. As you know we at the vespa.ai team have been working on the ColBERT model as it is cost effective on CPU and with almost the same accuracy as full cross-attention models and by using 32 dims per contextualized token representation the memory footprint is not a huge concern as you get a lot of memory and v-cpu for free if you can avoid having a GPU, example pricing from AWS EC2 (on-demand/Linux):
m5a.12xlarge 48 v-cpu, 192GB => $2.064 per hour
p3.2xlarge 8 v-cpu, 1 Nvidia Tesla V100 GPUs (16GB), 61GB RAM => $3.06 per hour
Vespa currently support float32, but will soon add bfloat16 for our tensor representation which will reduce memory footprint by 50% from 32 bits per value to 16. (Some data our work on ColBERT documented in vespa-engine/vespa#15854)
Now to the feedback on the modelling:
Thank you.
Hello Team,
Thanks a lot for sharing the code. :)
I am trying to understand how we are saving the embeddings of a document and let me know if my understanding is wrong.
Say, I have document _1/ passage with 10 words & after tokenization it gives me 15 tokens so, my final embeddings would be [1,15,128] considering I have set compressed dim = 128.
While storing this final embeddings we are doing two level of quantization on every context word embeddings & then saving it to FAISS but are we saving it in batch wise like all 15 embeddings will be stored against some batchidx?
I am still learning how FAISS works so, I am really sorry if I have asked an amateur question.
Hi, thank you for a great code release,
I've been trying to train with 2 gpus in the new v2.0 code, but having trouble with pytorch distributed parallel.
I used the command:
CUDA_VISIBLE_DEVICES=0,1 python3.7 -m torch.distributed.launch \ --nproc_per_node=${WORLD_SIZE} colbert/train.py \ --triples $TRIPLES_PATH \ --local_rank 2 \ --accum 1
but I'm not sure if it is correct - At the moment the training doesn't seem to run (model loads on 1 gpu but training loop hangs)
I'm wondering, is the distributed parallel meant to work for training with multigpu in the new code? If it works, will we be able to speed up the training with multigpu?
Thank you!
Hi,
Is there an easy way to make a pipeline that retrieves, then reranks and returns the run filepaths (without having to manually run both commands and pass the topk arguments) ?
Im not sure how easy it is to make a python pipeline (I saw you wrote/extended your own argument parser)
I also started trying doing something with bash, but it means I have to parse strings etc, which is prone to errors etc. etc.
Thanks!
Hi,
Thanks for this great model ๐!
I just published a knowledge-distilled ColBERT checkpoint: https://huggingface.co/sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco It's based on a 6-layer DistilBERT and trained with our Margin-MSE (https://arxiv.org/abs/2010.02666) distillation, it gets up to .375 MRR@10 on MSMARCO-DEV and .744 NDCG@10 on TREC-DL'19 when re-ranking top-1K BM25 results.
The model definition & training code we used (https://github.com/sebastian-hofstaetter/neural-ranking-kd/blob/main/minimal_colbert_usage_example.ipynb) is slightly different then in this repo, but maybe if you are interested we can add our definition as well as another option to easily use the checkpoint?
Best,
Sebastian
Just wondering if you can release "official" model checkpoints, ideally to the HuggingFace model repo (ideally both, also the model for TREC CAR trained only on the Wikipedia training set). It would also help experimentation (improve time to reproduce results) if you could also release datasets representing the document embeddings for MS MARCO Passages, TREC CAR (used in the paper) and even other common datasets (Google Natural Questions/Wiki passages used in DPR, for example).
Examples of similar releases include:
Hi,
Thanks for releasing the code!
I am trying to replicate the MSMARCO DEV passage ranking results using the quantization branch but unable to do so.
Here's the set of commands I use to index:
CUDA_VISIBLE_DEVICES="0,1,2,3" OMP_NUM_THREADS=6 \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.index --amp --doc_maxlen 180 --mask-punctuation --bsize 256 --compress \
--checkpoint /path_to_checkpoint/colbert.dnn \
--collection /path_to collection/collection.tsv \
--index_root /path_to_index/indexes --index_name MSMARCO.psg.32x200k \
--root /path_to_root_dir/ --experiment MSMARCO-psg
python -m colbert.index_faiss \
--index_root /path_to_index/indexes --index_name MSMARCO.psg.32x200k \
--partitions 32768 --sample 0.3 \
--root /path_to_root_dir/ --experiment MSMARCO-psg
For end-to-end retrieval, I use the same command as in the README of the master branch
python -m colbert.retrieve \
--amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--queries /exp/scale21/data/multi-msmarco/source/queries.dev.small.tsv
--nprobe 32 --partitions 32768 --faiss_depth 1024 \
--index_root /path_to_index/indexes/ --index_name MSMARCO.psg.32x200k \
--checkpoint /path_to_checkpoint/colbert.dnn \
--root /path_to_root_dir/ --experiment MSMARCO-psg
@okhat Can you let me know if I'm missing some parameters specific to indexing or retrieval that I need to additionally incorporate?
Thanks in advance,
Suraj
Hello, I would love to use ColBERT in my project as a retriever component in a retriever-reader architecture, hence I would prefer to use the huggingface
API to create the ColBERT model in the code and load the checkpoints :) I have three small questions:
bert-base-uncased
. My project's domain is the medical area, hence I was wondering whether using the training script with a small change - replacing the bert-base-uncased
with the biobert
would result in a properly pretrained model.huggingface
library?Thank you very much for your work ๐ช๐ป
As your paper has noted, you used L2 similarity during end to end retrieval, but in your code index_ranker->rank(), in the second stage of end-to-end retrieval, you rerank the passages with cosine similarity, here shoudn't you also use L2 similarity?
Thanks!
Dear authors,
thank you for your nice work and providing the code repository.
I would like to use your model to index a collection using faiss. However I see a few parameters (in the faiss indexing example command) that I do not understand.
python -m colbert.index_faiss \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--partitions 32768 --sample 0.3 \
--root /root/to/experiments/ --experiment MSMARCO-psg
One is sample
and the other partitions
. My guess is that the second one splits the generated index file to different partitions, hence not so important (correct me if Im wrong), but what about sample
?
Also, is the --root /root/to/experiments/
expected to be the colbert code directory (this repo)?
Thank you
Hi @okhat,
I'm having quite a hard time understanding what is being done in class IndexRanker() and I could really use your help.
It looks like doclens and strides play an important role in determining the output_scores here but I don't understand why we are taking doclens into consideration when calculating the scores of the retrieved pids.
Among some of my questions are:
Why are we calculating the strides?
Why do we need to compare the doclens of the retrieved pids with the strides?
assignments = (doclens.unsqueeze(1) > torch.tensor(self.strides).unsqueeze(0) + 1e-6).sum(-1)
Why did we create new views of the embeddings tensor and use some cumulative doclens values as indexes to select values from those new views of the embeddings tensor?
def _create_views(self, tensor):
views = []
for stride in self.strides:
outdim = tensor.size(0) - stride + 1
view = torch.as_strided(tensor, (outdim, stride, self.dim), (self.dim, self.dim, 1))
views.append(view)
return views
D = torch.index_select(input=views[group_idx], dim=0, index=group_offsets_uniq, out=D_buffers[group_idx][:D_size])
Thanks a lot!
[Nov 29, 12:42:01] #> Processing slice #1 of 1 (range 0..24).
<snip>
[Nov 29, 13:07:41] #> Indexing the vectors...
[Nov 29, 13:07:41] #> Loading ('/local/terrier/Indices/msMarco/2020_craig/colbert_passage/index_name/0.pt', '/local/terrier/Indices/msMarco/2020_craig/colbert_passage/index_name/1.pt', '/local/terrier/Indices/msMarco/2020_craig/colbert_passage/index_name/2.pt') (from queue)...
[Nov 29, 13:08:09] #> Processing a sub_collection with shape (86536593, 128)
[Nov 29, 13:08:09] Add data with shape (86536593, 128) (offset = 0)..
IndexIVFPQ size 0 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=1 reserveVecs=33554432
33488896/86536593 (219.002 s) Flush indexes to CPU
Traceback (most recent call last):
File "/users/tr.craigm/anaconda3/envs/colbert/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/users/tr.craigm/anaconda3/envs/colbert/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/index_faiss.py", line 43, in <module>
main()
File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/index_faiss.py", line 39, in main
index_faiss(args)
File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/indexing/faiss.py", line 108, in index_faiss
index.add(sub_collection)
File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/indexing/faiss_index.py", line 48, in add
self.gpu.add(self.index, data, self.offset)
File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/indexing/faiss_index_gpu.py", line 112, in add
self._flush_to_cpu(index, nb, offset)
File "/users/tr.craigm/projects/pyterrier/ColBERT/colbert/indexing/faiss_index_gpu.py", line 137, in _flush_to_cpu
assert index.ntotal == offset+nb, (index.ntotal, offset+nb, offset, nb)
AssertionError: (33619968, 86536593, 0, 86536593)
Can you explain the purpose of the assertion? index.ntotal
will not be the number processed if we are only doing one part of the index?
Do I need to set slices to match the number of .pt files written by index.py (24)?
There is lots of terminology in the indexing code that isnt detailed in the paper.
In the text matching task, colbert and sentence-bert sbert are both representation-based models. I would like to ask how the comparison of colbert and sentence-bert compares to effect, will colbert be better?
Hello Omar,
Thank you very much for sharing your awesome work.
I am currently trying to test ColBERT on the BioAsq8 dataset (3200 queries with approximately 10 relevant documents per query in the training set).
I have two questions about ColBERT.
The main problem that I face is that when I train ColBERT on a triples.p file, the training sometimes stop after couple of hundred or thousand steps : ie, each step takes less than a second and then it stops printing and saving ckpt, but the process is still running and doesn't exit. Is there somewhere in the code where it stops the training when the avg loss doesn't evolve anymore? I don't see it in the code. And my GPU has still available memory.
Also, I was wondering if there is a reason why you don't consider epochs in your training code. I guess that for MSMARCO dataset, the triples.p file is long enough. But in general, there is no problems to iterate over the train set right? Especially, I want to sample the the negatives not randomly but with BM25 negatives.
Again thank you for this great repository.
Alexandre
Hi
Thanks for sharing the code. I have a question about score calculation.
In doc function, the representation of the document is multiplied with the mask.
However, could the max(2)
in the score function accidentally choose the padding token score (which score is 0) if all the non-padding token scores are negative?
I asked because I check the implementation here and found they assign a large negative value score[~exp_mask] = - 10000
before search for the max score.
Hi,
I am trying to run inference (retrieve.py)in a quite big collection and I get the following error :
RuntimeError: [enforce fail at CPUAllocator.cpp:64]. DefaultCPUAllocator: can't allocate memory:you tried to allocate 664833940224 bytes. Error code 12 (Cannot allocate memory)
To my understanding, that means that colbert requires 600+ gb of memory, which the machine does not have. Is there any way to bypass this issue?
Can I read the file from an SSD and how much slower would the process be?
Some colleagues also suggested using huggingface.co datasets for that but I'm unsure whether this is feasible or where to start!
Thank you for sharing this incredible repo, great work!
Towards the end of your Readme file, you mention about providing pointers for the conversion of the code, especially concerning the retrieval loop, for when considering the model in a production environment. Are you able to provide any further details, such as what changes should be considered in this script for user-facing applications?
Thanks for the great repo! I'm trying to build a faiss index for retrieval, but can't get the script to run. I was originally using python3.8 and torch 1.8 in a docker container, but also downgraded to torch 1.6 to see if that would work.
I'm running
CUDA_VISIBLE_DEVICES="0,1" OMP_NUM_THREADS=1 \
python3.8 -m torch.distributed.launch --nproc_per_node=1 -m \
index --root $PWD/experiments/ --amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--checkpoint out-of-the-box-model.pt \
--collection passages.tsv \
--index_root /faiss --index_name INDEX \
--root $PWD/experiments/ --experiment out_of_the_box \
--local_rank 0
But I get the error "Default process group is not initialized". That also happens even if I manually put dist.init_process_group('nccl')
in the script.
Do you know why this is happening?
Thanks!
In the paper, I notice that query augmentation is done by padding the query with [MASK] token, and doc side is not appended. While in the code, i find query and doc use the same _encode
function. If I don't miss something, the attention mask disables the functionality of padding whatever it is.
def _encode(self, x, max_length):
input_ids = self.tokenizer.encode(x, add_special_tokens=True, max_length=max_length)
padding_length = max_length - len(input_ids)
attention_mask = [1] * len(input_ids) + [0] * padding_length
input_ids = input_ids + [103] * padding_length
return input_ids, attention_mask
Hi,
When I try to create a conda environment using the conda_env.yml file from v0.2, I get the following error:
Found conflicts! Looking for incompatible packages.
I then tried manually installing each package, however the same error occurred. Is there a workaround to this issue, or could it be an isolated problem?
Hi, great article! The code looks very clean and understandable. Is it possible to make a Colab with tiny sample data.. With the pre-trained model.
This for-loop takes like 75 seconds for me, which is annoying to run every time I load the index.
I think we can shorten it by doing the following in the index.sh
part of the pipeline:
total_num_embeddings = sum(all_doclens)
emb2pid = torch.zeros(total_num_embeddings, dtype=torch.int)
offset_doclens = 0
for pid, dlength in enumerate(all_doclens):
emb2pid[offset_doclens: offset_doclens + dlength] = pid_offset + pid
offset_doclens += dlength
torch.save(emb2pid, os.path.join(index_path, "emb2pid.pt")
and then in place of the current code in faiss_index.py
:
emb2pids = torch.load(os.path.join(index_path, "emb2pid.pt")
if part_range is not None:
emb2pids = emb2pids[emb2pids >= part_range.start]
emb2pids = emb2pids[emb2pids < part_range.stop]
self.emb2pids = emb2pids
I think this would cut the time from 75 seconds to like 3 seconds, which would be a significant speedup
hi, what is purpose of Sort by maxlens in tensorize_triples . I don't think it is effective
indices = maxlens.sort().indices
Q_ids, Q_mask = Q_ids[indices], Q_mask[indices]
D_ids, D_mask = D_ids[:, indices], D_mask[:, indices]
(positive_ids, negative_ids), (positive_mask, negative_mask) = D_ids, D_m
On the MSMARCO passage ranking task, this tensor is 687M entries - i.e. 2.4GB?
A multiprocessing Pool is then created for 16 processors - thus copying that tensor 16 times, consuming about 41GB of RAM.
The workaround is to make the tensor shared, but calling share_memory_()
- see https://pytorch.org/docs/stable/tensors.html#torch.Tensor.share_memory_:
print_message("len(self.emb2pid) =", len(self.emb2pid))
# prevent this being copied to each fork
self.emb2pid.share_memory_()
self.parallel_pool = Pool(16)
Despite this, I still have memory problems on our environment - the forking consumes just too much memory, and the job is killed. Probably the faiss index is copied 16 times as well, and all_doclens
also (though the latter is not needed at that point).
Hello Omar,
Thanks for open sourcing the code for amazing work.
This is not really a issue more like a doubt.
It was mentioned in the paper for MSMACRO that it was trained for 200k iterations with batch size of 32 to approximately reproduce the results., so effectively trained for 6.4 million triplets. So this means it was not trained on the full triples.small.tsv (39 mill points). Is my understanding on this correct?
I am trying to training on MSMACRO triples. During training individual current batch loss is decreasing only for the initial few steps and oscillating for later iterations. Did u face the same issue while training ? Should it be viewed as model is not getting trained or since it seeing the new examples every batch it is expected this way?
Hello,
Thank you for your amazing work.
I was wondering if it is possible to share your run on the MS MARCO passage collection dev set?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.