Comments (9)
That's right, MS MARCO doesn't have titles. But lazy_batcher isn't used with the official MS MARCO data. You can see that training.py automatically selects eager_batcher instead for this type of dataset.
from colbert.
Thanks for reaching out! First off, it looks like you're using v0.1 but you might find the v0.2 branch a lot richer in features and more systematic.
For Q1 and Q2, queries
(input to self.query(queries)
) and docs
(input to self.doc(docs)
) are not lists of words. They are batches (lists) of strings, and are then tokenized into word-piece. Also notice that our v0.1 branch uses HuggingFace Transformers version 2. The more recent branch uses Transformer 3, though.
For Q3, this is an implementation trick for concise code; the statement in the paper is correct. The reason for this is that the tokens appended to the document are masked twice: (a) their attention_mask is set to zero (this applies in general) and (b) in self.doc
lines 51--55 their output embeddings are also masked. Subsequently, during indexing, these embeddings are dropped entirely. This is applied to the document but not to the query. Thus, in fact, "Unlike queries, we do not append [mask] tokens to documents."
Let me know if you have any further questions!
from colbert.
Thanks a lot for such a quick reply!!!!!!!
For Q1 and Q3, sorry, I do not notice the "Transformers" version. I will read the v0.2 branch right now.
For Q3, I understand your meaning but I still think the expression "Unlike queries, we do not append [mask] tokens to documents." is misleading. Actually, you append [mask] tokens, while they are masked like punctuation in the skip list.
from colbert.
Not really, the query encoder is augmented with [MASK]s whereas the document encoder is not.
The document encoder never "sees" the [mask] tokens: they're masked in input, attention, and output, just like padding. This particular branch just saves a few "if" statements by clever use of attention and MaxSim masks, which could be confusing though.
from colbert.
Hello, I've read the v0.2 code and found you have fixed the issue.
Another question is, Why do you sort the samples with 'maxlen' in a batch?
from colbert.
Indeed, v0.2 uses a more straightforward implementation for this so it's clearer. However, the behavior is identical to v0.1 as there was no 'issue' to fix.
The sorting by maxlen is just for efficiency during training, in case you use --accum N where N > 1. It helps reduce the amount of padding used for document representations. (This padding is what is masked and dropped, in the responses above. It's needed to allow batch processing of variable-length documents.)
from colbert.
WOW, it is such a precise pruning! Thank you for your reply. I learn a lot through reading!!!! I just roughly randomize samples before feeding them into the trainer.
from colbert.
Hello, in "lazy_batcher.py", the function "_load_collection" is defined as follows.
def _load_collection(self, path):
print_message("#> Loading collection...")
collection = []
with open(path) as f:
for line_idx, line in enumerate(f):
pid, passage, title, *_ = line.strip().split('\t')
assert pid == 'id' or int(pid) == line_idx
passage = title + ' | ' + passage
collection.append(passage)
however, it seems there is no 'title' in "collection.tsv" file.
check again? :) https://github.com/microsoft/MSMARCO-Passage-Ranking
from colbert.
Hi @okhat ,
if my dataset has title and i want to use EagerBatcher instead LazyBatcher (LazyBatcher seems more complicated than than EagerBatcher), is oke if my collection is already in (title + ' | ' + passage) format ?
Can you explain more detail about when we should use LazyBatcher ?
from colbert.
Related Issues (20)
- CollectionEncoder blocking on encoder N passages HOT 1
- Focusing retrieval on list of document ids with doc_ids parameter doesn't work
- type object 'ColBERT' has no attribute 'segmented_maxsim' HOT 1
- Where is the qrels.dev.small.tsv?
- How to get rid of the "Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000" error while training of the ColBERTv1.9 modell? HOT 1
- Request for AMD gpu support
- How to quickly check if installation is working fine?
- ColBert is not failing when Error is encounter during both train and indexing
- How to insert new document into the pre-built index? HOT 1
- Is there a check point of ColBERT that wasn't trained on MSMARCO?
- How to check the centroids and the data in the clusters?
- Extract only embeddings
- Execution fails in colbert.index_objs() with assert classname.endswith('Vector')
- Results on BEIR
- unable to open file </root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/model.safetensors> in read-only mode: No such file or directory (2)
- Add_to_index only work first time
- Tokenization Assumption for Query Marker Replacement is Inconsistent
- GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message HOT 1
- Training script from doc is not working
- ImportError: cannot import name 'packaging' from 'pkg_resources' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colbert.