Hello, I am reading your code to replicate the experiment. There are some questions ab

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

about ColBERT(BertPreTrainedModel),about stanford-futuredata/colbert

Comments (9)

okhat commented on July 3, 2024 2

That's right, MS MARCO doesn't have titles. But lazy_batcher isn't used with the official MS MARCO data. You can see that training.py automatically selects eager_batcher instead for this type of dataset.

from colbert.

okhat commented on July 3, 2024

Thanks for reaching out! First off, it looks like you're using v0.1 but you might find the v0.2 branch a lot richer in features and more systematic.

For Q1 and Q2, queries (input to self.query(queries)) and docs (input to self.doc(docs)) are not lists of words. They are batches (lists) of strings, and are then tokenized into word-piece. Also notice that our v0.1 branch uses HuggingFace Transformers version 2. The more recent branch uses Transformer 3, though.

For Q3, this is an implementation trick for concise code; the statement in the paper is correct. The reason for this is that the tokens appended to the document are masked twice: (a) their attention_mask is set to zero (this applies in general) and (b) in self.doc lines 51--55 their output embeddings are also masked. Subsequently, during indexing, these embeddings are dropped entirely. This is applied to the document but not to the query. Thus, in fact, "Unlike queries, we do not append [mask] tokens to documents."

Let me know if you have any further questions!

from colbert.

kaishxu commented on July 3, 2024

Thanks a lot for such a quick reply!!!!!!!

For Q1 and Q3, sorry, I do not notice the "Transformers" version. I will read the v0.2 branch right now.
For Q3, I understand your meaning but I still think the expression "Unlike queries, we do not append [mask] tokens to documents." is misleading. Actually, you append [mask] tokens, while they are masked like punctuation in the skip list.

from colbert.

okhat commented on July 3, 2024

Not really, the query encoder is augmented with [MASK]s whereas the document encoder is not.

The document encoder never "sees" the [mask] tokens: they're masked in input, attention, and output, just like padding. This particular branch just saves a few "if" statements by clever use of attention and MaxSim masks, which could be confusing though.

from colbert.

kaishxu commented on July 3, 2024

Hello, I've read the v0.2 code and found you have fixed the issue.

Another question is, Why do you sort the samples with 'maxlen' in a batch?

from colbert.

okhat commented on July 3, 2024

Indeed, v0.2 uses a more straightforward implementation for this so it's clearer. However, the behavior is identical to v0.1 as there was no 'issue' to fix.

The sorting by maxlen is just for efficiency during training, in case you use --accum N where N > 1. It helps reduce the amount of padding used for document representations. (This padding is what is masked and dropped, in the responses above. It's needed to allow batch processing of variable-length documents.)

from colbert.

kaishxu commented on July 3, 2024

WOW, it is such a precise pruning! Thank you for your reply. I learn a lot through reading!!!! I just roughly randomize samples before feeding them into the trainer.

from colbert.

kaishxu commented on July 3, 2024

Hello, in "lazy_batcher.py", the function "_load_collection" is defined as follows.

def _load_collection(self, path):
    print_message("#> Loading collection...")

    collection = []

    with open(path) as f:
        for line_idx, line in enumerate(f):
            pid, passage, title, *_ = line.strip().split('\t')
            assert pid == 'id' or int(pid) == line_idx

            passage = title + ' | ' + passage
            collection.append(passage)

however, it seems there is no 'title' in "collection.tsv" file.

check again? :) https://github.com/microsoft/MSMARCO-Passage-Ranking

from colbert.

hieudx149 commented on July 3, 2024

Hi @okhat ,
if my dataset has title and i want to use EagerBatcher instead LazyBatcher (LazyBatcher seems more complicated than than EagerBatcher), is oke if my collection is already in (title + ' | ' + passage) format ?
Can you explain more detail about when we should use LazyBatcher ?

from colbert.

about ColBERT(BertPreTrainedModel) about colbert HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent