Coder Social home page Coder Social logo

Comments (9)

okhat avatar okhat commented on July 3, 2024 2

That's right, MS MARCO doesn't have titles. But lazy_batcher isn't used with the official MS MARCO data. You can see that training.py automatically selects eager_batcher instead for this type of dataset.

from colbert.

okhat avatar okhat commented on July 3, 2024

Thanks for reaching out! First off, it looks like you're using v0.1 but you might find the v0.2 branch a lot richer in features and more systematic.

For Q1 and Q2, queries (input to self.query(queries)) and docs (input to self.doc(docs)) are not lists of words. They are batches (lists) of strings, and are then tokenized into word-piece. Also notice that our v0.1 branch uses HuggingFace Transformers version 2. The more recent branch uses Transformer 3, though.

For Q3, this is an implementation trick for concise code; the statement in the paper is correct. The reason for this is that the tokens appended to the document are masked twice: (a) their attention_mask is set to zero (this applies in general) and (b) in self.doc lines 51--55 their output embeddings are also masked. Subsequently, during indexing, these embeddings are dropped entirely. This is applied to the document but not to the query. Thus, in fact, "Unlike queries, we do not append [mask] tokens to documents."

Let me know if you have any further questions!

from colbert.

kaishxu avatar kaishxu commented on July 3, 2024

Thanks a lot for such a quick reply!!!!!!!

For Q1 and Q3, sorry, I do not notice the "Transformers" version. I will read the v0.2 branch right now.
For Q3, I understand your meaning but I still think the expression "Unlike queries, we do not append [mask] tokens to documents." is misleading. Actually, you append [mask] tokens, while they are masked like punctuation in the skip list.

from colbert.

okhat avatar okhat commented on July 3, 2024

Not really, the query encoder is augmented with [MASK]s whereas the document encoder is not.

The document encoder never "sees" the [mask] tokens: they're masked in input, attention, and output, just like padding. This particular branch just saves a few "if" statements by clever use of attention and MaxSim masks, which could be confusing though.

from colbert.

kaishxu avatar kaishxu commented on July 3, 2024

Hello, I've read the v0.2 code and found you have fixed the issue.

Another question is, Why do you sort the samples with 'maxlen' in a batch?

from colbert.

okhat avatar okhat commented on July 3, 2024

Indeed, v0.2 uses a more straightforward implementation for this so it's clearer. However, the behavior is identical to v0.1 as there was no 'issue' to fix.

The sorting by maxlen is just for efficiency during training, in case you use --accum N where N > 1. It helps reduce the amount of padding used for document representations. (This padding is what is masked and dropped, in the responses above. It's needed to allow batch processing of variable-length documents.)

from colbert.

kaishxu avatar kaishxu commented on July 3, 2024

WOW, it is such a precise pruning! Thank you for your reply. I learn a lot through reading!!!! I just roughly randomize samples before feeding them into the trainer.

from colbert.

kaishxu avatar kaishxu commented on July 3, 2024

Hello, in "lazy_batcher.py", the function "_load_collection" is defined as follows.

def _load_collection(self, path):
    print_message("#> Loading collection...")

    collection = []

    with open(path) as f:
        for line_idx, line in enumerate(f):
            pid, passage, title, *_ = line.strip().split('\t')
            assert pid == 'id' or int(pid) == line_idx

            passage = title + ' | ' + passage
            collection.append(passage)

however, it seems there is no 'title' in "collection.tsv" file.
Screen Shot 2021-01-04 at 11 44 01
check again? :) https://github.com/microsoft/MSMARCO-Passage-Ranking

from colbert.

hieudx149 avatar hieudx149 commented on July 3, 2024

Hi @okhat ,
if my dataset has title and i want to use EagerBatcher instead LazyBatcher (LazyBatcher seems more complicated than than EagerBatcher), is oke if my collection is already in (title + ' | ' + passage) format ?
Can you explain more detail about when we should use LazyBatcher ?

from colbert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.