Coder Social home page Coder Social logo

Comments (6)

okhat avatar okhat commented on June 24, 2024

Hi @gm0616,

Actually, the code is almost exactly the same, with a couple of additional short scripts. To start, you just need to split the long documents into passages and create triples for supervision. "MaxP" just means MaxPassage, that is, we assign each document the score of its highest-scoring passage.

I agree releasing the extra instructions and scripts will be useful for others. I will update here on this.

from colbert.

gm0616 avatar gm0616 commented on June 24, 2024

Thanks for a detailed reply. I probably got the idea of the process.

To start, you just need to split the long documents into passages and create triples for supervision.

And what are your experiment settings here?

  • maximum length of the passage?
  • doc stride?
  • how to get triplets? Are all other passages are used as negative samples, or just a few passages are sampled as hard negatives?

from colbert.

okhat avatar okhat commented on June 24, 2024

The maximum length was 450 BERT tokens. I applied no hyperparameter tuning over this, however, so it's possible that other choices work too. The doc stride was also untuned. For this long-document task, I think it's around 60 BERT tokens, irrc.

To get initial training triples, we use zero-shot transfer of ColBERT trained on the passage task of MS MARCO. I suspect you can also start from BM25. We then divide the top-1000 passages retrieved into buckets: positive, negative, ignored. Negatives are passages that come from negative documents, positives are the best (one or more) passages that come from the positive document, and the rest are ignored (they are technically very weak positives; hence, not used in training).

from colbert.

gm0616 avatar gm0616 commented on June 24, 2024

Thank you for the fast response! The method of constructing triplets sounds like a great idea, I will try that as well. Thanks!!

from colbert.

okhat avatar okhat commented on June 24, 2024

By the way, let me know if you need the ColBERT ranking output on this task (e.g., if you'd like to re-rank it). We're happy to share/release it.

from colbert.

ashokrajab avatar ashokrajab commented on June 24, 2024

I think it's around 60 BERT tokens

I wonder whether using a sliding window technique in any way hinder the retrieval of lower rank documents.

In a sliding window implementation, the same token will appear in multiple segments. So during retrieval, in the first stage, among all the top k' token embeddings, many will essentially be a slight variation of the same token. This in essence would prevent the tokens from other documents from being retrieved.

Is this something one needs to be wary of, @okhat?

from colbert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.