Hi, when it comes to long document ranking, how to use colbert to solve the problem?<

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

long document ranking,about stanford-futuredata/colbert

Comments (6)

okhat commented on June 24, 2024

Actually, the code is almost exactly the same, with a couple of additional short scripts. To start, you just need to split the long documents into passages and create triples for supervision. "MaxP" just means MaxPassage, that is, we assign each document the score of its highest-scoring passage.

I agree releasing the extra instructions and scripts will be useful for others. I will update here on this.

from colbert.

gm0616 commented on June 24, 2024

Thanks for a detailed reply. I probably got the idea of the process.

To start, you just need to split the long documents into passages and create triples for supervision.

And what are your experiment settings here?

maximum length of the passage?
doc stride?
how to get triplets? Are all other passages are used as negative samples, or just a few passages are sampled as hard negatives?

from colbert.

okhat commented on June 24, 2024

The maximum length was 450 BERT tokens. I applied no hyperparameter tuning over this, however, so it's possible that other choices work too. The doc stride was also untuned. For this long-document task, I think it's around 60 BERT tokens, irrc.

To get initial training triples, we use zero-shot transfer of ColBERT trained on the passage task of MS MARCO. I suspect you can also start from BM25. We then divide the top-1000 passages retrieved into buckets: positive, negative, ignored. Negatives are passages that come from negative documents, positives are the best (one or more) passages that come from the positive document, and the rest are ignored (they are technically very weak positives; hence, not used in training).

from colbert.

gm0616 commented on June 24, 2024

Thank you for the fast response! The method of constructing triplets sounds like a great idea, I will try that as well. Thanks!!

from colbert.

okhat commented on June 24, 2024

By the way, let me know if you need the ColBERT ranking output on this task (e.g., if you'd like to re-rank it). We're happy to share/release it.

from colbert.

ashokrajab commented on June 24, 2024

I think it's around 60 BERT tokens

I wonder whether using a sliding window technique in any way hinder the retrieval of lower rank documents.

In a sliding window implementation, the same token will appear in multiple segments. So during retrieval, in the first stage, among all the top k' token embeddings, many will essentially be a slight variation of the same token. This in essence would prevent the tokens from other documents from being retrieved.

Is this something one needs to be wary of, @okhat?

from colbert.

Recommend Projects

long document ranking about colbert HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent