Hi, Thanks for this great model 🎉! I just publish

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Trained DistilBERT-based Checkpoint,about stanford-futuredata/colbert

Comments (5)

littlewine commented on June 24, 2024 1

maybe I am missing something, but is there a model checkpoint of the pretrained encoder of the original work somewhere (pretrained on MSMarco) in this repo?
or do we have to retrain train from scratch to use the model on a different collection?

On huggingface, I found https://huggingface.co/sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco -- thanks @sebastian-hofstaetter and also https://huggingface.co/vespa-engine/colbert-medium which seems to be what I wanted (but not from the orig authors). Maybe good to add it in the readme @okhat ?

from colbert.

okhat commented on June 24, 2024

Hey Sebastian!

Margin-MSE is awesome work---thanks for applying it to ColBERT and releasing the checkpoint! The results are impressive.

I have two thoughts: it would be great to test this out with end-to-end retrieval, but the use of d=768 embedding enlarges the index by a factor of six. We have aggressive quantization for ColBERT, for release very soon, so maybe that will ease this a bit. This will represent each vector with just 32 bytes.

I will take a look at your links. A merge will be really cool!

from colbert.

sebastian-hofstaetter commented on June 24, 2024

Great :) yes, i did try to do end-to-end retrieval, but faiss did not like the 1tb index on even the largest server i have access to. Does your quantization work on an existing checkpoint? If not I could also retrain a model with Margin-MSE to compress the output vectors to a smaller dimension.

from colbert.

okhat commented on June 24, 2024

I'm guessing you used FAISS with a large index type, maybe FlatL2 or HNSW.

For ColBERT, we use IVFPQ which decreases the index size dramatically and I've faced no issues with very large indexes (e.g., the full-document version of MS MARCO).

The only challenge I see is how do we reconcile the two model definitions, since there are a couple of differences in the base model (DistilBERT) and in masking.

from colbert.

okhat commented on June 24, 2024

Hey @sebastian-hofstaetter !

I thought you may be interested to know about our new quantization branch. By default, it represents each vector in just 32 bytes. I generally get very similar results with this to using the full 128-dim embedding, which use 256 bytes.

from colbert.

Recommend Projects

Trained DistilBERT-based Checkpoint about colbert HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent