Coder Social home page Coder Social logo

Comments (5)

littlewine avatar littlewine commented on June 24, 2024 1

maybe I am missing something, but is there a model checkpoint of the pretrained encoder of the original work somewhere (pretrained on MSMarco) in this repo?
or do we have to retrain train from scratch to use the model on a different collection?

On huggingface, I found https://huggingface.co/sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco -- thanks @sebastian-hofstaetter and also https://huggingface.co/vespa-engine/colbert-medium which seems to be what I wanted (but not from the orig authors). Maybe good to add it in the readme @okhat ?

from colbert.

okhat avatar okhat commented on June 24, 2024

Hey Sebastian!

Margin-MSE is awesome work---thanks for applying it to ColBERT and releasing the checkpoint! The results are impressive.

I have two thoughts: it would be great to test this out with end-to-end retrieval, but the use of d=768 embedding enlarges the index by a factor of six. We have aggressive quantization for ColBERT, for release very soon, so maybe that will ease this a bit. This will represent each vector with just 32 bytes.

I will take a look at your links. A merge will be really cool!

from colbert.

sebastian-hofstaetter avatar sebastian-hofstaetter commented on June 24, 2024

Great :) yes, i did try to do end-to-end retrieval, but faiss did not like the 1tb index on even the largest server i have access to. Does your quantization work on an existing checkpoint? If not I could also retrain a model with Margin-MSE to compress the output vectors to a smaller dimension.

from colbert.

okhat avatar okhat commented on June 24, 2024

I'm guessing you used FAISS with a large index type, maybe FlatL2 or HNSW.

For ColBERT, we use IVFPQ which decreases the index size dramatically and I've faced no issues with very large indexes (e.g., the full-document version of MS MARCO).

The only challenge I see is how do we reconcile the two model definitions, since there are a couple of differences in the base model (DistilBERT) and in masking.

from colbert.

okhat avatar okhat commented on June 24, 2024

Hey @sebastian-hofstaetter !

I thought you may be interested to know about our new quantization branch. By default, it represents each vector in just 32 bytes. I generally get very similar results with this to using the full 128-dim embedding, which use 256 bytes.

from colbert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.