Comments (5)
maybe I am missing something, but is there a model checkpoint of the pretrained encoder of the original work somewhere (pretrained on MSMarco) in this repo?
or do we have to retrain train from scratch to use the model on a different collection?
On huggingface, I found https://huggingface.co/sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco -- thanks @sebastian-hofstaetter and also https://huggingface.co/vespa-engine/colbert-medium which seems to be what I wanted (but not from the orig authors). Maybe good to add it in the readme @okhat ?
from colbert.
Hey Sebastian!
Margin-MSE is awesome work---thanks for applying it to ColBERT and releasing the checkpoint! The results are impressive.
I have two thoughts: it would be great to test this out with end-to-end retrieval, but the use of d=768 embedding enlarges the index by a factor of six. We have aggressive quantization for ColBERT, for release very soon, so maybe that will ease this a bit. This will represent each vector with just 32 bytes.
I will take a look at your links. A merge will be really cool!
from colbert.
Great :) yes, i did try to do end-to-end retrieval, but faiss did not like the 1tb index on even the largest server i have access to. Does your quantization work on an existing checkpoint? If not I could also retrain a model with Margin-MSE to compress the output vectors to a smaller dimension.
from colbert.
I'm guessing you used FAISS with a large index type, maybe FlatL2 or HNSW.
For ColBERT, we use IVFPQ which decreases the index size dramatically and I've faced no issues with very large indexes (e.g., the full-document version of MS MARCO).
The only challenge I see is how do we reconcile the two model definitions, since there are a couple of differences in the base model (DistilBERT) and in masking.
from colbert.
Hey @sebastian-hofstaetter !
I thought you may be interested to know about our new quantization branch. By default, it represents each vector in just 32 bytes. I generally get very similar results with this to using the full 128-dim embedding, which use 256 bytes.
from colbert.
Related Issues (20)
- [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL HOT 2
- How to set chunk_size
- Tokens in `skiplist` are not returned (masked out) but they still affect other tokens embeddings. Is this expected? HOT 2
- How to get the mapping information about doc_id with doc_content. HOT 1
- CollectionEncoder blocking on encoder N passages HOT 1
- Focusing retrieval on list of document ids with doc_ids parameter doesn't work
- type object 'ColBERT' has no attribute 'segmented_maxsim' HOT 1
- Where is the qrels.dev.small.tsv?
- How to get rid of the "Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000" error while training of the ColBERTv1.9 modell? HOT 1
- Request for AMD gpu support
- How to quickly check if installation is working fine?
- ColBert is not failing when Error is encounter during both train and indexing
- How to insert new document into the pre-built index? HOT 1
- Is there a check point of ColBERT that wasn't trained on MSMARCO?
- How to check the centroids and the data in the clusters?
- Extract only embeddings
- Execution fails in colbert.index_objs() with assert classname.endswith('Vector')
- Results on BEIR
- unable to open file </root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/model.safetensors> in read-only mode: No such file or directory (2)
- Add_to_index only work first time
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colbert.