Comments (6)
Thank you for the kind words!
Partitions is very important: it means the number of centroids used by FAISS for indexing and search. Higher means slower FAISS indexing but faster retrieval. You can make the number 2x or 4x smaller and it would still be fine.
The sample dictates how much of the data is used for FAISS indexing, so here it's 30%. If you drop this parameter completely, the default will internally be 5%. More is better, but 5--30% is enough.
from colbert.
Yes, all documents will be indexed! Irrespective of what you choose for sample, all embeddings are going to be stored.
Sampling just dictates the not-so-critical aspect of how to create internal representations without too much cost.
from colbert.
so as far as I understand sample
is used for indexing, but it does not mean that using a sample param of 0.3
will index only 30% of the documents. correct?
from colbert.
Another semi-related indexing parameter question:
What does the --doc_maxlen 180
parameter value do? From my understanding, it cuts the passages to 180 tokens, but what if a passage/document is >180
? Does it throw away the rest of the document (FirstP
), or does it just split the document into multiple passages and search across all of them (MaxP
)?
Also was the parameter of the value used in the original work 180? I am planning to use ColBERT for indexing other datasets apart from MSMarco-passage so I am not sure whether in fact I would have to retrain everything from scratch.
After trying to index a collection using the checkpoint from the original work transformers gave me this warning, which I guess should alarm me:
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
If I understand correctly, trying to index using a checkpoint created with a different --doc_maxlen
would likely create inconsistencies and result to a worse representation of the corpus.
Thank you again for your help! :)
from colbert.
By default it's FirstP. You'll have to split the documents up if you want to implement MaxP on top of this.
You don't have to retrain. Just split up the passages to 100--150 words (with Python whitespace split) and select an appropriate --doc_maxlen in the range 180--256. It should work fine.
from colbert.
The sample dictates how much of the data is used for FAISS indexing, so here it's 30%. If you drop this parameter completely, the default will internally be 5%. More is better, but 5--30% is enough.
Yes, all documents will be indexed! Irrespective of what you choose for sample, all embeddings are going to be stored.
Sampling just dictates the not-so-critical aspect of how to create internal representations without too much cost.
hello,
I have 2 questions about sampling and index.add
- sampling
What I've understood so far is that sampling is just for analysing the distribution of documents(collection) to use faiss
is it right?
if it's right then, we only use 30% for decreasing the cost?
because if we sampling 100% then the index.train() will take too much time?
- index.add
why do we feed only three ".pt" files to the index.add function?
index.add(sub_collection)
can't we just feed all files to the function?
thank you
+) and I also want to know about below things..
- the role of slice in faiss_index.py
- the role of chunck_size in encoder.py
- how the .pt is made? what batch size? subset size?
from colbert.
Related Issues (20)
- Duplicate search results when `k` is a high value HOT 8
- Recall and MRR for checkpoint different from paper HOT 2
- Using GPU, ColBERT.try_load_torch_extensions from IndexUpdater reports "Error building extension 'segmented_lookup_cpp'" HOT 1
- How to see progress of Indexing HOT 2
- About training from scratch HOT 8
- low recall@1k HOT 2
- about the ranking operation HOT 1
- FAILED: decompress_residuals.cuda.o, ninja: build stopped: subcommand failed (from: ColBERT/colbert/indexing/codecs/decompress_residuals.cu) HOT 5
- Fine tuning using colbert-ir/colbertv2.0 using training script now gives error HOT 5
- Using with more recent versions of Python than 3.8 HOT 3
- Parallel indexing operations
- What does the nway document actually indicate? HOT 1
- Why is the labels initialised to zeros? HOT 1
- RuntimeError: quantile() input tensor is too large HOT 4
- Set batch size when indexing HOT 3
- troubleshooting encoding performance HOT 1
- Pre-filtering the documents based on metadata before late-interaction HOT 5
- What is Colbert v1.9?
- Issue: Training "resume" and "resume_optimizer" implementation was removed
- Irrelevant results returned by the Colbert V2 Model HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colbert.