Comments (16)
@Phil1108 @xuuuluuu The issue arises due to the use of an incorrect model prefix in this line
https://github.com/Phil1108/ColBERT/blob/b786e2e7ef1c13bac97a5f8a35aa9e5ff9f3425f/src/model.py#L30
For XLM-RoBERTa (or RoBERTa) models, the prefix should be roberta and not bert. Due to this mismatch, the model parameters are not loaded from the pretrained checkpoint and rather initialized from scratch.
Here's a related issue: huggingface/transformers#8407
from colbert.
@puppetm4st3r
I have not trained an XLM RoBERTa with the solution provided above.
But as you can see in my charts above I once trained a colbert based on multilingual BERT that worked quite fine.
Have not uploaded it anywhere though, might do it when I have some time in the next couple of weeks.
If its urgent just write me a dm.
from colbert.
@xuuuluuu
No I felt like I was stuck in a dead end there, so I stopped working on (XLM)-RoBERTa.
My best guess is still that the missing next sentence prediction from RoBERTa is the main problem here (I think some people also discovered that as a reason why roberta is hard to use as a SBERT base model).
I am not saying that it won't work at all, but it will probably need far more time to find a suitable configuration.
You can see my code here: https://github.com/Phil1108/ColBERT/tree/XLM-lr
from colbert.
Hi Philipp,
I'm not surprised that RoBERTa is worse than BERT with ColBERT---we do not recommend RoBERTa. As you say, I do suspect it may have to do with next-sentence prediction.
But I'm surprised at how large the difference is. Did you try tuning the learning rate? The default 3e-06 may be too small for RoBERTa.
Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.
Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.
from colbert.
Seems resolved? Closing but feel free to re-open if you have further questions about this.
from colbert.
Hi,
I have tried lots of things but the loss behaves weird.
So tuning the learning rate seems to make it worse, even when introducing a learning rate with linear decay:
This would support the proposal that RoBERTa doesn't work with the ColBERT approach due to the missing NSP training objective.
However when trying to use ALBERT (which is trained with a sophisticated Sentence Order Objective) this leads to similar results as RoBERTa.
Also this paper which is i think very close to the ColBERT paper (https://www.aclweb.org/anthology/2020.coling-main.568.pdf) reports an increase in accuracy when using RoBERTa over BERT. (So the missing NSP doesn't seem to really influence these passage ranking tasks).
One possible explenation would be that RoBERTa needs far more fine tuning steps (or larger batch sizes) to achieve the same result. So I recently switched to A100 GPUs, increased the BS from 32 to 150 and now I am waiting for the training and evaluation to finish.
from colbert.
Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.
Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.
I'm curious about your response to these.
from colbert.
Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.
No I explicitly checked that through printing out the input_ids. (This code is not yet uploaded in my fork because I do the inference and evaluation locally)
Additionally I changed [unused0] & [unused1] tokens both to [UNK] and found out that it only had minor differences on the BERT performance. So it could hardly be a tokenizer/vocab problem, I guess.
Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.
I'm curious about your response to these.
I've done that here quick and dirty: https://github.com/Phil1108/ColBERT/blob/XLM-lr/src/model.py#L27
With skipping these, the performance became a little bit better, but not significantly.
(Both answers here refer to the 32k Checkpoints not the 400k due to time limitations)
from colbert.
I see. I'd stick to the original ColBERT then, based on BERT and its multilingual variants.
I can't offer a lot of details about multilingual RoBERTa models without trying them myself.
from colbert.
Yeah I see. Multilingual BERTs work very well, even cross lingual. Loss of accuracy is not significantly in comparison with the english only. With the larger Batch Size on the A100s I think we can even beat the english-only model.
Sadly, there is still not a single open-sourced Large Multilingual BERT (It seems to me that Google wants to keep them private). That was the initial reason why I had to switch to XLM-RoBERTa (Large) Models because these are nearly the only available multilingual Large Models
from colbert.
@Phil1108 I am working on a similar idea on roBERTa. After adding the unused token to the vocab,
self.tokenizer.add_tokens(["[unused1]"])
the model's not working properly. Have you figured out where the problem is?
from colbert.
@Phil1108
Thanks for the reply. My recent finding is that the roBERTa model performs worse when the added new token is critical to the task. It seems that the embedding of the newly added token cannot be properly learned. I am checking the reason...
from colbert.
@xuuuluuu
Ah okay that could also be the reason, maybe it is the BPE that is not working in its intended way.
Even though it suprises me a little bit, because I have run experiments with BERT were I accidently assigned the two special tokens wrong so both became [unknown] tokens and it still produced better results than RoBERTa.
So my overall conclusion was that the introduction of the tokens boost performance but BERT learns very well even without them.
Have you tried to get completely rid of the special tokens?
from colbert.
Without adding the special tokens, the tokens become "[unknown]" to the RoBERTa model. As my model heavily relies on those tokens, it can be understood that the model performs badly. It is surprising that the BERT model performs better even with "[unknown]" than RoBERTa.
I found that the vocab of RoBERTa model has three unused tokens: "madeupword0000": 50261, "madeupword0001": 50262, "madeupword0002": 50263. Not sure if you can simply use these tokens for your case. For my case, I need much more special tokens.
from colbert.
@Phil1108 do you resolve the issue? i'm looking for a multilanguaje model like colbert
from colbert.
@Phil1108 my model is too bad, can you share your work? Maybe it can help me improve
from colbert.
Related Issues (20)
- CollectionEncoder blocking on encoder N passages HOT 1
- Focusing retrieval on list of document ids with doc_ids parameter doesn't work
- type object 'ColBERT' has no attribute 'segmented_maxsim' HOT 1
- Where is the qrels.dev.small.tsv?
- How to get rid of the "Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000" error while training of the ColBERTv1.9 modell? HOT 1
- Request for AMD gpu support
- How to quickly check if installation is working fine?
- ColBert is not failing when Error is encounter during both train and indexing
- How to insert new document into the pre-built index? HOT 1
- Is there a check point of ColBERT that wasn't trained on MSMARCO?
- How to check the centroids and the data in the clusters?
- Extract only embeddings
- Execution fails in colbert.index_objs() with assert classname.endswith('Vector')
- Results on BEIR
- unable to open file </root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/model.safetensors> in read-only mode: No such file or directory (2)
- Add_to_index only work first time
- Tokenization Assumption for Query Marker Replacement is Inconsistent
- GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message HOT 1
- Training script from doc is not working
- ImportError: cannot import name 'packaging' from 'pkg_resources' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colbert.