I am currently training a multilingual Model with your approach and with the bert-base

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Performance Issues with RoBERTa Models ,about stanford-futuredata/colbert

Comments (16)

surajnair04 commented on July 3, 2024 2

@Phil1108 @xuuuluuu The issue arises due to the use of an incorrect model prefix in this line
https://github.com/Phil1108/ColBERT/blob/b786e2e7ef1c13bac97a5f8a35aa9e5ff9f3425f/src/model.py#L30

For XLM-RoBERTa (or RoBERTa) models, the prefix should be roberta and not bert. Due to this mismatch, the model parameters are not loaded from the pretrained checkpoint and rather initialized from scratch.
Here's a related issue: huggingface/transformers#8407

from colbert.

Phil1108 commented on July 3, 2024 2

@puppetm4st3r
I have not trained an XLM RoBERTa with the solution provided above.
But as you can see in my charts above I once trained a colbert based on multilingual BERT that worked quite fine.
Have not uploaded it anywhere though, might do it when I have some time in the next couple of weeks.

If its urgent just write me a dm.

from colbert.

Phil1108 commented on July 3, 2024 1

@xuuuluuu
No I felt like I was stuck in a dead end there, so I stopped working on (XLM)-RoBERTa.

My best guess is still that the missing next sentence prediction from RoBERTa is the main problem here (I think some people also discovered that as a reason why roberta is hard to use as a SBERT base model).

I am not saying that it won't work at all, but it will probably need far more time to find a suitable configuration.

You can see my code here: https://github.com/Phil1108/ColBERT/tree/XLM-lr

from colbert.

okhat commented on July 3, 2024

Hi Philipp,

I'm not surprised that RoBERTa is worse than BERT with ColBERT---we do not recommend RoBERTa. As you say, I do suspect it may have to do with next-sentence prediction.

But I'm surprised at how large the difference is. Did you try tuning the learning rate? The default 3e-06 may be too small for RoBERTa.

Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.

Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.

from colbert.

okhat commented on July 3, 2024

Seems resolved? Closing but feel free to re-open if you have further questions about this.

from colbert.

Phil1108 commented on July 3, 2024

Hi,
I have tried lots of things but the loss behaves weird.

So tuning the learning rate seems to make it worse, even when introducing a learning rate with linear decay:

This would support the proposal that RoBERTa doesn't work with the ColBERT approach due to the missing NSP training objective.

However when trying to use ALBERT (which is trained with a sophisticated Sentence Order Objective) this leads to similar results as RoBERTa.

Also this paper which is i think very close to the ColBERT paper (https://www.aclweb.org/anthology/2020.coling-main.568.pdf) reports an increase in accuracy when using RoBERTa over BERT. (So the missing NSP doesn't seem to really influence these passage ranking tasks).

One possible explenation would be that RoBERTa needs far more fine tuning steps (or larger batch sizes) to achieve the same result. So I recently switched to A100 GPUs, increased the BS from 32 to 150 and now I am waiting for the training and evaluation to finish.

from colbert.

okhat commented on July 3, 2024

Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.

Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.

I'm curious about your response to these.

from colbert.

Phil1108 commented on July 3, 2024

Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.

No I explicitly checked that through printing out the input_ids. (This code is not yet uploaded in my fork because I do the inference and evaluation locally)

Additionally I changed [unused0] & [unused1] tokens both to [UNK] and found out that it only had minor differences on the BERT performance. So it could hardly be a tokenizer/vocab problem, I guess.

Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.

I'm curious about your response to these.

I've done that here quick and dirty: https://github.com/Phil1108/ColBERT/blob/XLM-lr/src/model.py#L27
With skipping these, the performance became a little bit better, but not significantly.

(Both answers here refer to the 32k Checkpoints not the 400k due to time limitations)

from colbert.

okhat commented on July 3, 2024

I see. I'd stick to the original ColBERT then, based on BERT and its multilingual variants.

I can't offer a lot of details about multilingual RoBERTa models without trying them myself.

from colbert.

Phil1108 commented on July 3, 2024

Yeah I see. Multilingual BERTs work very well, even cross lingual. Loss of accuracy is not significantly in comparison with the english only. With the larger Batch Size on the A100s I think we can even beat the english-only model.

Sadly, there is still not a single open-sourced Large Multilingual BERT (It seems to me that Google wants to keep them private). That was the initial reason why I had to switch to XLM-RoBERTa (Large) Models because these are nearly the only available multilingual Large Models

from colbert.

xuuuluuu commented on July 3, 2024

@Phil1108 I am working on a similar idea on roBERTa. After adding the unused token to the vocab,

self.tokenizer.add_tokens(["[unused1]"])

the model's not working properly. Have you figured out where the problem is?

from colbert.

xuuuluuu commented on July 3, 2024

@Phil1108
Thanks for the reply. My recent finding is that the roBERTa model performs worse when the added new token is critical to the task. It seems that the embedding of the newly added token cannot be properly learned. I am checking the reason...

from colbert.

Phil1108 commented on July 3, 2024

@xuuuluuu
Ah okay that could also be the reason, maybe it is the BPE that is not working in its intended way.

Even though it suprises me a little bit, because I have run experiments with BERT were I accidently assigned the two special tokens wrong so both became [unknown] tokens and it still produced better results than RoBERTa.
So my overall conclusion was that the introduction of the tokens boost performance but BERT learns very well even without them.

Have you tried to get completely rid of the special tokens?

from colbert.

xuuuluuu commented on July 3, 2024

@Phil1108

Without adding the special tokens, the tokens become "[unknown]" to the RoBERTa model. As my model heavily relies on those tokens, it can be understood that the model performs badly. It is surprising that the BERT model performs better even with "[unknown]" than RoBERTa.

I found that the vocab of RoBERTa model has three unused tokens: "madeupword0000": 50261, "madeupword0001": 50262, "madeupword0002": 50263. Not sure if you can simply use these tokens for your case. For my case, I need much more special tokens.

from colbert.

puppetm4st3r commented on July 3, 2024

@Phil1108 do you resolve the issue? i'm looking for a multilanguaje model like colbert

from colbert.

hiepxanh commented on July 3, 2024

@Phil1108 my model is too bad, can you share your work? Maybe it can help me improve

from colbert.

Performance Issues with RoBERTa Models about colbert HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent