Coder Social home page Coder Social logo

Comments (16)

surajnair04 avatar surajnair04 commented on July 3, 2024 2

@Phil1108 @xuuuluuu The issue arises due to the use of an incorrect model prefix in this line
https://github.com/Phil1108/ColBERT/blob/b786e2e7ef1c13bac97a5f8a35aa9e5ff9f3425f/src/model.py#L30

For XLM-RoBERTa (or RoBERTa) models, the prefix should be roberta and not bert. Due to this mismatch, the model parameters are not loaded from the pretrained checkpoint and rather initialized from scratch.
Here's a related issue: huggingface/transformers#8407

from colbert.

Phil1108 avatar Phil1108 commented on July 3, 2024 2

@puppetm4st3r
I have not trained an XLM RoBERTa with the solution provided above.
But as you can see in my charts above I once trained a colbert based on multilingual BERT that worked quite fine.
Have not uploaded it anywhere though, might do it when I have some time in the next couple of weeks.

If its urgent just write me a dm.

from colbert.

Phil1108 avatar Phil1108 commented on July 3, 2024 1

@xuuuluuu
No I felt like I was stuck in a dead end there, so I stopped working on (XLM)-RoBERTa.

My best guess is still that the missing next sentence prediction from RoBERTa is the main problem here (I think some people also discovered that as a reason why roberta is hard to use as a SBERT base model).

I am not saying that it won't work at all, but it will probably need far more time to find a suitable configuration.

You can see my code here: https://github.com/Phil1108/ColBERT/tree/XLM-lr

from colbert.

okhat avatar okhat commented on July 3, 2024

Hi Philipp,

I'm not surprised that RoBERTa is worse than BERT with ColBERT---we do not recommend RoBERTa. As you say, I do suspect it may have to do with next-sentence prediction.

But I'm surprised at how large the difference is. Did you try tuning the learning rate? The default 3e-06 may be too small for RoBERTa.

Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.

Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.

from colbert.

okhat avatar okhat commented on July 3, 2024

Seems resolved? Closing but feel free to re-open if you have further questions about this.

from colbert.

Phil1108 avatar Phil1108 commented on July 3, 2024

Hi,
I have tried lots of things but the loss behaves weird.

So tuning the learning rate seems to make it worse, even when introducing a learning rate with linear decay:
grafik

This would support the proposal that RoBERTa doesn't work with the ColBERT approach due to the missing NSP training objective.

However when trying to use ALBERT (which is trained with a sophisticated Sentence Order Objective) this leads to similar results as RoBERTa.
grafik

Also this paper which is i think very close to the ColBERT paper (https://www.aclweb.org/anthology/2020.coling-main.568.pdf) reports an increase in accuracy when using RoBERTa over BERT. (So the missing NSP doesn't seem to really influence these passage ranking tasks).

One possible explenation would be that RoBERTa needs far more fine tuning steps (or larger batch sizes) to achieve the same result. So I recently switched to A100 GPUs, increased the BS from 32 to 150 and now I am waiting for the training and evaluation to finish.

from colbert.

okhat avatar okhat commented on July 3, 2024

Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.

Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.

I'm curious about your response to these.

from colbert.

Phil1108 avatar Phil1108 commented on July 3, 2024

Also, when you modify the tokenizer, do you do this in training only? Or is it applied for inference as well? You want to be sure the model is not using the wrong tokenizer later.

No I explicitly checked that through printing out the input_ids. (This code is not yet uploaded in my fork because I do the inference and evaluation locally)

Additionally I changed [unused0] & [unused1] tokens both to [UNK] and found out that it only had minor differences on the BERT performance. So it could hardly be a tokenizer/vocab problem, I guess.

Lastly, when trying a new model you may want to avoid skipping punctuation just to be sure.

I'm curious about your response to these.

I've done that here quick and dirty: https://github.com/Phil1108/ColBERT/blob/XLM-lr/src/model.py#L27
With skipping these, the performance became a little bit better, but not significantly.

(Both answers here refer to the 32k Checkpoints not the 400k due to time limitations)

from colbert.

okhat avatar okhat commented on July 3, 2024

I see. I'd stick to the original ColBERT then, based on BERT and its multilingual variants.

I can't offer a lot of details about multilingual RoBERTa models without trying them myself.

from colbert.

Phil1108 avatar Phil1108 commented on July 3, 2024

Yeah I see. Multilingual BERTs work very well, even cross lingual. Loss of accuracy is not significantly in comparison with the english only. With the larger Batch Size on the A100s I think we can even beat the english-only model.

Sadly, there is still not a single open-sourced Large Multilingual BERT (It seems to me that Google wants to keep them private). That was the initial reason why I had to switch to XLM-RoBERTa (Large) Models because these are nearly the only available multilingual Large Models

from colbert.

xuuuluuu avatar xuuuluuu commented on July 3, 2024

@Phil1108 I am working on a similar idea on roBERTa. After adding the unused token to the vocab,

self.tokenizer.add_tokens(["[unused1]"])

the model's not working properly. Have you figured out where the problem is?

from colbert.

xuuuluuu avatar xuuuluuu commented on July 3, 2024

@Phil1108
Thanks for the reply. My recent finding is that the roBERTa model performs worse when the added new token is critical to the task. It seems that the embedding of the newly added token cannot be properly learned. I am checking the reason...

from colbert.

Phil1108 avatar Phil1108 commented on July 3, 2024

@xuuuluuu
Ah okay that could also be the reason, maybe it is the BPE that is not working in its intended way.

Even though it suprises me a little bit, because I have run experiments with BERT were I accidently assigned the two special tokens wrong so both became [unknown] tokens and it still produced better results than RoBERTa.
So my overall conclusion was that the introduction of the tokens boost performance but BERT learns very well even without them.

Have you tried to get completely rid of the special tokens?

from colbert.

xuuuluuu avatar xuuuluuu commented on July 3, 2024

@Phil1108

Without adding the special tokens, the tokens become "[unknown]" to the RoBERTa model. As my model heavily relies on those tokens, it can be understood that the model performs badly. It is surprising that the BERT model performs better even with "[unknown]" than RoBERTa.

I found that the vocab of RoBERTa model has three unused tokens: "madeupword0000": 50261, "madeupword0001": 50262, "madeupword0002": 50263. Not sure if you can simply use these tokens for your case. For my case, I need much more special tokens.

from colbert.

puppetm4st3r avatar puppetm4st3r commented on July 3, 2024

@Phil1108 do you resolve the issue? i'm looking for a multilanguaje model like colbert

from colbert.

hiepxanh avatar hiepxanh commented on July 3, 2024

@Phil1108 my model is too bad, can you share your work? Maybe it can help me improve

from colbert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.