Coder Social home page Coder Social logo

XLM-R support about longformer HOT 24 OPEN

allenai avatar allenai commented on August 15, 2024 1
XLM-R support

from longformer.

Comments (24)

ibeltagy avatar ibeltagy commented on August 15, 2024 3

@JohannesTK, in case you are still interested, I have just added a notebook that demonstrates how we pretrain Longformer starting from the RoBERTa checkpoint. It should be easy to reuse this notebook to pretrain your XLM-R-Long.
The notebook is here: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb

from longformer.

ibeltagy avatar ibeltagy commented on August 15, 2024 1

We don't have plans to implement it for XLM-R, but our procedure to pretrain longformer starting from the RoBERTa checkpoint (beginning of section5) can be easily applied to most other models (including XLM-R). Here's a summary of the steps:

  • replace standard selfattention with LongformerSelfAttention (something like this)
  • create a position embedding matrix with the maximum sequence length you want, say 4096
  • replicate the position embeddings of the first 512 positions multiple times to fill the 4096 positions (at this point, you have on ok model that works for long sequences (check table 11 row 7))
  • continue MLM pretraining. This is a little bit more work but nothing complicated

from longformer.

ibeltagy avatar ibeltagy commented on August 15, 2024 1

how long it takes to pretrain XLM-R model?

@samru-rai, as mentioned in the notebook, you can still get a reasonable model even with zero pretraining. Additional pretraining definitely helps but for RoBERTa you get diminishing returns after processing around 800M tokens (around 2 days on a single GPU). With models other than RoBERTa, you will probably see the same general pattern but with different numbers.

from longformer.

MarkusSagen avatar MarkusSagen commented on August 15, 2024 1

@MarkusSagen interesting. hoping that you will opensource it. if it is of any help, i can provide resources for training it further than 6000 iterations.

@rplawate I've gotten the go ahead. Send me a mail and we can take it from there if it is still interesting

from longformer.

JohannesTK avatar JohannesTK commented on August 15, 2024

Thanks for the thorough answer & advice!

from longformer.

JohannesTK avatar JohannesTK commented on August 15, 2024

@ibeltagy, thank you! Will give it a spin.

from longformer.

davidhsv avatar davidhsv commented on August 15, 2024

I tried to do that, but I'm getting an error:

C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py:26: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer: Num examples = 2461
INFO:transformers.trainer: Batch size = 16
Evaluation: 0%| | 0/154 [00:01<?, ?it/s]
Traceback (most recent call last):
File "", line 149, in
File "", line 87, in pretrain_and_evaluate
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\trainer.py", line 745, in evaluate
output = self._prediction_loop(eval_dataloader, description="Evaluation")
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\trainer.py", line 823, in _prediction_loop
outputs = model(**inputs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch_utils.py", line 395, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_roberta.py", line 231, in forward
outputs = self.roberta(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 755, in forward
encoder_outputs = self.encoder(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 433, in forward
layer_outputs = layer_module(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 370, in forward
self_attention_outputs = self.attention(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 314, in forward
self_outputs = self.self(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
TypeError: forward() takes from 2 to 4 positional arguments but 7 were given

from longformer.

davidhsv avatar davidhsv commented on August 15, 2024

the code:

#%%
import logging
import os
import math
from dataclasses import dataclass, field
from transformers import XLMRobertaForMaskedLM, LongformerTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers.modeling_longformer import LongformerSelfAttention
from transformers import AutoTokenizer, AutoModelWithLMHead
from transformers import LineByLineTextDataset

logger = logging.getLogger(name)
logging.basicConfig(level=logging.INFO)

class XLMRobertaLongForMaskedLM(XLMRobertaForMaskedLM):
def init(self, config):
super().init(config)
for i, layer in enumerate(self.roberta.encoder.layer):
# replace the modeling_bert.BertSelfAttention object with LongformerSelfAttention
layer.attention.self = LongformerSelfAttention(config, layer_id=i)

def create_long_model(save_model_to, attention_window, max_pos):
model = XLMRobertaForMaskedLM.from_pretrained('xlm-roberta-large')
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", model_max_length=max_pos, use_fast=True)
config = model.config

# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
    new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
    k += step
model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed

# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.roberta.encoder.layer):
    longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
    longformer_self_attn.query = layer.attention.self.query
    longformer_self_attn.key = layer.attention.self.key
    longformer_self_attn.value = layer.attention.self.value

    longformer_self_attn.query_global = layer.attention.self.query
    longformer_self_attn.key_global = layer.attention.self.key
    longformer_self_attn.value_global = layer.attention.self.value

    layer.attention.self = longformer_self_attn

logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer

def copy_proj_layers(model):
for i, layer in enumerate(model.roberta.encoder.layer):
layer.attention.self.query_global = layer.attention.self.query
layer.attention.self.key_global = layer.attention.self.key
layer.attention.self.value_global = layer.attention.self.value
return model

def pretrain_and_evaluate(args, model, tokenizer, eval_only, model_path):
val_dataset = LineByLineTextDataset(tokenizer=tokenizer,
file_path=args.val_datapath,
block_size=tokenizer.max_len)
if eval_only:
train_dataset = val_dataset
else:
logger.info(f'Loading and tokenizing training data is usually slow: {args.train_datapath}')
train_dataset = LineByLineTextDataset(tokenizer=tokenizer,
file_path=args.train_datapath,
block_size=tokenizer.max_len)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
                  train_dataset=train_dataset, eval_dataset=val_dataset, prediction_loss_only=True, )

eval_loss = trainer.evaluate()
eval_loss = eval_loss['eval_loss']
logger.info(f'Initial eval bpc: {eval_loss / math.log(2)}')

if not eval_only:
    trainer.train(model_path=model_path)
    trainer.save_model()

    eval_loss = trainer.evaluate()
    eval_loss = eval_loss['eval_loss']
    logger.info(f'Eval bpc after pretraining: {eval_loss / math.log(2)}')

@DataClass
class ModelArgs:
attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
max_pos: int = field(default=4096, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))

training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
#'script.py',
'--output_dir', 'tmp',
'--warmup_steps', '500',
'--learning_rate', '0.00003',
'--weight_decay', '0.01',
'--adam_epsilon', '1e-6',
'--max_steps', '3000',
'--logging_steps', '500',
'--save_steps', '500',
'--max_grad_norm', '5.0',
'--per_gpu_eval_batch_size', '8',
'--per_gpu_train_batch_size', '1', # 2 - 32GB gpu with fp32
#'--device', 'cuda0', # one GPU
'--gradient_accumulation_steps', '32',
'--evaluate_during_training',
'--do_train',
'--do_eval',
])
training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw'
training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'

model_path = f'{training_args.output_dir}/xlm-roberta-large-{model_args.max_pos}'
if not os.path.exists(model_path):
os.makedirs(model_path)

logger.info(f'Converting roberta-base into roberta-large-{model_args.max_pos}')
model, tokenizer = create_long_model(
save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)

logger.info(f'Loading the model from {model_path}')
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", use_fast=True)
model = XLMRobertaLongForMaskedLM.from_pretrained(model_path)

logger.info(f'Pretraining xlm-roberta-base-{model_args.max_pos} ... ')

training_args.max_steps = 3 ## <<<<<<<<<<<<<<<<<<<<<<<< REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<

pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)

logger.info(f'Copying local projection layers into global projection layers ... ')
model = copy_proj_layers(model)
logger.info(f'Saving model to {model_path}')
model.save_pretrained(model_path)

from longformer.

ibeltagy avatar ibeltagy commented on August 15, 2024

I don't know which version of HF you have so can't be sure, but looks like the forward function of BertSelfAttention here has a different input format compared to LongformerSelfAttention here. You can implement a small class around LongformerSelfAttention that takes the input from BERT and convert it to the format expected in LongformerSelfAttention. We did the same thing when working on converting BART (check here).

from longformer.

davidhsv avatar davidhsv commented on August 15, 2024

Thanks for the response! Unfortunately, I'm a newbie in this area, I would love to have the best multilingual model in longformer, gonna subscribe for any news!

from longformer.

ibeltagy avatar ibeltagy commented on August 15, 2024

This is a pretty easy issue to fix. Put a breakpoint here, then compare the parameters passed to self.self(...) with the arguments expected by LongformerSelfAttention here.

from longformer.

davidhsv avatar davidhsv commented on August 15, 2024

I tried to rerun the original pynb, just for mental healthiness, and I can't make it run. I tried in python 3.8 and 3.7, installing the pip install -r requirements.txt file on pycharm with conda env.
I tried on google colab too, no luck there too.

Take a look here:
https://colab.research.google.com/drive/1skFNZ1pil1YG6mzO8jLGE4L-AASGTN5E?usp=sharing

My main goal is to make it work first and replace to use the XLMRoberta. I think it will be a simple model change, because xlmroberta is the same as roberta.

Thank you for your help in advance!

from longformer.

ibeltagy avatar ibeltagy commented on August 15, 2024

Thanks, @davidhsv for reporting this. Looks like the recent release of the HF code changed LongformerSelfAttention a bit making it less compatible with BertSelfAttention. I will fix the notebook soon and let you know.

from longformer.

davidhsv avatar davidhsv commented on August 15, 2024

Thanks! I really appreciate that :)

Sorry for not being able to contribute more, I'm a recovering java developer learning data science.

from longformer.

samru-rai avatar samru-rai commented on August 15, 2024

Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html

from longformer.

ibeltagy avatar ibeltagy commented on August 15, 2024

@davidhsv, @samru-rai, fixed. Can you please try the notebook again and let me know if you run into any issues.

from longformer.

CyndxAI avatar CyndxAI commented on August 15, 2024

@ibeltagy regarding diminishing returns on additional pretraining, you mean in terms of improvements on the same pretraining corpus right, not e.g. domain-adaptive pretraining on a different corpus?

from longformer.

ibeltagy avatar ibeltagy commented on August 15, 2024

@CyndxAI, good point, yes, if you are training the long version + adapting to a new domain, more training will be needed.

from longformer.

MarkusSagen avatar MarkusSagen commented on August 15, 2024

Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html

For me, training on a single GPU, with the same hyperparmeters, for 3000 iterations and with transformers 3.0.2, took 3 days and 11h.

Also Used fp16 and no gradient checkpointing

from longformer.

rplawate avatar rplawate commented on August 15, 2024

Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html

For me, training on a single GPU, with the same hyperparmeters, for 3000 iterations and with transformers 3.0.2, took 3 days and 11h.

Also Used fp16 and no gradient checkpointing

@MarkusSagen Is there any chance you could share the pretrained multilingual longformer model and inference code?

from longformer.

MarkusSagen avatar MarkusSagen commented on August 15, 2024

@rplawate It depends, Im doing a master thesis at a company investigating if long context can be transferred to low-resource languages by extending the context of multilingual models and training in English only. If I get permision from the company, then yes, the aim is to release it to Higgingface. I settned with training for 6000 iterations, but the training and eval loss could be decreased further

from longformer.

rplawate avatar rplawate commented on August 15, 2024

@MarkusSagen interesting. hoping that you will opensource it. if it is of any help, i can provide resources for training it further than 6000 iterations.

from longformer.

peakji avatar peakji commented on August 15, 2024

For those still interested, I've made a model initialized with XLM-RoBERTa's weights without further pretraining. The output of the long version should be identical to the original model for input sequences with lengths < 0.5 * attention window size.

As @ibeltagy mentioned earlier, the intermediate model produced by just copying the position embeddings and linear projects is already good enough to be fine-tuned on a downstream task.

The model could also be used as a starting point for pretraining on other languages, like what @MarkusSagen did with the English WikiText-103 corpus.

Variants of the model are available on Hugging Face model hub:

Model attention_window hidden_size num_hidden_layers model_max_length
base 256 768 12 16384
large 512 1024 24 16384

And the notebook for replicating the models is available here: https://github.com/hyperonym/dirge/blob/master/models/xlm-roberta-longformer/convert.ipynb - Instead of swapping the self-attention implementation of RoBERTa, the notebook started with a blank Longformer and copied the weights into it. It might be easier for converting other BERTology variants to their long versions.

from longformer.

ricardorei avatar ricardorei commented on August 15, 2024

@peakji Hey! thanks for sharing. When I click the notebook link I get a 404 error. Can you share it again?

from longformer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.