Comments (24)
@JohannesTK, in case you are still interested, I have just added a notebook that demonstrates how we pretrain Longformer starting from the RoBERTa checkpoint. It should be easy to reuse this notebook to pretrain your XLM-R-Long.
The notebook is here: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb
from longformer.
We don't have plans to implement it for XLM-R, but our procedure to pretrain longformer starting from the RoBERTa checkpoint (beginning of section5) can be easily applied to most other models (including XLM-R). Here's a summary of the steps:
- replace standard selfattention with LongformerSelfAttention (something like this)
- create a position embedding matrix with the maximum sequence length you want, say 4096
- replicate the position embeddings of the first 512 positions multiple times to fill the 4096 positions (at this point, you have on ok model that works for long sequences (check table 11 row 7))
- continue MLM pretraining. This is a little bit more work but nothing complicated
from longformer.
how long it takes to pretrain XLM-R model?
@samru-rai, as mentioned in the notebook, you can still get a reasonable model even with zero pretraining. Additional pretraining definitely helps but for RoBERTa you get diminishing returns after processing around 800M tokens (around 2 days on a single GPU). With models other than RoBERTa, you will probably see the same general pattern but with different numbers.
from longformer.
@MarkusSagen interesting. hoping that you will opensource it. if it is of any help, i can provide resources for training it further than 6000 iterations.
@rplawate I've gotten the go ahead. Send me a mail and we can take it from there if it is still interesting
from longformer.
Thanks for the thorough answer & advice!
from longformer.
@ibeltagy, thank you! Will give it a spin.
from longformer.
I tried to do that, but I'm getting an error:
C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py:26: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer: Num examples = 2461
INFO:transformers.trainer: Batch size = 16
Evaluation: 0%| | 0/154 [00:01<?, ?it/s]
Traceback (most recent call last):
File "", line 149, in
File "", line 87, in pretrain_and_evaluate
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\trainer.py", line 745, in evaluate
output = self._prediction_loop(eval_dataloader, description="Evaluation")
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\trainer.py", line 823, in _prediction_loop
outputs = model(**inputs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch_utils.py", line 395, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_roberta.py", line 231, in forward
outputs = self.roberta(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 755, in forward
encoder_outputs = self.encoder(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 433, in forward
layer_outputs = layer_module(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 370, in forward
self_attention_outputs = self.attention(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\transformers\modeling_bert.py", line 314, in forward
self_outputs = self.self(
File "C:\Users\david\Anaconda3\envs\longformer from roberta\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
TypeError: forward() takes from 2 to 4 positional arguments but 7 were given
from longformer.
the code:
#%%
import logging
import os
import math
from dataclasses import dataclass, field
from transformers import XLMRobertaForMaskedLM, LongformerTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers.modeling_longformer import LongformerSelfAttention
from transformers import AutoTokenizer, AutoModelWithLMHead
from transformers import LineByLineTextDataset
logger = logging.getLogger(name)
logging.basicConfig(level=logging.INFO)
class XLMRobertaLongForMaskedLM(XLMRobertaForMaskedLM):
def init(self, config):
super().init(config)
for i, layer in enumerate(self.roberta.encoder.layer):
# replace the modeling_bert.BertSelfAttention
object with LongformerSelfAttention
layer.attention.self = LongformerSelfAttention(config, layer_id=i)
def create_long_model(save_model_to, attention_window, max_pos):
model = XLMRobertaForMaskedLM.from_pretrained('xlm-roberta-large')
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", model_max_length=max_pos, use_fast=True)
config = model.config
# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
max_pos += 2 # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
k += step
model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.roberta.encoder.layer):
longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
longformer_self_attn.query = layer.attention.self.query
longformer_self_attn.key = layer.attention.self.key
longformer_self_attn.value = layer.attention.self.value
longformer_self_attn.query_global = layer.attention.self.query
longformer_self_attn.key_global = layer.attention.self.key
longformer_self_attn.value_global = layer.attention.self.value
layer.attention.self = longformer_self_attn
logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer
def copy_proj_layers(model):
for i, layer in enumerate(model.roberta.encoder.layer):
layer.attention.self.query_global = layer.attention.self.query
layer.attention.self.key_global = layer.attention.self.key
layer.attention.self.value_global = layer.attention.self.value
return model
def pretrain_and_evaluate(args, model, tokenizer, eval_only, model_path):
val_dataset = LineByLineTextDataset(tokenizer=tokenizer,
file_path=args.val_datapath,
block_size=tokenizer.max_len)
if eval_only:
train_dataset = val_dataset
else:
logger.info(f'Loading and tokenizing training data is usually slow: {args.train_datapath}')
train_dataset = LineByLineTextDataset(tokenizer=tokenizer,
file_path=args.train_datapath,
block_size=tokenizer.max_len)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
train_dataset=train_dataset, eval_dataset=val_dataset, prediction_loss_only=True, )
eval_loss = trainer.evaluate()
eval_loss = eval_loss['eval_loss']
logger.info(f'Initial eval bpc: {eval_loss / math.log(2)}')
if not eval_only:
trainer.train(model_path=model_path)
trainer.save_model()
eval_loss = trainer.evaluate()
eval_loss = eval_loss['eval_loss']
logger.info(f'Eval bpc after pretraining: {eval_loss / math.log(2)}')
@DataClass
class ModelArgs:
attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
max_pos: int = field(default=4096, metadata={"help": "Maximum position"})
parser = HfArgumentParser((TrainingArguments, ModelArgs,))
training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
#'script.py',
'--output_dir', 'tmp',
'--warmup_steps', '500',
'--learning_rate', '0.00003',
'--weight_decay', '0.01',
'--adam_epsilon', '1e-6',
'--max_steps', '3000',
'--logging_steps', '500',
'--save_steps', '500',
'--max_grad_norm', '5.0',
'--per_gpu_eval_batch_size', '8',
'--per_gpu_train_batch_size', '1', # 2 - 32GB gpu with fp32
#'--device', 'cuda0', # one GPU
'--gradient_accumulation_steps', '32',
'--evaluate_during_training',
'--do_train',
'--do_eval',
])
training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw'
training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'
model_path = f'{training_args.output_dir}/xlm-roberta-large-{model_args.max_pos}'
if not os.path.exists(model_path):
os.makedirs(model_path)
logger.info(f'Converting roberta-base into roberta-large-{model_args.max_pos}')
model, tokenizer = create_long_model(
save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)
logger.info(f'Loading the model from {model_path}')
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", use_fast=True)
model = XLMRobertaLongForMaskedLM.from_pretrained(model_path)
logger.info(f'Pretraining xlm-roberta-base-{model_args.max_pos} ... ')
training_args.max_steps = 3 ## <<<<<<<<<<<<<<<<<<<<<<<< REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<
pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
logger.info(f'Copying local projection layers into global projection layers ... ')
model = copy_proj_layers(model)
logger.info(f'Saving model to {model_path}')
model.save_pretrained(model_path)
from longformer.
I don't know which version of HF you have so can't be sure, but looks like the forward function of BertSelfAttention
here has a different input format compared to LongformerSelfAttention
here. You can implement a small class around LongformerSelfAttention
that takes the input from BERT and convert it to the format expected in LongformerSelfAttention
. We did the same thing when working on converting BART (check here).
from longformer.
Thanks for the response! Unfortunately, I'm a newbie in this area, I would love to have the best multilingual model in longformer, gonna subscribe for any news!
from longformer.
This is a pretty easy issue to fix. Put a breakpoint here, then compare the parameters passed to self.self(...)
with the arguments expected by LongformerSelfAttention
here.
from longformer.
I tried to rerun the original pynb, just for mental healthiness, and I can't make it run. I tried in python 3.8 and 3.7, installing the pip install -r requirements.txt file on pycharm with conda env.
I tried on google colab too, no luck there too.
Take a look here:
https://colab.research.google.com/drive/1skFNZ1pil1YG6mzO8jLGE4L-AASGTN5E?usp=sharing
My main goal is to make it work first and replace to use the XLMRoberta. I think it will be a simple model change, because xlmroberta is the same as roberta.
Thank you for your help in advance!
from longformer.
Thanks, @davidhsv for reporting this. Looks like the recent release of the HF code changed LongformerSelfAttention
a bit making it less compatible with BertSelfAttention
. I will fix the notebook soon and let you know.
from longformer.
Thanks! I really appreciate that :)
Sorry for not being able to contribute more, I'm a recovering java developer learning data science.
from longformer.
Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html
from longformer.
@davidhsv, @samru-rai, fixed. Can you please try the notebook again and let me know if you run into any issues.
from longformer.
@ibeltagy regarding diminishing returns on additional pretraining, you mean in terms of improvements on the same pretraining corpus right, not e.g. domain-adaptive pretraining on a different corpus?
from longformer.
@CyndxAI, good point, yes, if you are training the long version + adapting to a new domain, more training will be needed.
from longformer.
Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html
For me, training on a single GPU, with the same hyperparmeters, for 3000 iterations and with transformers 3.0.2, took 3 days and 11h.
Also Used fp16 and no gradient checkpointing
from longformer.
Does anyone have an approximation about how long it takes to pretrain XLM-R model? Assuming pretrain on a checkpoint version of the XML-R from HF https://huggingface.co/transformers/model_doc/xlmroberta.html
For me, training on a single GPU, with the same hyperparmeters, for 3000 iterations and with transformers 3.0.2, took 3 days and 11h.
Also Used fp16 and no gradient checkpointing
@MarkusSagen Is there any chance you could share the pretrained multilingual longformer model and inference code?
from longformer.
@rplawate It depends, Im doing a master thesis at a company investigating if long context can be transferred to low-resource languages by extending the context of multilingual models and training in English only. If I get permision from the company, then yes, the aim is to release it to Higgingface. I settned with training for 6000 iterations, but the training and eval loss could be decreased further
from longformer.
@MarkusSagen interesting. hoping that you will opensource it. if it is of any help, i can provide resources for training it further than 6000 iterations.
from longformer.
For those still interested, I've made a model initialized with XLM-RoBERTa's weights without further pretraining. The output of the long version should be identical to the original model for input sequences with lengths < 0.5 * attention window size.
As @ibeltagy mentioned earlier, the intermediate model produced by just copying the position embeddings and linear projects is already good enough to be fine-tuned on a downstream task.
The model could also be used as a starting point for pretraining on other languages, like what @MarkusSagen did with the English WikiText-103 corpus.
Variants of the model are available on Hugging Face model hub:
Model | attention_window | hidden_size | num_hidden_layers | model_max_length |
---|---|---|---|---|
base | 256 | 768 | 12 | 16384 |
large | 512 | 1024 | 24 | 16384 |
And the notebook for replicating the models is available here: https://github.com/hyperonym/dirge/blob/master/models/xlm-roberta-longformer/convert.ipynb - Instead of swapping the self-attention implementation of RoBERTa, the notebook started with a blank Longformer and copied the weights into it. It might be easier for converting other BERTology variants to their long versions.
from longformer.
@peakji Hey! thanks for sharing. When I click the notebook link I get a 404 error. Can you share it again?
from longformer.
Related Issues (20)
- CUDA error: device-side assert triggered in multi class text classification
- AttributeError: 'RobertaEmbeddings' object has no attribute 'position_ids' HOT 1
- AttributeError: module 'dill._dill' has no attribute 'stack'
- Can't find a valid checkpoint at tmp
- One hot encoding classes
- LED Training Time
- Pretraining longformer for NER on big pdf text
- Why the TVM impelmentation is memroy efficient
- Updated BART to Longformer-encoder-decoder (LED) converter
- Answering performance of Longformer-base on the HotpotQA dev set
- Number of tokens per batch mismatch - longformer vs roberta HOT 1
- On cheatsheet
- @ibeltagy I have similar issues with converting the model to ONNX, I converted the model to ONNX model, but when I tried to infer with onnxruntime I got ScatterND error while session run. I am guessing there are some operations not supported by onnx.
- Longformer embeddings for calculating similarity score between 2 documents using KNN
- Where is the global attention?
- Reproducibility Problem
- may you share link to somebody else latest development for long sentences pls?
- Can Longformer support adapter transformer?
- "requirements.txt" update (transformers==3.0.2) HOT 2
- Cosine similarity scores between random words are well above 0.9 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from longformer.