[WIP] LongformerEncoderDecoder ,about allenai/longformer

Comments (40)

patil-suraj commented on August 15, 2024 2

Thank you @ibeltagy
Goal is to handle sequences as long as 32k tokens, so for now I've decided to go ahead with Reformer.

But yes, I'm indeed interested in LongformerEncoderDecoder, since its already pre-trained, it would useful for lot of other tasks. I'll try to give it a shot.

from longformer.

matt-peters commented on August 15, 2024 2

For evaluation of Long-BART, I'd recommend a summarization dataset. A document summarization dataset, or one with input contexts much longer then 512 tokens will show the most benefit from Long-BART vs BART.

from longformer.

patil-suraj commented on August 15, 2024 2

Thank you @matt-peters and @ibeltagy for the suggestions. Then I think PubMed seems to be good choice. As given in this paper the avg document length for PubMed is 3016.

from longformer.

ibeltagy commented on August 15, 2024 2

Folks here, I just pushed a fix to a bug in the implementation of LongformerEncoderDecoder that affects all results with batch_size>1. The bug fix is here and it is fixing the issue mentioned in this conversation. Notice that the query.reshape solution will lead to the wrong behavior with any batch_size > 1

from longformer.

ibeltagy commented on August 15, 2024 1

@patil-suraj, @ohmeow, you will most probably need this PR huggingface/transformers#4659 which adds support for gradient checkpointing. It will reduce memory usage by 5x and get longformer memory usage closer to reformer. This PR will take a while to be merged, so for now just cherry-pick the commit before running your experiments.

from longformer.

ibeltagy commented on August 15, 2024 1

Very cool, thanks @patil-suraj. I will review the notebook carefully later but here are a few quick thoughts,

2560 for inference only or with forward/backward? If inference only, make sure to use with torch.no_grad which can reduce memory by like 10x depending on number of layers. If forward/backward, then add something like output.sum(). backward () just to get a sense how much memory/time the full forward/backward pass will take.
please try to run with fp16 with optimization level O2. This should give you 2x memory saving for training and inference.
gradient checkpointing can reduce memory by something like 5x for training.
how long is the maximum output length? If it is short, using longformerSelfAttention won't help.
I am curious how it compares to Reformer if you control for number of layer, heads, and dimensions. If you already have such comparison that would be great.

from longformer.

patil-suraj commented on August 15, 2024 1

@ibeltagy I went through the gradient checkpointing commit. It adds checkpointing on encoder layers. So in case of BART I'll need to add it on both enocoder and decoder layers. Is this a correct approach ?

Thanks !

from longformer.

ibeltagy commented on August 15, 2024 1

yes, adding it on both encoder and decoder layers will save the most memory.

from longformer.

armancohan commented on August 15, 2024 1

The ArXiv and PubMed datasets in the following paper are both abstractive and long:
https://arxiv.org/pdf/1804.05685.pdf

BitPatent dataset is another abstractive dataset of long documents: https://arxiv.org/pdf/1906.03741.pdf

from longformer.

ibeltagy commented on August 15, 2024 1

There's a preliminary working version in this branch https://github.com/allenai/longformer/tree/encoderdecoder. Check readme for instructions.

from longformer.

ibeltagy commented on August 15, 2024 1

Good catch. Thanks, @hannes89. I will try it and let you know but you are probably right just from looking at the code.

from longformer.

ibeltagy commented on August 15, 2024 1

This is facebook/bart-large with no additional pretraining. We are currently working on the additional pretraining.

from longformer.

chandu7077 commented on August 15, 2024 1

I am getting the following error?

Error(s) in loading state_dict for LongformerEncoderDecoderForConditionalGeneration:
size mismatch for model.encoder.embed_positions.weight: copying a param with shape torch.Size([16386, 768]) from checkpoint, the shape in current model is torch.Size([1026, 768]).

from longformer.

ibeltagy commented on August 15, 2024

It depends on many things including the task, the attention configuration (window size, global attention), the GPU memory, memory optimization approaches (fp16 and gradient checkpointing), model design (number of layers, embedding size, number of heads, size of output vocabulary) ...

Can you provide more details about the task and the setting you have in mind?

from longformer.

patil-suraj commented on August 15, 2024

Hello @ibeltagy
I want to use the sliding window attention for summarization task with at least 16k seq len with BART model with fp16. Can you give some approximate estimation of memory usage ?

from longformer.

ibeltagy commented on August 15, 2024

I am assuming that the 16k is the input sequence, and that the output sequence is short enough.
BART has three selfattention blocks:

encoder (16k x 16x): needs LongformerSelfattention
decoder (output length x output length): n^2 attention
decoder (16k x output length): n^2 attention (sliding window attention won't make it any faster)

Assuming output length is 512, then the block "decoder (16k x output length): n2" is going to take the same memory as "encoder (16k x 16x): LongformerSelfattention".

In our experiments, we only used an encoder-only model. For the base model (size of BERT-base), we know that seqlen of 16k easily works on 48GB gpu (no global attention). For the large model, 16k will require gradient checkpointing to fit in the 48GB.

So I would say, with gradient checkpointing, fp16, 16k input sequence, short output sequence, no (or very little) global attention, and BART-base model size, it might fit in a 48GB gpu.

from longformer.

ibeltagy commented on August 15, 2024

@patil-suraj, awesome work implementing many of the missing models of Longformer on the huggingface repo. I really appreciate your work.

I am curious how your summarization experiments are going? I think it would be great if we can implement LongformerEncoderDecoder. Would you be interested in giving it a shot? A relatively easy to implement version can be:

assume that the output length is short.
base it of BART (or T5, you decide)
use LongformerSelfattention in the encoder, but keep using regular n^2 attention for the two selfattention components in the decoder
follow this notebook to convert BART's encoder into its long version (skip the pertaining step)

This should give us a reasonable baseline to try. After it works, we can consider

how to implement it when the output length is also long
doing additional pertaining

Let me know what you think. Thanks.

from longformer.

ohmeow commented on August 15, 2024

I'll try to give it a shot.

@patil-suraj ... lmk where you are at, and if you would like assistance. I'm looking to use/build a Longformer BART encoder given the success I've had thus far using BART ... but I also don't want to repeat work that is already underway :)

from longformer.

patil-suraj commented on August 15, 2024

@ohmeow Thank you for your interest. I've started playing with it just now. I'll let you know how it goes.

from longformer.

patil-suraj commented on August 15, 2024

Thanks @ibeltagy I'll try to add that as well.

I've been able to replace BART encoder's attention with LongformerSelfattention. I tried it with 35 GB RAM colab instance. For bart-large it went well till seq len 2560, but crashed after that. So I think we'll also need LongformerSelfattention in decoder as well.

@ohmeow @ibeltagy would you mind taking a look at the notebook to see if there's anything wrong ?

Here's the repo and colab

from longformer.

patil-suraj commented on August 15, 2024

Thank you @ibeltagy for the suggestions

I was trying inference for a single example on CPU. I forgot to use torch.no_grad. With torch.no_grad and seq len 4096 it went well in ~3-4GB memory
Now I'm trying to add fp16 and gradient checkpointing
for now I want to experiment with 16k input length and 1024 output length

Also I'm wondering if we can create distilled-longformer starting from distilled-roberta, that will give us further memory saving. Also can we use the recently introduced movement-pruning technique with longformer as well ?

from longformer.

ibeltagy commented on August 15, 2024

for now I want to experiment with 16k input length and 1024 output length

cool. Curious to see how this goes. What is the size of the attention_window you are using? How does it compare to the maximum sequence length of BART? It would be a good idea to match that length, or the resulting bart-long model won't work well.

Also I'm wondering if we can create distilled-longformer starting from distilled-roberta, that will give us further memory saving.

We can already run longformer-large-16384 on today's GPUs. Do we have a task that requires a longer sequence?

movement-pruning

Looks cool. So they zero a large percentage of the model parameters which reduces the model size, but how does it reduce memory at run time? use torch.sparse?

from longformer.

patil-suraj commented on August 15, 2024

For BART maximum sequence length is 1024, I used the attention_window size 512. I haven't carried out any proper evaluation yet. So how much extra memory it will take with attention_window 1024 ?

We can already run longformer-large-16384 on today's GPUs. Do we have a task that requires a longer sequence

I don't have any specific task in mind, I'm just curious about this

how does it reduce memory at run time? use torch.sparse?

I haven't completely gone through it. But I do want to try this with longformer once I'm done with LongformerEncoderDecoder

from longformer.

ibeltagy commented on August 15, 2024

For BART maximum sequence length is 1024, I used the attention_window size 512. I haven't carried out any proper evaluation yet.

My default would be to use attention_window=1024 but it is worth evaluating a few sizes to see how they work. I am referring to an inference-only evaluation, no training.

So how much extra memory it will take with attention_window 1024 ?

The attention component will take 2x the memory, so the full model memory usage will be less than that.

longformer-large-16384

yeah, we can probably train/release a longformer-large-16384 in the future.

how does it reduce memory at run time? use torch.sparse?
I haven't completely gone through it. But I do want to try this with longformer once I'm done with LongformerEncoderDecoder

Many of the sparsity work reduce number of model parameters but still use dense operations. This reduces the model size on the disk but doesn't save gpu memory because activations and gradients are what consumes most of the gpu memory, not model parameters. I will be curious to see how torch.sparse work here.

from longformer.

ibeltagy commented on August 15, 2024

@patil-suraj, another trick to save more memory is to use a different window size for different layers. This will require some evaluation to get the setting right, and it might not work without a bit of additional training.

from longformer.

patil-suraj commented on August 15, 2024

@patil-suraj, another trick to save more memory is to use a different window size for different layers. This will require some evaluation to get the setting right, and it might not work without a bit of additional training.

Cool! First I want to train and evaluate long-bart-4096 with fp16 and gradient accumulation. Because 16k is going to take a lot of time and resources.

I'm thinking of training it on TrviaQA with text-2-text format. Would this be a good task to evaluate LongBART ?

from longformer.

ibeltagy commented on August 15, 2024

The main problem with TriviaQA is that the dataset is large and takes very long time to train (12 x 8 GPU hours). This is the reason we didn't use it for ablation (last table in the paper). Wikihop would be much better because it is smaller but we didn't release its training/evaluation script. I can send you the wikihop code we have internally, but it still using the fairseq model and it has a lot of experimental code, so it might be a bit of work to get it to work. The other solution is to train/evaluate on a subset of TriviaQA, say 10%, or for a limited number of gradient updates, say half an epoch. In both cases, you will still need to get your own baseline numbers because your results won't be comparable to the published ones.

from longformer.

ibeltagy commented on August 15, 2024

This would be a good dataset if there's an abstractive version of it. As far as I can tell, the one discussed in this paper is extractive with a yes/no classification for each sentence.

from longformer.

ibeltagy commented on August 15, 2024

@Adrian-1234, check this issue for the WIP on summarization.

from longformer.

fabrahman commented on August 15, 2024

Hi,
Is there any update on the LonformerEncoderDecoder? I wonder if there is any notebook that I can follow/start from to fine-tune on some abstractive long documents for generation?

Thanks

from longformer.

hannes89 commented on August 15, 2024

Hi @ibeltagy !

I've been following this issue and the summarization one in the transformers repo huggingface/transformers#4406, and I've managed to fine tune the LongformerEncoderDecoder for summarization.

I noticed that global attention doesn't work with the bart encoder directly due to this line, I was wondering if a change like this makes sense (for using with longformer attention):

From:
attention_mask = invert_mask(attention_mask)

To:
attention_mask = (1.0 - attention_mask)

Thanks!

from longformer.

hannes89 commented on August 15, 2024

Thanks! I noticed it when trying to run with multiple GPUs, using find_unused_parameters=False for gradient_checkpointing=True, that none of the global attention parameters contributed to the loss (tested it as you suggested in #63 (comment)).

from longformer.

ibeltagy commented on August 15, 2024

@hannes89, I just pushed a fix in my fork of the HF repo (which you also need for gradient checkpointing).

from longformer.

patil-suraj commented on August 15, 2024

@ibeltagy , how was this # checkpoint: https://ai2-s2-research.s3-us-west-2.amazonaws.com/longformer/longformer-encdec-large-12288.tar.gz trained, i.e for summerization or using the BART's denoising pre-training approach ?

from longformer.

ibeltagy commented on August 15, 2024

I just pushed a fix for another bug also related to batch_size>1

from longformer.

ibeltagy commented on August 15, 2024

Update - made many changes that include:

Changes related to gradient checkpointing (my branch of HF/transformers)
Fixing small issues in the model implementation
Fixed some problems in the checkpoints (check readme for a base and large ones)
Summarization training script (check readme for path)

from longformer.

hannes89 commented on August 15, 2024

Thanks for the update, very interesting work!

I saw that you mention global attention in the script and I was wondering if it used mainly for the model to use all the parameters, or if you think it will help with summarization?

I tried before to remove the global layers from LongformerSelfAttention and running with just local attention and that seemed to work better.

I was also curious if you've tried to do the same for Pegasus (drop in the Longformer attention), I saw that you refer to it in the summarization script and it that the model inherits directly from BartForConditionalGeneration in the transformers repository (except that the model checkpoints uses SinusoidalPositionalEmbedding)?

from longformer.

ibeltagy commented on August 15, 2024

@hannes89

global attention

you are right, it is only used for the model to use all the parameters, which is needed for gradient checkpointing to work.

I tried before to remove the global layers from LongformerSelfAttention and running with just local attention and that seemed to work better.

better as in faster/less memory or better as in better results? I would expect the only difference to be a slight reduction in memory usage.

Pegasus

This is something in my todo list but didn't try yet. There are some issues with Pegasus optimization than needs to be figured out.

from longformer.

ibeltagy commented on August 15, 2024

@chandu7077, please make sure you install the HF/transformers fork specified in requirements.txt.

from longformer.

siddagra commented on August 15, 2024

We can already run longformer-large-16384 on today's GPUs. Do we have a task that requires a longer sequence?

It is pretty impractical to run these on consumer GPUs as most do not have 18GB+ of VRAM which is required, even for inferencing with batch size of 1. So a distilled version would definitely be useful.

And loading up a cloud instance just to inference a couple 100 documents becomes impractical with the high 18GB+ requirement for a batch size of 1. Even if it was reduced down to less than 16GB, it would suffice as it could at least run on Colab/Kaggle GPUs.

Not to mention that there are indeed some documents that are longer than 16384 and fine-tuning is almost impossible unless ran on pricey cloud servers.

from longformer.

[WIP] LongformerEncoderDecoder about longformer HOT 40 OPEN

Comments (40)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent