Coder Social home page Coder Social logo

unlimiformer's Introduction

Unlimiformer: Long-Range Transformers with Unlimited Length Input (NeurIPS 2023)

unlimiformer_diagram3_with_overlaps

This is the official implementation of the paper:

Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley:
Unlimiformer: Long-Range Transformers with Unlimited Length Input (to appear in NeurIPS 2023)

Unlimiformer is a method for augmenting pretrained encoder-decoder models with retrieval-based attention, without changing the mathematical definition of attention. This allows the use of unlimited length inputs with any pretrained encoder-decoder!
See also our Tweet.

Unlimiformer can be used to improve the performance of an already-trained model. For best results, the model can be trained with Unlimiformer training.

If you have any questions on this work, please open a GitHub issue or email the authors at [email protected], [email protected]

October 2023 - Unlimiformer will appear at NeurIPS 2023!

August 2023 - Unlimiformer now supports Llama-2 (and all its derivatives)!

To prompt Llama-2 with extremely long inputs, for example, the content of an entire book, use:

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf \
    --prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \
    --index_devices 1 --datastore_device 1 
  • The final prompt will be a concatenation of the content of the flags: --prefix, --prompt, --suffix.
  • The flag --prompt may contain either a path to a text file (e.g., example_inputs/harry_potter_full.txt) or the concrete prompt string.
  • The flag --test_unlimiformer is required to enable Unlimiformer.
  • The flag --length determines the desired output length.
  • The flag --layer_begin determines the layer from which Unlimiformer will start to be applied. For example, if we set --layer_begin 20, the first 20 layers of the model will perform the standard attention over the last context_window_size tokens of the prompt as usual, and the 21st layer and above will attend to the entire long input. From our initial experiments, the value of --layer_begin should be more than half of the total number of layers in the model, and tuning it dramatically changes the quality of the output.
  • The flags: --datastore_device N and --index_devices N1 N2 N3 ... specify on which GPUs to store Unlimiformer's datastore and index (the base model will be stored on GPU #0).
  • Add the flag --stream_output to make the generated tokens appear one by one as they are generated.

Getting Started

General Instructions

Copy the files from src into your source code folder.

You'll need to set values for the Unlimiformer-specific arguments outlined in usage.py - you can add these arguments wherever you usually process hyperparameters. To use the model, you must set test_unlimiformer=True. For datastore usage, the model must be in evaluation model (e.g. call model.eval() before inference).

inference-example.py outlines a minimal example for running a sequence through an Unlimiformer model, using the default arguments.

run.py is an example of a full training setup that integrates Unlimiformer, adopted from SLED. See full command lines below.

Reproducing the Experiments from the Paper - Command Lines

To run a standard finetuning + evaluation of BART-base on the GovReport dataset (as examples), use:

python src/run.py \
    src/configs/training/base_training_args.json \
    src/configs/data/gov_report.json \
    --output_dir output_train_bart_base_local/ \
    --learning_rate 1e-5 \
    --model_name_or_path facebook/bart-base \
    --max_source_length 1024 \
    --eval_max_source_length 1024 --do_eval=True \
    --eval_steps 1000 --save_steps 1000 \
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
    --extra_metrics bertscore
  • To use Unlimiformer at training time (called "Retrieval training" in the paper), use: --unlimiformer_training --max_source_length 16384
    • In this case, you might want to use Unlimiformer also at test/validation time, and use also: --test_unlimiformer --eval_max_source_length 999999
  • Alternatively, to use the computationally cheaper "Random-encoded" at training time, use --random_unlimiformer_training --max_source_length 16384
  • To alternate between "retrieval training" and "random-encoded training", use both flags: --unlimiformer_training --random_unlimiformer_training --max_source_length 16384

For additional flags and options, see usage.py

Recommended settings

To evaluate with Unlimiformer

At evaluation time, we recommend the default value for each setting.

To train with Unlimiformer

For an inexpensive method, we recommend training as usual and using Unlimiformer during early stopping. To do so, set knn=True and leave all other values at default.

For best performance, there are 3 expensive settings for training. The best one varies by dataset.

  1. Set random_unlimiformer_training=True: this is the random-encoded training setting from the paper
  2. Set unlimiformer_training=True: this is the retrieval training setting from the paper
  3. Set random_unlimiformer_training=True AND unlimiformer_training=True: this is the alternating training setting from the paper

See Table 5 in the paper for a more detailed breakdown of relative training costs.

Tips for very large inputs

For training

  • you may need to truncate your inputs at training time, e.g. to 8k or 16k tokens. You can use the full inputs at evaluation time
  • you can also try splitting your inputs into 16k-token-chunks and training on each one as its own example

For evaluation (including early stopping)

  • if you're consistently running out of CUDA memory, set use_datastore=True to use a Faiss datastore to store hidden states.
  • if you're still having issues, set gpu_datastore=False or gpu_index=False, but note that this will degrade performance

Trained models

The following models from the paper are available on Hugging Face. Please note that you must add the Unlimiformer-specific files to your repository, and load these models with test_unlimiformer=True. If you download these models from Hugging Face, they may not use Unlimiformer by default!

Table 3: low-cost training methods

Dataset Method Hugging Face link
GovReport Baseline: BART-base abertsch/bart-base-govreport
GovReport BART-base + Unlimiformer early stopping abertsch/unlimiformer-bart-govreport-earlyk
SummScreen Baseline: BART-base abertsch/bart-base-summscreen
SummScreen BART-base + Unlimiformer early stopping abertsch/unlimiformer-bart-summscreen-earlyk

Table 4: Long-range training methods

Dataset Method Hugging Face link
GovReport BART + Unlimiformer (alternating training) abertsch/unlimiformer-bart-govreport-alternating
SummScreen BART + Unlimiformer (retrieval training) abertsch/unlimiformer-bart-summscreen-retrieval

Table 5: BookSum

Dataset Method Hugging Face link
BookSum Baseline: BART-base abertsch/bart-base-booksum
BookSum BART-base + Unlimiformer early stopping abertsch/unlimiformer-bart-booksum-earlyk
Booksum BART-base + Unlimiformer (random-encoding training) abertsch/unlimiformer-bart-booksum-random-encoding
Booksum BART-base + Unlimiformer (alternating training) abertsch/unlimiformer-bart-booksum-alternating

Results

image image image

Citation

If you use our method or models, please cite our paper:

@article{bertsch2023unlimiformer,
  title={Unlimiformer: Long-Range Transformers with Unlimited Length Input},
  author={Bertsch, Amanda and Alon, Uri and Neubig, Graham and Gormley, Matthew R},
  journal={arXiv preprint arXiv:2305.01625},
  year={2023}
}

unlimiformer's People

Contributors

9au5a avatar abertsch72 avatar eltociear avatar szepeviktor avatar urialon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unlimiformer's Issues

Why using different calculation methods for the key and value of the cross-attention of the decoder layer in the training and validation stages?

For example, in the training stage, you use the SLED context chunking method, allowing the input to only enter the encoders and get the encoder last layers hidden state. Then you use the encoder last layers hidden state to calculate the long key and long value of the cross-attention of each layer of the decoder. However, in the validation stage, you directly input the input into the entire model, and then directly merge the key and value of the cross-attention layer of the decoder layer as long key and long value. I want to know the reason for calculating long key and long value in different ways in two stages.

Why is the inference so slow?

Hi,

Unlimiformer is amazing and can really help me. However, the inference is so slow that I believe I might do something wrong. Please help me. Thank you.

The task was pretty simple. I asked the LM to optimize following Python codes:

# bad_python_codes.py
total = 0
total += 0
total += 1
total += 2
total += 3
total += 4

I run vanilla text generation with following command and model.generate(...) took 3 seconds to complete:

python run_generation.py \
--model_type llama \
--model_name_or_path /path/to/CodeLlama-13b-Instruct-hf \
--prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Optimize following Python codes: " \
--prompt bad_python_codes.py \
--suffix " [/INST]" \
--test_unlimiformer False \
--fp16 \
--length 10 \
--use_datastore False \

While I enable Unlimiformer, model.generate(...) took 1 minute and 20 seconds to complete:

python run_generation.py \
--model_type llama \
--model_name_or_path /path/to/CodeLlama-13b-Instruct-hf \
--prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Optimize following Python codes: " \
--prompt bad_python_codes.py \
--suffix " [/INST]" \
--test_unlimiformer True \
--fp16 \
--length 10 \
--layer_begin 0 \
--index_devices 1 \
--datastore_device 1 \
--use_datastore True \

Making Unlimiformer work with decoder models (specifically LLaMA)

This is a related issue, and in this issue I'd like to get technical.
I've been working on trying to adapt Unlimiformer to work with LLaMA for a while now, and it has come down to two main issues (so far): naming and architecture.
As an example of naming, in Unlimiformer.create_key_value the original attention is calculated from encoder_attn:

attention = decoder_layer.encoder_attn

I had to replace this with self_attn in UnlimiformerLlama, since it's the only kind of attention LLaMA has.

And the issue with the architecture came up when trying to calculate the key:
https://github.com/abertsch72/unlimiformer/blob/5b534d1532246da8ef3fb02bdfff41aa853b12de/src/unlimiformer.py#LL753C30-L753C30
Neither LLaMA nor GPT has any encoders, and so it has no encoder_hidden_states, and attention.k_proj doesn't accept a None value it defaults to.

There's probably some simple solution to this, some other state that should be used in decoder-only models in this case, but I haven't figured it out yet, neither on my own nor re-reading the original paper.

So, any pointers would be welcome.

support other llms?

Is it possible to support other llms that performs better on chinese, like qwen-7b-chat or chatglm2-6b? Or is it possible to give instructions on how to do so? thank you :)

TypeError: torch_replacement_knn_gpu() got an unexpected keyword argument 'device'

Hey looks like I'm having some issues working with Llama models. This is the modified script I'm using:

!python run_generation.py --model_type llama --model_name_or_path psmathur/orca_mini_3b \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \
    --index_devices 1 --datastore_device 0

But I get this error:

2023-08-14 14:28:33.395015: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
08/14/2023 14:28:35 - WARNING - __main__ - device: cuda, n_gpu: 1, 16-bits training: True
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565, and set the legacy attribute accordingly.
Loading checkpoint shards: 100% 3/3 [00:08<00:00,  2.95s/it]
08/14/2023 14:29:16 - INFO - __main__ - Namespace(model_type='llama', model_name_or_path='psmathur/orca_mini_3b', prompt='example_inputs/harry_potter_full.txt', length=200, num_hidden_layers=None, stop_token=None, temperature=1.0, repetition_penalty=1.0, k=0, p=0.9, prefix='<<SYS>>\\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \\n<</SYS>>\\n\\n [INST] Summarize the following book: ', suffix=' [/INST]', padding_text='', xlm_language='', seed=42, no_cuda=False, stream_output=False, num_return_sequences=1, fp16=True, jit=False, device=device(type='cuda'), n_gpu=1)
08/14/2023 14:29:16 - INFO - Unlimiformer - Encoding 0 to 65 out of 65
Traceback (most recent call last):
  File "/content/unlimiformer/src/run_generation.py", line 577, in <module>
    main()
  File "/content/unlimiformer/src/run_generation.py", line 532, in main
    output_sequences = model.generate(
  File "/content/unlimiformer/src/unlimiformer.py", line 529, in pre_generate_hook
    return self.original_generate_func(input_ids_prefix, **new_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2724, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/unlimiformer/src/unlimiformer.py", line 551, in pre_forward_hook
    result = self.original_forward_func(input_ids=input_ids, labels=labels, attention_mask=attention_mask, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 810, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 698, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/unlimiformer/src/unlimiformer.py", line 575, in attention_pre_forward_hook
    result = original_cross_attn_forward_func(hidden_states=hidden_states, attention_mask=attention_mask, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 310, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1547, in _call_impl
    hook_result = hook(self, args, result)
  File "/content/unlimiformer/src/unlimiformer.py", line 629, in attention_forward_hook
    _, top_search_key_indices = self.datastore[datastore_index].search(datastore_query, k=topk)
  File "/content/unlimiformer/src/index_building.py", line 34, in search
    scores, values = self.indices[i].search(queries[i], k)
  File "/content/unlimiformer/src/index_building.py", line 144, in search
    scores, values = faiss.knn_gpu(faiss.StandardGpuResources(), queries, self.keys, k, 
TypeError: torch_replacement_knn_gpu() got an unexpected keyword argument 'device'

Any ideas on how to fix that?

Thanks again for all the help and for the new features!

Inference example with external model

Very interested into your work here!

Could you provide an example on howto load an external model (not one of your pre trained models) and plug the uniformer architecture into it for inference?

Set max_size to 128 but use 512 tokens

Hi, great work I must say!

I understand that books can be fed into trainer while having the trainer token max size set to max but is it possible to set it to a lower number? My input is ca. 400 tokens but I'd like to speed up training by shortening it.

Thanks!

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, ....

Hi,

Thank you for this great effort.

I'm running into an issue with multi-gpu training. Here's my entry command.

  1. I'm using local data files.
  2. The base_training_args is the default one.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

python src/run.py \
    src/configs/training/base_training_args.json \
    --model_name_or_path facebook/bart-large \
    --train_file ...\
    --validation_file ...\
    --test_file ...\
    --input_column ...\
    --input_prefix_column ... \
    --output_column ...\
    --overwrite_cache \
    --output_dir... \
    --overwrite_output_dir \
    --max_source_length 1024 \
    --eval_max_source_length 999999 \
    --generation_max_length 640 \
    --max_target_length 640 \
    --max_prefix_length 96 \
    --pad_prefix=True \
    --do_eval=True \
    --learning_rate 1e-5 \
    --per_device_eval_batch_size 1 \
    --per_device_train_batch_size 2 \
    --unlimiformer_training=True \
    --test_unlimiformer \
    --eval_steps 30 --save_steps 30 \
    --num_train_epochs 10 \
    --metric_names rouge \
    --extra_metrics bertscore \
    --metric_for_best_model bertscore \

The error arises in the forward pass,
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/models/bart/modeling_bart.py", line 810, in forward
inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale

And the error is:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

Could you please give some clues on where to look at for debugging? I don't think this is related to custom datasets itself. I'm aware issues could be traced to index, datastore, batching, ... The nature of this work has complexity on this, and unfortunately I really have limited knowledge.

Thank you very much!

Attached is a full stack trace:

Traceback (most recent call last):
File "unlimiformer/src/run.py", line 1183, in
main()
File "unlimiformer/src/run.py", line 803, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
outputs = self.parallel_apply(replicas, inputs, module_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py", line 110, in parallel_apply
output.reraise()
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/_utils.py", line 693, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in _worker
output = module(*input, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/ks4765/research/unlimiformer_ODMDS/src/unlimiformer.py", line 551, in pre_forward_hook
result = self.original_forward_func(input_ids=input_ids, labels=labels, attention_mask=attention_mask, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/models/bart/modeling_bart.py", line 1380, in forward
outputs = self.model(
^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/models/bart/modeling_bart.py", line 1248, in forward
encoder_outputs = self.encoder(
^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/models/bart/modeling_bart.py", line 810, in forward
inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/functional.py", line 2235, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

About adding a prefix and input length

Hello and thank you for this great work!

  1. Is it possible to add the same prefix in front of every chunk? For instance, as you mention in #20 for a QA task we want to add the question before every chunk. Do we need to make any other changes to this codebase or just use the input_prefix_column argument?

  2. Have you tried also using models which can process inputs longer than 4k?

running unlimiformer inference on multiple gpus

Hi, after solving my problem on running summarization using llama-2-7b, I tried and found a way to modify the code and it finally works!
Now I can load llama-2-13b-chat-hf and inference on inputs over 130k tokens!

Using the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --index_devices 2 --datastore_device 3 --stream_output

I got the output:

=== GENERATED SEQUENCE 1 (input length: 131316) ===
|||  This is the transcription of the first 26 pages of the book "Harry Potter and the Philosopher's Stone" by J.K. Rowling. It is a faithful reproduction of the original text, with all the imperfections and idiosyncrasies of the author's writing style included. The text has not been edited or corrected in any way, as it is presented here in its original form. 

Please note that this transcription is for entertainment purposes only, and it is not intended to be a replacement for the original book. The original book is a work of fiction and any similarity to real persons, living or dead, is purely coincidental. 

Please enjoy this transcription for what it is, a reproduction of the original work, and please do not use it as a substitute for the original work.</s>

It seems like a non-typical summary of Harry Potter :(
But it's much better than outputs of 7b models!

I'll keep working on it, and update what I found in time.

Script utilizing LLM

Can you provide a script similar to inference-example.py, that utilises run_generation.py file? i.e instead of command like execution
python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf \ --prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Summarize the following book: " \ --prompt example_inputs/harry_potter_full.txt \ --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \ --index_devices 1 --datastore_device 1
instead load the model and run inference from python script.
Thanks in advance!

Question: During training, the calculation of topk value’s att_weight is different from the classic transformer’s multi-head attention.

The top-k key attention calculation formula in Unlimiformer:
#this_layer_prompt_keys: (batch, head, source_len, dim)
#query: (batch, tgt, head, dim)
attn_weights = torch.matmul(this_layer_prompt_keys.unsqueeze(1), query.unsqueeze(-1))
.reshape(batch_size, tgt_len, query.shape[-2], 1,this_layer_prompt_keys.shape[-2])

The normal transformer multi-head attention mechanism:
#key(batchhead, source_len, dim)
#query(batch
head , target_len, dim)
att_weight = torch.matmul(query, key)

Unlimiformer uses the first attention calculation formula to get the top-k key, so i want to know why uses it and if i can use normal transformer multi-head attention mechanism to get top-k value.

Question about decoder models

Great work, i just took a look to the code and found parts where you declare mappings containing gpt2
Models like opt should be adaptable too i think, but what about gpt neox, which only has query_key_value instead of query, key, value seperated layers ?
It would be interesting for open-assistant project and if it works it would have huge potential

multi-gpu unlimiformer training: Expected all tensors to be on the same device

Hello again,

Thanks for your effort again

Running unlimiformer training on gov_report (your README standard finetuning with the unlimiformer flags added):

python src/run.py \
    src/configs/training/base_training_args.json \
    src/configs/data/gov_report.json \
    --output_dir output_train_bart_base_local/ \
    --learning_rate 1e-5 \
    --unlimiformer_training \
    --max_source_length 16384 \
    --test_unlimiformer  \
    --model_name_or_path facebook/bart-base \
    --max_source_length 1024 \
    --eval_max_source_length 999999 --do_eval=True \
    --eval_steps 1000 --save_steps 1000 \
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
    --extra_metrics bertscore

All other configs are default.

Multi-gpu setting gets me the following error, and I couldn't find a fix.
However, single gpu works.

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/storage/home/research//unlimiformer/src/random_training_unlimiformer.py", line 163, in random_inputs_forward_hook
    self.long_inputs_encoded, self.long_inputs_mask = self.chunked_encode_input(input_ids=input_ids, attention_mask=attention_mask)
  File "/storage/home/research//unlimiformer/src/random_training_unlimiformer.py", line 195, in chunked_encode_input
    output = self.model.base_model.encoder(chunk, attention_mask=chunk_attention_mask, return_dict=True, output_hidden_states=True)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/transformers/models/bart/modeling_bart.py", line 818, in forward
    inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/miniconda3/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

I am curious do you have similar issues when running (latest main commit) .

Thank you!

Reproduce the +test Unlimiformer setup

Hi I want to reproduce the results of the +test Unlimiformer from the paper. Based on my understanding this setup does not require training, so is it possible to load an available checkpoint (like this) and convert it to Unlimiformer like the example demonstrated in inference-example.py? Are there any settings that I omitted here? Thanks!

Errors on running llama with `test_datastore`

Dear Authors,

I am running llama2 with Unlimiformer and want to investigate the data store, which is enabled with --test_datastore, but I find that there are some errors:

  1. The function process_key_value in UnlimiformerLLaMa have an error (L1056) where capturers are injected by function activation_to_capture instead of get_kv_projections. I am not sure if the solution is to add another capturer for recording the outputs from get_kv_projections.
  2. I tried to use another capturer and it could run successfully with my data; however, it could only preserve less than 1% of the attention mass instead of 99% of the attention (code). Have you tested llama2 on this or could you guide me if I did something incorrectly?

Thank you!

Prompt with Llama-2 stops after "Loading checkpoint shards: 0%"

Hi,
I'm trying to get the Llama-2 example working but I'm stuck with the following issue: the program stops with no message.
Any advice on what I can try ?
I'm on windows 11 with Nvidia T1000

PS C:\__noel\AAA\github\unlimiformer> py .\src\run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf --prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Summarize the following book: " --prompt example_inputs/Annette_et_le_criminel.txt --suffix " [/INST]" --test_unlimiformer --fp16 --length 400 --layer_begin 16 --index_devices 1 --datastore_device 1 11/14/2023 12:56:00 - WARNING - __main__ - device: cpu, n_gpu: 0, 16-bits training: True Using pad_token, but it is not set yet. Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] PS C:\__noel\AAA\github\unlimiformer>

Why "import sled" was commented out in run.py?

Hi,

Thank you again for this great effort.

As title reads, why the current commit of run.py has
from sled import SledConfig and
import sled # *** required so that SledModels will be registered for the AutoClasses ***
were commented out?
May I ask if the import is no longer needed in default setting?

https://github.com/abertsch72/unlimiformer/blob/651c5b37d96d676e1da32e36b05dc388bcc440e4/src/run.py#L31C28-L31C28

File "/unlimiformer/src/unlimiformer.py", line 814, in convert_model
type_to_class[type(model)](model, *args, **kwargs)

KeyError: <class 'sled.modeling_sled.SledForConditionalGeneration'>

Relative positions in RoPE embeddings

Hi,
I was going through your code to know how you calculated the RoPE embeddings and need a clarification

In assigning a relative position to a newly generated token, the base reference is taken as the end of the prompt input
https://github.com/abertsch72/unlimiformer/blob/232fc235706c304667f7a671cca2203d4625eaa1/src/unlimiformer.py#L1084C10-L1084C10

In assigning a relative position to the retrieved key indices the relative position is taken as the start of the prompt input

scaled_key_indices = ((top_search_key_indices / self.prompt_input_ids.shape[1]) * self.actual_model_window_size).int()

Then would it not be the case that the current hidden state gives more attention to the tokens somewhere in the middle of the prompt and then decays both to the right and left?

Thank you
Ashwin Ramachandran

About the method `attention_forward_hook`

Hi, I've been reading the code for a few days, and I have a question about how the code works.

As far as I know, attention_forward_hook is the only method that looks for the relevant keys in the datastore.

However, the method only deals with local variables, except for cur_layer_key_value_placeholder, which does not really affect the output of the attention layer. It also does not return anything.

Can you explain how the stored data is used?

Is it able to change the base model ?

Thanks for your work. I have skimmed the source and paper and see that only a few base models are supported (all are seq2seq models). Now I have a question that can I replace the seq2seq model with a sequence classify model or token classify model (from hugging face like Bert, Roberta)?

Not really an issue - TrainingArguments are now immutable

See https://discuss.huggingface.co/t/trainingarguments-now-immutable-why/52565
and the nice solution huggingface/trl#682

Quick fix if anybody has latest version too (& wants to run run.py lol):

  1. In line 39 add 'replace'
  2. Replace line 705 by "training_args = replace(training_args, eval_fraction=min(training_args.eval_fraction, n))"
  3. & analogous line 710 by "training_args = replace(training_args, eval_fraction=training_args.eval_fraction / n)"

PS: I also had some troubles with wandb and had to replace "fork" with "thread" in line 413, but I didn't really look up what exactly it was about.

Best,
Paula

IndexError when running inference with Llama-2 model

Hi thanks for this amazing work.

I followed the installation guide in this issue: #25. but it gives me the following error when running the inference code below on 2 V100 GPUs, each with 32GB:

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Summarize the following book: " \
    --prompt example_inputs/harry_potter.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \
    --index_devices 0 --datastore_device 0

Error:

File "/ocean/projects/cts180021p/shang9/foundation_models/openLLM4chem/unlimiformer/src/unlimiformer.py", line 1086, in preprocess_query
    cos = cos[:,:,-1]  # [1, 1, dim]
IndexError: too many indices for tensor of dimension 2

Do you know what may go wrong? Thanks.

Error Encountered While Running 'run_generation.py' Script

I am facing an issue when attempting to run the 'run_generation.py' script from the 'unlimiformer/src' directory. This script is crucial for my work with Unlimiformer on Google Colab, and I need assistance in resolving the problem.

When executing the following command:

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf \
    --prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \
    --index_devices 1 --datastore_device 1
 the script runs on Google Colab, but it abruptly terminates with the following error message:
  10/05/2023 11:16:12 - WARNING - __main__ - device: cuda, n_gpu: 1, 16-bits training: True
  Using pad_token, but it is not set yet.
 ^C 

The script utilizes the 'cuda' device, and I have one GPU ('n_gpu: 1') available for the process.
I am enabling 16-bits training with the '--fp16' flag.
The script also specifies various parameters, including 'length,' 'layer_begin,' 'index_devices,' and 'datastore_device.'
The issue arises when running the script but does not provide a clear indication of the problem's root cause.

Use of other Encode/Decoder Models

Hello, I've been using Unlimiformer as a comparison with current standard methods of summarization and was wondering if there was anything in particular that would be needed in order to convert say a Pegasus model into Unlimiformer as it should work with "All Encoder/Decoder" models. I see several lines commented out in unlimiformer.py (here) for AutoModelForSeq2Seq, however I currently dont see a direct way this has been implemented yet.

As Pegasus is BART based, I set up a new model conveter PegasusForConditionalGeneration: UnlimiformerPegasus, and started a new unlimiformer class for it:

class UnlimiformerPegasus(UnlimiformerBART):
    def __init__(self, model: PegasusModel, *args, **kwargs):
        super().__init__(model, *args, **kwargs)

However, I was wondering if you or anyone else had found additional tweeking that was needed to fully convert say a pegasus model.

And I guess more generally, what is the procedure that you use when setting up your own new unlimiformer converted models as I was unable to simply glean what was necessary to assure "consistent" performance and or results.

Thanks!

knn_args, unlimiformer_args, tokenizer is not defined

Hi, I'm trying to deploy unlimiformer on my local machine.
I cloned src folder into my source folder and installed the dependencies.
When I open usage.py I see a bunch of problems:
image
image

What am I missing here? I followed the instructions in the repo.

Running Unlimiformer with the `forward` method

Hi,
I am currently trying to run the inference example with a slight modification. Instead of generate()-ing text, I want to pass in input_ids and labels to the forward function and obtain the logits. However, the current implementation leads to device-side asserts. How may I achieve the above goal?

error while training

Hi, I tried to train with inputs longer than 1024 on bart using the following command:

CUDA_VISIBLE_DEVICES=0 python src/run.py \
    src/configs/model/bart_base_sled.json \
    src/configs/training/base_training_args.json \
    src/configs/data/gov_report.json \
    --output_dir output_train_bart_base_local/ \
    --learning_rate 1e-5 \
    --model_name_or_path facebook/bart-base \
    --max_source_length 16384 \
    --eval_max_source_length 1024 --do_eval=True \
    --eval_steps 1000 --save_steps 1000 \
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
    --extra_metrics bertscore --unlimiformer_training

And I got a lot of error like this:

/opt/conda/conda-bld/pytorch_1682343962757/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [334,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

But as long as max_source_length is smaller than 1024, I can train the model successfully. Any clues on that?

Is there any example codes to run?

This project is very interesting. However, I can't find the example code that can run directly. Could the authors provide some examples?

I have created a LinkedIn post for this repo.

I found the work done under this repo is very much ground breaking , and that's why I though to create a Linkedin post to bring it to public eyes.
Here is the link to my original post : https://www.linkedin.com/posts/hemang-joshi-046746aa_home-activity-7061716941949796352-ztHP?utm_source=share&utm_medium=member_desktop

Please if you like the post then please help me by sharing by profile or my website with your friends.

Thanks,
Have a nice day
Hemang Joshi
Https://hjlabs.in

Question:too many indices for tensor of dimension 1

I can run inference-example.py. But when I try to combined it to my own code. I face this problem "too many indices for tensor of dimension 1". I guess in inference-example, the datasets are just one tensor. And my own datasets are about pdf. I think that is one tensor as well. However, I think I ignore somewhere. Would you help me to fix this problem? Thank you very much.
Screenshot 2023-09-11 at 4 55 19 PM

Error while evaluating

Hello, I am running the bart_base_sled code using the contract_nli dataset, with the following arguments:

python run.py configs/training/base_training_args.json configs/model/bart_base_sled.json configs/data/contract_nli.json \
--output_dir checkpoints \
--per_device_train_batch_size 1 \
--test_unlimiformer \
--model_name_or_path facebook/bart-base \
--unlimiformer_training \
--max_source_length 16384 \
--learning_rate 1e-5 \
--eval_max_source_length 16384 \
--do_eval=True  \
--eval_steps 16 \
--save_steps 16 \
--extra_metrics bertscore

(I have low eval_steps for debugging purposes). The training seems to work fine, but after evaluation starts I get an error:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 4 but got size 1 for tensor number 1 in the list.
Full Log
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                                                                                                            
│ /home/aaa/unlimiformer/src/run.py:1213 in <module>                                               │                                                                                                                                            
│                                                                                                  │                                                                                                                                            
│   1210                                                                                           │                                                                                                                                            
│   1211                                                                                           │                                                                                                                                            
│   1212 if __name__ == "__main__":                                                                │                                                                                                                                            
│ ❱ 1213 │   main()                                                                                │                                                                                                                                            
│   1214                                                                                           │                                                                                                                                            
│                                                                                                  │                                                                                                                                            
│ /home/aaa/unlimiformer/src/run.py:822 in main                                                    │                                                                                                                                            
│                                                                                                  │                                                                                                                                            
│    819 │   │   elif last_checkpoint is not None:                                                 │                                                                                                                                            
│    820 │   │   │   checkpoint = last_checkpoint  # look for checkpoints in the outdir            │                                                                                                                                            
│    821 │   │                                                                                     │                                                                                                                                            
│ ❱  822 │   │   train_result = trainer.train(resume_from_checkpoint=checkpoint)                   │                                                                                                                                            
│    823 │   │   logger.info('Done training')                                                      │                                                                                                                                            
│    824 │   │   trainer.save_model()  # Saves the tokenizer too for easy upload                   │                                                                                                                                            
│    825                                                                                           │                                                                                                                                            
│                                                                                                  │                                                                                                                                            
│ /home/aaa/anaconda3/envs/hf-latest/lib/python3.10/site-packages/transformers/trainer.py:1521 in  │                                                                                                                                            
│ train                                                                                            │                                                                                                                                            
│                                                                                                  │                                                                                                                                            
│   1518 │   │   inner_training_loop = find_executable_batch_size(                                 │                                                                                                                                            
│   1519 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │                                                                                                                                            
│   1520 │   │   )                                                                                 │                                                                                                                                            
│ ❱ 1521 │   │   return inner_training_loop(                                                       │                                                                                                                                            
│   1522 │   │   │   args=args,                                                                    │                                                                                                                                            
│   1523 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │                                                                                                                                            
│   1524 │   │   │   trial=trial,                                                                  │                                                                                                                                            
│                                                                                                  │                                                                                                                                            
│ /home/aaa/anaconda3/envs/hf-latest/lib/python3.10/site-packages/transformers/trainer.py:1840 in  │                                                                                                                                            
│ _inner_training_loop                                                                             │                                                                                                                                            
│                                                                                                  │                                                                                                                                            
│   1837 │   │   │   │   │   self.state.epoch = epoch + (step + 1) / steps_in_epoch                │                                                                                                                                            
│   1838 │   │   │   │   │   self.control = self.callback_handler.on_step_end(args, self.state, s  │                                                                                                                                            
│   1839 │   │   │   │   │                                                                         │                                                                                                                                            
│ ❱ 1840 │   │   │   │   │   self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_k  │                                                                                                                                            
│   1841 │   │   │   │   else:                                                                     │                                                                                                                                            
│   1842 │   │   │   │   │   self.control = self.callback_handler.on_substep_end(args, self.state  │                                                                                                                                            
│   1843                                                                                           │                                                                                                                                            
│                                                                                                  │                                                                                                                                            
│ /home/aaa/anaconda3/envs/hf-latest/lib/python3.10/site-packages/transformers/trainer.py:2065 in  │                                                                                                                                            
│ _maybe_log_save_evaluate                                                                         │
│                                                                                                  │
│   2062 │   │                                                                                     │
│   2063 │   │   metrics = None                                                                    │
|   2064 │   │   if self.control.should_evaluate:                                                  │                                                                                                                                            
│ ❱ 2065 │   │   │   metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)                     │                                                                                                                                            
│   2066 │   │   │   self._report_to_hp_search(trial, self.state.global_step, metrics)             │                                                                                                                                            
│   2067 │   │                                                                                     │                                                                                                                                            
│   2068 │   │   if self.control.should_save:                                                      │
│                                                                                                  │
│ /home/aaa/unlimiformer/src/utils/custom_seq2seq_trainer.py:267 in evaluate                       │
│                                                                                                  │
│   264 │   │                                                                                      │
│   265 │   │   eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else se   │
│   266 │   │   try:                                                                               │
│ ❱ 267 │   │   │   output = eval_loop(                                                            │
│   268 │   │   │   │   eval_dataloader,                                                           │
│   269 │   │   │   │   description="Evaluation",                                                  │
│   270 │   │   │   │   # No point gathering the predictions if there are no metrics, otherwise    │
│                                                                                                  │
│ /home/aaa/anaconda3/envs/hf-latest/lib/python3.10/site-packages/transformers/trainer.py:2965 in  │
│ evaluation_loop                                                                                  │
│                                                                                                  │
│   2962 │   │   │   │   │   batch_size = observed_batch_size                                      │
│   2963 │   │   │                                                                                 │
│   2964 │   │   │   # Prediction step                                                             │
│ ❱ 2965 │   │   │   loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_o  │
│   2966 │   │   │   inputs_decode = self._prepare_input(inputs["input_ids"]) if args.include_inp  │
│   2967 │   │   │                                                                                 │
│   2968 │   │   │   if is_torch_tpu_available():                                                  │
│                                                                                                  │
│ /home/aaa/unlimiformer/src/utils/custom_seq2seq_trainer.py:140 in prediction_step                │
│                                                                                                  │
│   137 │   │   if has_labels:  # changed the order of the if's here because there is no point g   │
│   138 │   │   │   with torch.no_grad():                                                          │
│   139 │   │   │   │   with self.compute_loss_context_manager():                                  │
│ ❱ 140 │   │   │   │   │   outputs = model(**inputs)                                              │
│   141 │   │   │   │   │   if self.label_smoother is not None:                                    │
│   142 │   │   │   │   │   │   loss = self.label_smoother(outputs, inputs["labels"]).mean().det   │
│   143 │   │   │   │   │   else:                                                                  │
│                                                                                                  │
│ /home/aaa/anaconda3/envs/hf-latest/lib/python3.10/site-packages/torch/nn/modules/module.py:1130  │
│ in _call_impl                                                                                    │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/aaa/unlimiformer/src/unlimiformer.py:499 in pre_forward_hook                               │
│                                                                                                  │
│   496 │   │   │   │   if input_ids is not None:                                                  │
│   497 │   │   │   │   │   self.input_ids = torch.cat([self.input_ids, input_ids[0]])             │
│   498 │   │   │   │   if kwargs.get('decoder_input_ids') is not None:                            │
│ ❱ 499 │   │   │   │   │   self.generated_input_ids = torch.cat([self.generated_input_ids, kwar   │
│   500 │   │                                                                                      │
│   501 │   │   result = self.original_forward_func(input_ids=input_ids, labels=labels, attentio   │
│   502 │   │   self.is_first_test_decoding_step = False                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 4 but got size 1 for tensor number 1 in the list.

Sanity check: VRAM usage on llama-2-7b-chat-hf higher than without Unlimiformer on low tokens?

I'm trying out the new Unlimiformer llama-2 code on llama-2-7b-chat-hf, on a 24GB 3090.
I understand Unlimiformer probably wasn't created with consumer GPUs in mind, but I'd hoped I'd be able to squeeze some more context out of my GPU locally before having to resort to expensive cloud GPUs.
I managed to get everything working, but the VRAM usage per token seems to be higher than on stock llama-2-7b-hf.
I imagine there is expected overhead from running Unlimiformer, though it is more than I expected.
With vanilla Transformers (same versions and everything) on fp16, I can ingest up to ~5350 tokens at once before running out of memory.
With Unlimiformer, 5350 tokens runs out of memory, and I can barely do more than 4096 tokens (5000 OOMs).
Is this expected overhead? And is this overhead fixed, or does it vary with the model size?

Semi-related side-questions: Is there anything Unlimiformer does that would prevent it from working with bitsandbytes 8/4 bit quanitzation, or should that be a matter of simply enabling it? And should training QLoRA with peft work?

ImportError: cannot import name 'Unlimiformer' from 'unlimiformer'

Hi, I've been trying to run inference using the model checkpoints on HF by following the code in inference-example.py.

I'm using Google Colab and I'm cloning the whole repository. However, when I try to run the code, I get the following error:

---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

[<ipython-input-9-ea5aa8573cba>](https://localhost:8080/#) in <cell line: 1>()
----> 1 from unlimiformer import Unlimiformer
      2 from random_training_unlimiformer import RandomTrainingUnlimiformer
      3 from usage import UnlimiformerArguments, training_addin
      4 
      5 from transformers import BartForConditionalGeneration, AutoTokenizer

ImportError: cannot import name 'Unlimiformer' from 'unlimiformer' (unknown location)


---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

I've tried multiple things but nothing seems to work:

  • Restarting runtime
  • Deleting runtime & creating a new one
  • Adding the directory to PATH
  • Changing to directory using %cd /content/unlimiformer/src
  • Moving the files to /content/

I'm not sure what I'm missing here. It seems to be such a minor issue but I'm out here pulling my hair out trying to figure why it isn't working. If someone can help me out I'd really appreciate it. Thanks!

Working with 8bit and 4bit quantized models

Hey! Great work on this project! I got it t work on a couple of t5 instruction tuned models from huggingface, I was just curious, has anyone been able to get the code to work with quantized modes? Currently when I set it to 'load_in_4bit=True' I get this error:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <cell line: 1>:1 │ │ │ │ /content/unlimiformer/src/unlimiformer.py:707 in convert_model │ │ │ │ 704 │ @classmethod │ │ 705 │ def convert_model(cls, model, *args, **kwargs): │ │ 706 │ │ model_clone = AutoModelForSeq2SeqLM.from_config(model.config) │ │ ❱ 707 │ │ model_clone.load_state_dict(model.state_dict()) │ │ 708 │ │ type_to_class = { │ │ 709 │ │ │ BartModel: UnlimiformerBART, │ │ 710 │ │ │ BartForConditionalGeneration: UnlimiformerBART, │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:2041 in load_state_dict │ │ │ │ 2038 │ │ │ │ │ │ ', '.join('"{}"'.format(k) for k in missing_keys))) │ │ 2039 │ │ │ │ 2040 │ │ if len(error_msgs) > 0: │ │ ❱ 2041 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( │ │ 2042 │ │ │ │ │ │ │ self.__class__.__name__, "\n\t".join(error_msgs))) │ │ 2043 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │ │ 2044 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Error(s) in loading state_dict for T5ForConditionalGeneration: size mismatch for encoder.block.0.layer.0.SelfAttention.q.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for encoder.block.0.layer.0.SelfAttention.k.weight: copying a param with shape torch.Size([524288, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).

Does anyone have any solutions to this?

Encoder Only Unlimiformer

I'm currently trying to modify the idea such that it can work with encoder only models. How would the faiss work to make it efficient? Does it have to be retrained at each timestep? Thank you.

Typing checks fail

Using the CustomHfArgumentParser to parse the UnlimiformerArguments from a json config which reads

"layer_begin": 0,
"layer_end": "", 
``` (since json doesn't natively handle `None`) leads to a parsing error:
> error: argument --layer_end: invalid int value: ''

We can omit the `layer_end` arg from the config, but there's probably a more elegant solution

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.