neelsjain / neftune Goto Github PK

Official repository of NEFTune: Noisy Embeddings Improves Instruction Finetuning

License: MIT License

Python 97.65% Shell 2.35%

neftune's Introduction

NEFTune

[10/17/2023] NEFTune has been integrated into the Huggingface's TRL (Transformer Reinforcement Learning) library and HF Trainer . See Annoucement.

[11/25/2023] NEFTune has been integrated into Ludwig.ai for LLM fine-tuning. See PR.

Please see the limitations of our study below. Additionally, for generation, we suggest using greedy decoding with a repetition penalty of $1.2$. Note that without the repetition penalty that we have seen performance degrade and generations to degenerate.

Feel free to reach out to me as well (email at the bottom of the page 1 in the paper).

Overview

In this paper, we propose to add random noise to the embedding vectors of the training data during the forward pass of fine-tuning. We show that this simple trick can improve the outcome of instruction fine-tuning, often by a large margin, with no additional compute or data overhead. Noisy Embedding Instruction Fine Tuning (NEFTune), while simple, has a strong impact on downstream conversational quality. When a raw LLM like LLaMA-2-7B is finetuned with noisy embeddings with popular Alpaca dataset, its performance on AlpacaEval improves from 29.8% to 64.7% -- an impressive boost of around 35 percentage points. NEFTune leads to this surprising and large jump in performance on conversational tasks, maintaining performance on factual question answering baselines. Using noisy embeddings seems to be a free lunch for LLM fine-tuning. The paper can be found here.

Note: During training, we observed the training loss values may be very close. The loss histograms in the Analysis section in the paper are the loss values when the noise is turned off. This is where the difference should show up.

Code

The easiest way to incorporate NEFTune into your training procedure is to rewrite the forward for the embedding. An example of one way to do this for LLaMA is provided below. Note different distributed training will require different implementations.

from torch.nn import functional as F

def NEFTune(model, noise_alpha=5):
    def noised_embed(orig_embed, noise_alpha):
        def new_func(x):
            # during training, we add noise to the embedding
            # during generation, we don't add noise to the embedding
            if model.training:
                embed_init = orig_embed(x)
                dims = torch.tensor(embed_init.size(1) * embed_init.size(2))
                mag_norm = noise_alpha/torch.sqrt(dims)
                return embed_init + torch.zeros_like(embed_init).uniform_(-mag_norm, mag_norm)
            else:
                return orig_embed(x)
        return new_func
    ##### NOTE: this is for a LLaMA model ##### 
    ##### For a different model, you need to change the attribute path to the embedding #####
    orig_forward = model.base_model.embed_tokens.forward
    model.base_model.embed_tokens.forward = noised_embed(orig_forward, noise_alpha)
    return model

The code we used to run the experiments can be found in the experiment_code folder.

Limitations

Our study has several limitations. We adopt AlpacaEval as our central measure of instruction following ability for LLMs, which is subject to the biases of a single judge (GPT-4). Additionally, due to limited computing resources, we were not able to validate the success of NEFTune on larger 70B variants of LLaMA-2 on multiple datasets, and we had to rely on fixed hyper-parameters for most NEFTune runs rather than sweeping. Finally, despite our empirical studies, we do not have a conclusive understanding of why NEFTune works.

Call for Feedback

Our study was limited to the settings that we explored. Please feel free to open an issue regarding any weakness of NEFTune. We hope to be open about any issues with NEFTune to help future research and users.

neftune's People

Contributors

Stargazers

Watchers

Forkers

touristshaun codeaudit eltociear techthiyanes brunotech bijoyboban7 chrismii simran135 tianbingsz ianbestfellow arnavgarg1 mc-nya kenakafrosty lichao-cy bbuing9 lemon-awa jlhe2000 chenguang-wang

neftune's Issues

No eval_scoring.py file in the repo

When I look at the eval_generate.py, I need eval_scoring.py to evaluate the performance. Can you upload this missing file?

[Reimplementation] Unable to reproduce results -- Training loss curves are similar

Hello,

We implemented hijacking for mistral and llama in Axolotl (see PR: axolotl-ai-cloud/axolotl#721 ) but are unable to reproduce your results, our loss curves pretty much match the version without noisy embeddings. We then copied the implementation in your code instead of using the one in the readme and same results.

We have tried:

With/Without sample packing
Shorter context (512 like in your paper)
Llama 2 and Mistral 7b

Could you help us understand what we are doing wrong?

can you suggest about chat tuning？

i think "return model" should be within the scope of the NEFTune function, not outside it

Output 0 of ViewBackward0 is a view

Hi,

I am trying to reproduce the experiment by finetuning a Llama2 model and I have the following error when doing the forward pass to get the embeddings:

"Output 0 of ViewBackward0 is a view and its base or another view of its base has been modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one."

Did you have the same error ? I think it is a problem with fsdp but I am using the same wrapping policy as you did

Unable to evaluate the model

Thank you very much for your great work!
I try to run the evaluation script scripts/alpaca_eval.sh and it occurs an error:

File"hug/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 1081, in load_checkpoint_in_model
    checkpoint_files = sorted(list(set(index.values())))
                                   ^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'list'

It could be better if you could help me fix it.

Thank you very much in advance!

Best
Lucas

Question about output embedding from noised tokens

Hi! Thanks for the very interesting paper! We were discussing this paper in the hugging face reading group and we wondered, for the new model trained with noise token embeddings, do the output embeddings have larger cone/lower cosine similarities/different singular values compared to the original model? (For output embeddings, not the token embeddings)

We thought larger cones led to less overfitting intuitively. Anyway, thanks for the paper again!

RuntimeError: ``sharded_state_dict`` can only be used when parameters are flatten and sharded.

Hello ,could you ask me a favor? I have some problem with this code,
I first run convert_hf_to_fsdp.py

HF_CHECKPOINT_PATH=/......./llama2-7b-hf
SAVE_PATH=/......./NEFTune/test_7b/

python convert_hf_to_fsdp.py --load_path $HF_CHECKPOINT_PATH --save_path $SAVE_PATH #--add tokens $NUM_TOKENS

and get the files .metadata and __0_0.distcp under the file test_7b

then I run the `train.py' with config

--init_checkpoint_path /......./NEFTune/test_7b 
--model_config_path /......../llama2-7b-hf 
--wrapped_class_name LlamaDecoderLayer 
--data_path datasets/alpaca-train.jsonl 
--added_tokens 0 
--act_checkpointing 
--lr 5e-5 
--accumulation_steps 8 
--batch_size 1 
--checkpoint_path ./checkpoints/naive 
--neftune_alpha 5

just remove wandb config the change the path

I got a error
RuntimeError: ``sharded_state_dict`` can only be used when parameters are flatten and sharded.
Can you help me see why？

More benchmarks

Do you guys test your model on classical benchmarks such as MMLU, GSM8K, HumanEVAL?
Perhaps the relative improvement on AlpacaEval is merely due to GPT-4's preference for longer responses.

{RecursionError}maximum recursion depth exceeded while calling a Python object

Environment

transformers==4.34.0

I try the patch in README

from torch.nn import functional as F

def NEFTune(model, noise_alpha=5)
    def noised_embed(orig_embed, noise_alpha):
        def new_func(x):
            # during training, we add noise to the embedding
            # during generation, we don't add noise to the embedding
            if model.training:
                embed_init = orig_embed(x)
                dims = torch.tensor(embed_init.size(1) * embed_init.size(2))
                mag_norm = noise_alpha/torch.sqrt(dims)
                return embed_init + torch.zeros_like(embed_init).uniform_(-mag_norm, mag_norm)
            else:
                return orig_embed(x)
        return new_func
    ##### NOTE: this is for a LLaMA model ##### 
    ##### For a different model, you need to change the attribute path to the embedding #####
    model.base_model.model.model.embed_tokens.forward = noised_embed(model.base_model.model.model.embed_tokens, noise_alpha)
    return model

But model.base_model.model.model.embed_tokens.forward fails to work, showing the following error message:

{AttributeError}'LlamaModel' object has no attribute 'model'

Therefore, I tried to edit the patch as below:

 model.base_model.embed_tokens.forward = noised_embed(model.base_model.embed_tokens,
                                                         noise_alpha)

But this will cause another issue:

{RecursionError}maximum recursion depth exceeded while calling a Python object

Could you give me some advice about it? Thank you very much.

quesiton about the noise injection location

Thank for you interesting work. In Algorithm 1, it seems that noise is injected into the embedding of the instruction X. However, the code example shows that the noise is injected into the embedding of the instruction + answer. Is there some inconsistency here?

trainer = Trainer(
    model=model, tokenizer=tokenizer, args=training_args, **data_module
)

how to add it?