Coder Social home page Coder Social logo

tatsu-lab / alpaca_farm Goto Github PK

View Code? Open in Web Editor NEW
737.0 8.0 58.0 1.85 MB

A simulation framework for RLHF and alternatives. Develop your RLHF method without collecting human data.

Home Page: https://arxiv.org/abs/2305.14387

License: Apache License 2.0

Python 99.97% JavaScript 0.03%
deep-learning instruction-following large-language-models reinforcement-learning-from-human-feedback natural-language-processing

alpaca_farm's Introduction

AlpacaFarm

AlpacaFarm: A Simulation Framework for Methods that
Learn from Human Feedback

Code License Data License Python 3.10+ Code style: black

Changing auto-annotators: text-davinci-003 is now depreciated by OpenAI, as a result, we can't use the original pool of annotators for automatically generating preferences (for fine-tuning or evaluation). We, therefore, switched to the GPT-4 annotator from AlpacaEval 1. All results should thus be compared to models from AlpacaEval 1 rather than the original AlpacaFarm results. Note that over-optimization might not be seen in this new setting (see Figure 4 in the paper). We are sorry for the inconvenience caused.


Research and development on learning from human feedback is difficult because methods like RLHF are complex and costly to run. AlpacaFarm is a simulator that enables research and development on learning from feedback at a fraction of the usual cost, promoting accessible research on instruction following and alignment.

Please read our paper and blog post for details on our research findings.

This repo contains code for

The data needed to run our code is hosted on HuggingFace: https://huggingface.co/datasets/tatsu-lab/alpaca_farm.

Usage and License Notices: AlpacaFarm is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff is also CC BY NC 4.0 (allowing only non-commercial use).

The AlpacaFarm


Workflow

Instruction-following models are typically developed in 3 steps

  1. Supervised fine-tuning with demonstrations
  2. Learning from human feedback; usually pairwise preferences
  3. Human evaluation with interaction

The goal of AlpacaFarm is to provide three key components that tackles steps 2 and 3: Low-cost simulation of pairwise feedback from API models (e.g. GPT-4, ChatGPT), automated evaluations for methods development, and reference implementations of learning algorithms for comparison and modification.

Installation

To install the stable release, run

pip install alpaca-farm

To install from the latest commit on main branch, run

pip install git+https://github.com/tatsu-lab/alpaca_farm.git

To enable FlashAttention and other optimizations, install the flash-attn and apex packages.

Simulating pairwise preference

Notebook example: Using

For all the evaluation and annotations we use AlpacaEval with our pool of automatic annotators and additional noise to simulate the variance of human annotations.

To get started, set the environment variable OPENAI_API_KEY to your OpenAI API key, and (optionally) OPENAI_ORG to the organization ID. You can do this by running

export OPENAI_API_KEY="sk..."

To annotate the pairs of outputs of your model use the following code. For more details or functions to use if you have outputs in different formats refer to the example notebook.

from alpaca_farm.auto_annotations import PairwiseAutoAnnotator
import json

# load some data
with open("examples/data/outputs_pairs.json") as f:
    outputs_pairs = json.load(f)[:6]
print(outputs_pairs[-1:])
# [{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.',
#   'input': '',
#   'output_1': "Dear Friends, \r\n\r\nI hope this message finds you well. I'm excited to invite you to dinner on Friday. We'll meet at 7:00 PM at [location]. I look forward to seeing you there. \r\n\r\nBest,\r\n[Name]",
#   'output_2': "Hey everyone! \n\nI'm hosting a dinner party this Friday night and I'd love for all of you to come over. We'll have a delicious spread of food and some great conversations. \n\nLet me know if you can make it - I'd love to see you all there!\n\nCheers,\n[Your Name]"}]

annotator = PairwiseAutoAnnotator()
annotated = annotator.annotate_pairs(outputs_pairs)

print(annotated[-1:])
# [{'instruction': 'If you could help me write an email to my friends inviting them to dinner on Friday, it would be greatly appreciated.', 
# 'input': '', 
# 'output_1': "Dear Friends, \r\n\r\nI hope this message finds you well. I'm excited to invite you to dinner on Friday. We'll meet at 7:00 PM at [location]. I look forward to seeing you there. \r\n\r\nBest,\r\n[Name]", 
# 'output_2': "Hey everyone! \n\nI'm hosting a dinner party this Friday night and I'd love for all of you to come over. We'll have a delicious spread of food and some great conversations. \n\nLet me know if you can make it - I'd love to see you all there!\n\nCheers,\n[Your Name]",
# 'annotator': 'chatgpt_2', 
# 'preference': 2}]

If instead of pairs you have a list of sampled outputs, you can use the following.

multisample_outputs = [dict(instruction="repeat the following", input="yes", output=["yes", "no", "maybe", "repeat"])]
print(annotator.annotate_samples(multisample_outputs))
# [{'sample_id': 0, 
#   'instruction': 'repeat the following', 
#   'input': 'yes', 
#   'output_1': 'yes', 
#   'output_2': 'maybe', 
#   'annotator': 'chatgpt_2', 
#   'preference': 1}]

Running automatic evaluation

For all the evaluation we use AlpacaEval with our pool of automatic annotators.

To get started, set the environment variable OPENAI_API_KEY to your OpenAI API key, and (optionally) OPENAI_ORG to the organization ID. You can do this by running

export OPENAI_API_KEY="sk..."

The easiest to add your model to the Alpaca Leaderboard is to run the following code, which only requires having outputs for your model on our eval data.

from alpaca_farm.auto_annotations import alpaca_leaderboard
import datasets

# predict on Alpaca eval data
alpaca_eval_data = datasets.load_dataset("tatsu-lab/alpaca_farm", "alpaca_farm_evaluation")["eval"]
...  # use the data to get outputs for your model and save it
path_to_outputs = "examples/data/eval_gpt-3.5-turbo-0301.json"
# outputs should be a list of json as such:
# [{'instruction': 'What are the names of some famous actors that started their careers on Broadway?', 'input': '', 'output': 'Some famous actors that started their careers on Broadway are Hugh Jackman, Meryl Streep, Denzel Washington, Audra McDonald, and Lin-Manuel Miranda.', 'generator': 'gpt-3.5-turbo-0301', 'dataset': 'helpful_base', 'datasplit': 'eval'},
# ...]

alpaca_leaderboard(path_to_outputs, name="My fancy model")
#                               win_rate  standard_error  n_total  avg_length
# gpt35_turbo_instruct             81.71            1.33      801        1018
# alpaca-farm-ppo-sim-gpt4-20k     44.10            1.74      805         511
# My fancy model                   41.54            2.01      597         327
# alpaca-farm-ppo-human            41.24            1.73      805         803
# alpaca-7b                        26.46            1.54      805         396
# text_davinci_001                 15.17            1.24      804         296

Running reference methods

We provide reference implementations of several methods for learning from pairwise feedback. Example code to run these methods can be found in the examples/ directory. This includes supervised fine-tuning, reward modeding , RLHF with PPO, best-of-n decoding and more.

Below we give example commands for reproducing the model artifacts in our paper. Notes:

  • All training code are tested with FlashAttention enabled on a machine with 8 80GB A100 GPUs.
  • Best-of-n decoding was tested with a single 80GB GPU.
  • Supervised fine-tuning and reward modeling can fit on 4 80GB A100 GPUs, while PPO training currently requires at least 8 80GB GPUs.
  • Before running the code below, make sure to convert your LLaMA checkpoint and tokenizer into HuggingFace format and store it at <your_path_to_hf_converted_llama_ckpt_and_tokenizer>.

Supervised fine-tuning (SFT)

To replicate our SFT10k model fine-tuned from LLaMA in the paper, run

bash examples/scripts/sft.sh \
  <your_output_dir_for_sft10k> \
  <your_wandb_run_name> \
  <your_path_to_hf_converted_llama_ckpt_and_tokenizer>

The SFT10k model will be saved at <your_output_dir>, and the name of the wandb run will be <your_wandb_run_name>.

Reward modeling

To replicate our reward models trained in the paper, run

bash examples/scripts/reward_modeling.sh \
  <your_output_dir_for_reward_model> \
  <your_wandb_run_name> \
  <your_output_dir_for_sft10k> \
  <preference_dataset_name>

Set <preference_dataset_name> to "alpaca_noisy_multi_preference" for simulated preference reward model, and "alpaca_human_preference" for human preference reward model.

RLHF with PPO

To replicate our RLHF PPO model trained with simulated reward model in the paper, run

bash examples/scripts/rlhf_ppo.sh \
  <your_output_dir_for_ppo> \
  <your_wandb_run_name> \
  <your_output_dir_for_reward_model> \
  <your_output_dir_for_sft10k> \
  <kl_coef>

<your_output_dir_for_reward_model> should point to either simulated reward model or human reward model trained according to the previous step. Note the KL penalty coefficient for human reward PPO is much larger than for simulated PPO. Set <kl_coef> to 0.0067 for simulated PPO, and 0.02 for human PPO to recover our original results. Performance of the PPO model is typically much better than SFT at 20-80 PPO steps (less than 4 passes through the entire set of instructions) and starts to decay with more PPO steps.

Best-of-n decoding

To replicate our best-of-n inference-time decoding results for the AlpacaFarm evaluation suite, run

python examples/best_of_n.py \
  --task "run_best_of_n" \
  --decoder_name_or_path <your_output_dir_for_decoder> \  # Can be SFT model or even PPO tuned model.
  --scorer_name_or_path <your_output_dir_for_reward_model> \
  --num_return_sequences 16 \  # This is the n in best-of-n.
  --per_device_batch_size 4 \  # Reduce this if you don't have enough memory.
  --split "eval" \
  --mixed_precision "bf16" \
  --tf32 True \
  --flash_attn True \
  --output_path <your_output_path_to_store_samples>

You can then use the generated samples at <your_output_path_to_store_samples> directly with our automated evaluation.

Expert Iteration

To replicate our expert iteration results for the AlpacaFarm evaluation suite, first produce best-of-n samples. Run

python examples/best_of_n.py \
  --task "run_best_of_n" \
  --decoder_name_or_path <your_output_dir_for_decoder> \  # SFT10k model.
  --scorer_name_or_path <your_output_dir_for_reward_model> \
  --num_return_sequences 16 \  # This is the n in best-of-n.
  --per_device_batch_size 4 \  # Reduce this if you don't have enough memory.
  --split "unlabeled" \
  --mixed_precision "bf16" \
  --tf32 True \
  --flash_attn True \
  --output_path '<your_output_dir_for_expiter_data>/best_of_n_samples.json'

Then perform supervised fine-tuning from the SFT10k checkpoint with the best-of-n samples

bash examples/scripts/expiter.sh \
  <your_output_dir_for_expiter> \
  <your_wandb_run_name> \
  <your_output_dir_for_sft10k> \
  <your_output_dir_for_expiter_data>

Quark

To replicate our Quark results for the AlpacaFarm evaluation suite, run

bash examples/scripts/rlhf_quark.sh \
  <your_output_dir_for_quark> \
  <your_wandb_run_name> \
  <your_output_dir_for_reward_model> \
  <your_output_dir_for_sft10k> \
  <kl_coef>

To replicate our DPO results for the AlpacaFarm evaluation suite, run

bash examples/scripts/dpo.sh \
  <your_output_dir_for_dpo> \
  <your_wandb_run_name> \
  <your_output_dir_for_sft10k>

OpenAI models

To run the OpenAI reference models with our prompts and decoding hyperparameters, run

python examples/oai_baselines.py \
  --model_name <oai_model_name> \
  --save_path <save_path> 

You can then use the generated samples at <save_path> directly with our automated evaluation.

Downloading pre-tuned AlpacaFarm models

We provide model checkpoints for reward models and all our reference methods, listed in Table 2 of our paper. Concretely, we tune each reference method in AlpacaFarm simulation and on human preference data and release both versions. The current list of models (available here) includes:

  • sft10k, the supervised learning base model that we collect preference data with.
  • reward-model-sim, the reward model trained on AlpacaFarm preference data.
  • reward-model-human, the reward model trained on human preference data.
  • ppo-sim, the best PPO checkpoint trained in simulation.
  • ppo-human, the best PPO checkpoint trained on human data.
  • expiter-sim, the best expert iteration checkpoint trained in simulation.
  • expiter-human, the best expert iteration checkpoint trained on human data.
  • feedme-sim, the FeedME method trained on simulated preferences.
  • feedme-human, the FeedME method trained on human preferences.
  • reward-condition-sim, the reward conditioning method trained on simulated preferences.

To download and recover these checkpoints, first make sure to have a LLaMA-7B checkpoint converted into the Hugging Face format with transformers>=4.29.2. Then, run the following to download all AlpacaFarm models:

python -m pretrained_models.recover_model_weights \
  --llama-7b-hf-dir <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
  --alpaca-farm-model-name all

Or, specify a particular model name to download just that model:

python -m pretrained_models.recover_model_weights \
  --llama-7b-hf-dir <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
  --alpaca-farm-model-name <one_of_the_model_names_from_above> \
  --models-save-dir <dir_to_save_all_models>

To download either of the reward models individually, you'll need to have sft10k downloaded first to <dir_to_save_all_models>.

Citation

Please consider citing our work if you use the data or code in this repo.

@misc{dubois2023alpacafarm,
      title={AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback}, 
      author={Yann Dubois and Xuechen Li and Rohan Taori and Tianyi Zhang and Ishaan Gulrajani and Jimmy Ba and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto},
      year={2023},
      eprint={2305.14387},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

If you use alpaca-farm>=0.2.0 make sure to specify that the annotator changed (as text-davinci-003 is depreciated). The preferences and win-rates are now from AlpacaEval 1 and are not comparable to the numbers from our paper. You can cite AlpacaEval as:

@misc{alpaca_eval,
  author = {Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
  title = {AlpacaEval: An Automatic Evaluator of Instruction-following Models},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tatsu-lab/alpaca_eval}}
}

alpaca_farm's People

Contributors

actions-user avatar lxuechen avatar rtaori avatar stceum avatar yanndubs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alpaca_farm's Issues

[Reward Model Training] Inconsistent accuracy caused by flash-attention

Many thanks for your excellent work~
When training the reward model, I found that flash attn affected the final accuracy. I followed the README exactly to reproduce sft10k and then use it to train a reward model. I used all the default parameters, but found that using flash attn or not made a 3.4% difference in accuracy (w/ flash-attn 60% v.s. w/o flash-attn 56.6%). The results are shown in the figure below. The pink plot is for w/o flash-attn and the blue plot is for w/ flash-attn. Is this accuracy gap normal?

I noticed that only the inference consistency was tested in tests/test_flash_llama.py. Did the author test the back-propagation?

image

[tokenization] preprocessing inputs and labels

Hello,

Firstly thanks for open-sourcing all components of alpaca_farm!

I'm looking into data_preprocessor.py and am wondering where/if the labels are set to the input_ids shifted by 1. (something like input_ids = input_ids[...,1:] labels = input_ids[...,:-1]classic next token prediction)

However, it seems like they're set to input_ids without any shifting? I'm not sure what I'm missing but any clarification would be great :)

Screen Shot 2023-06-25 at 5 54 51 PM

Reproducibility of pretuned reward model

Hi, thanks for sharing the great project!

I attempted to reproduce pretuned reward model reward-model-human using the provided script https://github.com/tatsu-lab/alpaca_farm#reward-modeling from the pretuned sft10k model sft10k. But when I measured the eval accuracy of "alpaca_human_preference"(always using the same split from the seed=42), I got 64~66% eval acc, which is far lower than the 73% eval acc of pretuned model reward-model-human. I didn't change any hyper-parameters and used flash LLaMA and apex. Could there be another recipe that was used to create the pretuned reward-model-human?

Huge memory demand of recover_model_weights.py?

[ continuation of #70 ]

When running recover_model_weights.py --alpaca-farm-model-name sft10k, memory use of the python process grows to >30GB, at which point it gets killed by my system due to out of memory. Is this expected behavior?

Log:

Downloading sft10k
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 610/610 [00:00<00:00, 1.03MB/s]
Downloading (…)model.bin.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 26.8MB/s]
Downloading (…)l-00001-of-00003.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.88G/9.88G [04:26<00:00, 37.1MB/s]
Downloading (…)l-00002-of-00003.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.89G/9.89G [05:15<00:00, 31.3MB/s]
Downloading (…)l-00003-of-00003.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.18G/7.18G [03:13<00:00, 37.1MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [13:00<00:00, 260.29s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [07:13<00:00, 144.42s/it]
Downloading (…)neration_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 153kB/s]
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 726/726 [00:00<00:00, 4.54MB/s]
Downloading tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 1.35MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.92MB/s]
Downloading (…)in/added_tokens.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0/21.0 [00:00<00:00, 62.1kB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 435/435 [00:00<00:00, 2.81MB/s]
WARNING:root:Your base LLaMA checkpoint is converted with transformers==4.27.0.dev0, but transformers>=4.29.2 is expected. This may produce a corrupted checkpoint and lead to unexpected behavior. Please regenerate your base LLaMA checkpoint with transformers>=4.29.2.
Loading checkpoint shards:  12%|█████████████████                                                                                                                            | 4/33 [00:16<01:59,  4.13s/it]Killed

PairwiseAutoAnnotator always "Annotating 0 examples with gpt4_3"

INFO:root:Annotating 0 examples with gpt4_3
INFO:root:Saving all annotations to ./eval_results/answer1.json
I am trying to compare two answers that are of low quality. I am wondering why it is always annotating 0 examples and always returns that "preferences"=0. It seems that the annotator skips the evaluation process. Is it because the answers are too meaningless?

Possible issue with gradient accumulation

Hello, thank you for a great work!

During studying the implementation, I suspected this line https://github.com/tatsu-lab/alpaca_farm/blob/main/src/alpaca_farm/rl/rl_trainer.py#L150 for zero the gradients during gradient accumulation could cause zero out all gradients except the gradients from the final gradient accumulation steps (accelerator.sync_gradients), as policy.zero_grad is used instead of optimizer.zero_grad.

I think this could cause ignore all gradients from the gradient accumulation steps except step with sync_gradients=True. Could you let me know about this possible problems? Thank you!

Use of decapoda-research/llama-7b-hf checkpoint for the LLaMa-7B

In my test, when decapoda-research/llama-7b-hf checkpoint model used for the LLaMa-7B, the released model seems not work.

model_name_or_path = 'tatsu-lab/alpaca-farm-sft10k-wdiff'
model = model_cls.from_pretrained(model_name_or_path, **model_kwargs).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, **tokenizer_kwargs)
model_raw = load_raw_model('decapoda-research/llama-7b-hf', device)
reconstruct_tuned_model(model, model_raw)

source = tokenizer('What is the meaning of life?', return_tensors="pt", add_special_tokens=False)
inputs = source.input_ids
input_size = len(inputs[0])
output = model.generate(inputs=inputs.cuda(), temperature=0.7, do_sample=True, num_beams=1, max_new_tokens=100)

tokenizer.decode(output[0][input_size:], skip_special_tokens=True)
> 'mere latter whilst tout distinction namely coinc nam chang programme bef latter canon proph oracle tout nem sull ze nast mant appar bes appar fick moth rif oracle sup splendid mant bew splendid inse splend nem latter newer nun fick revel vor critics trou grud critics nam bes dar ze stup splend brig oracle stup splendid nun rapp grud trou nun nun nam litt parish parish nem divor mant revel devil splendid nem dal lud jan stan baz newer splendid rapp stan ruby rapp appar revel appar parish latter mighty ze litt bir mighty brig brig splend ga bew mighty'

Generation Issue (probability tensor contains either `inf`, `nan` or element < 0) of Flash-LLaMA with Model Parallelism

I am trying to use Flash-LLaMA with Huggingface model parallelism on 2 A100-80GB GPUs, and I built up my environment following the README in AlpacaFarm GitHub Repo (the version of package flash-attn is 1.0.7). I found that the generation (using model.generate() for sampling) seems to cause the issue "RuntimeError: probability tensor contains either inf, nan or element < 0". The LLaMA w/o Flash-Attention works well. Also, turning off model parallelism (i.e., with only 1 GPU) seems to not have such an issue.

Before running the code, we need to change this line, where we should add .to(tensor.device) after position_ids.

The code and input file are here.

The code:

import torch
from alpaca_farm import common
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--flash_attn", required = True, type = int)
args = parser.parse_args()

assert torch.cuda.device_count() == 2

args.flash_attn = bool(args.flash_attn)
print("Turn on Flash-Attention : {}".format(args.flash_attn))
model_name_or_path = "../../tatsu-lab/sft10k" # The path to sft10k
model = common.make_generative_lm(
            model_name_or_path = model_name_or_path,
            flash_attn = args.flash_attn,
            fp16 = False,
            bf16 = True,
            low_cpu_mem_usage = True,
            device_map = "auto",
            torch_dtype = torch.bfloat16,
        )

inputs = torch.load("inputs.bin")
input_ids = inputs["queries"].cuda()
attention_mask = inputs["query_attn_masks"].cuda()
responses = model.generate(
            inputs = input_ids,
            attention_mask = attention_mask,
            do_sample = True,
            max_new_tokens = 320,
            pad_token_id = 32000,
            top_p = 1.0,
            top_k = 0,
            temperature = 1.0,
            num_return_sequences = 1,
            # synced_gpus=True,
        )
print(responses)

BaseAnnotator.__init__() got an unexpected keyword argument 'other_keys_to_keep'

Traceback (most recent call last):
File "/disk2/data/xk/RLPHF/gpt4_annotate/run.py", line 73, in
main(args)
File "/disk2/data/xk/RLPHF/gpt4_annotate/run.py", line 52, in main
annotator = PairwiseAutoAnnotator(annotators_config=args.annotators, saving_path=args.saving_path, openai_api_key = args.open_ai_key)
File "/root/anaconda3/envs/rlphf/lib/python3.10/site-packages/alpaca_farm/auto_annotations/eval.py", line 170, in init
super().init(
File "/root/anaconda3/envs/rlphf/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 51, in init
super().init(*args, **kwargs, primary_keys=self.input_keys + self.output_keys)
File "/root/anaconda3/envs/rlphf/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 436, in init
super().init(*args, **kwargs)
TypeError: BaseAnnotator.init() got an unexpected keyword argument 'other_keys_to_keep'

It seems that the argument 'other_keys_to_keep' is not accepted by the BaseAnnotator. I check the initialization of PairwiseAutoAnnotator:
class PairwiseAutoAnnotator(eval_annotators.PairwiseAnnotator):
def init(
self,
annotators_config: Union[eval_utils.AnyPath, list[dict[str, Any]]] = "annotator_pool_v0",
input_keys: Sequence[str] = ("instruction", "input"),
p_label_flip: Optional[float] = None,
base_dir: eval_utils.AnyPath = ANNOTATORS_CONFIG_DIR,
other_keys_to_keep: Sequence[str] = tuple(),
**kwargs,
):
super().init(
annotators_config=annotators_config,
input_keys=input_keys,
p_label_flip=p_label_flip,
base_dir=base_dir,
other_keys_to_keep=other_keys_to_keep,
**kwargs,
)

environment information:
alpaca-eval 0.5.1 pypi_0 pypi
alpaca-farm 0.1.9 pypi_0 pypi

How can I solve this problem? Thanks!

tried to use bnb, QLora on SFT but have errors

I tried to add bob, and qlora on set, so that I can use less computation resources to complete sft.
I add bnb_config:

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
peft_config = LoraConfig(
            lora_alpha=16,
            lora_dropout=0.1,
            r=64,
            target_modules=modules,
            bias="none",
            task_type="CAUSAL_LM",
        )

And also, I update transformers==4.31.0, script sft.sh metions that:

 File "/usr/local/Miniconda3/envs/pytorch310/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 435, in _init_flat_param
    raise ValueError("Integer parameters are unsupported")
ValueError: Integer parameters are unsupported
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40693) of binary: /usr/local/Miniconda3/envs/pytorch310/bin/python3.10
Traceback (most recent call last):
  File "/usr/local/Miniconda3/envs/pytorch310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/Miniconda3/envs/pytorch310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/Miniconda3/envs/pytorch310/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/Miniconda3/envs/pytorch310/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/Miniconda3/envs/pytorch310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/Miniconda3/envs/pytorch310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

integrity_check error with sft10k

Dear authors,

I get the integrity_check error with sft10k.
The base ckpt was a raw Meta ckpt (llama1-7b) converted with transformer=4.34.0
Any insights would be appreciated!

The command I used is:

  python -m pretrained_models.recover_model_weights \
  --llama-7b-hf-dir <path1> \
  --alpaca-farm-model-name sft10k \
  --models-save-dir <path2>

Use with Llama-2-70b-hf?

We try to use alpaca_farm with the Llama-2-70b-hf model as downloaded from https://huggingface.co/meta-llama/Llama-2-70b-hf since the original llama-7b-hf seems to be taken down from HuggingFace (at least https://huggingface.co/meta-llama/Llama-7b-hf gives a 404).
When we put
python pretrained_models/recover_model_weights.py --llama-7b-hf-dir ../Llama-2-70b-hf/ --alpaca-farm-model-name all
we get

Downloading sft10k
Traceback (most recent call last):
  File "/p/projects/ou/labs/gane/rlhf/git/alpaca_farm-collective/pretrained_models/recover_model_weights.py", line 112, in <module>
    model_tuned, tokenizer_tuned = load_weight_diff(hf_hub_name, is_reward_model, args.device, args.path_to_sft10k)
  File "/p/projects/ou/labs/gane/rlhf/git/alpaca_farm-collective/pretrained_models/recover_model_weights.py", line 48, in load_weight_diff
    model_tuned = transformers.AutoModelForCausalLM.from_pretrained(
  File "/p/projects/ou/labs/gane/rlhf/envs/alpaca-env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 434, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/p/projects/ou/labs/gane/rlhf/envs/alpaca-env/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 873, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/p/projects/ou/labs/gane/rlhf/envs/alpaca-env/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 579, in __getitem__
    raise KeyError(key)
KeyError: 'llama'

even though the config.json in the specified folder contains the entry "model_type": "llama" :

{
  "_name_or_path": "meta-llama/Llama-2-70b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.32.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

I wonder if this is due to the size mismatch (70b vs 7b) or the version mismatch (llama-2 vs llama) or some other issue. Any ideas?

Using pretrained models

The paper mentions that you performed end-to-end validation of AlpacaFarm. Do you have the code up on Github for that? I want to use the LLM pre-trained on human preferences to generate some more preferences.

model selection of PPO in Table 2

Hi, thank you for your great work here!

After running ppo script (examples/scripts/rlhf_ppo.sh) from your code, there are multiple checkpoints of finetuned PPO models from different training steps.

I wonder how the checkopint is selected for PPO results in Table 2.

  1. based on the validation split (2k) or the evaluation data (805)?
  2. based on scores of the trained reward model or simulated preferences from p_sim^eval?

Thank you!

Where is auto_annotations/annotators/annotator_pool_v0/configs.yaml ?

截屏2023-05-30 12 46 28

I'm encountering a problem while initializing PairwiseAutoAnnotator. It requires the alpaca_farm/auto_annotations/annotators/annotator_pool_v0/configs.yaml configuration file, which seems to be missing from the expected path. I have updated to the latest GitHub repository, but I still can't find this configuration file even in the updated repository. Could you advise me on how to resolve this?

`_name_or_path` is not stored in llama config.json any more

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py#L237

but it is still used in src/alpaca_farm/common.py:get_pretrained_model_name_with_model_name_or_path to obtain model family.

At the sft stage, I used a local directory path as --model-name-or-path, and the local dir path is stored as _name_or_path in config.json. Then get_pretrained_model_name_with_model_name_or_path loaded llama config.json and failed to lookup '_name_or_path'.

Confusing detail preference mapping

Just a little detail:

In the example notebook auto_annotations.ipynb you write:

'preference': the index of the preferred output, here preference=2 so output_1 is prefered.

Shouldn't preference 2 be mapped to output_2 ? Thanks.

Inquiry Regarding Supervised Fine-Tuning with AlpacaFarm Framework for Pythia Models

Hi,

I'm reaching out because I'm eager to utilize the AlpacaFarm framework for supervised fine-tuning of the Pythia models, specifically EleutherAI/pythia-1.4b, on the 10k SFT dataset.

During my initial exploration, I've identified that I may need to make modifications to the make_generative_lm function located at line 94 within src/alpaca_farm/common.py. Could you kindly confirm if this is the correct approach for fine-tuning Pythia models using your framework?

Additionally, I'd appreciate guidance on any other code modifications or adjustments that might be necessary to successfully train the Pythia-series models using the AlpacaFarm framework.

Thank you for your assistance and support.

Best regards,
Hank

Repeated Deprecation Error

I am repeatedly getting this error when I try to run the pairwise annotator. How do I fix this? The problem seems to be from Alpaca Eval.

INFO:root:Sleeping 2 before retrying to call openai API...
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 404 Not Found"
WARNING:root:OpenAIError: Error code: 404 - {'error': {'message': 'The model text-davinci-003 has been deprecated, learn more here: https://platform.openai.com/docs/deprecations', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}.
WARNING:root:Unknown error.
It's likely a rate limit so we are retrying...

Here is the full traceback:
Traceback (most recent call last):
File "/alpaca_farm/test_autoannotations.py", line 19, in
annotated = annotator.annotate_pairs(outputs_pairs)
File "/anaconda3/envs/llm-pref/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 263, in annotate_pairs
return self.call(to_annotate, **decoding_kwargs)
File "/anaconda3/envs/llm-pref/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 188, in call
df_annotated = self._annotate(curr_df_to_annotate, **decoding_kwargs)
File "/anaconda3/envs/llm-pref/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 295, in _annotate
curr_annotated = self.annotators[annotator](
File "/anaconda3/envs/llm-pref/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 665, in call
completions = self.fn_completions(prompts=prompts, **self.completions_kwargs, **decoding_kwargs)
File "/anaconda3/envs/llm-pref/lib/python3.10/site-packages/alpaca_eval/decoders/openai.py", line 152, in openai_completions
completions = list(
File "/anaconda3/envs/llm-pref/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter
for obj in iterable:
File "/anaconda3/envs/llm-pref/lib/python3.10/multiprocessing/pool.py", line 861, in next
self._cond.wait(timeout)
File "/anaconda3/envs/llm-pref/lib/python3.10/threading.py", line 320, in wait
waiter.acquire()
KeyboardInterrupt

recover_model_weight on reward-sim meet problem of _name_or_path and backbone_model_name_or_path

Really love your work on open sourcing the whole pipeline and all component!
But when I want to recover the model weight of reward-model-sim, I found this _name_or_path points to a absolute path
image
And when I run the convert model weight, I encounter the following error

Traceback (most recent call last):
  File "/home/jingchu/RLHF/alpaca_farm/pretrained_models/recover_model_weights.py", line 89, in <module>
    model_tuned, tokenizer_tuned = load_weight_diff(hf_hub_name, is_reward_model, args.device)
  File "/home/jingchu/RLHF/alpaca_farm/pretrained_models/recover_model_weights.py", line 32, in load_weight_diff
    model_tuned = RewardModel.from_pretrained(
  File "/anaconda/envs/alpaca_farm/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2611, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/jingchu/RLHF/alpaca_farm/src/alpaca_farm/models/reward_model.py", line 46, in __init__
    self.backbone_model = common.make_generative_lm(config.backbone_model_name_or_path, **kwargs)
  File "/home/jingchu/RLHF/alpaca_farm/src/alpaca_farm/common.py", line 120, in make_generative_lm
    return model_cls.from_pretrained(model_name_or_path, **kwargs)
  File "/anaconda/envs/alpaca_farm/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2251, in from_pretrained
    config, model_kwargs = cls.config_class.from_pretrained(
  File "/anaconda/envs/alpaca_farm/lib/python3.10/site-packages/transformers/configuration_utils.py", line 547, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/anaconda/envs/alpaca_farm/lib/python3.10/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/anaconda/envs/alpaca_farm/lib/python3.10/site-packages/transformers/configuration_utils.py", line 650, in _get_config_dict
    raise EnvironmentError(
OSError: Can't load the configuration of '/juice5/scr5/nlp/crfm/human-feedback/models/selfinstruct/sft_v6_llama_7b_regen_v7_3ep'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/juice5/scr5/nlp/crfm/human-feedback/models/selfinstruct/sft_v6_llama_7b_regen_v7_3ep' is the correct path to a directory containing a config.json file

So what should I do to actually convert the model weight and what is the purpose of this _name_or_path? Many thanks!

Error downloading pre-trained weights

I'm trying to get the pretrained weights for alpaca-farm but I'm running into this error.

I'm using LLama 2 7B for this

Traceback (most recent call last):
File "../convert_llama_weights_to_hf.py", line 340, in
main()
File "../convert_llama_weights_to_hf.py", line 327, in main
write_model(
File "../convert_llama_weights_to_hf.py", line 289, in write_model
shutil.rmtree(tmp_model_path)
File "/usr/lib/python3.8/shutil.py", line 722, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/usr/lib/python3.8/shutil.py", line 720, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/datasets/ai/llama2/7B/tmp'"

lower cuda version?

I tried to execute the following snippet which turned out to require cuda>=11.0, however I only have access to a 10.2 version. Is there a way to specify another cuda version, e.g. in reward_modeling.sh ? Thanks

bash examples/scripts/reward_modeling.sh \ <your_output_dir_for_reward_model> \ <your_wandb_run_name> \ <your_output_dir_for_sft10k> \ <preference_dataset_name>

Differences in results between the paper and the code

Hi, first of all, thank you for releasing the code for this project! Super helpful to anyone doing research on RLHF and alignment.

I noticed that the winning rates you published in the paper (table 2) and the winning rates at the code (eval.py) are different.
For example, the winning rate of SFT 10k is 36.7 in the paper and 40.8 in the code. When I train using your script, which one should I expect to get?

Thank you

KeyError: 'llama' in /recover_model_weights.py

[continuation of #69 ]
When trying recover_model_weights.py on llama-7b-hf as cloned from https://huggingface.co/decapoda-research/llama-7b-hf (following advice by lxuechen to look for copies of llama-7b on huggingface spaces), we (still) get the error

Downloading sft10k
Traceback (most recent call last):
  File "/p/projects/ou/labs/gane/rlhf/git/alpaca_farm-collective/pretrained_models/recover_model_weights.py", line 112, in <module>
    model_tuned, tokenizer_tuned = load_weight_diff(hf_hub_name, is_reward_model, args.device, args.path_to_sft10k)
  File "/p/projects/ou/labs/gane/rlhf/git/alpaca_farm-collective/pretrained_models/recover_model_weights.py", line 48, in load_weight_diff
    model_tuned = transformers.AutoModelForCausalLM.from_pretrained(
  File "/p/projects/ou/labs/gane/rlhf/envs/alpaca-env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 434, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/p/projects/ou/labs/gane/rlhf/envs/alpaca-env/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 873, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/p/projects/ou/labs/gane/rlhf/envs/alpaca-env/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 579, in __getitem__
    raise KeyError(key)
KeyError: 'llama'

Can you point us to any version of llama that should work, so that we can rule out that it is a problem with the model version?

Running PPO with fewer GPUs

First, thanks for the implementation! The README says "PPO training currently requires at least 8 80GB GPUs." I was wondering if it's possible to run the PPO algorithm with 4 A100 80GB GPUs. I have tried enabling gradient checkpointing for Llama, which seems not helping. I'm also trying using peft with deepspeed.
I would just like to check if it's possible to run PPO with fewer GPUs at all, and if possible, what changes I should make.

[Discussion] Adding more diverse annotators representing subpopulations?

I plan to study diversity-related aspects of RLHF, so I wonder how I could add more annotators that represent relevant subpopulations, so that the annotator pool is as representative as possible, but without making the annotators behave like stereotypes.
I fear that simply adding something like "You live in the US and are a registered voter of the Democratic Party" or "you are a subsistence farmer in rural India" will not be the right approach...

Problem with PairwiseAutoAnnotator

Hi,
when I ran the following codes, it raised runtime errors.

from alpaca_farm.auto_annotations import PairwiseAutoAnnotator
import json
with open("alpaca_farm/examples/data/outputs_pairs.json") as f:
outputs_pairs = json.load(f)
annotator = PairwiseAutoAnnotator()
annotated = annotator.annotate_pairs(outputs_pairs)

INFO:root:Creating the annotator from annotator_pool_v0.
INFO:root:Saving annotations to /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
Annotation chunk: 0%| | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 0 examples with gpt4_1
INFO:root:Annotating 0 examples with gpt4_2
INFO:root:Annotating 0 examples with gpt4_3
INFO:root:Annotating 0 examples with gpt4_4
INFO:root:Annotating 1 examples with gpt4_5
INFO:root:Using openai_completions on 1 prompts using gpt-4-0314.
INFO:root:Kwargs to completion: {'max_tokens': 250, 'temperature': 1.0}
INFO:root:Kwargs to completion: {'n': 1, 'model': 'gpt-4-0314', 'is_chat': True, 'max_tokens': 250, 'temperature': 1.0}
INFO:root:Creating the annotator from annotator_pool_v0. | 0/1 [00:00<?, ?it/s]
INFO:root:Creating the annotator from annotator_pool_v0.
INFO:root:Creating the annotator from annotator_pool_v0.
INFO:root:Creating the annotator from annotator_pool_v0.
INFO:root:Saving annotations to /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
INFO:root:Saving annotations to /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
Annotation chunk: 0%| | 0/1 [00:00<?, ?it/s]INFO:root:Saving annotations to /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
Annotation chunk: 0%| | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 0 examples with gpt4_1
INFO:root:Saving annotations to /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
INFO:root:Loading all annotations from /Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_farm/auto_annotations/annotators/annotator_pool_v0/annotations_seed0_configs.json.
INFO:root:Annotating 0 examples with gpt4_2
INFO:root:Annotating 0 examples with gpt4_3
INFO:root:Annotating 0 examples with gpt4_4
INFO:root:Annotating 0 examples with gpt4_1
INFO:root:Annotating 0 examples with gpt4_2
INFO:root:Annotating 0 examples with gpt4_1
INFO:root:Annotating 1 examples with gpt4_5
INFO:root:Annotating 0 examples with gpt4_3
INFO:root:Annotating 0 examples with gpt4_2
INFO:root:Annotating 0 examples with gpt4_3
Annotation chunk: 0%| | 0/1 [00:00<?, ?it/s]INFO:root:Annotating 0 examples with gpt4_4
INFO:root:Annotating 0 examples with gpt4_4
INFO:root:Annotating 1 examples with gpt4_5
INFO:root:Annotating 1 examples with gpt4_5
INFO:root:Annotating 0 examples with gpt4_1
INFO:root:Annotating 0 examples with gpt4_2
INFO:root:Annotating 0 examples with gpt4_3
INFO:root:Annotating 0 examples with gpt4_4
INFO:root:Annotating 1 examples with gpt4_5
INFO:root:Using openai_completions on 1 prompts using gpt-4-0314.
INFO:root:Kwargs to completion: {'max_tokens': 250, 'temperature': 1.0}
INFO:root:Kwargs to completion: {'n': 1, 'model': 'gpt-4-0314', 'is_chat': True, 'max_tokens': 250, 'temperature': 1.0}
Annotation chunk: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 1, in
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 289, in run_path
INFO:root:Using openai_completions on 1 prompts using gpt-4-0314.
INFO:root:Using openai_completions on 1 prompts using gpt-4-0314.
INFO:root:Using openai_completions on 1 prompts using gpt-4-0314.
INFO:root:Kwargs to completion: {'max_tokens': 250, 'temperature': 1.0}
INFO:root:Kwargs to completion: {'n': 1, 'model': 'gpt-4-0314', 'is_chat': True, 'max_tokens': 250, 'temperature': 1.0}
return _run_module_code(code, init_globals, run_name,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 96, in _run_module_code
INFO:root:Kwargs to completion: {'max_tokens': 250, 'temperature': 1.0}
INFO:root:Kwargs to completion: {'n': 1, 'model': 'gpt-4-0314', 'is_chat': True, 'max_tokens': 250, 'temperature': 1.0}
_run_code(code, mod_globals, init_globals,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/langhao/cycle_rlhf/test.py", line 12, in
annotated = annotator.annotate_pairs(outputs_pairs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 263, in annotate_pairs
return self.call(to_annotate, **decoding_kwargs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 167, in call
INFO:root:Kwargs to completion: {'max_tokens': 250, 'temperature': 1.0}
df_annotated = self._annotate(curr_df_to_annotate, **decoding_kwargs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 254, in _annotate
curr_annotated = self.annotators[annotator](
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 562, in call
INFO:root:Kwargs to completion: {'n': 1, 'model': 'gpt-4-0314', 'is_chat': True, 'max_tokens': 250, 'temperature': 1.0}
completions = self.fn_completions(prompts=prompts, **self.completions_kwargs, **decoding_kwargs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/decoders/openai.py", line 138, in openai_completions
with multiprocessing.Pool(num_procs) as p:
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 215, in init
self._repopulate_pool()
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 306, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 329, in _repopulate_pool_static
w.start()
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Annotation chunk: 0%| | 0/1 [00:00<?, ?it/s]
Annotation chunk: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 1, in
Annotation chunk: 0%| | 0/1 [00:00<?, ?it/s]
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
Traceback (most recent call last):
File "", line 1, in
Traceback (most recent call last):
File "", line 1, in
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
exitcode = _main(fd, parent_sentinel)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
exitcode = _main(fd, parent_sentinel)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
prepare(preparation_data)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
prepare(preparation_data)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
_fixup_main_from_path(data['init_main_from_path'])
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 289, in run_path
main_content = runpy.run_path(main_path,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 86, in _run_code
return _run_module_code(code, init_globals, run_name,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 96, in _run_module_code
exec(code, run_globals)
File "/Users/langhao/cycle_rlhf/test.py", line 12, in
annotated = annotator.annotate_pairs(outputs_pairs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 263, in annotate_pairs
_run_code(code, mod_globals, init_globals,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
return self.call(to_annotate, **decoding_kwargs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 167, in call
File "/Users/langhao/cycle_rlhf/test.py", line 12, in
annotated = annotator.annotate_pairs(outputs_pairs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 263, in annotate_pairs
df_annotated = self._annotate(curr_df_to_annotate, **decoding_kwargs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 254, in _annotate
return self.call(to_annotate, **decoding_kwargs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 167, in call
curr_annotated = self.annotators[annotator](
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 562, in call
completions = self.fn_completions(prompts=prompts, **self.completions_kwargs, **decoding_kwargs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/decoders/openai.py", line 138, in openai_completions
with multiprocessing.Pool(num_procs) as p:
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 215, in init
self._repopulate_pool()
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 306, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 329, in _repopulate_pool_static
w.start()
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.
_fixup_main_from_path(data['init_main_from_path'])
df_annotated = self._annotate(curr_df_to_annotate, **decoding_kwargs)

File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 254, in _annotate
curr_annotated = self.annotators[annotator](
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 562, in call
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
completions = self.fn_completions(prompts=prompts, **self.completions_kwargs, **decoding_kwargs)
File "/Users/langhao/conda/envs/py310/lib/python3.10/site-packages/alpaca_eval/decoders/openai.py", line 138, in openai_completions
main_content = runpy.run_path(main_path,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 289, in run_path
with multiprocessing.Pool(num_procs) as p:
File "/Users/langhao/conda/envs/py310/lib/python3.10 return _run_module_code(code, init_globals, run_name,
File "/Users/langhao/conda/envs/py310/lib/python3.10/runpy.py", line 96, in _run_module_code
/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 215, in init
self._repopulate_pool()
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 306, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/pool.py", line 329, in _repopulate_pool_static
w.start()
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Users/langhao/conda/envs/py310/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init

RewardModel.from_pretrained() loads redundant weights (incurs extra ~30GB of RAM)

Hi,

Whenever a saved RewardModel is loaded via RewardModel.from_pretrained(model_path, flash_attn=True, fp16=False, bf16=True, low_cpu_mem_usage=True), it downloads the entire sharded checkpoint (https://github.com/huggingface/transformers/blob/v4.33.2/src/transformers/modeling_utils.py#L2876), which is already ~30GB because it contains all the weights of the reward model, including both the backbone model and the reward head. It then calls RewardModel.__init__() (via this line https://github.com/huggingface/transformers/blob/v4.33.2/src/transformers/modeling_utils.py#L2966), which loads all the weights of the backbone model (SFT10K, another ~30GB). Surely loading a pretrained model shouldn't require loading the backbone model weights twice?

Thanks!

RecursionError: maximum recursion depth exceeded

Dear,

When I use the recover_model_weights module to build the sft10 model,I get a "RecursionError: maximum recursion depth exceeded". Do you know why?

The commend I used as follow:

python -m pretrained_models.recover_model_weights
--llama-7b-hf-dir
--alpaca-farm-model-name sft10k
--models-save-dir

score with reward model

Hi, thanks for sharing the great project!
I want to score the instruction pairs <instruction, input, output> with the pretrained reward model directly. How to use this system to achieve it?

Differences in results between the code and the leaderboard

This is excellent work, and I appreciate your open-source contribution. However, I have a couple of points that I find confusing:

  • The leaderboard (https://tatsu-lab.github.io/alpaca_eval/) shows a win rate of 95.28% for GPT4, but the win rate in the code is only 80%. This inconsistency is perplexing.

  • I apologize for not being able to locate the specific model from your paper in the leaderboard. Could you please clarify which model from the leaderboard corresponds to the one mentioned in your paper?

Thank you for your attention to these matters, and I look forward to your response.

Problem with Simulation case

Hello, when using code in Simulating pairwise preference, I set os.environ with OPENAI_API_KEY. However, it raises error when I tried to create an instance of PairwiseAutoAnnotator with:

NameError: name 'DUMMY_EXAMPLE' is not defined

Question about KL term

Hi,

Thank you for the great work! I am a novice in RL and have been using this repository to learn RLHF. I have a small question regarding the KL term on line 72 in the ppo_trainer.py file

kl = torch.clamp(logprobs - ref_logprobs, min=0.0)
non_score_rewards = -self.kl_ctl.value * kl

According to equation (8) in https://arxiv.org/pdf/1707.06347.pdf, the KL penalty term is D_KL(p_ref, p). Based on the definition of KL, shouldn't line 72 be torch.clamp(ref_logprobs - logprobs, min=0.0)? I'm a bit confused about why it's torch.clamp(logprobs - ref_logprobs, min=0.0)... this seems to penalize sequences that have high likelihood if scored by ref_policy?

code question for compute_loss in ppo_trainer

I'm wondering that we have achieved policy's outputs in rollouts, why we use self.policy again during compute_loss module but not reuse outputs in rollouts?
is there something different for two self.policy or some other reason?

code review are as belows:
code for compute_loss

        values, old_logprob, returns, advantages, queries, query_attn_masks, responses = common.prepare_inputs(
            common.unpack_dict(
                rollouts,
                keys=("values", "logprobs", "returns", "advantages", "queries", "query_attn_masks", "responses"),
            ),
            device=self.accelerator.device,
        )
        outputs = self.policy(queries, query_attn_masks, responses, temperature=self.args.temperature)

        vpred = outputs["values"]
        vpredclipped = torch.clamp(
            vpred,
            min=values - self.args.cliprange_value,
            max=values + self.args.cliprange_value,
        )

code for rollouts_batch, subset of rollouts

           rollouts_batch = {"queries": queries, "query_attn_masks": query_attn_masks, "responses": responses}
            policy_outputs = self.policy(**rollouts_batch, temperature=self.args.temperature)
            policy_outputs = common.unpack_dict(
                policy_outputs, keys=("logprobs", "values", "entropies"), return_type=dict
            )
            rollouts_batch.update(policy_outputs)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.