Coder Social home page Coder Social logo

dsp's Introduction

Learning Customized Human Preferences

Code License Data License Python 3.8+

This is the repo for paper Everyone Deserves A Reward: Learning Customized Human Preferences, which contains:

Overview

Reward models (RMs) are essential for aligning LLMs with human preferences. However, the real world is pluralistic. Besides universal values, human preferences can be diversified based on different religions, politics, cultures, etc. This leads to an interesting question: "How to learn a customized reward model well while preserving its general preference ability?" To answer this, we collect a Domain-Specific Preference (DSP) dataset and test multiple training and data strategies for customized RM learning.

Domain-Specific Preference (DSP) Dataset

We collect domain-specific preferences from the four typical domains: Academy, Business, Entertainment, and Literature&Art. We select 13K instructions from the 52K Alpaca training set, then ask ChatGPT with domain-specific system prompts to collect preferred responses from each domain, as shown in the figure below.

The collected data are data/domain_specific_preference.train.json and data/domain_specific_preference.test.json, which are lists of items including the following keys:

  • query: a question collected from the Alpaca training set,
  • responses: a collection of responses from the four application domains (academy, business, entertainment, literature) additionally with the origin response from Alpaca (marked as normal).

To convert the collected responses into a preference data format, use the following command:

DOMAIN="academy"
DATA_TYPE="train"

python3 reward_datasets.py \
       --input_data_path data/domain_specific_preference.${DATA_TYPE}.json \
       --domain ${DOMAIN} \
       --output_data_path data/dsp_${DOMAIN}_pairs.${DATA_TYPE}.json \
       --convert --to_pairs

where DOMAIN can be changed to the other four domains, and DATA_TYPE is set from train and test. After the conversion, the preference data has a text-score format, where each item contains two keys:

  • text: a list of text, each of which combines the query and one response as a complete human-assistant interaction,
  • score: a list of scores, each of which is a preference score to the corresponding interaction.

Besides, we also cleaned the general preference comparisons from Helpful&Harmless, WebGPT, and GPT-4-LLM datasets and convert them into the same text-score format.

Customized RM training

To learn customized human preferences, we design a three-stage RM training scheme as shown in the figure above. For base LM training, we directly use the pretrained LLMs including LLaMA-7B and Alpaca-7B. To reproduce our results on general RM fine-tuning (GRFT) and customized RM fine-tuning (CRFT), first install the dependencies with Python3.8:

pip3 install -r requirements.txt

Note that we conduct all our experiments with 8 A100PRO 40G GPUs.

For general RM fine-tuning, use the following command:

DATA_DIR="data"
TRAIN_DATA_LIST="${DATA_DIR}/helpful.train.json \
                 ${DATA_DIR}/harmless.train.json"
TEST_DATA_LIST="${DATA_DIR}/helpful.test.json \
                 ${DATA_DIR}/harmless.test.json \
                 ${DATA_DIR}/webgpt.test.json \
                 ${DATA_DIR}/gpt4llm.test.json"
LM_COEFF=0.

torchrun --nproc_per_node=8 --master_port=6000 train.py \
    --do_train True --eval_at_start True\
    --lm_loss_coeff ${LM_COEFF} \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --train_data_path ${TRAIN_DATA_LIST} \
    --eval_data_path ${TEST_DATA_LIST} \
    --remove_unused_columns false \
    --reward_domain general \
    --output_dir <your_path_to_save_experiment_results> \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy steps \
    --padding_side right \
    --truncation_side left \
    --pooling_type last \
    --max_length 512 \
    --save_strategy steps \
    --learning_rate 1e-6 \
    --warmup_steps 100 \
    --logging_steps 1 \
    --eval_steps 100 \
    --save_steps 300 \
    --deepspeed configs/default_offload_opt_param.json \
    --tf32 false --fp16 false

Here we can change the TRAIN_DATA_LIST to select which general preference data to use for GRFT. If LM_COEFF is set positive, RM will be trained additionally with the language modeling loss on preferred samples.

For customized RM fine-tuning, use the following command:

DOMAIN="academy"
DATA_DIR="data"
TRAIN_DATA_LIST="${DATA_DIR}/dsp_${DOMAIN}_pairs.train.json"
TEST_DATA_LIST="${DATA_DIR}/helpful.test.json \
                 ${DATA_DIR}/harmless.test.json \
                 ${DATA_DIR}/webgpt.test.json \
                 ${DATA_DIR}/gpt4llm.test.json \
                 ${DATA_DIR}/dsp_${DOMAIN}_pairs.test.json"
LM_COEFF=0.

torchrun --nproc_per_node=8 --master_port=6000 train.py \
    --do_train True --eval_at_start True\
    --lm_loss_coeff ${LM_COEFF} \
    --model_name_or_path <your_path_to_general_rm_ckpt_and_tokenizer> \
    --train_data_path ${TRAIN_DATA_LIST} \
    --eval_data_path ${TEST_DATA_LIST} \
    --remove_unused_columns false \
    --reward_domain ${DOMAIN} \
    --output_dir <your_path_to_save_experiment_results> \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --evaluation_strategy steps \
    --padding_side right \
    --truncation_side left \
    --pooling_type last \
    --max_length 512 \
    --save_strategy steps \
    --learning_rate 1e-7 \
    --warmup_steps 0 \
    --logging_steps 1 \
    --eval_steps 1 \
    --save_steps 100 \
    --deepspeed configs/default_offload_opt_param.json \
    --tf32 false --fp16 false

Here DOMAIN can be selected from ["academy", "business", "entertainment", "literature"] for fine-tuning RM on preferences of different domains.

Citation

@article{cheng2023deserves,
      title={Everyone Deserves A Reward: Learning Customized Human Preferences}, 
      author={Pengyu Cheng and Jiawen Xie and Ke Bai and Yong Dai and Nan Du},
      journal={arXiv preprint arXiv:2309.03126},
      year={2023}
}

dsp's People

Contributors

linear95 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.