At meta level, PPO based RLHF is performing minor adjustments to weights to align with

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can we just replace PPO+RLHF with a preference models thats basically a transformer encoder + sigmoid model, trained with BCE. And during finetuning perform a reward maximization by just making the reward model predict 1s? about palm-rlhf-pytorch HOT 5 CLOSED

ssintelli commented on July 29, 2024

Can we just replace PPO+RLHF with a preference models thats basically a transformer encoder + sigmoid model, trained with BCE. And during finetuning perform a reward maximization by just making the reward model predict 1s?

from palm-rlhf-pytorch.

Comments (5)

ssintelli commented on July 29, 2024 1

Yes, you are correct, if simpler approach worked they (OpenAI) would have already tried. The approach I suggested is similar to PPLM.

As RL dynamics might be hard to tame for bigger models, maybe we can use PPLM as auxiliary guidance in addition to PPO, which I hope has minimal overhead as you have implemented much of ingredients already in the codebase.

We can have a schedule where PPLM can have weight of say 0.5 and it can decay to 0.0.

For now, I am closing the issue, if I get some good results using toy huggingface transformers + ppo/pplm combination experiments will post here.

@lucidrains Thanks for awesome work.

from palm-rlhf-pytorch.

ssintelli commented on July 29, 2024

Replacing the core RL algorithm might be too far fetched. Instead of replacing the core RL based algorithm I will see if I can actually apply the above naive supervised end to end reward maximization as guide to PPO as auxiliary task in RLHF Trainer. Sorry if it sounds vague. I will try to work on it on weekends. Also I need to go over all files in repo in details and recent literatures in incorporating human feedbacks to improve LLM so I might be completely wrong with my approach.

from palm-rlhf-pytorch.

lucidrains commented on July 29, 2024

@ssintelli haha, you read my deleted post

yea, let us know if you get PPLM working

from palm-rlhf-pytorch.

EricLee8 commented on July 29, 2024

I am also interested in your experiments, good luck and let me know if you get good results : )

from palm-rlhf-pytorch.

ssintelli commented on July 29, 2024

I didn't try on language however, I tried something similar, with image segmentation and creating pseudo feedback as a didn't have supervision data. Results were inconclusive sometimes good sometimes bad. From my experiments what I could figure out is that we need good amount of supervision data, then overfit the generative model for few epochs on supervised fine-tuning task and then optionally use RL with pesudo feedback and supervised feedback for alignment and task specific results. Actually for my PhD thesis I was initially trying to apply RLHF to improve image segmentation but later as per discussion with my supervisor and some crude experiments paused it for later.

I am still so confused, hope I didn't confuse you either. Hope it might help you even if I didn't answer exactly what I intended to do.

from palm-rlhf-pytorch.

Can we just replace PPO+RLHF with a preference models thats basically a transformer encoder + sigmoid model, trained with BCE. And during finetuning perform a reward maximization by just making the reward model predict 1s? about palm-rlhf-pytorch HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent