Proximal Policy Optimization (PPO) in LLM Training

部署运行你感兴趣的模型镜像

Reinforcement learning (RL) has been proven to be effective in further improving the ability of LLMs after the Supervised Fine-Tuning (SFT) stage. Especially after the publish of DeepSeek-R1, people all focus on RL in LLM training and the RL algorithm used to train DeepSeek-R1——GRPO. Actually, GRPO is a variant of PPO, which is proposed by OpenAI in 2017 and then used to train GPT series. Therefore, before we dive deep into GRPO, it is necessary to know PPO first.

在这里插入图片描述

Component

There are 4 models in PPO. They are Policy Model, Value Model, Reference Model and Reward Model.

  • Policy Model: The LLM we want to train, usually it is already pre-trained and supervised fine-tuned.
  • Reference Model: A model producing reference output to compare with Policy Model, usually it will be the initial checkpoint of Policy Model (after SFT before RL).
  • Reward Model: A model to give reward based on current output, usually it is obtained from RLHF.
  • Value Model: A model to estimate future value based on current output, it is trainable.

Difference between Reward Model and Value Model:
Reward Model evaluates the quality of current output; Value Model evaluates potential reward in the future. Value Model is applied to stabilize the training process, considering global and future reward; Reward Model just give immediate reward based on current action.

Workflow

In each iteration:

  • A question qqq is sampled from dataset.
  • qqq (and o<to_{<t}o<t) is input to Policy Model, gaining new output oto_tot.
  • Given q,o≤tq, o_{\le t}q,ot, Reward Model gives the reward rφ(q,o≤t)r_{\varphi}(q,o_{\le t})rφ(q,ot); to mitigate over-optimization of the reward model, a per-token KL penalty from Reference Model βlog⁡π(ot∣q,o≤t)πref(ot∣q,o≤t)\beta\log \frac{\pi(o_t|q, o_{\le t})}{\pi_{ref}(o_t|q, o_{\le t})}βlogπref(otq,ot)π(otq,ot). The final reward is
    rt=rφ(q,o≤t)−βlog⁡π(ot∣q,o≤t)πref(ot∣q,o≤t) r_t = r_{\varphi}(q,o_{\le t}) - \beta\log \frac{\pi(o_t|q, o_{\le t})}{\pi_{ref}(o_t|q, o_{\le t})} rt=rφ(q,ot)βlogπref(otq,ot)π(otq,ot)
    It can be observed that the penalty is not always positive. It punishes too significant probability assignment increase, but encourages probability decrease (that’s quite wired). Thereby restricts the model to be conservative (?).
  • Given q,o≤tq, o_{\le t}q,ot, Value Model gives the estimated value vtv_tvt.
  • Given rt,vtr_t, v_trt,vt, GAE combines and smooths their value and give the estimated advantage AtA_tAt; at the same time, the supervised signal vtarget=rt+γvtv_{target}=r_t+\gamma v_tvtarget=rt+γvt is passed to Value Model as target label of vt−1v_{t-1}vt1 (Yes, t-1\textbf{t-1}t-1, referring to Bellman Equation vt−1=rt+γvtv_{t-1} = r_t+\gamma v_tvt1=rt+γvt). The parameters in Value Model will be updated using MSE or something similar.
  • Given AtA_tAt, the surrogate objective of PPO is
    JPPO(θ)=Eo∼πθold(O∣q)[1∣o∣∑t=1∣o∣min⁡[πθ(ot∣q,o<t)πθold(ot∣q,o<t)At,clip(πθ(ot∣q,o<t)πθold(ot∣q,o<t),1−ϵ,1+ϵ)At]] \mathcal{J}_{PPO}(\theta) = \mathbb{E}_{o\sim\pi_{\theta_{old}}(O|q)}\left[ \frac{1}{|o|}\sum_{t=1}^{|o|}\min\left[ \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{old}}(o_t|q,o_{<t})}A_t, \text{clip}\left( \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{old}}(o_t|q,o_{<t})}, 1-\epsilon, 1+\epsilon \right)A_t \right] \right] JPPO(θ)=Eoπθold(Oq)o1t=1omin[πθold(otq,o<t)πθ(otq,o<t)At,clip(πθold(otq,o<t)πθ(otq,o<t),1ϵ,1+ϵ)At]
    where AtA_tAt acts like the gradient and min⁡(⋯ )\min(\cdots)min() acts like the step rate of this surrogate objective. clip\text{clip}clip function restricts the magnitude of updates along AtA_tAt
  • Using policy gradient in RL, we can update Policy Model according to JPPO(θ)\mathcal{J}_{PPO}(\theta)JPPO(θ) by adjusting the probability of πθ(ot∣q,o<t)\pi_\theta(o_t|q,o_{<t})πθ(otq,o<t).

Application

After unsupervised pre-training (base model), supervised fine-tuning (instruct model), further improve the performance of LLM.

The value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden.

In the LLM context, usually only the last token is assigned a reward score by the reward model (because Reward Model is trained to evaluate the entire response not each token), which may complicate the training of a value function that is accurate at each token.

您可能感兴趣的与本文相关的镜像

Qwen3-VL-30B

Qwen3-VL-30B

图文对话
Qwen3-VL

Qwen3-VL是迄今为止 Qwen 系列中最强大的视觉-语言模型,这一代在各个方面都进行了全面升级:更优秀的文本理解和生成、更深入的视觉感知和推理、扩展的上下文长度、增强的空间和视频动态理解能力,以及更强的代理交互能力

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ShadyPi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值