Proximal Policy Optimization (PPO) in LLM Training

原创已于 2025-04-08 05:04:25 修改 · 1.3k 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #人工智能 #自然语言处理

于 2025-01-28 15:16:22 首次发布

自然语言处理同时被 2 个专栏收录

29 篇文章

订阅专栏

LLMs

7 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

Reinforcement learning (RL) has been proven to be effective in further improving the ability of LLMs after the Supervised Fine-Tuning (SFT) stage. Especially after the publish of DeepSeek-R1, people all focus on RL in LLM training and the RL algorithm used to train DeepSeek-R1——GRPO. Actually, GRPO is a variant of PPO, which is proposed by OpenAI in 2017 and then used to train GPT series. Therefore, before we dive deep into GRPO, it is necessary to know PPO first.

在这里插入图片描述

Component

There are 4 models in PPO. They are Policy Model, Value Model, Reference Model and Reward Model.

Policy Model: The LLM we want to train, usually it is already pre-trained and supervised fine-tuned.
Reference Model: A model producing reference output to compare with Policy Model, usually it will be the initial checkpoint of Policy Model (after SFT before RL).
Reward Model: A model to give reward based on current output, usually it is obtained from RLHF.
Value Model: A model to estimate future value based on current output, it is trainable.

Difference between Reward Model and Value Model:
Reward Model evaluates the quality of current output; Value Model evaluates potential reward in the future. Value Model is applied to stabilize the training process, considering global and future reward; Reward Model just give immediate reward based on current action.

Workflow

In each iteration:

A question $q$ is sampled from dataset.
$q$ (and $o_{<t}$ ) is input to Policy Model, gaining new output $o_t$ .
Given $o_{\le t}$ , Reward Model gives the reward $rφ(q,o≤t)r_{\varphi}(q,o_{\le t})$ ; to mitigate over-optimization of the reward model, a per-token KL penalty from Reference Model $βlog⁡π(ot∣q,o≤t)πref(ot∣q,o≤t)\beta\log \frac{\pi(o_t|q, o_{\le t})}{\pi_{ref}(o_t|q, o_{\le t})}$ . The final reward is
$r_t = r_{\varphi}(q,o_{\le t}) - \beta\log \frac{\pi(o_t|q, o_{\le t})}{\pi_{ref}(o_t|q, o_{\le t})}$
It can be observed that the penalty is not always positive. It punishes too significant probability assignment increase, but encourages probability decrease (that’s quite wired). Thereby restricts the model to be conservative (?).
Given $o_{\le t}$ , Value Model gives the estimated value $v_t$ .
Given $r_t, v_t$ , GAE combines and smooths their value and give the estimated advantage $A_t$ ; at the same time, the supervised signal $vtarget=rt+γvtv_{target}=r_t+\gamma v_t$ is passed to Value Model as target label of $v_{t-1}$ (Yes, $t-1\textbf{t-1}$ , referring to Bellman Equation $vt−1=rt+γvtv_{t-1} = r_t+\gamma v_t$ ). The parameters in Value Model will be updated using MSE or something similar.
Given $A_t$ , the surrogate objective of PPO is
$\mathcal{J}_{PPO}(\theta) = \mathbb{E}_{o\sim\pi_{\theta_{old}}(O|q)}\left[ \frac{1}{|o|}\sum_{t=1}^{|o|}\min\left[ \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{old}}(o_t|q,o_{<t})}A_t, \text{clip}\left( \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{old}}(o_t|q,o_{<t})}, 1-\epsilon, 1+\epsilon \right)A_t \right] \right]$
where $A_t$ acts like the gradient and $)\min(\cdots)$ acts like the step rate of this surrogate objective. $clip\text{clip}$ function restricts the magnitude of updates along $A_t$
Using policy gradient in RL, we can update Policy Model according to $JPPO(θ)\mathcal{J}_{PPO}(\theta)$ by adjusting the probability of $πθ(ot∣q,o<t)\pi_\theta(o_t|q,o_{<t})$ .

Application

After unsupervised pre-training (base model), supervised fine-tuning (instruct model), further improve the performance of LLM.

The value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden.

In the LLM context, usually only the last token is assigned a reward score by the reward model (because Reward Model is trained to evaluate the entire response not each token), which may complicate the training of a value function that is accurate at each token.

您可能感兴趣的与本文相关的镜像