Reinforcement learning (RL) has been proven to be effective in further improving the ability of LLMs after the Supervised Fine-Tuning (SFT) stage. Especially after the publish of DeepSeek-R1, people all focus on RL in LLM training and the RL algorithm used to train DeepSeek-R1——GRPO. Actually, GRPO is a variant of PPO, which is proposed by OpenAI in 2017 and then used to train GPT series. Therefore, before we dive deep into GRPO, it is necessary to know PPO first.

Component
There are 4 models in PPO. They are Policy Model, Value Model, Reference Model and Reward Model.
- Policy Model: The LLM we want to train, usually it is already pre-trained and supervised fine-tuned.
- Reference Model: A model producing reference output to compare with Policy Model, usually it will be the initial checkpoint of Policy Model (after SFT before RL).
- Reward Model: A model to give reward based on current output, usually it is obtained from RLHF.
- Value Model: A model to estimate future value based on current output, it is trainable.
Difference between Reward Model and Value Model:
Reward Model evaluates the quality of current output; Value Model evaluates potential reward in the future. Value Model is applied to stabilize the training process, considering global and future reward; Reward Model just give immediate reward based on current action.
Workflow
In each iteration:
- A question qqq is sampled from dataset.
- qqq (and o<to_{<t}o<t) is input to Policy Model, gaining new output oto_tot.
- Given q,o≤tq, o_{\le t}q,o≤t, Reward Model gives the reward rφ(q,o≤t)r_{\varphi}(q,o_{\le t})rφ(q,o≤t); to mitigate over-optimization of the reward model, a per-token KL penalty from Reference Model βlogπ(ot∣q,o≤t)πref(ot∣q,o≤t)\beta\log \frac{\pi(o_t|q, o_{\le t})}{\pi_{ref}(o_t|q, o_{\le t})}βlogπref(ot∣q,o≤t)π(ot∣q,o≤t). The final reward is
rt=rφ(q,o≤t)−βlogπ(ot∣q,o≤t)πref(ot∣q,o≤t) r_t = r_{\varphi}(q,o_{\le t}) - \beta\log \frac{\pi(o_t|q, o_{\le t})}{\pi_{ref}(o_t|q, o_{\le t})} rt=rφ(q,o≤t)−βlogπref(ot∣q,o≤t)π(ot∣q,o≤t)
It can be observed that the penalty is not always positive. It punishes too significant probability assignment increase, but encourages probability decrease (that’s quite wired). Thereby restricts the model to be conservative (?). - Given q,o≤tq, o_{\le t}q,o≤t, Value Model gives the estimated value vtv_tvt.
- Given rt,vtr_t, v_trt,vt, GAE combines and smooths their value and give the estimated advantage AtA_tAt; at the same time, the supervised signal vtarget=rt+γvtv_{target}=r_t+\gamma v_tvtarget=rt+γvt is passed to Value Model as target label of vt−1v_{t-1}vt−1 (Yes, t-1\textbf{t-1}t-1, referring to Bellman Equation vt−1=rt+γvtv_{t-1} = r_t+\gamma v_tvt−1=rt+γvt). The parameters in Value Model will be updated using MSE or something similar.
- Given AtA_tAt, the surrogate objective of PPO is
JPPO(θ)=Eo∼πθold(O∣q)[1∣o∣∑t=1∣o∣min[πθ(ot∣q,o<t)πθold(ot∣q,o<t)At,clip(πθ(ot∣q,o<t)πθold(ot∣q,o<t),1−ϵ,1+ϵ)At]] \mathcal{J}_{PPO}(\theta) = \mathbb{E}_{o\sim\pi_{\theta_{old}}(O|q)}\left[ \frac{1}{|o|}\sum_{t=1}^{|o|}\min\left[ \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{old}}(o_t|q,o_{<t})}A_t, \text{clip}\left( \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{old}}(o_t|q,o_{<t})}, 1-\epsilon, 1+\epsilon \right)A_t \right] \right] JPPO(θ)=Eo∼πθold(O∣q)∣o∣1t=1∑∣o∣min[πθold(ot∣q,o<t)πθ(ot∣q,o<t)At,clip(πθold(ot∣q,o<t)πθ(ot∣q,o<t),1−ϵ,1+ϵ)At]
where AtA_tAt acts like the gradient and min(⋯ )\min(\cdots)min(⋯) acts like the step rate of this surrogate objective. clip\text{clip}clip function restricts the magnitude of updates along AtA_tAt - Using policy gradient in RL, we can update Policy Model according to JPPO(θ)\mathcal{J}_{PPO}(\theta)JPPO(θ) by adjusting the probability of πθ(ot∣q,o<t)\pi_\theta(o_t|q,o_{<t})πθ(ot∣q,o<t).
Application
After unsupervised pre-training (base model), supervised fine-tuning (instruct model), further improve the performance of LLM.
The value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden.
In the LLM context, usually only the last token is assigned a reward score by the reward model (because Reward Model is trained to evaluate the entire response not each token), which may complicate the training of a value function that is accurate at each token.

2572

被折叠的 条评论
为什么被折叠?



