Search-R1论文浅析与代码实现

转载于 2025-10-21 12:30:02 发布 · 17 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://yours.tools/zh/formatsql.html

文章标签：

#it

GitHub: https://github.com/PeterGriffinJin/Search-R1

论文： link1, link2

Motivation

使用seach engine给reasoning LLM赋能

Method

在PPO的基础上，基于给定的Search Egine \(R\)，进行轨迹生成。

\[J_{PPO}(\theta) = \mathbb{E}_{(q,a)\sim\mathcal{D}, o\sim{\pi_{old}(\cdot|q;R)}}\frac{1}{\sum_{t=1}^{|o|}I(o_t)} \min[\frac{\pi_{\theta}(o_t|q, o_{<t};R)}{\pi_{old}(o_t|q,o_{<t};R)} A_t, clip(1-\epsilon, 1+\epsilon, \frac{\pi_{\theta}(o_t|q,o_{<t};R)}{\pi_{old}(o_t|q, o_{<t};R)})A_t] \]

其中需要对\(R\)返回的token进行mask

\[I(o_t) = \begin{cases} 0, & o_t\mathrm{\ is\ a\ retrived\ token};\\ 1, & otherwise; \end{cases} \]

Experiments

默认使用PPO，整体效果来看search-r1强化是有效的。training dataset来自NQ和Hotpot QA

PPO vs GRPO

认为PPO比GRPO更加稳定，效果更好；GRPO收敛更快
Instruct model vs base model

认为虽然instruct model在最开始的reward要优于base model，但是在step的后期，两者reward是可比的，且base model的效果优于instruct model。

（我认为，这里instruct好于base，可能是因为instruct后，模型的多样性下降了（因为RL的对齐），导致模型在search task的探索能力下降。但是，WebDancer等文章均使用的是Instruct model，我认为是那些工作并不是一上来就search RL的，而是先做RFT的SFT，想让instruct model适应RL的格式，并注入search task的领域知识（planing能力、工具调用能力、总结能力等等）。如果是对base model做post-training的RFT（数据量可能不大），base model会出现指令不遵循的问题。因此在SFT+RL的后续WebAgent的工作中，一半以Instruct model为基座。）
Response length and valid study
- early stage：response length明显下降，同时reward有小幅度提升（更好的理解search 任务，输出更精简）
- latter stage：response length回升，reward也提升（可以发现是seach call的次数提升导致）
ablation of retrived token mask

mask是必要的，因为model的预测目标本就不是预测出retrieved token，而是学会工具调用与计划总结
Number of Retrieved Passages Study in SEARCH-R1 Training

召回的docs不是越多越好（actor model总结时会更容易出现幻觉或是遗漏细节），也不是越少越好（巧妇难为无米之炊）
group size of GRPO

GRPO的size 大的话，效果好收敛快，但是不太稳定（感觉是论文工作设计有问题，我没有遇到过这种reward sharp decrease）