DAPO:Decouple Clip and Dynamic sAmpling Policy Optimization ,解耦剪辑和动态采样策略优化
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning.
推理扩展法则赋予了LLMs前所未有的推理能力,强化学习作为核心技术来激发复杂推理。
关键词:
Test-time scaling,
Clip-Higher, which promotes the diversity of the system and avoids entropy collapse;
裁剪高概率token,促进系统多样性并避免熵坍缩;
Dynamic Sampling, which improves training efficiency and stability;