DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning | nature论文翻译

最新推荐文章于 2025-11-24 18:29:32 发布

原创

最新推荐文章于 2025-11-24 18:29:32 发布 · 1k 阅读

26 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #机器学习 #深度学习

Abstract | 摘要

General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)(1,2) and chain-of-thought (CoT) prompting(3), have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent on extensive human-annotated demonstrations and the capabilities of models are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions and STEM fields, surpassing its counterparts trained through conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically used to guide and enhance the reasoning capabilities of smaller models.

通用推理一直是人工智能（AI）领域长期面临的艰巨挑战。以大语言模型（LLMs）(1,2)和思维链（CoT）提示(3)为代表的近期突破，在基础推理任务上取得了显著成功。然而，这种成功在很大程度上依赖于大量人工标注的示例，并且模型的能力对于更复杂的问题仍然不足。在这里，我们表明可以通过纯强化学习（RL）来激发大语言模型的推理能力，从而无需人工标注的推理轨迹。所提出的强化学习框架促进了高级推理模式的涌现式发展，如自我反思、验证和动态策略调整。因此，经过训练的模型在数学、编程竞赛和STEM领域等可验证任务上取得了卓越的性能，超越了通过传统的基于人类示例的监督学习训练的同类模型。此外，这些大规模模型所展现的涌现式推理模式可以系统地用于指导和增强较小模型的推理能力。

Main | 主要内容

Reasoning capability, the cornerstone of human intelligence, enables complex cognitive tasks ranging from mathematical problem-solving to logical deduction and programming. Recent advances in AI have demonstrated that LLMs can exhibit emergent behaviours, including reasoning abilities, when scaled to a sufficient size 4 , 5 . However, achieving such capabilities in pre-training typically demands substantial computational resources. In parallel, a complementary line of research has demonstrated that LLMs can be effectively augmented through CoT prompting. This technique, which involves either providing carefully designed few-shot examples or using minimalistic prompts such as “Let’s think step by step” 3 , 6 , enables models to produce intermediate reasoning steps, thereby substantially enhancing their performance on complex tasks. Similarly, further performance gains have been observed when models learn high-quality, multistep reasoning trajectories during the post-training phase 2 , 7 . Despite their effectiveness, these approaches exhibit notable limitations. Their dependence on human-annotated reasoning traces slows scalability and introduces cognitive biases. Furthermore, by constraining models to replicate human thought processes, their performance is inherently capped by the human-provided exemplars, which prevents the exploration of superior, non-human-like reasoning pathways.

推理能力作为人类智能的基石，使我们能够完成从数学问题解决到逻辑推理和编程等复杂的认知任务。最近AI的进展表明，大语言模型（LLMs）在达到足够规模时可以展现出涌现行为，包括推理能力4 ，5 。然而，在预训练中实现这些能力通常需要大量的计算资源。与此同时，另一项互补的研究表明，大语言模型可以通过思维链（CoT）提示得到有效增强。这种技术包括提供精心设计的少样本示例或使用诸如“让我们逐步思考”之类的极简提示3 ，6 ，使模型能够生成中间推理步骤，从而显著提高其在复杂任务上的表现。同样，在训练后阶段，当模型学习高质量的多步推理轨迹时，也观察到了进一步的性能提升2 ，7 。尽管这些方法具有有效性，但它们也存在明显的局限性。它们对人工标注的推理痕迹的依赖减缓了可扩展性，并引入了认知偏差。此外，通过限制模型复制人类思维过程，其性能本质上受到人类提供的示例的限制，这阻碍了对更优、非人类式推理路径的探索。

To tackle these issues, we aim to explore the potential of LLMs for developing reasoning abilities through self-evolution in a RL framework, with minimal reliance on human labelling efforts. Specifically, we build on DeepSeek-V3 Base 8 and use Group Relative Policy Optimization (GRPO) 9 as our RL framework. The reward signal is only based on the correctness of final predictions against ground-truth answers, without imposing constraints on the reasoning process itself. Notably, we bypass the conventional supervised fine-tuning (SFT) phase before RL training. This design choice originates from our hypothesis that human-defined reasoning patterns may limit model exploration, whereas unrestricted RL training can better incentivize the emergence of new reasoning capabilities in LLMs. Through this process, detailed in the next section, our model (referred to as DeepSeek-R1-Zero) naturally developed diverse and sophisticated reasoning behaviours. To solve reasoning problems, the model exhibits a tendency to generate longer responses, incorporating verification, reflection and the exploration of alternative approaches within each response. Although we do not explicitly teach the model how to reason, it successfully learns improved reasoning strategies through RL.

为解决这些问题，我们旨在探索大语言模型（LLMs）在强化学习（RL）框架下通过自我进化发展推理能力的潜力，同时尽量减少对人工标注工作的依赖。具体而言，我们基于DeepSeek-V3 Base 8 ，并使用组相对策略优化（GRPO） 9 作为我们的强化学习框架。奖励信号仅基于最终预测相对于真实答案的正确性，而不对推理过程本身施加约束。值得注意的是，我们绕过了强化学习训练前的传统有监督微调（SFT）阶段。这一设计选择源于我们的假设，即人类定义的推理模式可能会限制模型的探索，而无约束的强化学习训练可以更好地激励大语言模型中出现新的推理能力。通过下一节详细介绍的这一过程，我们的模型（称为DeepSeek-R1-Zero）自然地发展出了多样且复杂的推理行为。为解决推理问题，该模型倾向于生成更长的回答，在每个回答中融入验证、反思和对替代方法的探索。尽管我们没有明确教导模型如何推理，但它通过强化学习成功学习到了改进的推理策略。

Although DeepSeek-R1-Zero demonstrates excellent reasoning capabilities, it faces challenges such as poor readability and language mixing, occasionally combining English and Chinese in a single CoT response. Furthermore, the rule-based RL training stage of DeepSeek-R1-Zero is narrowly focused on reasoning tasks, resulting in limited performance in broader areas such as writing and open-domain question answering. To address these challenges, we introduce DeepSeek-R1, a model trained through a multistage learning framework that integrates rejection sampling, RL and supervised fine-tuning, detailed in the ‘DeepSeek-R1’ section. This training pipeline enables DeepSeek-R1 to inherit the reasoning capabilities of its predecessor, DeepSeek-R1-Zero, while aligning model behaviour with human preferences through further non-reasoning data.

尽管DeepSeek-R1-Zero展现出了出色的推理能力，但它也面临着易读性差和语言混杂等挑战，偶尔会在单个思维链（CoT）回复中同时使用英语和中文。此外，DeepSeek-R1-Zero基于规则的强化学习（RL）训练阶段过于专注于推理任务，导致其在写作和开放领域问答等更广泛领域的表现有限。为应对这些挑战，我们推出了DeepSeek-R1，这是一个通过多阶段学习框架训练的模型，该框架整合了拒绝采样、强化学习和监督微调，具体细节见“DeepSeek-R1”部分。这种训练流程使DeepSeek-R1能够继承其前身DeepSeek-R1-Zero的推理能力，同时通过进一步的非推理数据使模型行为与人类偏好保持一致。

To enable broader access to powerful AI at a lower energy cost, we have distilled several smaller models and made them publicly available. These distilled models exhibit strong reasoning capabilities, surpassing the performance of their original instruction-tuned counterparts. We believe that these instruction-tuned versions will also greatly contribute to the research community by providing a valuable resource for understanding the mechanisms underlying long CoT reasoning models and for promoting the development of more powerful reasoning models. We release DeepSeek-R1-Zero, DeepSeek-R1, data samples and distilled models to the public as described in the ‘Code availability’ section.

为了以更低的能耗让更多人能够使用强大的AI，我们提炼了几个较小的模型并将其公开发布。这些提炼后的模型展现出强大的推理能力，超越了其原始指令调优版本的性能。我们相信，这些指令调优版本也将为研究界做出重大贡献，为理解长链式思维（CoT）推理模型的底层机制提供宝贵资源，并推动更强大推理模型的发展。我们按照“代码可用性”部分所述，向公众发布DeepSeek-R1-Zero、DeepSeek-R1、数据样本和提炼后的模型。

DeepSeek-R1-Zero

To implement large-scale RL of DeepSeek-R1-Zero, we use a highly efficient RL pipeline. Specifically, we use GRPO 9 as our RL algorithm, described in Methods section ‘GRPO’. Furthermore, we use a rule-based reward system to compute accuracy and format rewards, with detailed methodologies outlined in Methods section ‘Reward design’. Furthermore, our high-performance RL infrastructure is described in Supplementary Information, section 2.1 , ensuring scalable and efficient training.

为了实现DeepSeek-R1-Zero的大规模强化学习（RL），我们采用了一个高效的RL管道。具体来说，我们使用GRPO9作为我们的RL算法，该算法在方法部分 “GRPO” 中有描述。此外，我们使用基于规则的奖励系统来计算准确性和格式奖励，详细方法在方法部分 “奖励设计” 中概述。此外，我们的高性能RL基础设施在补充信息的2.1节中有描述，确保了可扩展且高效的训练。

Specifically, we apply the RL technique on the DeepSeek-V3 Base to train DeepSeek-R1-Zero. During training, we design a straightforward template to require DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. The prompt template is written as below.

具体而言，我们在DeepSeek-V3 Base上应用强化学习（RL）技术来训练DeepSeek-R1-Zero。在训练过程中，我们设计了一个简单的模板，要求DeepSeek-R1-Zero先给出推理过程，再给出最终答案。提示模板如下。

“A conversation between User and Assistant. The User asks a question and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process and answer are enclosed within ... and ... tags, respectively, that is, reasoning process here answer here . User: prompt . Assistant:”, in which the prompt is replaced with the specific reasoning question during training. We intentionally limit our constraints to this structural format, avoiding any content-specific biases to ensure that we can accurately observe the natural progression of the model during the RL process.

“用户与助手之间的对话。用户提出问题，助手解决问题。助手首先在脑海中思考推理过程，然后向用户提供答案。推理过程和答案分别包含在 ... 和 ... 标签中，即这里是推理过程这里是答案。用户：提示。助手：”，其中提示在训练期间会被具体的推理问题所取代。我们有意将约束限制在这种结构格式中，避免任何特定内容的偏差，以确保我们能够准确观察模型在强化学习过程中的自然进展。

Figure 1a shows the performance trajectory of DeepSeek-R1-Zero on the American Invitational Mathematics Examination (AIME) 2024 benchmark throughout the RL training process, in which the average pass@1 score on AIME 2024 shows a marked increase, jumping from an initial value of 15.6% to 77.9%. Also, by using the self-consistency decoding 10 , the performance of the model can be further improved, achieving an accuracy of 86.7%. This performance greatly surpasses the average performance across all human competitors of the AIME. Besides the maths competitions, as shown in Supplementary Fig. 8 , DeepSeek-R1-Zero also achieves remarkable performance in coding competitions and graduate-level biology, physics and chemistry problems. These results underscore the effectiveness of RL in enhancing the reasoning capabilities of LLMs.

图1a展示了DeepSeek-R1-Zero在2024年美国数学邀请赛（AIME）基准测试中的性能轨迹，贯穿整个强化学习（RL）训练过程。在这个过程中，AIME 2024的平均pass@1分数显著提高，从初始值15.6%跃升至77.9%。此外，通过使用自一致性解码10，模型的性能可以进一步提升，达到86.7%的准确率。这一性能远远超过了AIME所有人类参赛者的平均水平。除了数学竞赛，如补充图8所示，DeepSeek-R1-Zero在编程竞赛以及研究生水平的生物学、物理学和化学问题上也取得了显著的成绩。这些结果强调了强化学习在提升大语言模型（LLMs）推理能力方面的有效性。

Fig. 1: Accuracy and output length of DeepSeek-R1-Zero throughout the training process.

图1：DeepSeek-R1-Zero在整个训练过程中的准确率和输出长度。

a , AIME accuracy of DeepSeek-R1-Zero during training. AIME takes a mathematical problem as input and a number as output, illustrated in Extended Data Table 1 . pass@1 and cons@16 are described in Supplementary Information, section 4.1 . The baseline is the average score achieved by human participants in the AIME competition. b , The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time. Note that a training step refers to a single policy update operation.

a ，训练期间DeepSeek-R1-Zero的AIME准确率。AIME以数学问题为输入，以数字为输出，如扩展数据表 1 所示。pass@1和cons@16在补充信息第 4.1 节中描述。基线是AIME竞赛中人类参与者的平均得分。b ，强化学习过程中DeepSeek-R1-Zero在训练集上的平均响应长度。DeepSeek-R1-Zero自然地学会了用更多思考时间来解决推理任务。请注意，一个训练步骤指的是一次策略更新操作。

Full size image

全尺寸图像

As well as the progressive enhancement of reasoning capabilities during training, DeepSeek-R1-Zero also demonstrates self-evolutionary behaviour with RL training. As shown in Fig. 1b , DeepSeek-R1-Zero exhibits a steady increase in thinking time throughout training, driven only by intrinsic adaptation rather than external modifications. Making use of long CoT, the model progressively refines its reasoning, generating hundreds to thousands of tokens to explore and improve its problem-solving strategies.

在训练过程中，除了推理能力的逐步提升外，DeepSeek-R1-Zero还通过强化学习训练展现出自我进化的行为。如图1b所示，DeepSeek-R1-Zero在整个训练过程中思考时间稳步增加，这仅由内在适应驱动，而非外部修改。利用长思维链（CoT），该模型逐步完善其推理过程，生成数百到数千个词元来探索和改进其解决问题的策略。

在训练过程中，除了推理能力的逐步提升，DeepSeek-R1-Zero还通过强化学习训练展现出自我进化的行为。如图1b所示，DeepSeek-R1-Zero在整个训练过程中思考时间稳步增加，这仅由内在适应性驱动，而非外部修改。利用长思维链，该模型逐步完善其推理过程，生成数百到数千个词元来探索和改进其解决问题的策略。

The increase in thinking time helps with the