Deepseek-R1 论文翻译

TFATS

已于 2025-02-22 17:22:58 修改

阅读量1.2k

点赞数 13

于 2025-02-10 18:17:59 首次发布

本文链接：https://blog.youkuaiyun.com/TFATS/article/details/145498287

版权

nlp 同时被 3 个专栏收录

62 篇文章

订阅专栏

LLM 大模型

7 篇文章

订阅专栏

deepseek

1 篇文章

订阅专栏

Deepseek-R1 论文全名：DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Deepseek-R1
论文地址：https://arxiv.org/abs/2501.12948
发表日期： 2025年1月22日

Abstract - 摘要

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

我们介绍了我们的第一代推理模型，DeepSeek-R1-Zero 和 DeepSeek-R1。DeepSeek-R1-Zero 是一个通过大规模强化学习（RL）训练的模型，未经过监督微调（SFT）作为初步步骤，展现出卓越的推理能力。 通过 RL，DeepSeek-R1-Zero 自然展现出众多强大而有趣的推理行为。然而，它面临着可读性差和语言混合等挑战。为了解决这些问题并进一步提升推理性能，我们引入了 DeepSeek-R1，该模型在 RL 之前结合了多阶段训练和冷启动数据。DeepSeek-R1 在推理任务上的表现与 OpenAI-o1-1217 相当。为了支持研究社区，我们开源了 DeepSeek-R1-Zero、DeepSeek-R1 以及六个基于 Qwen 和 Llama 从 DeepSeek-R1 中提炼出的密集模型（1.5B、7B、8B、14B、32B、70B）。
在这里插入图片描述

1、Introduction - 引言

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI).

近年来，大型语言模型（LLMs）经历了快速的迭代和演变（Anthropic, 2024; Google, 2024; OpenAI, 2024a），逐渐缩小了与人工通用智能（AGI）之间的差距。

Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models were the first to introduce inference-time scaling by increasing the length of the Chain-ofThought reasoning process. This approach has achieved significant improvements in various reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open question for the research community. Several prior works have explored various approaches, including process-based reward models (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning performance comparable to OpenAI’s o1 series models.

最近，后训练已成为完整训练流程中的一个重要组成部分。研究表明，它能够提高推理任务的准确性，与社会价值观保持一致，并适应用户偏好，同时在计算资源上相较于预训练要求相对较少。在推理能力的背景下，OpenAI 的 o1（OpenAI, 2024b）系列模型首次通过增加思维链推理过程的长度引入了推理时的扩展。这种方法在数学、编码和科学推理等各种推理任务中取得了显著的改进。然而，有效的测试时间扩展的挑战仍然是研究界的一个未解之谜。一些先前的研究探索了各种方法，包括基于过程的奖励模型（Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023）、强化学习（Kumar et al., 2024）以及搜索算法，如蒙特卡洛树搜索和束搜索（Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024）。然而，这些方法都未能达到与OpenAI的o1系列模型相当的通用推理性能。

In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1-0912.

在本文中，我们迈出了利用纯强化学习（RL）提高语言模型推理能力的第一步。我们的目标是探索大型语言模型在没有任何监督数据的情况下发展推理能力的潜力，专注于它们通过纯RL过程的自我演化。具体而言，我们使用 DeepSeek-V3-Base 作为基础模型，并采用 GRPO (Shao et al., 2024) 作为强化学习框架，以提高模型在推理方面的性能。在训练过程中，DeepSeek-R1-Zero 自然出现了许多强大而有趣的推理行为。经过数千步的强化学习，DeepSeek-R1-Zero 在推理基准测试中表现出色。 例如，AIME 2024 上的 pass@1 分数从 15.6% 提高到 71.0%，并且通过多数投票，分数进一步提高到 86.7%，与 OpenAI-o1-0912 的表现相匹配。

However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1- Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.

然而，DeepSeek-R1-Zero 遇到了一些挑战，例如可读性差和语言混合。为了解决这些问题并进一步增强推理性能，我们引入了 DeepSeek-R1，该模型结合了一小部分冷启动数据和多阶段训练流程。具体而言，我们首先收集数千条冷启动数据，以微调 DeepSeek-V3-Base 模型。接下来，我们执行类似于DeepSeek-R1-Zero的面向推理的强化学习。在强化学习过程接近收敛时，我们通过对强化学习检查点进行拒绝采样，结合来自DeepSeek-V3的监督数据，生成新的SFT数据，涉及写作、事实问答和自我认知等领域，然后对DeepSeek-V3-Base模型进行再训练。在使用新数据进行微调后，检查点将经历额外的强化学习过程，考虑到所有场景的提示。经过这些步骤，我们获得了一个称为DeepSeek-R1的检查点，其性能与OpenAI-o1-1217相当。

We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the reasoning benchmarks among dense models.

我们进一步探索从DeepSeek-R1到更小的稠密模型的蒸馏。以Qwen2.5-32B（Qwen，2024b）作为基础模型，直接从DeepSeek-R1进行蒸馏的效果优于对其应用强化学习。这表明，大型基础模型发现的推理模式对于提高推理能力至关重要。 我们开源了提炼后的Qwen和Llama（Dubey等，2024）系列。值得注意的是，我们的提炼14B模型在性能上大幅超越了最先进的开源QwQ-32B-Preview（Qwen，2024a），而提炼后的32B和70B模型在密集模型的推理基准测试中创下了新纪录。

1.1 Contributions - 贡献

Post-Training: Large-Scale Reinforcement Learning on the Base Model
• We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeekR1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.
• We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.
Distillation: Smaller Models Can Be Powerful Too
• We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future.
• Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeekR1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench. These results significantly outperform previous opensource models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

后训练：在基础模型上进行大规模强化学习
• 我们直接将强化学习应用于基础模型，而不依赖于监督微调（SFT）作为初步步骤。这种方法使模型能够探索解决复杂问题的思维链（CoT），从而开发出DeepSeek-R1-Zero。DeepSeek-R1-Zero展示了自我验证、反思和生成长思维链的能力，标志着研究社区的一个
重要里程碑。值得注意的是，这是首个开放研究，验证了大型语言模型的推理能力可以仅通过强化学习激励，而无需进行监督微调。这一突破为该领域未来的进展铺平了道路。

• 我们介绍了开发DeepSeek-R1的流程。该流程包含两个强化学习阶段，旨在发现改进的推理模式并与人类偏好对齐，以及两个监督微调阶段，作为模型推理和非推理能力的基础。我们相信该流程将通过创造更好的模型来惠及行业。

蒸馏：小型模型也可以强大
• 我们证明了较大模型的推理模式可以被蒸馏到较小模型中，从而在性能上优于通过强化学习在小型模型上发现的推理模式。开源的DeepSeek-R1及其API将使研究社区在未来蒸馏出更好的小型模型受益。

• 利用DeepSeek-R1生成的推理数据，我们对研究界广泛使用的几种密集模型进行了微调。评估结果表明，经过蒸馏的小型密集模型在基准测试中表现出色。 DeepSeek-R1-DistillQwen-7B在AIME 2024上取得了55.5%的成绩，超过了QwQ-32B-Preview。此外，DeepSeek-R1-Distill-Qwen-32B在AIME 2024上得分72.6%，在MATH-500上得分94.3%，在LiveCodeBench上得分57.2%。这些结果显著优于之前的开源模型，并且与o1-mini相当。我们向社区开源了基于Qwen2.5和Llama3系列的1.5B、7B、8B、14B、32B和70B检查点。

1.2 Summary of Evaluation Results - 评估结果摘要

• Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks.
• Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeekR1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,
demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 4o on this benchmark.
• Others: DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. It achieves an impressive length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on ArenaHard, showcasing its strong ability to intelligently handle non-exam-oriented queries. Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks.

• 推理任务: (1) DeepSeek-R1 在 AIME 2024 上获得 79.8% Pass@1 的分数，略微超过 OpenAI-o1-1217。在 MATH-500 上，它取得了令人印象深刻的 97.3% 的分数，表现与 OpenAI-o1-1217 相当，并显著超越其他模型。 (2) 在与编码相关的任务中，DeepSeek-R1 在代码
竞赛任务中表现出专家水平，获得了 2,029 Elo 评级，超越了 96.3% 的人类参与者。在与工程相关的任务中，DeepSeek-R1 的表现略优于 DeepSeek-V3，这可能有助于开发人员在实际任务中的应用。
• 知识: 在 MMLU、MMLU-Pro 和 GPQA 钻石等基准测试中，DeepSeek-R1 取得了卓越的成绩，显著超越 DeepSeek-V3，MMLU 得分为 90.8%，MMLU-Pro 得分为 84.0%，GPQA 钻石得分为 71.5%。虽然在这些基准测试中的表现略低于 OpenAI-o1-1217，但 DeepSeek-R1 超越了其他闭源模型，展示了其在教育任务中的竞争优势。在事实基准测试SimpleQA上，DeepSeek-R1的表现优于DeepSeek-V3，展示了其处理基于事实查询的能力。在这个基准测试中，OpenAI-o1的表现也超过了4o，呈现出类似的趋势。
• 其他：DeepSeek-R1在广泛的任务中也表现出色，包括创意写作、一般问答、编辑、摘要等。它在AlpacaEval 2.0上取得了87.6%的长度控制胜率，在ArenaHard上获得了92.3%的胜率，展示了其智能处理非考试导向查询的强大能力。此外，DeepSeek-R1在需要长上下文理解的任务上表现出色，在长上下文基准测试中大幅超越了DeepSeek-V3。

2. Approach - 方法

2.1 Overview - 概述

Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models.

以往的工作在很大程度上依赖大量的监督数据来提升模型性能。在本研究中，我们证明了推理能力可以通过大规模强化学习（RL）显著提高，即使在没有使用监督微调（SFT）作为冷启动的情况下。此外，通过加入少量冷启动数据，性能可以进一步增强。在接下来的部分中，我们介绍：(1) DeepSeek-R1-Zero，它直接对基础模型应用RL，而不使用任何SFT数据；(2) DeepSeek-R1，它从一个经过数千个长链思维（CoT）示例微调的检查点开始应用RL；(3) 从DeepSeek-R1中提炼推理能力到小型密集模型。

2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model - DeepSeek-R1-Zero：在基础模型上进行强化学习

Reinforcement learning has demonstrated significant effectiveness in reasoning tasks, as evidenced by our previous works (Shao et al., 2024; Wang et al., 2023). However, these works heavily depended on supervised data, which are time-intensive to gather. In this section, we explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning process. We start with a brief overview of our RL algorithm, followed by the presentation of some exciting results, and hope this provides the community with valuable insights.

强化学习在推理任务中表现出了显著的有效性，正如我们之前的研究所证明的（Shao 等，2024 ；Wang 等，2023）。然而，这些工作严重依赖于监督数据，而收集这些数据是耗时的。在本节中，我们探讨大型语言模型（LLMs）在没有任何监督数据的情况下发展推理能力的潜力，重点关注它们通过纯强化学习过程的自我演化。我们首先简要概述我们的强化学习算法，然后展示一些令人兴奋的结果，希望这能为社区提供有价值的见解。

2.2.1. Reinforcement Learning Algorithm - 2.2.1. 强化学习算法

Group Relative Policy Optimization In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question 𝑞, GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the old policy 𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model 𝜋𝜃 by maximizing the following objective:

GRPO群体相对策略优化 为了节省强化学习的训练成本，我们采用群体相对策略优化（GRPO）（Shao等，2024），该方法省略了通常与策略模型大小相同的评论模型，并从群体得分中估计基线。
具体而言，对于每个问题 𝑞，GRPO从旧策略𝜋𝜃中抽样一组输出{𝑜1,𝑜2 ,· · ·,𝑜𝐺 }。然后通过最大化以下目标来优化策略模型 𝜋𝜃：

在这里插入图片描述

where 𝜀 and 𝛽 are hyper-parameters, and 𝐴𝑖 is the advantage, computed using a group of rewards {𝑟1,𝑟2, . . . ,𝑟𝐺} corresponding to the outputs within each group:

其中 𝜀和 𝛽是超参数，且 𝐴𝑖是优势，通过一组奖励 {𝑟1, 𝑟2,. . . , 𝑟𝐺 }计算得出，这些奖励对应于每组内的输出：在这里插入图片描述

2.2.2. Reward Modeling - 2.2.2. 奖励建模

The reward is the source of the training signal, which decides the optimization direction of RL.
To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
types of rewards:
• Accuracy rewards: The accuracy reward model evaluates whether the response is correct.
For example, in the case of math problems with deterministic results, the model is required
to provide the final answer in a specified format (e.g., within a box), enabling reliable
rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be
used to generate feedback based on predefined test cases.
• Format rewards: In addition to the accuracy reward model, we employ a format reward
model that enforces the model to put its thinking process between ‘’ and ‘’
tags.
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero,
because we find that the neural reward model may suffer from reward hacking in the large-scale
reinforcement learning process, and retraining the reward model needs additional training
resources and it complicates the whole training pipeline.

奖励是训练信号的来源，它决定了强化学习的优化方向。为了训练DeepSeek-R1-Zero，我们采用了一种基于规则的奖励系统，主要由两种类型的奖励组成：
• 准确性奖励: 准确性奖励模型评估响应是否正确。
例如，在具有确定性结果的数学问题中，模型需要以指定格式（例如，在一个框内）提供最终答案，从而实现可靠的基于规则的正确性验证。同样，对于LeetCode问题，可以使用编译器根据预定义的测试用例生成反馈。
• 格式奖励: 除了准确性奖励模型外，我们还采用了格式奖励模型，强制模型将其思考过程放在‘<think>’和‘</think>’标签之间。

我们在开发DeepSeek-R1-Zero时不应用结果或过程神经奖励模型，因为我们发现神经奖励模型可能在大规模强化学习过程中遭受奖励黑客攻击，并且重新训练奖励模型需要额外的训练资源，这使整个训练流程变得复杂。

2.2.3. Training Template - 2.2.3. 训练模板

To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions. As depicted in Table 1, this template requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. We intentionally limit our constraints to this structural format, avoiding any content-specific biases—such as mandating reflective reasoning or promoting particular problem-solving strategies—to ensure that we can accurately observe the model’s natural progression during the RL process.

为了训练DeepSeek-R1-Zero，我们首先设计一个简单的模板，以指导基础模型遵循我们指定的指令。如表1所示，该模板要求DeepSeek-R1-Zero首先产生推理过程，然后给出最终答案。

我们故意将约束限制在这种结构格式上，避免任何内容特定的偏见——例如，强制反思性推理或促进特定问题解决策略——以确保我们能够准确观察模型在RL过程中的自然进展。

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively, i.e., reasoning process here answer here . User: prompt. Assistant:

Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning question during training.

用户与助手之间的对话。用户提出一个问题，助手解决它。助手首先在心中思考推理过程，然后向用户提供答案。推理过程和答案分别被包含在<think> </think>和<answer> </answer>标签中，即<think> 推理过程在这里 </think> <answer> 答案在这里 </answer>。用户：提示。助手：

表1 |DeepSeek-R1-Zero的模板。提示将在训练期间替换为具体的推理问题。

2.2.4. DeepSeek-R1-Zero的性能、自我进化过程和顿悟时刻

Performance of DeepSeek-R1-Zero Figure 2 depicts the performance trajectory of DeepSeekR1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances. Notably, the average pass@1 score on AIME 2024 shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels comparable to OpenAI-o1-0912. This significant improvement highlights the efficacy of our RL algorithm in optimizing the model’s performance over time.

DeepSeek-R1-Zero的性能图2描绘了DeepSeek-R1-Zero在AIME 2024基准测试中的性能轨迹，贯穿整个RL训练过程。如图所示，DeepSeek-R1-Zero在RL训练过程中表现出稳定且持续的性能提升。值得注意的是，AIME 2024 的平均 pass@1 分数显示出显著的增长，从最初的 15.6% 跃升至令人印象深刻的 71.0%，达到了与 OpenAI-o1-0912 相当的性能水平。这一显著的改进突显了我们的强化学习算法在优化模型性能方面的有效性。

在这里插入图片描述

Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI’s o1-0912 models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepSeekR1-Zero can be further augmented through the application of majority voting. For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without majority voting, highlights its strong foundational capabilities and its potential for further advancements in reasoning tasks.

表 2 提供了 DeepSeek-R1-Zero 和 OpenAI 的 o1-0912 模型在多种推理相关基准上的比较分析。研究结果表明，强化学习赋能DeepSeek-R1-Zero在没有任何监督微调数据的情况下实现了强大的推理能力。这是一个值得注意的成就，因为它强调了模型通过强化学习单独学习和有效概括的能力。 此外，通过应用多数投票，DeepSeek-R1-Zero的性能可以进一步增强。 例如，当在AIME基准上采用多数投票时，
DeepSeek-R1-Zero的性能从71.0%提升至86.7%，从而超越了OpenAI-o1-0912的表现。DeepSeek-R1-Zero在有无多数投票的情况下都能实现如此具有竞争力的表现，突显了其强大的基础能力及在推理任务中进一步发展的潜力。
在这里插入图片描述

Self-evolution Process of DeepSeek-R1-Zero The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously. By initiating RL directly from the base model, we can closely monitor the model’s progression without the influence of the supervised fine-tuning stage. This approach provides a clear view of how the model evolves over time, particularly in terms of its ability to handle complex reasoning tasks.

DeepSeek-R1-Zero的自我进化过程 DeepSeek-R1-Zero的自我进化过程是一个引人入胜的示范，展示了强化学习如何驱动模型自主提升其推理能力。通过直接从基础模型启动强化学习，我们可以在没有监督微调阶段影响的情况下，密切监控模型的进展。这种方法清晰地展示了模型随时间演变的过程，特别是在处理复杂推理任务的能力方面。

As depicted in Figure 3, the thinking time of DeepSeek-R1-Zero shows consistent improve-ment throughout the training process. This improvement is not the result of external adjustments but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth.

如图3所示，在整个训练过程中 DeepSeek-R1-Zero的思考时间表现出持续的改善。这种改善不是外部调整的结果，而是模型内部的内在发展。 DeepSeek-R1-Zero通过利用扩展的测试时间计算，自然获得了解决日益复杂的推理任务的能力。这种计算范围从生成数百到数千个推理标记，使模型能够更深入地探索和完善其思维过程。
在这里插入图片描述

One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.

这种自我演变中最显著的方面之一是随着测试时间计算的增加，复杂行为的出现。例如反思——模型重新审视和重新评估其先前步骤的过程——以及自发探索替代问题解决方法的行为。 这些行为并不是显式编程的，而是模型与强化学习环境互动的结果。 这种自发的发展显著增强了DeepSeek-R1-Zero的推理能力，使其能够以更高的效率和准确性应对更具挑战性的任务。

Aha Moment of DeepSeek-R1-Zero A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase,
DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.

DeepSeek-R1-Zero的顿悟时刻在DeepSeek-R1-Zero的训练过程中观察到的一个特别引人注目的现象是“顿悟时刻”的发生。这一时刻，如表3所示，发生在模型的一个中间版本中。在这个阶段，DeepSeek-R1-Zero通过重新评估其初始方法，学习将更多思考时间分配给一个问题。这种行为不仅证明了模型日益增长的推理能力，也是强化学习如何导致意想不到且复杂结果的迷人例子。
在这里插入图片描述

This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.

这一时刻不仅是模型的“顿悟时刻”，也是观察其行为的研究人员的“顿悟时刻”。它强调了强化学习的力量和美妙：我们并不是明确教导模型如何解决问题，而是简单地为它提供正确的激励，它便能自主发展出先进的问题解决策略。 “顿悟时刻”强有力地提醒我们，强化学习有潜
力解锁人工系统的新智能水平，为未来更自主和适应性强的模型铺平道路。
在这里插入图片描述

Drawback of DeepSeek-R1-Zero Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.

DeepSeek-R1-Zero的缺点尽管DeepSeek-R1-Zero展现了强大的推理能力，并自主发展出意想不到且强大的推理行为，但它面临着几个问题。例如，DeepSeek-R1-Zero在可读性差和语言混合等挑战上存在困难。 为了使推理过程更具可读性并与开放社区分享，我们探索了DeepSeek-R1，这是一种利用强化学习与人性化冷启动数据的方法。

2.3. DeepSeek-R1: Reinforcement Learning with Cold Start - 2.3. DeepSeek-R1：带冷启动的强化学习

Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The pipeline consists of four stages, outlined as follows.

受到DeepSeek-R1-Zero良好结果的启发，两个自然的问题出现了：1）通过引入少量高质量数据作为冷启动，推理性能是否可以进一步提高或收敛加速？2）我们如何训练一个用户友好的模型，不仅能产生清晰连贯的思维链（CoT），还展示出强大的通用能力？为了解决这些问题，我们设计了一个训练DeepSeek-R1的流程。该流程分为四个阶段，概述如下。

2.3.1. Cold Start

Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.

与DeepSeek-R1-Zero不同，为了防止基座模型的RL训练早期不稳定的冷启动阶段，对于DeepSeek-R1，我们构建并收集了一小部分长的CoT数据，以微调模型作为初始RL行为。为了收集这些数据，我们探索了几种方法：使用长链推理（CoT）的少量示例提示，直接提示模型生成带有反思和验证的详细答案，以可读格式收集DeepSeek-R1-Zero的输出，并通过人工注释者的后处理来精炼结果。

In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data include:
• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading. Responses may mix multiple languages or lack markdown formatting to highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as |special_token|<reasoning_process>|special_token|, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results.
• Potential: By carefully designing the pattern for cold-start data with human priors, we observe better performance against DeepSeek-R1-Zero. We believe the iterative training is a better way for reasoning models.

在这项工作中，**我们收集了数千个冷启动数据，以微调DeepSeek-V3-Base作为强化学习的起点。**与DeepSeek-R1-Zero相比，冷启动数据的优势包含：
• 可读性： DeepSeek-R1-Zero的一个关键限制是其内容通常不适合阅读。响应可能混合多种语言或缺乏Markdown格式以突出用户的答案。相比之下，在为DeepSeek-R1创建冷启动数据时，我们设计了一种可读的模式，其中每个响应的末尾包含一个摘要，并过滤掉不
适合读者的响应。在这里，我们将输出格式定义为|special_token|<推理过程>|special_token|<摘要>，其中推理过程是针对查询的链式推理，而摘要用于总结推理结果。
• 潜力： 通过精心设计带有人类先验的冷启动数据模式，我们观察到相较于DeepSeek-R1-Zero的更好表现。 我们相信迭代训练是推理模型的更好方法。

2.3.2. Reasoning-oriented Reinforcement Learning - 2.3.2. 面向推理的强化学习

After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions. During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable. Finally, we combine the accuracy of reasoning tasks and the reward for language consistency by directly summing them to form the final reward. We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks.

在对DeepSeek-V3-Base进行冷启动数据的微调后，我们应用与DeepSeek-R1-Zero中采用的相同的大规模强化学习训练过程。此阶段专注于增强模型的推理能力，特别是在编码、数学、科学和逻辑推理等推理密集型任务中，这些任务涉及具有明确解决方案的明确定义的问题。在训练过程中，我们观察到 CoT 经常表现出语言混合，特别是在 RL 提示涉及多种语言时。为了缓解语言混合的问题，我们在 RL 训练中引入了语言一致性奖励，该奖励的计算方式是 CoT 中目标语言单词的比例。尽管消融实验表明，这种对齐会导致模型性能略微下降，但该奖励与人类偏好一致，使其更具可读性。最后，我们通过直接相加推理任务的准确性和语言一致性奖励来形成最终奖励。然后，我们在微调后的模型上应用 RL 训练，直到其在推理任务上达到收敛。

2.3.3. Rejection Sampling and Supervised Fine-Tunin - 2.3.3. 拒绝采样与监督微调

When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we generate the data and fine-tune the model as described below.

当面向推理的 RL 收敛时，我们利用生成的检查点收集 SFT（监督微调）数据以进行后续轮次。 与最初的冷启动数据主要关注推理不同，这一阶段结合了来自其他领域的数据，以增强模型在写作、角色扮演和其他通用任务中的能力。具体而言，我们生成数据并按照下面的描述对模型进行微调。

Reasoning data We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For each prompt, we sample multiple responses and retain only the correct ones. In total, we collect about 600k reasoning related training samples.

推理数据 我们策划推理提示，并通过从上述强化学习训练的检查点进行拒绝采样来生成推理轨迹。在前一个阶段，我们只包含可以使用基于规则的奖励进行评估的数据。然而，在这个阶段，我们通过纳入额外数据来扩展数据集，其中一些数据使用生成奖励模型，通过将真实值和模型预测输入DeepSeek-V3进行判断。此外，由于模型输出有时混乱且难以阅读，我们过滤掉了混合语言的思维链、长段落和代码块。对于每个提示，我们采样多个响应，仅保留正确的响应。 总的来说，我们收集了大约60万条与推理相关的训练样本。

Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting. However, for simpler queries, such as “hello” we do not provide a CoT in response. In the end, we collected a total of approximately 200k training samples that are unrelated to reasoning.

非推理数据 对于非推理数据，如写作、事实问答、自我认知和翻译，我们采用DeepSeek-V3 链路，并重用DeepSeek-V3的部分SFT数据集。对于某些非推理任务，我们调用DeepSeek-V3在回答问题之前生成潜在的思维链。然而，对于更简单的查询，例如“你好”，我们不提供链式推理作为回应。 最终，我们收集了大约20万个与推理无关的训练样本。

We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples.

我们使用上述约80万个样本的精心策划数据集对DeepSeek-V3-Base进行了两轮微调。

2.3.4. Reinforcement Learning for all Scenarios - 2.3.4. 所有场景的强化学习

To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

为了进一步使模型与人类偏好对齐，我们实施了一个次级强化学习阶段，旨在提高模型的有用性和无害性，同时同时提升其推理能力。 具体而言，我们使用奖励信号和多样化提示分布的组合来训练模型。对于推理数据，我们遵循DeepSeek-R1-Zero中概述的方法，该方法利用基于规则的奖励来指导数学、代码和逻辑推理领域的学习过程。对于一般数据，我们依赖奖励模型来捕捉复杂和微妙场景中的人类偏好。我们基于DeepSeek-V3链路，并采用类似的偏好对和训练提示的分布。为了有用性，我们专注于最终摘要，确保评估强调响应对用户的实用性和相关性，同时最小化对基础推理过程的干扰。为了无害性，我们评估模型的整个响应，包括推理过程和摘要，以识别和减轻在生成过程中可能出现的任何潜在风险、偏见或有害内容。 最终，奖励信号和多样化数据分布的整合使我们能够训练出在推理方面表现出色，同时优先考虑有用性和无害性的模型。

2.4. Distillation: Empower Small Models with Reasoning Capability - 2.4. 蒸馏：赋予小型模型推理能力

To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its reasoning capability is slightly better than that of Llama-3.1.

For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community

为了使更高效的小型模型具备类似DeepSeek-R1的推理能力，我们直接对开源模型如Qwen (Qwen, 2024b) 和Llama (AI@Meta, 2024) 进行了微调，使用了与DeepSeek-R1策划的80万样本， 如§2.3.3所述。我们的研究结果表明，这种简单的蒸馏方法显著增强了小型模型的推理能力。我们在这里使用的基础模型是 Qwen2.5-Math-1.5B、Qwen2.5-Math-7B、Qwen2.5-14B、Qwen2.5-32B、Llama-3.1-8B 和 Llama-3.3-70B-Instruct。我们选择 Llama-3.3，因为它的推理能力略优于 Llama-3.1。

对于蒸馏模型，我们仅应用 SFT，而不包括 RL 阶段，尽管引入 RL 可能会显著提升模型性能。我们在这里的主要目标是展示蒸馏技术的有效性，将 RL 阶段的探索留给更广泛的研究社区。

3. Experiment - 3. 实验

Benchmarks We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al.,2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2, Chinese National High School Mathematics Olympiad (CNMO 2024)3, and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we only feed the final summary to evaluation to avoid the length bias. For distilled models, we report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench.

基准测试 我们在 MMLU（Hendrycks 等，2020）、MMLU-Redux（Gema 等，2024）、MMLU-Pro（Wang 等，2024）、C-Eval（Huang 等，2023）和 CMMLU（Li 等，2023）、IFEval（Zhou 等，2023）、FRAMES（Krishna 等，2024）、GPQA 钻石（Rein 等，2023）、Simpl
eQA（OpenAI，2024c）、C-SimpleQA（He 等，2024）、SWE-Bench Verified（OpenAI，2024d), Aider 1, LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2, 中国全国高中数学奥林匹克 (CNMO 2024)3, 以及美国邀请数学考试 2024 (AIME 2024) (MAA, 2024)。除了标准基准测试，我们还在开放式生成任务上评估我们的模型，使用 LLMs 作为评审。具体而言，我们遵循 AlpacaEval 2.0 (Dubois et al., 2024) 和 Arena-Hard (Li et al., 2024) 的原始配置，这些配置利用 GPT-4-Turbo-1106 作为成对比较的评审。在这里，我们仅将最终摘要提供给评估，以避免长度偏差。对于蒸馏模型，我们报告 AIME 2024、MATH-500、GPQA 钻石、Codeforces 和 LiveCodeBench 的代表性结果。

Evaluation Prompts Following the setup in DeepSeek-V3, standard benchmarks such as MMLU, DROP, GPQA Diamond, and SimpleQA are evaluated using prompts from the simpleevals framework. For MMLU-Redux, we adopt the Zero-Eval prompt format (Lin, 2024) in a zero-shot setting. In terms of MMLU-Pro, C-Eval and CLUE-WSC, since the original prompts are few-shot, we slightly modify the prompt to the zero-shot setting. The CoT in few-shot may hurt the performance of DeepSeek-R1. Other datasets follow their original evaluation protocols with default prompts provided by their creators. For code and math benchmarks, the HumanEval-Mul dataset covers eight mainstream programming languages (Python, Java, C++, C#, JavaScript, TypeScript, PHP, and Bash). Model performance on LiveCodeBench is evaluated using CoT format, with data collected between August 2024 and January 2025. The Codeforces dataset is evaluated using problems from 10 Div.2 contests along with expert-crafted test cases, after which the expected ratings and percentages of competitors are calculated. SWE-Bench verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related benchmarks are measured using a “diff” format. DeepSeek-R1 outputs are capped at a maximum of 32,768 tokens for each benchmark.

评估提示 根据 DeepSeek-V3 的设置，使用来自 simple-evals 框架的提示评估标准基准，如 MMLU、DROP、GPQA 钻石和 SimpleQA。对于 MMLU-Redux，我们在零样本设置中采用 Zero-Eval 提示格式（Lin, 2024）。在 MMLU-Pro、C-Eval 和 CLUE-WSC 中，由于原始提示是少样本的，我们稍微修改提示以适应零样本设置。在少样本中，链式推理（CoT）可能会影响 DeepSeek-R1 的性能。其他数据集遵循其原始评估协议，使用其创建者提供的默认提示。对于代码和数学基准，HumanEval-Mul 数据集涵盖八种主流编程语言（Python、Java、C++、C#、JavaScript、TypeScript、PHP 和 Bash）。模型在 LiveCodeBench 上的表现使用 CoT 格式进行评估，数据收集时间为 2024 年 8 月至 2025 年 1 月。 Codeforces 数据集使用来自 10 个 Div.2 竞赛的问题以及专家设计的测试用例进行评估，之后计算预期的评分和竞争者的百分比。 SWE-Bench验证结果通过无代理框架获得（Xia 等，2024）。与 AIDER 相关的基准使用 “diff” 格式进行测量。 DeepSeek-R1 的输出在每个基准上限制为最多 32,768 个token。

Baselines We conduct comprehensive evaluations against several strong baselines, including DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, and OpenAI-o1-1217. Since accessing the OpenAI-o1-1217 API is challenging in mainland China, we report its performance based on official reports. For distilled models, we also compare the open-source model QwQ-32B-Preview (Qwen, 2024a).

基线我们对多个强基线进行全面评估，包括 DeepSeek-V3、Claude-Sonnet-3.5-1022、GPT-4o-0513、OpenAI-o1-mini 和 OpenAI-o1-1217。由于在中国大陆访问 OpenAI-o1-1217 API 较为困难，我们根据官方报告报告其性能。对于蒸馏模型，我们还比较了开源模型 QwQ-32B-预览（Qwen，2024a）。

Evaluation Setup We set the maximum generation length to 32,768 tokens for the models. We found that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates and significant variability across different checkpoints. Therefore, we default to pass@𝑘 evaluation (Chen et al., 2021) and report pass@1 using a non-zero temperature. Specifically, we use a sampling temperature of 0.6 and a top-𝑝 value of 0.95 to generate 𝑘responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1 is then calculated as

在这里插入图片描述

where 𝑝𝑖 denotes the correctness of the 𝑖-th response. This method provides more reliable performance estimates. For AIME 2024, we also report consensus (majority vote) results (Wanget al., 2022) using 64 samples, denoted as cons@64.

评估设置 我们将模型的最大生成长度设置为 32,768 个标记。我们发现，使用贪婪解码来评估长输出推理模型会导致更高的重复率，并且在不同的检查点之间存在显著的变异性。因此，我们默认使用 pass@𝑘evaluation（Chen et al., 2021），并报告使用非零温度 pass@1。
具体而言，我们使用 0.6 的采样温度和 0.95 的 top-𝑝值来生成 𝑘个响应（通常在 4 到 64 之间，具体取决于测试集的大小）以回答每个问题。然后计算
在这里插入图片描述

其中 𝑝𝑖表示第 𝑖个响应的正确性。该方法提供了更可靠的性能估计。对于 AIME 2024，我们还报告了共识（多数投票）结果（Wang et al., 2022），使用 64 个样本，记作 cons@64。

3.1. DeepSeek-R1 Evaluation - 3.1. 深度寻求-R1 评估

在这里插入图片描述

For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This improvement is primarily attributed to enhanced accuracy in STEM-related questions, where significant gains are achieved through large-scale reinforcement learning. Additionally, DeepSeek-R1 excels on FRAMES, a long-context-dependent QA task, showcasing its strong document analysis capabilities. This highlights the potential of reasoning models in AI-driven search and data analysis tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses GPT-4o on this benchmark. However, DeepSeek-R1 performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an accuracy of over 70%.

对于以教育为导向的知识基准，如 MMLU、MMLU-Pro 和 GPQA 钻石，DeepSeek-R1 的表现优于 DeepSeek-V3。这一改进主要归因于在 STEM 相关问题上的准确性提升，通过大规模强化学习实现了显著的进步。此外，DeepSeek-R1 在 FRAMES 这一长上下文依赖的问答任务上表现出色，展示了其强大的文档分析能力。这突显了推理模型在 AI 驱动的搜索和数据分析任务中的潜力。在事实基准测试SimpleQA上，DeepSeek-R1的表现优于DeepSeek-V3，展示了其处理基于事实查询的能力。在这一基准上，OpenAI-o1 超越了 GPT-4o，观察到类似的趋势。然而，DeepSeek-R1 在中文 SimpleQA 基准上的表现不如 DeepSeek-V3，主要是由于其在安全强化学习后倾向于拒绝回答某些查询。如果没有安全强化学习，DeepSeek-R1 的准确率可以超过 70%。

DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model’s ability to follow format instructions. These improvements can be linked to the inclusion of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard, indicating DeepSeek-R1’s strengths in writing tasks and open-domain question answering. Its significant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scale RL, which not only boosts reasoning capabilities but also improves performance across diverse domains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an average of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying its robustness across multiple tasks.

DeepSeek-R1 在 IF-Eval 上也取得了令人印象深刻的结果，这是一个旨在评估模型遵循格式指令能力的基准。这些改进可以归因于在监督微调（SFT）和强化学习训练的最后阶段纳入了遵循指令的数据。此外，在 AlpacaEval2.0 和 ArenaHard 上观察到显著的性能，这表明 DeepSeek-R1 在写作任务和开放领域问答方面的优势。其显著超越 DeepSeek-V3 的表现突显了大规模强化学习的泛化优势，这不仅提升了推理能力，还改善了在不同领域的表现。此外，DeepSeek-R1 生成的摘要长度简洁，在 ArenaHard 上平均为 689 个标记，在AlpacaEval 2.0 上为 2,218 个字符。这表明DeepSeek-R1 在基于 GPT 的评估中避免了引入长度偏差，进一步巩固了其在多项任务中的稳健
性。

On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217, surpassing other models by a large margin. A similar trend is observed on coding algorithm tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these benchmarks. On engineering-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1 on Aider but achieves comparable performance on SWE Verified. We believe the engineering performance of DeepSeek-R1 will improve in the next version, as the amount of related RL training data currently remains very limited.

在数学任务上，DeepSeek-R1的表现与OpenAI-o1-1217相当，大幅超越其他模型。在编码算法任务上，如LiveCodeBench和Codeforces，也观察到类似的趋势，推理为重点的模型主导了这些基准。在工程导向的编码任务中，OpenAI-o1-1217在Aider上优于DeepSeek-R1，但在SWE验证上表现相当。我们相信，随着相关强化学习训练数据的增加，DeepSeek-R1的工程性能将在下一个版本中得到改善，因为目前相关数据仍然非常有限。

3.2. Distilled Model Evaluation - 3.2. 蒸馏模型评估

在这里插入图片描述

As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeekR1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks. These results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.

如表 5 所示，简单地蒸馏深度寻求-R1 的输出使得高效的深度寻求-R1-7B（即DeepSeek-R1-Distill-Qwen-7B，以下简称相似）在各方面超越了非推理模型如 GPT-4o-0513。 DeepSeek-R1-14B在所有评估指标上超过QwQ-32B-Preview，而DeepSeek-R1-32B和DeepSeek-R1-70B在大多数基准测试中显著超过o1-mini。这些结果展示了蒸馏的强大潜力。此外，我们发现将强化学习应用于这些蒸馏模型可以带来显著的进一步提升。我们认为这值得进一步探索，因此在此仅呈现简单 SFT 蒸馏模型的结果。

4. Discussion - 4. 讨论

4.1. Distillation v.s. Reinforcement Learning - 4.1. 蒸馏与强化学习

In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve impressive results. However, there is still one question left: can the model achieve comparable performance through the large-scale RL training discussed in the paper without distillation?

在第 3.2 节中，我们可以看到通过蒸馏 DeepSeek-R1，小模型可以取得令人印象深刻的结果。然而，仍然有一个问题：该模型是否可以通过本文讨论的大规模 RL 训练在没有蒸馏的情况下实现可比的性能？

To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math, code, and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The experimental results, shown in Table 6, demonstrate that the 32B base model, after large-scale RL training, achieves performance on par with QwQ-32B-Preview. However, DeepSeek-R1-Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than DeepSeek-R1-Zero-Qwen-32B across all benchmarks.

为了解答这个问题，我们在 Qwen-32B-Base 上进行大规模 RL 训练，使用数学、代码和 STEM 数据，训练超过 10K 步，最终得到 DeepSeek-R1-Zero-Qwen-32B。实验结果如表 6 所示，表明 32B 基础模型在大规模强化学习训练，达到了与 QwQ-32B-Preview 相当的性能。然而，从 DeepSeek-R1 蒸馏而来的 DeepSeek-R1- Distill-Qwen-32B 在所有基准测试中表现显著优于 DeepSeek-R1-Zero-Qwen-32B。
在这里插入图片描述

Therefore, we can draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and largerscale reinforcement learning.

因此，我们可以得出两个结论：首先，将更强大的模型蒸馏成更小的模型可以获得优秀的结果，而依赖于本文提到的大规模强化学习的小模型则需要巨大的计算能力，甚至可能无法达到蒸馏的性能。其次，尽管蒸馏策略既经济又有效，但超越智能的边界可能仍需要更强大的基础模型和更大规模的强化学习。

4.2. Unsuccessful Attempts - 4.2. 不成功的尝试

In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the way. We share our failure experiences here to provide insights, but this does not imply that these approaches are incapable of developing effective reasoning models.

在开发 DeepSeek-R1 的早期阶段，我们也遇到了失败和挫折。我们在这里分享我们的失败经验，以提供见解，但这并不意味着这些方法无法开发出有效的推理模型。

Process Reward Model (PRM) PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

过程奖励模型（PRM）PRM 是一种合理的方法，可以引导模型朝着更好的解决推理任务的方法前进（Lightman 等，2023；Uesato 等，2022；Wang 等，2023）。然而，在实践中，PRM 存在三个主要限制，这可能会阻碍其最终成功。首先，很难明确地定义一般推理中的细粒度步骤。其次，确定当前中间步骤是否正确是一项具有挑战性的任务。使用模型进行自动标注可能无法产生令人满意的结果，而手动标注又不利于规模化。第三，一旦引入基于模型的 PRM，就不可避免地会导致奖励黑客行为（Gao 等，2022），而重新训练奖励模型需要额外的训练资源，并且使整个训练流程变得复杂。总之，尽管 PRM 在重新排序模型生成的前 N 个响应或辅助引导搜索方面表现出良好的能力（Snell 等，2024），但与其在我们实验中引入的大规模强化学习过程中的额外计算开销相比，其优势是有限的。

Monte Carlo Tree Search (MCTS) Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability. This approach involves breaking answers into smaller parts to allow the model to explore the solution space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond to specific reasoning steps necessary for the search. For training, we first use collected prompts to find answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process.

蒙特卡罗树搜索 (MCTS) 受到AlphaGo (Silver et al., 2017b) 和 AlphaZero (Silver et al., 2017a)的启发，我们探索了使用蒙特卡罗树搜索 (MCTS) 来增强测试时的计算可扩展性。这种方法涉及将答案分解为更小的部分，以便模型能够系统地探索解决方案空间。为此，我们提示模型生成多个标签，这些标签对应于搜索所需的特定推理步骤。在训练过程中，我们首先使用收集到的提示通过MCTS找到答案，并由预训练的价值模型进行指导。随后，我们使用生成的问题-答案对来训练 actor 模型和 value 模型，迭代地完善这一过程。

However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo’s core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation.

然而，这种方法在扩大训练规模时遇到了几个挑战。首先，与棋类游戏不同，棋类游戏的搜索空间相对明确，token 生成呈现出一个指数级更大的搜索空间。为了解决这个问题，我们为每个节点设置了最大扩展限制，但这可能导致模型陷入局部最优解。其次，价值模型直接影响生成的质量，因为它指导搜索过程的每一步。训练一个细粒度的价值模型本质上是困难的，这使得模型难以进行迭代改进。虽然AlphaGo的核心成功依赖于训练一个价值模型以逐步提升其性能，但由于token生成的复杂性，这一原则在我们的设置中难以复制。

In conclusion, while MCTS can improve performance during inference when paired with a pre-trained value model, iteratively boosting model performance through self-search remains a significant challenge.

总之，尽管MCTS在与预训练价值模型配对时可以提高推理过程中的性能，但通过自我搜索迭代提升模型性能仍然是一个重大挑战。

5. Conclusion, Limitations, and Future Work - 5. 结论、局限性与未来工作

In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful,
leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on a range of tasks.

在这项工作中，我们分享了通过强化学习增强模型推理能力的过程。 DeepSeek-R1-Zero代表了一种纯粹的强化学习方法，无需依赖冷启动数据，在各种任务中取得了强劲的表现。 DeepSeek-R1更强大，利用冷启动数据以及迭代的强化学习微调。最终，DeepSeek-R1在一系列任务上的表现与OpenAI-o1-1217相当。

We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other dense models also achieve impressive results, significantly outperforming other instructiontuned models based on the same underlying checkpoints.

我们进一步探索将推理能力蒸馏到小型密集模型中。我们使用DeepSeek-R1作为教师模型生成80万训练样本，并微调多个小型密集模型。结果令人鼓舞：DeepSeek-R1-Distill-Qwen-1.5B在数学基准测试中以28.9%的AIME和83.9%的MATH超越了GPT-4o和Claude-3.5-Sonnet。
其他密集模型也取得了令人印象深刻的结果，显著超越了基于相同基础检查点的其他指令调优模型。

In the future, we plan to invest in research across the following directions for DeepSeek-R1.
• General Capability: Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output. Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in
these fields.
• Language Mixing: DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages. For instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is
in a language other than English or Chinese. We aim to address this limitation in future updates.
• Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a
zero-shot setting for optimal results.
• Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement
over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.

未来，我们计划在以下方向上投资研究DeepSeek-R1。
• 通用能力： 目前，DeepSeek-R1在函数调用、多轮对话、复杂角色扮演和JSON输出等任务上的能力仍不及DeepSeek-V3。展望未来，我们计划探索如何利用链式推理（CoT）来增强这些领域的任务。
• 语言混合： DeepSeek-R1 目前针对中文和英文进行了优化，这可能导致在处理其他语言查询时出现语言混合问题。例如，DeepSeek-R1 可能会在推理和响应中使用英语，即使查询是用其他语言而非英语或中文提出的。我们旨在在未来的更新中解决这一限制。
• 提示工程： 在评估 DeepSeek-R1 时，我们观察到它对提示非常敏感。少量示例提示始终会降低其性能。因此，我们建议用户直接描述问题，并在 zero-shot 设置中指定输出格式，以获得最佳结果。
• 软件工程任务： 由于评估时间较长，影响了强化学习过程的效率，因此大规模强化学习尚未在软件工程任务中得到广泛应用。因此，DeepSeek-R1 在软件工程基准测试中并未表现出相较于 DeepSeek-V3 的巨大改进。未来的版本将通过在软件工程数据上实施拒绝采
样或在强化学习过程中引入异步评估来提高效率。