r1-interpretability：两款开源自编码器，揭开 DeepSeek-R1 的推理之谜-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01157/article/details/147601174

r1-interpretability：两款开源自编码器，揭开 DeepSeek-R1 的推理之谜

r1-interpretability Open source interpretability artefacts for R1. 项目地址: https://gitcode.com/gh_mirrors/r1/r1-interpretability

项目介绍

在深度学习领域，模型的可解释性一直是一个关键的研究课题。为了推动这一领域的发展，Goodfire 团队开源了基于 DeepSeek-R1 的两款先进的自编码器（State-of-the-Art Self-Attention Encoder, SAE），旨在帮助研究者和开发者理解 DeepSeek-R1 这类大型推理模型的内部工作原理。这些自编码器是首个公开的、在真实推理模型上训练的解释器模型，并且是在任何此类规模模型上的首次尝试。

项目技术分析

DeepSeek-R1 是一个拥有 671B 参数的大型模型，其复杂的内部结构和巨大的参数量使得独立研究者难以进行有效的运行和分析。为了解决这个问题，Goodfire 团队发布了两个自编码器模型，分别针对通用推理和数学推理。这些自编码器通过分析 DeepSeek-R1 的激活，帮助研究者发现模型在解决复杂问题时的特征使用情况。

自编码器模型加载示例如下：

from sae import load_math_sae
from huggingface_hub import hf_hub_download

file_path = hf_hub_download(
    repo_id="Goodfire/DeepSeek-R1-SAE-l37",
    filename="math/DeepSeek-R1-SAE-l37.pt",
    repo_type="model"
)
device = "cpu"
math_sae = load_math_sae(file_path, device)

通过这种方式，研究者可以轻松加载并使用这些自编码器进行进一步的分析和推理。