【论文翻译】Mamba: Linear-Time Sequence Modeling with Selective State Spaces

原创

已于 2025-07-21 23:13:29 修改 · 1.1k 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#nlp #人工智能

于 2025-07-21 02:19:42 首次发布

在这里插入图片描述 Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Abstract

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic - time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content - based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware - aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end - to - end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million - length sequences. As a general sequence model backbone, Mamba achieves state - of - the - art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba - 3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

基础模型目前为深度学习中大多数令人兴奋的应用提供动力，它们几乎都基于Transformer架构及其核心注意力模块。许多亚二次时间架构，如线性注意力、门控卷积和循环模型，以及结构化状态空间模型（SSM），都是为了解决Transformer在长序列上的计算效率低下问题而开发的，但它们在语言等重要模态上的表现不如注意力机制好。我们发现，这些模型的一个关键弱点是无法进行基于内容的推理，并做出了几项改进。首先，只需让SSM参数成为输入的函数，就能解决它们在离散模态下的弱点，使模型能够根据当前标记，沿着序列长度维度有选择地传播或遗忘信息。其次，尽管这一变化导致无法使用高效的卷积，但我们设计了一种循环模式下的硬件感知并行算法。我们将这些选择性SSM集成到一个简化的端到端神经网络架构中，该架构不含注意力模块，甚至没有多层感知机（MLP）模块（即Mamba）。Mamba推理速度快（吞吐量比Transformer高5倍），且序列长度呈线性扩展，其性能在长达百万长度的真实数据上也有所提升。作为一种通用的序列模型主干，Mamba在语言、音频和基因组学等多种模态中都取得了最先进的性能。在语言建模方面，我们的Mamba - 3B模型在预训练和下游评估中均优于同等规模的Transformer，并与两倍于其规模的Transformer性能相当。

1 Introduction

Foundation models (FMs), or large models pretrained on massive data then adapted for downstream tasks, have emerged as an effective paradigm in modern machine learning. The backbone of these FMs are often sequence models, operating on arbitrary sequences of inputs from a wide variety of domains such as language, images, speech, audio, time series, and genomics (Brown et al. 2020; Dosovitskiy et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014). While this concept is agnostic to a particular choice of model architecture, modern FMs are predominantly based on a single type of sequence model: the Transformer (Vaswani et al. 2017) and its core attention layer (Bahdanau, Cho, and Bengio 2015).

基础模型（FMs），即先在海量数据上预训练、再适配下游任务的大型模型，已成为现代机器学习中的一种有效范式。这些基础模型的主干通常是序列模型，可处理来自语言、图像、语音、音频、时间序列和基因组学等多个领域的任意输入序列（Brown等人，2020；Dosovitskiy等人，2020；Ismail Fawaz等人，2019；Oord等人，2016；Poli等人，2023；Sutskever、Vinyals和Quoc V Le，2014）。尽管这一概念并不依赖于特定的模型架构选择，但现代基础模型主要基于一种序列模型：Transformer（Vaswani等人，2017）及其核心的注意力层（Bahdanau、Cho和Bengio，2015）。

The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length. An enormous body of research has appeared on more efficient variants of attention to overcome these drawbacks (Tay, Dehghani, Bahri, et al. 2022), but often at the expense of the very properties that makes it effective. As of yet, none of these variants have been shown to be empirically effective at scale across domains.

自注意力机制的有效性源于其在上下文窗口内密集路由信息的能力，使其能够建模复杂数据。然而，这一特性也带来了根本性缺陷：无法对有限窗口之外的内容进行建模，且计算量随窗口长度呈二次方增长。已有大量研究致力于开发更高效的注意力变体以克服这些缺陷（Tay、Dehghani、Bahri等人，2022），但这些改进往往以牺牲注意力机制的核心有效性为代价。迄今为止，尚未有任何一种变体被证明能在跨领域大规模场景下取得实证上的有效结果。

Recently, structured state space sequence models (SSMs) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021) have emerged as a promising class of architectures for sequence modeling. These models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models (Kalman 1960). This class of models can be computed very efficiently as either a recurrence or convolution, with linear or near-linear scaling in sequence length. Additionally, they have principled mechanisms for modeling long-range dependencies (Gu, Dao, et al. 2020) in certain data modalities, and have dominated benchmarks such as the Long Range Arena (Tay, Dehghani, Abnar, et al. 2021). Many flavors of SSMs (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Y. Li et al. 2023; Ma et al. 2023; Orvieto et al. 2023; Smith, Warrington, and Linderman 2023) have been successful in domains involving continuous signal data such as audio and vision (Goel et al. 2022; Nguyen, Goel, et al. 2022; Saon, Gupta, and Cui 2023). However, they have been less effective at modeling discrete and information-dense data such as text.
近年来，结构化状态空间序列模型（SSMs）（Gu、Goel和Ré，2022；Gu、Johnson、Goel等人，2021）已成为一类极具潜力的序列建模架构。这类模型可被理解为循环神经网络（RNNs）与卷积神经网络（CNNs）的结合体，其设计灵感源自经典的状态空间模型（Kalman，1960）。它们能以递归或卷积的方式高效计算，计算量随序列长度呈线性或近线性增长。此外，在特定数据模态中，它们具备建模长程依赖关系的规范化机制（Gu、Dao等人，2020），并在“长程任务基准测试”（Long Range Arena）等评测中表现领先（Tay、Dehghani、Abnar等人，2021）。多种变体的SSMs（Gu、Goel和Ré，2022；Gu、Gupta等人，2022；Gupta、Gu和Berant，2022；Y. Li等人，2023；Ma等人，2023；Orvieto等人，2023；Smith、Warrington和Linderman，2023）已在音频、视觉等涉及连续信号数据的领域取得成功（Goel等人，2022；Nguyen、Goel等人，2022；Saon、Gupta和Cui，2023）。然而，在建模文本等离散且信息密集型数据时，它们的效果欠佳。

We propose a new class of selective state space models, that improves on prior work on several axes to achieve the modeling power of Transformers while scaling linearly in sequence length.
我们提出了一类新的选择性状态空间模型，通过在多个方面改进现有工作，使其既能达到Transformer的建模能力，又能随序列长度呈线性增长。

Selection Mechanism. First, we identify a key limitation of prior models: the ability to efficiently select data in an input-dependent manner (i.e. focus on or ignore particular inputs). Building on intuition based on important synthetic tasks such as selective copy and induction heads, we design a simple selection mechanism by parameterizing the SSM parameters based on the input. This allows the model to filter out irrelevant information and remember relevant information indefinitely.
选择机制。首先，我们发现了先前模型的一个关键局限：它们无法以“依赖输入”的方式高效筛选数据（即关注或忽略特定输入）。基于对“选择性复制”“归纳头”等重要合成任务的直观理解，我们设计了一种简单的选择机制——通过让SSM的参数依赖于输入来实现参数化。这使得模型能够过滤无关信息，并永久记住相关信息。

Hardware-aware Algorithm. This simple change poses a technical challenge for the computation of the model; in fact, all prior SSMs models must be time- and input-invariant in order to be computationally efficient. We overcome this with a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution, but does not materialize the expanded state in order to avoid IO access between different levels of the GPU memory hierarchy. The resulting implementation is faster than previous methods both in theory (scaling linearly in sequence length, compared to pseudo-linear for all convolution-based SSMs) and on modern hardware (up to 3× faster on A100 GPUs).
硬件感知算法。这一简单改动给模型计算带来了技术挑战：事实上，以往所有SSM模型都必须满足“时间不变性”和“输入不变性”才能保证计算效率。我们通过一种“硬件感知算法”解决了这一问题：该算法采用循环扫描而非卷积进行模型计算，且不生成扩展状态，从而避免了GPU内存层级间的IO访问。最终实现的算法在理论上（计算量随序列长度线性增长，而所有基于卷积的SSM仅为伪线性）和现代硬件上（在A100 GPU上速度提升高达3倍）都比以往方法更快。

Architecture. We simplify prior deep sequence model architectures by combining the design of prior SSM architectures (Dao, Fu, Saab, et al. 2023) with the MLP block of Transformers into a single block, leading to a simple and homogenous architecture design (Mamba) incorporating selective state spaces.
架构设计。我们简化了以往的深度序列模型架构：将先前SSM架构的设计（Dao、Fu、Saab等人，2023）与Transformer的MLP模块合并为单个模块，形成了一种简单且统一的架构（即Mamba），其中整合了选择性状态空间。

Selective SSMs, and by extension the Mamba architecture, are fully recurrent models with key properties that make them suitable as the backbone of general foundation models operating on sequences. (i) High quality: selectivity brings strong performance on dense modalities such as language and genomics. (ii) Fast training and inference: computation and memory scales linearly in sequence length during training, and unrolling the model autoregressively during inference requires only constant time per step since it does not require a cache of previous elements. (iii) Long context: the quality and efficiency together yield performance improvements on real data up to sequence length 1M.
选择性SSM及其扩展而来的Mamba架构是全循环模型，其核心特性使其适合作为处理序列数据的通用基础模型的骨干：（i）高质量：选择性机制使其在语言、基因组学等信息密集型模态上表现优异；（ii）训练与推理快速：训练时计算量和内存随序列长度线性增长；推理时，由于无需缓存历史元素，自回归展开模型每步仅需恒定时间；（iii）长上下文能力：性能与效率的结合，使得在处理长达100万长度的真实数据时，模型性能仍能提升。

We empirically validate Mamba’s potential as a general sequence FM backbone, in both pretraining quality and domain-specific task performance, on several types of modalities and settings:
我们通过实验验证了Mamba作为通用序列基础模型（FM）骨干的潜力，从预训练质量和特定领域任务性能两方面，在多种模态和场景下进行了测试：

Synthetics. On important synthetic tasks such as copying and induction heads that have been proposed as being key to large language models, Mamba not only solves them easily but can extrapolate solutions indefinitely long (>1M tokens).
- 合成任务。在复制、归纳头等被认为是大型语言模型核心能力的重要合成任务上，Mamba不仅能轻松解决，还能外推得出超长序列（超过100万个token）的解决方案。
Audio and Genomics. Mamba out-performs prior state-of-the-art models such as SaShiMi, Hyena, and Transformers on modeling audio waveforms and DNA sequences, both in pretraining quality and downstream metrics (e.g. reducing FID on a challenging speech generation dataset by more than half). In both settings, its performance improves with longer context up to million-length sequences.
- 音频与基因组学。在音频波形和DNA序列建模中，Mamba在预训练质量和下游指标上均优于SaShiMi、Hyena、Transformer等现有最先进模型（例如，在一个具有挑战性的语音生成数据集上，将FID指标降低了一半以上）。在这两个领域中，随着上下文长度增加（最长达百万级序列），Mamba的性能也随之提升。
Language Modeling. Mamba is the first linear-time sequence model that truly achieves Transformer-quality performance, both in pretraining perplexity and downstream evaluations. With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023). Our Mamba language model has 5× generation throughput compared to Transformers of similar size, and Mamba-3B’s quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B).
- 语言建模。Mamba是首个真正达到Transformer级性能的线性时间序列模型，无论是预训练的困惑度（perplexity）还是下游任务评估均是如此。在参数规模达10亿的范围内，我们发现Mamba的性能超过了大量基线模型，包括基于LLaMa的强现代Transformer训练方案（Touvron等人，2023）。我们的Mamba语言模型生成吞吐量是同规模Transformer的5倍，且Mamba-3B的性能可与规模为其两倍的Transformer相媲美（例如，在常识推理任务上，平均得分比Pythia-3B高4分，甚至超过了Pythia-7B）。

Model code and pre-trained checkpoints are open-sourced at https://github.com/state-spaces/mamba.

模型代码和预训练 checkpoint 已开源，地址为：https://github.com/state-spaces/mamba。

2 State Space Models

Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are broadly related to RNNs, and CNNs, and classical state space models. They are inspired by a particular continuous system (1) that maps a 1-dimensional function or sequence $\in \mathbb{R} \mapsto y(t) \in \mathbb{R}$ through an implicit latent state $\in \mathbb{R}^{N}$

结构化状态空间序列模型（S4）是近年来出现的一类深度学习序列模型，与循环神经网络（RNNs）、卷积神经网络（CNNs）以及经典状态空间模型都有密切关联。它们的设计灵感源自一个特定的连续系统（公式1），该系统通过一个隐含的 latent 状态 $\in \mathbb{R}^{N}$ ，将一维函数或序列 $\in \mathbb{R}$ 映射为 $\in \mathbb{R}$ 。

Concretely, S4 models are defined with four parameters $(\Delta, A, B, C)$ , which define a sequence-to-sequence transformation in two stages.

具体来说，S4模型由四个参数 $(\Delta, A, B, C)$ 定义，这些参数通过两个阶段实现序列到序列的转换：

$\quad (1a) \\y(t) = C h(t) \quad (1b)$

公式（1a）（1b）描述的是连续时间场景下的状态更新与输出： $h^{'} (t)$ 是 latent 状态 $h (t)$ 的导数，体现状态随时间的变化； $y (t)$ 是由当前状态 $h (t)$ 计算得到的输出。
$h_{t}=\overline{A} h_{t-1}+\overline{B} x_{t} \quad (2a) \\ y_t= C h_t \quad (2b)$
公式（2a）（2b）是离散时间场景的对应形式： $h_t$ 表示第 $t$ 步的状态，由前一步状态 $h_{t-1}$ 和当前输入 $x_t$ 更新而来； $y_t$ 是第 $t$ 步的输出。