Do Efficient Transformers Really Save Computation? 翻译

最新推荐文章于 2025-11-25 12:11:01 发布

原创最新推荐文章于 2025-11-25 12:11:01 发布 · 983 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

经典论文翻译专栏收录该内容

30 篇文章

订阅专栏

Doc2X：Markdown 转换工具专家
Doc2X 提供专业 PDF 转 Markdown 服务，支持表格解析、多栏布局和代码提取，优化工作流程。
Doc2X: Markdown Conversion Tool Specialist
Doc2X offers professional PDF to Markdown services with table parsing, multi-column layout, and code extraction to optimize workflows.
👉 访问 Doc2X 官网 | Visit Doc2X Official Site

原文链接：https://arxiv.org/pdf/2402.13934

Do Efficient Transformers Really Save Computation?

高效的Transformer真的节省计算吗？

Kai Yang ${}^{1}$ Jan Ackermann ${}^{2}$ Zhenyu He ${}^{3}$ Guhao Feng ${}^{1}$ Bohang Zhang ${}^{3}$ Yunzhen Feng ${}^{4}$ Qiwei Ye ${}^{5}$ Di He ${}^{3}$ Liwei Wang ${}^{36}$

杨凯 ${}^{1}$ 简·阿克曼 ${}^{2}$ 何振宇 ${}^{3}$ 冯国豪 ${}^{1}$ 张博航 ${}^{3}$ 冯云珍 ${}^{4}$ 叶启伟 ${}^{5}$ 何迪 ${}^{3}$ 王立伟 ${}^{36}$

Abstract

摘要

As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers’ practical strengths and weaknesses.

随着基于Transformer的语言模型在越来越大的数据集上训练，并且参数数量庞大，寻找标准Transformer的更高效替代方案变得非常有价值。尽管已经提出了许多高效的Transformer和Transformer替代方案，但没有一个方案提供了理论保证，证明它们是标准Transformer的合适替代品。这使得在何时使用特定模型以及优先考虑哪些方向进行进一步研究方面变得具有挑战性。在本文中，我们旨在理解高效Transformer的能力和局限性，特别是稀疏Transformer和线性Transformer。我们关注它们在思维链（Chain-of-Thought, CoT）提示下展示的推理能力，并参考先前的工作将它们建模为动态规划（Dynamic Programming, DP）问题。我们的结果表明，尽管这些模型具有足够的表达能力来解决一般的DP任务，但与预期相反，它们需要与问题规模成比例的模型规模。尽管如此，我们确定了一类DP问题，对于这些问题，这些模型可以比标准Transformer更高效。我们通过在代表性DP任务上的实验验证了我们的理论结果，增加了对高效Transformer实际优缺点的理解。

1. Introduction

1. 引言

The Transformer architecture, as introduced in the seminal work of Vaswani et al. (2017), has demonstrated a remarkable performance in numerous applications ranging from natural language processing to computer vision and speech. A significant advancement has recently been made, by scaling up Transformers to build Large Language Models (LLMs) (Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023). These LLMs, exemplified by models like GPT and LLaMa, typically have billions of parameters and are trained on datasets containing trillions of tokens. Given the substantial computational demands, enhancing LLMs’ efficiency has become a pivotal research focus in academic and industrial contexts.

Transformer架构，正如Vaswani等人（2017）的开创性工作所介绍的那样，在从自然语言处理到计算机视觉和语音的众多应用中展示了卓越的性能。最近，通过将Transformer扩展为构建大型语言模型（LLMs）（Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023），取得了显著的进展。这些LLMs，如GPT和LLaMa等模型，通常具有数十亿个参数，并在包含数万亿个标记的数据集上进行训练。鉴于巨大的计算需求，提高LLMs的效率已成为学术界和工业界的关键研究重点。

The primary computational bottleneck in Transformers arises from the self-attention module, whose complexity scales quadratically with the sequence length. The cost becomes particularly noticeable in tasks that require long sequence generation, such as coherent story generation or reasoning with Chain-of-Thought prompts (Wei et al., 2022b; Kojima et al., 2022; Nye et al., 2022; Zhou et al., 2023). Given the practical needs, a large body of work seeks to develop efficient Transformers that can reduce the quadratic complexity of self-attention (Tay et al., 2022), typically by imposing sparsity into architectural design (Child et al., 2019; Beltagy et al., 2020; Qiu et al., 2020; Kitaev et al., 2020; Vyas et al., 2020; Roy et al., 2021) or by employing low-rank or kernel-based approximations to accelerate the computation (Katharopoulos et al., 2020; Choromanski et al., 2021; Peng et al., 2021; Wang et al., 2020; Luo et al., 2021). However, there is generally a lack of understanding about the capabilities of efficient Transformer.

Transformer 中的主要计算瓶颈源于自注意力模块，其复杂度随序列长度呈二次方增长。这种成本在需要长序列生成的任务中尤为显著，例如连贯的故事生成或使用思维链提示进行推理（Wei et al., 2022b; Kojima et al., 2022; Nye et al., 2022; Zhou et al., 2023）。鉴于实际需求，大量工作致力于开发能够降低自注意力二次复杂度的有效 Transformer（Tay et al., 2022），通常通过在架构设计中引入稀疏性（Child et al., 2019; Beltagy et al., 2020; Qiu et al., 2020; Kitaev et al., 2020; Vyas et al., 2020; Roy et al., 2021）或采用低秩或基于核的近似方法来加速计算（Katharopoulos et al., 2020; Choromanski et al., 2021; Peng et al., 2021; Wang et al., 2020; Luo et al., 2021）。然而，对于高效 Transformer 的能力，通常缺乏理解。

In this work, we take a step towards theoretically understanding the capability of efficient Transformers. In particular, we focus on the models’ reasoning ability, a fundamental aspect of human intelligence that plays a vital role in problem-solving, decision-making, and planning. Inspired by a recent study in Feng et al. (2023), we model reasoning as a dynamic programming (DP) process as it closely resembles the way Chain-of-Thought prompts are executed. The output sequence consists of answers to a series of intermediate steps, each corresponding to solving a subproblem represented by a DP state. Feng et al. (2023) proved that all reasoning problems fitting within this framework can be solved by a standard autoregressive Transformer of a constant size (irrelevant to the problem scale), thus achieving a

在这项工作中，我们朝着理论上理解高效Transformer的能力迈出了一步。特别是，我们关注模型的推理能力，这是人类智能的一个基本方面，在解决问题、决策和规划中起着至关重要的作用。受Feng等人（2023）最近研究的启发，我们将推理建模为一个动态规划（DP）过程，因为它与Chain-of-Thought提示的执行方式非常相似。输出序列由一系列中间步骤的答案组成，每个步骤对应于解决一个由DP状态表示的子问题。Feng等人（2023）证明了所有适合此框架的推理问题都可以通过一个固定大小的标准自回归Transformer解决（与问题规模无关），从而实现

${}^{1}$ School of EECS,Peking University ${}^{2}$ ETH Zürich ${}^{3}$ National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University ${}^{4}$ New York University ${}^{5}$ Beijing Academy of Artificial Intelligence ${}^{6}$ Center for Machine Learning Research,Peking University. Correspondence to: Di He $<$ dihe@pku.edu.cn $>$ ,Li-wei Wang wanglw@pku.edu.cn, Bohang Zhang zhangbo-hang@pku.edu.cn.

${}^{1}$ 北京大学电子工程与计算机科学学院 ${}^{2}$ 苏黎世联邦理工学院 ${}^{3}$ 北京大学智能科学与技术学院通用人工智能国家重点实验室 ${}^{4}$ 纽约大学 ${}^{5}$ 北京人工智能研究院 ${}^{6}$ 北京大学机器学习研究中心。联系人：何迪 $<$ dihe@pku.edu.cn $>$ ，王立伟 wanglw@pku.edu.cn，张伯航 zhangbo-hang@pku.edu.cn。

Do Efficient Transformers Really Save Computation? computational complexity of $\Theta \left( {L}^{2}\right)$ where $L$ is the length of the output sequence.

高效的Transformer真的节省了计算吗？计算复杂度为 $\Theta \left( {L}^{2}\right)$ ，其中 $L$ 是输出序列的长度。

Table 1. Complexity of the Transformer variants on different tasks.

表1. 不同任务上Transformer变体的复杂度。

In our work, we focus on two representative (Tay et al., 2022) and successful (Tay et al., 2020b; Brown et al., 2020) variants of efficient Transformers: the Sparse Transformer (Child et al., 2019) and the Linear Transformer (Katharopou-los et al., 2020). (In the following we will refer to these two as efficient Transformers.) Our analysis shows that both architectures possess the necessary expressiveness for all problems within this DP framework despite only scaling with $\Theta \left( {L\sqrt{L}}\right)$ and $\Theta \left( L\right)$ . Although this positive result might lead one to believe that we can supplant the standard Transformer with others of lower complexity, the situation is more complicated: our main result highlights that both Sparse Transformer and Linear Transformer require a growing model size with respect to the problem scale $L$ , in contrast to the constant size of standard Transformers. Specifically, under mild assumptions, we prove that neither architecture can generate the DP solution unless the hidden dimension of the network layers scales as $\widetilde{\Omega }\left( \sqrt{L}\right)$ . This scaling results in a total computational complexity of $\widetilde{\Omega }\left( {L}^{2}\right)$ , matching the vanilla Transformer’s complexity. But this result introduces a paradox: when tackling general DP problems, the touted efficiency of these “efficient” Transformers appears to dissolve, rendering them comparably efficient to standard Transformers.

在我们的工作中，我们专注于两种具有代表性（Tay et al., 2022）且成功的（Tay et al., 2020b; Brown et al., 2020）高效Transformer变体：稀疏Transformer（Child et al., 2019）和线性Transformer（Katharopoulos et al., 2020）。（在下文中，我们将这两种称为高效Transformer。）我们的分析表明，尽管这两种架构仅随 $\Theta \left( {L\sqrt{L}}\right)$ 和 $\Theta \left( L\right)$ 缩放，但它们对于此DP框架内的所有问题都具有必要的表达能力。尽管这一积极结果可能让人认为我们可以用复杂度较低的其他Transformer替代标准Transformer，但情况更为复杂：我们的主要结果强调，稀疏Transformer和线性Transformer相对于问题规模 $L$ 都需要不断增长的模型大小，这与标准Transformer的恒定大小形成对比。具体而言，在温和假设下，我们证明了除非网络层的隐藏维度按 $\widetilde{\Omega }\left( \sqrt{L}\right)$ 缩放，否则这两种架构都无法生成DP解决方案。这种缩放导致总计算复杂度为 $\widetilde{\Omega }\left( {L}^{2}\right)$ ，与标准Transformer的复杂度相匹配。但这一结果引入了悖论：在处理一般DP问题时，这些“高效”Transformer所宣称的效率似乎消失了，使它们与标准Transformer的效率相当。

The above findings about general DP problems raise the question: For which problems are efficient Transformers efficient? To answer this question, we start by studying a fundamental task for reasoning: evaluating arithmetic expressions (Feng et al., 2023). Notably, we find that the complexity lower bound can be improved to $\widetilde{\Omega }\left( {L\sqrt{L}}\right)$ for both architectures, and the lower bound can be attained for the Sparse Transformer with a constant hidden dimension. Motivated by this finding, we then identify a general condition that unlocks the efficiency of efficient Transformers, called the locality assumption. Intuitively, this assumption states that each step in the reasoning process only depends on the outcome of recent $m$ reasoning steps where $m$ is far smaller than $L$ ,i.e. $o\left( L\right)$ . Under this assumption, we show that the complexity lower bound can be improved for sparse- and Linear Transformer. We summarize our theoretical results in Table 1.

上述关于一般动态规划（DP）问题的发现引出了一个问题：哪些问题是高效的Transformer能够高效解决的？为了回答这个问题，我们首先研究了一个基本的推理任务：评估算术表达式（Feng et al., 2023）。值得注意的是，我们发现对于这两种架构，复杂度下限可以改进为 $\widetilde{\Omega }\left( {L\sqrt{L}}\right)$ ，并且对于具有常数隐藏维度的稀疏Transformer，可以达到该下限。受此发现启发，我们随后确定了一个解锁高效Transformer效率的通用条件，称为局部性假设。直观地说，这个假设表明推理过程中的每一步仅依赖于最近的 $m$ 推理步骤的结果，其中 $m$ 远小于 $L$ ，即 $o\left( L\right)$ 。在此假设下，我们证明了稀疏Transformer和线性Transformer的复杂度下限可以得到改进。我们将理论结果总结在表1中。

We complement our theoretical findings with an extensive set of experiments. Following Feng et al. (2023), we focus on the Arithmetic task and two additional DP problems: the Longest Increasing Subsequence (LIS) and the Edit Distance (ED). Notably, the ED task satisfies the locality assumption, whereas the LIS task does not. For each task, we systematically investigate how variations in the problem size (i.e., the sequence length $L$ ) and the hidden dimension of the Transformer models impact the models’ performance. Empirical evidence confirms that, for both efficient Transformers, the required hidden dimension increases as the problem size grows in most scenarios, while this is not the case for the standard Transformer. Moreover, the dependency between hidden dimension and problem scale is more pronounced in LIS than in ED. These results validate our theory and offer practical insights into the strengths and weaknesses of efficient Transformers.

我们通过一系列广泛的实验来补充我们的理论发现。继Feng et al. (2023)之后，我们重点关注算术任务以及两个额外的DP问题：最长递增子序列（LIS）和编辑距离（ED）。值得注意的是，ED任务满足局部性假设，而LIS任务则不满足。对于每个任务，我们系统地研究了问题规模（即序列长度 $L$ ）和Transformer模型的隐藏维度变化如何影响模型的性能。实证证据证实，在大多数情况下，对于两种高效Transformer，所需的隐藏维度随着问题规模的增大而增加，而对于标准Transformer则并非如此。此外，隐藏维度与问题规模之间的依赖关系在LIS中比在ED中更为明显。这些结果验证了我们的理论，并为高效Transformer的优缺点提供了实际见解。

Notations. We adopt the big-O notation throughout this paper. Specifically,given two functions $\mathcal{X} \rightarrow \lbrack 0,\infty )$ where $\mathcal{X}$ can be any set,we write $\mathrm{O}\left( g\right)$ if there exists a constant $c > 0$ such that $f\left( x\right) \leq {cg}\left( x\right)$ for all $\in \mathcal{X}$ . We also write $\Omega \left( g\right)$ if $\mathrm{O}\left( f\right)$ ,and write $f =$ $\Theta \left( g\right)$ if both $\mathrm{O}\left( g\right)$ and $\Omega \left( g\right)$ hold. Moreover, given two functions ${\mathbb{N}}_{ + }^{d} \rightarrow \lbrack 0,\infty )$ ,we write $f =$ $\widetilde{\mathrm{O}}\left( g\right)$ if there exist constants $c, k > 0$ such that $f\left( x\right) \leq$ ${cg}\left( x\right) {\log }^{k}\left( {{x}_{1}\cdots {x}_{d}}\right)$ for all $\in {\mathbb{N}}_{ + }^{d}$ . The notations $\widetilde{\Omega }\left( \cdot \right)$ and $\widetilde{\Theta }\left( \cdot \right)$ can be similarly defined.

符号表示。我们在本文中采用大-O符号。具体来说，给定两个函数 $\mathcal{X} \rightarrow \lbrack 0,\infty )$ ，其中 $\mathcal{X}$ 可以是任何集合，我们写 $\mathrm{O}\left( g\right)$ 如果存在一个常数 $c > 0$ 使得 $f\left( x\right) \leq {cg}\left( x\right)$ 对所有 $\in \mathcal{X}$ 成立。我们还写 $\Omega \left( g\right)$ 如果 $\mathrm{O}\left( f\right)$ ，并且写 $f =$ $\Theta \left( g\right)$ 如果 $\mathrm{O}\left( g\right)$ 和 $\Omega \left( g\right)$ 都成立。此外，给定两个函数 ${\mathbb{N}}_{ + }^{d} \rightarrow \lbrack 0,\infty )$ ，我们写 $f =$ $\widetilde{\mathrm{O}}\left( g\right)$ 如果存在常数 $c, k > 0$ 使得 $f\left( x\right) \leq$ ${cg}\left( x\right) {\log }^{k}\left( {{x}_{1}\cdots {x}_{d}}\right)$ 对所有 $\in {\mathbb{N}}_{ + }^{d}$ 成立。符号 $\widetilde{\Omega }\left( \cdot \right)$ 和 $\widetilde{\Theta }\left( \cdot \right)$ 可以类似地定义。

2. Related Work

2. 相关工作

Transformers and Large Language Models have received significant attention due to their unprecedented success across various domains. A considerable body of literature has emerged to establish a deeper theoretical understanding of their strengths and constraints.

变压器和大语言模型因其跨多个领域的空前成功而受到广泛关注。大量文献涌现，以建立对其优势和局限性的更深入理论理解。

Universal Approximation. Initially, the theoretical focus was on the capacity of Transformers to approximate diverse functions. Yun et al. (2019) postulated that adequately sized Transformers can universally approximate any continuous sequence-to-sequence functions within certain bounds. A parallel line of work first showed that Transformers with infinite precision are turing-complete (Pérez et al., 2019; 2021) and later Wei et al. (2022a) established that Transformers with finite precision are approximately turing-complete. Recently, Alberti et al. (2023) proved that Linear Transformers are also universal approximators. Whereas these results approach expressiveness by proving computational capacity, we complement our expressiveness results with complexity lower bounds for practical settings.

通用逼近。最初，理论研究的重点是变压器逼近各种函数的能力。Yun 等人（2019）假设，足够大小的变压器可以在一定范围内普遍逼近任何连续的序列到序列函数。一条平行的研究首先表明，具有无限精度的变压器是图灵完备的（Pérez 等人，2019；2021），随后 Wei 等人（2022a）证明了具有有限精度的变压器近似图灵完备。最近，Alberti 等人（2023）证明了线性变压器也是通用逼近器。虽然这些结果通过证明计算能力来接近表达性，但我们通过为实际设置提供复杂度下界来补充我们的表达性结果。

Formal Language Learning. Additionally, the Transformer’s expressivity has been studied in the context of formal language learning. Bhattamishra et al. (2020) constructed a Transformer that detects counter languages, and Yao et al. (2021) show how to detect Dyck languages. Liu et al. (2022) show that shallow Transformers can learn finite state automata and simulate them for a number of steps that scale with the model size. Conversely, Hahn (2020) shows that transformers can not learn distributions over languages. Other works use classical techniques from circuit complexity (Furst et al., 1984) to prove that Transformers can simulate classes of circuits (Hao et al., 2022; Merrill et al., 2022; Merrill & Sabharwal, 2023).

形式语言学习。此外，Transformer 的表达能力已在形式语言学习的背景下进行了研究。Bhattamishra 等人（2020）构建了一个能够检测反语言的 Transformer，而 Yao 等人（2021）展示了如何检测 Dyck 语言。Liu 等人（2022）表明，浅层 Transformer 可以学习有限状态自动机，并模拟其规模随模型大小扩展的步骤数。相反，Hahn（2020）表明，Transformer 无法学习语言的分布。其他工作则使用电路复杂性中的经典技术（Furst 等人，1984）来证明 Transformer 可以模拟电路类（Hao 等人，2022；Merrill 等人，2022；Merrill 和 Sabharwal，2023）。

Measuring Complexity. Weiss et al. (2021) introduce a programming language that maps to learnable Transformer encoders and facilitates the analysis of the complexity of problems with respect to layers and attention heads. Sanford et al. (2023) introduce a sparse averaging task that requires recurrent and feed-forward networks to be of linear complexity, whereas the Transformer only needs to scale logarithmically. These works are similar to ours in that we establish concrete relationships between model complexity and solvability of the posed problems. But our work deals with autoregressive efficient Transformers equipped with Chain-of-Thought.

测量复杂性。Weiss 等人（2021）引入了一种编程语言，该语言映射到可学习的 Transformer 编码器，并有助于分析问题相对于层和注意力头的复杂性。Sanford 等人（2023）引入了一项稀疏平均任务，该任务要求循环和前馈网络具有线性复杂性，而 Transformer 仅需对数级扩展。这些工作与我们的工作类似，因为我们建立了模型复杂性与所提出问题的可解性之间的具体关系。但我们的工作涉及配备思维链的自回归高效 Transformer。

In-context learning. A recent approach shows its in-context learning ability (Garg et al., 2022; Brown et al., 2020). Following this, there are also theoretical results that (Dai et al., 2023; Von Oswald et al., 2023; Akyürek et al., 2022) prove it can perform gradient descent. Another line of work shows in-context-learning via induction heads (Elhage et al., 2021; Olsson et al., 2022). Similarly, Feng et al. (2023) show that auto regressive transformers can learn to perform dynamic programming when equipped with Chain-of-Thought. While in the same setting as Feng et al., we investigate efficient Transformers and present a problem class that encourages efficiency.

上下文学习。一种最近的方法展示了其上下文学习能力（Garg 等，2022；Brown 等，2020）。随后，也有理论结果表明（Dai 等，2023；Von Oswald 等，2023；Akyürek 等，2022）它能够执行梯度下降。另一项工作展示了通过归纳头实现的上下文学习（Elhage 等，2021；Olsson 等，2022）。类似地，Feng 等（2023）表明，当配备思维链时，自回归变换器可以学会执行动态规划。在与 Feng 等相同的设置下，我们研究了高效变换器，并提出了一类鼓励效率的问题。

Efficient Transformer. Due to the high complexity of the attention layer, many more efficient methods have been proposed. A first series of ideas exploit fixed attention patterns (Child et al., 2019; Beltagy et al., 2020; Qiu et al., 2020). Another line of work approximates the attention as a low rank matrix or with kernels (Katharopoulos et al., 2020; Wang et al., 2020; Choromanski et al., 2021) and further works deal with learned patterns (Kitaev et al., 2020; Tay et al., 2020a; Roy et al., 2021). A last set of works even completely move away from transformers (Sun et al.; Gu & Dao, 2023). Two recent works study when standard attention can be efficient (Alman & Song, 2023) and how to approximate standard attention in linear time (Keles et al., 2023). In contrast to their work, we give theoretical analyses for existing and popular efficient Transformers.

高效变换器。由于注意力层的高复杂性，已经提出了许多更高效的方法。第一系列想法利用固定的注意力模式（Child 等，2019；Beltagy 等，2020；Qiu 等，2020）。另一项工作将注意力近似为低秩矩阵或使用核（Katharopoulos 等，2020；Wang 等，2020；Choromanski 等，2021），进一步的工作处理了学习模式（Kitaev 等，2020；Tay 等，2020a；Roy 等，2021）。最后一组工作甚至完全脱离了变换器（Sun 等；Gu & Dao，2023）。两项最近的工作研究了标准注意力何时可以高效（Alman & Song，2023）以及如何在线性时间内近似标准注意力（Keles 等，2023）。与他们的工作相比，我们对现有且流行的高效变换器进行了理论分析。

3. Efficient Transformers

3. 高效变换器

The autoregressive Transformer, also called the decoder-only Transformer (Radford et al., 2019; Dai et al., 2019), is a sequence-to-sequence neural network defined as follows. Given an input sequence $s$ of length $n$ ,it first transforms each input token ${s}_{i}\left( {i \in \left\lbrack n\right\rbrack }\right)$ into a $D$ -dimensional vector ${\mathbf{x}}^{\left( 0\right) } = \operatorname{Embed}\left( {s}_{i}\right) + {\mathbf{p}}_{i} \in {\mathbb{R}}^{D}$ ,where $\operatorname{Embed}\left( \cdot \right)$ is the token embedding layer and ${\mathbf{p}}_{i}$ a learnable positional embedding. Then, $M$ Transformer blocks follow,the $l$ -th of which has the following form:

自回归Transformer，也称为仅解码器Transformer（Radford et al., 2019; Dai et al., 2019），是一种序列到序列的神经网络，定义如下。给定一个长度为 $n$ 的输入序列 $s$ ，它首先将每个输入标记 ${s}_{i}\left( {i \in \left\lbrack n\right\rbrack }\right)$ 转换为一个 $D$ 维向量 ${\mathbf{x}}^{\left( 0\right) } = \operatorname{Embed}\left( {s}_{i}\right) + {\mathbf{p}}_{i} \in {\mathbb{R}}^{D}$ ，其中 $\operatorname{Embed}\left( \cdot \right)$ 是标记嵌入层， ${\mathbf{p}}_{i}$ 是可学习的位置嵌入。然后， $M$ 个Transformer块依次进行，其中第 $l$ 个块的形式如下：

Here, ${\operatorname{Attn}}^{\left( l\right) }$ and ${\mathrm{{FFN}}}^{\left( l\right) }$ denote the multi-head self-attention layer and the feed-forward network of the $l$ -th Transformer block, respectively:

这里， ${\operatorname{Attn}}^{\left( l\right) }$ 和 ${\mathrm{{FFN}}}^{\left( l\right) }$ 分别表示第 $l$ 个Transformer块的多头自注意力层和前馈网络：

where ${\mathbf{W}}_{\mathrm{Q}}^{\left( l,h\right) },{\mathbf{W}}_{\mathrm{K}}^{\left( l,h\right) },{\mathbf{W}}_{\mathrm{V}}^{\left( l,h\right) },{\mathbf{W}}_{\mathrm{O}}^{\left( l,h\right) } \in {\mathbb{R}}^{\left\lceil \frac{D}{H}\right\rceil \times D}$ are the query,key,value,output matrices of the $h$ -th head in the $l$ -th layer,respectively,and ${\mathbf{W}}_{1}^{\left( l\right) },{\mathbf{W}}_{2}^{\left( l\right) } \in {\mathbb{R}}^{D \times D}$ are weight matrices in the FFN. The activation $\sigma$ is chosen as GeLU (Hendrycks & Gimpel, 2016), following (Radford et al., 2019; Devlin et al., 2019). The computed embedding ${\mathbf{x}}_{n}^{\left( M\right) }$ will be used to predict the next token ${s}_{n + 1}$ , which is then concatenated to the input to continue the sequence generation process. The process stops when an End-of-Sentence token is generated.

其中 ${\mathbf{W}}_{\mathrm{Q}}^{\left( l,h\right) },{\mathbf{W}}_{\mathrm{K}}^{\left( l,h\right) },{\mathbf{W}}_{\mathrm{V}}^{\left( l,h\right) },{\mathbf{W}}_{\mathrm{O}}^{\left( l,h\right) } \in {\mathbb{R}}^{\left\lceil \frac{D}{H}\right\rceil \times D}$ 分别是第 $h$ 层中第 $l$ 个头的查询、键、值、输出矩阵， ${\mathbf{W}}_{1}^{\left( l\right) },{\mathbf{W}}_{2}^{\left( l\right) } \in {\mathbb{R}}^{D \times D}$ 是FFN中的权重矩阵。激活函数 $\sigma$ 选择为GeLU（Hendrycks & Gimpel, 2016），遵循（Radford et al., 2019; Devlin et al., 2019）。计算得到的嵌入 ${\mathbf{x}}_{n}^{\left( M\right) }$ 将用于预测下一个标记 ${s}_{n + 1}$ ，然后将其连接到输入以继续序列生成过程。当生成一个句子结束标记时，该过程停止。

Based on Equations (1) to (3) and (5), it is easy to see that the computational complexity of an autoregressive Transformer is $\Theta \left( {M\left( {{L}^{2}D + L{D}^{2}}\right) }\right)$ ,where $L$ is the sequence length. This quadratic dependency on $L$ limits the application of Transformers to long text, in particular for complex reasoning tasks. To battle this, researchers have proposed various efficient Transformers to reduce the complexity. In our work, we investigate the Sparse Transformer and the Linear Transformer. Below, we describe the two architectures which are studied in this paper.

基于公式 (1) 至 (3) 和 (5)，可以看出自回归 Transformer 的计算复杂度为 $\Theta \left( {M\left( {{L}^{2}D + L{D}^{2}}\right) }\right)$ ，其中 $L$ 是序列长度。这种对 $L$ 的二次依赖限制了 Transformer 在长文本中的应用，特别是在复杂推理任务中。为了应对这一问题，研究人员提出了各种高效的 Transformer 来降低复杂度。在我们的工作中，我们研究了 Sparse Transformer 和 Linear Transformer。下面，我们描述了本文中研究的两类架构。

Sparse Transformer. Unlike the standard Transformer where each token ${\mathbf{x}}^{\left( l\right) }$ can attend to all previous positions $\left\{ {{\mathbf{x}}_{j}^{\left( l\right) } : j \in \left\lbrack i\right\rbrack }\right\}$ (see Equation (1)),in a Sparse Transformer it only attends to a subset of previous tokens $\left\{ {{\mathbf{x}}_{j}^{\left( l\right) } : j \in {\mathcal{I}}_{i}}\right\}$ . In this paper, we study a standard design paradigm proposed in Child et al. (2019), which employs a block-wise pattern as shown in the following:

Sparse Transformer。与标准 Transformer 不同，在标准 Transformer 中，每个标记 ${\mathbf{x}}^{\left( l\right) }$ 可以关注所有先前的位置 $\left\{ {{\mathbf{x}}_{j}^{\left( l\right) } : j \in \left\lbrack i\right\rbrack }\right\}$ （见公式 (1)），而在 Sparse Transformer 中，它只关注先前标记的一个子集 $\left\{ {{\mathbf{x}}_{j}^{\left( l\right) } : j \in {\mathcal{I}}_{i}}\right\}$ 。在本文中，我们研究了 Child 等人（2019）提出的标准设计范式，该范式采用了一种块状模式，如下所示：

where $B$ is called the block size and $k, c$ are constant integers. When $\Theta \left( \sqrt{L}\right)$ ,the Sparse Transformer achieves a minimal complexity of $\Theta \left( {M\left( {L\sqrt{L}D + L{D}^{2}}\right) }\right)$ . We note that GPT-3 adopted the above design paradigm (Brown et al.,

其中 $B$ 称为块大小， $k, c$ 是常数整数。当 $\Theta \left( \sqrt{L}\right)$ 时，Sparse Transformer 实现了最小的复杂度 $\Theta \left( {M\left( {L\sqrt{L}D + L{D}^{2}}\right) }\right)$ 。我们注意到 GPT-3 采用了上述设计范式（Brown 等人，

2020).

Linear Transformer. Another line of work proposed to accelerate the attention computation (Equation (3)) using kernel-based approximations. A representative approach is the Linear Transformer (Katharopoulos et al., 2020), which approximates ${\operatorname{Attn}}^{\left( l\right) }$ with the following formula:

Linear Transformer。另一类工作提出使用基于核的近似方法来加速注意力计算（公式 (3)）。一个代表性的方法是 Linear Transformer（Katharopoulos 等人，2020），它使用以下公式近似 ${\operatorname{Attn}}^{\left( l\right) }$ ：

where they choose $\phi \left( \mathbf{x}\right) = \operatorname{elu}\left( \mathbf{x}\right) + 1$ . The above computation can be accelerated by rearranging the order of computation so that the intermediate results $\mathop{\sum }\limits_{{\mathbf{z} \in \mathcal{S}}}\left( {{\mathbf{W}}_{\mathrm{V}}^{\left( l,\widetilde{h}\right) }\mathbf{z}}\right) \phi {\left( {\mathbf{W}}_{\mathrm{K}}^{\left( l,h\right) }\mathbf{z}\right) }^{\top }$ and $\mathop{\sum }\limits_{{\mathbf{z} \in \mathcal{S}}}\phi {\left( {\mathbf{W}}_{\mathrm{K}}^{\left( l,h\right) }\mathbf{z}\right) }^{\top }$ associated with different $\mathcal{S}$ can be jointly computed using prefix sum,finally yielding a complexity of $\Theta \left( {{ML}{D}^{2}}\right)$ which is linear in $L$ .

其中他们选择 $\phi \left( \mathbf{x}\right) = \operatorname{elu}\left( \mathbf{x}\right) + 1$ 。通过重新排列计算顺序，可以使与不同 $\mathcal{S}$ 相关的中间结果 $\mathop{\sum }\limits_{{\mathbf{z} \in \mathcal{S}}}\left( {{\mathbf{W}}_{\mathrm{V}}^{\left( l,\widetilde{h}\right) }\mathbf{z}}\right) \phi {\left( {\mathbf{W}}_{\mathrm{K}}^{\left( l,h\right) }\mathbf{z}\right) }^{\top }$ 和 $\mathop{\sum }\limits_{{\mathbf{z} \in \mathcal{S}}}\phi {\left( {\mathbf{W}}_{\mathrm{K}}^{\left( l,h\right) }\mathbf{z}\right) }^{\top }$ 可以联合计算使用前缀和，最终得到复杂度为 $\Theta \left( {{ML}{D}^{2}}\right)$ ，即 $L$ 的线性复杂度。

4. Expressiveness of Efficient Transformers in Reasoning Tasks

4. 高效 Transformer 在推理任务中的表达能力

Reasoning constitutes a fundamental aspect of human intelligence and plays a vital role in problem-solving, decision-making, and planning. Recently, Transformer-based LLMs have demonstrated remarkable reasoning abilities (OpenAI, 2023; Touvron et al., 2023). This has sparked a series of studies aimed at theoretically understanding how powerful these models are. In particular, Feng et al. (2023) recently revealed that autoregressive Transformers are capable of solving a general class of reasoning problems formalized as Dynamic Programming (DP). In this section, we extend this finding by investigating how things change when moving to various types of efficient Transformers.

推理构成了人类智能的一个基本方面，在问题解决、决策制定和规划中起着至关重要的作用。最近，基于Transformer的大型语言模型（LLMs）展示了显著的推理能力（OpenAI, 2023; Touvron et al., 2023）。这引发了一系列旨在从理论上理解这些模型有多强大的研究。特别是，Feng et al. (2023) 最近揭示了自回归Transformer能够解决一类形式化为动态规划（DP）的推理问题。在本节中，我们通过研究在转向各种类型的有效Transformer时情况如何变化来扩展这一发现。

4.1. Problem formulation

4.1. 问题表述

Dynamic programming decomposes a complex reasoning problem into a sequence of reasoning steps, each of which corresponds to a subproblem and is called a DP state. Different subproblems depend on each other because they can be efficiently solved based on the answers of previously solved subproblems. Formally,denoting by $\mathrm{{dp}}\left( i\right)$ the answer of subproblem $i$ ,then the relation between subproblems can be characterized using a transition function:

动态规划将一个复杂的推理问题分解为一系列推理步骤，每个步骤对应一个子问题，称为DP状态。不同的子问题相互依赖，因为它们可以根据先前解决的子问题的答案高效地解决。形式上，用 $\mathrm{{dp}}\left( i\right)$ 表示子问题 $i$ 的答案，那么子问题之间的关系可以用一个转移函数来表征：

where $s$ is the input sequence,and $f,{g}_{1},\cdots ,{g}_{J}$ , ${h}_{1},\cdots ,{h}_{K}$ are functions that depends on the problem. In other words, the answer of each subproblem is fully determined by the answers of a finite number of previous subproblems plus a finite number of input tokens. Based on Equation (9), we can sequentially solve all subproblems one by one. After solving all subproblems, the final answer can be computed by $u\left( {\operatorname{dp}\left( {i}_{N}\right) }\right)$ ,where ${i}_{N}$ is the last DP state and $u$ is a problem-dependent function. By defining our problem so generally, we also cover CoT problems. We assume that the $f,\mathbf{g},\mathbf{h}$ and $u$ above can be approximated by an MLP with GeLU activation of constant size. We also assume that during the CoT generation process, the next state can be obtained by an MLP where the input is the current state. One can refer to Appendix B for a formal description. We argue that these assumptions are mild and that they have been used in previous work (Feng et al., 2023).

其中 $s$ 是输入序列，而 $f,{g}_{1},\cdots ,{g}_{J}$ 和 ${h}_{1},\cdots ,{h}_{K}$ 是依赖于问题的函数。换句话说，每个子问题的答案完全由有限数量的先前子问题的答案加上有限数量的输入标记决定。基于公式 (9)，我们可以依次解决所有子问题。在解决所有子问题后，最终答案可以通过 $u\left( {\operatorname{dp}\left( {i}_{N}\right) }\right)$ 计算，其中 ${i}_{N}$ 是最后一个 DP 状态， $u$ 是与问题相关的函数。通过如此广泛地定义我们的问题，我们也涵盖了 CoT 问题。我们假设上述的 $f,\mathbf{g},\mathbf{h}$ 和 $u$ 可以通过具有 GeLU 激活的常量大小的 MLP 近似。我们还假设在 CoT 生成过程中，下一个状态可以通过输入为当前状态的 MLP 获得。可以参考附录 B 以获得正式描述。我们认为这些假设是温和的，并且它们已经在先前的工作中使用过（Feng 等，2023）。

In our subsequent analysis, without loss of generality, we assume that each input element ${s}_{j}$ is an integer,and each state $i$ ,DP value $\mathrm{{dp}}\left( i\right)$ ,and the final answer can all be represented by vectors of integer elements. The domain of these integers can grow polynomially with respect to the length $L$ .

在我们的后续分析中，不失一般性，我们假设每个输入元素 ${s}_{j}$ 是一个整数，并且每个状态 $i$ 、DP 值 $\mathrm{{dp}}\left( i\right)$ 以及最终答案都可以由整数元素的向量表示。这些整数的域可以相对于长度 $L$ 多项式增长。

Output format. Following Feng et al. (2023), given a DP task and an input seuqence $s$ ,an autoregressive Transformer generates the answer with all intermediate steps in the following form:

输出格式。遵循 Feng 等（2023），给定一个 DP 任务和一个输入序列 $s$ ，一个自回归 Transformer 以以下形式生成答案及其所有中间步骤：

Here, the subsequence ending at the special token “|” is the input to the Transformer, and the remainder will be autoregressively generated. The output at each position is split into four parts that store the input, state, DP value, and final answer,respectively. We denote by ${i}_{1},\cdots ,{i}_{N}$ the sequence of DP states representing all subproblems in order. We consider the regression setting where the output at each position is simply obtained from the embedding of the last Transformer layer by projecting each dimension to the nearest integer. Similarly, each generated output directly serves as the input of the next position (without using a token embedding layer).

在这里，以特殊标记 “|” 结尾的子序列是输入到 Transformer 的，其余部分将自回归生成。每个位置的输出被分成四部分，分别存储输入、状态、DP 值和最终答案。我们用 ${i}_{1},\cdots ,{i}_{N}$ 表示表示所有子问题的 DP 状态序列。我们考虑回归设置，其中每个位置的输出仅通过将最后一层 Transformer 的嵌入的每个维度投影到最近的整数来获得。类似地，每个生成的输出直接作为下一个位置的输入（不使用标记嵌入层）。

Log-precision Transformers. We adopt a realistic and widely-used setting where all internal neurons in the Transformer can only store floating-point numbers within a finite

对数精度 Transformer。我们采用一个现实且广泛使用的设置，其中 Transformer 中的所有内部神经元只能存储有限范围内的浮点数

—— 更多内容请到Doc2X翻译查看——
—— For more content, please visit Doc2X for translations ——