Attention is all you need

2chen_

已于 2024-10-07 16:03:07 修改

阅读量417

点赞数 4

CC 4.0 BY-SA版权

分类专栏：机器学习文章标签：深度学习人工智能 transformer chatgpt

于 2024-10-07 15:59:47 首次发布

本文链接：https://blog.youkuaiyun.com/chen030209/article/details/142728652

机器学习专栏收录该内容

2 篇文章

订阅专栏

摘要

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.

主流的序列转换模型是基于复杂的循环神经网络或卷积神经网络，其中包括一个编码器和一个解码器。

The best performing models also connect the encoder and decoder through an attention mechanism.

表现最佳的模型还通过注意力机制将编码器和解码器连接起来。

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

我们提出了一种新的简单网络架构 ——Transformer，它完全基于注意力机制，完全摒弃了循环和卷积。

Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

在两个机器翻译任务上的实验表明，这些模型在质量上更优，同时更具并行性，并且训练所需的时间显著减少。

引言

Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [29, 2, 5].

循环神经网络，特别是长短期记忆网络 [12] 和门控循环神经网络 [7]，已经被牢固地确立为序列建模和转换问题（如语言建模和机器翻译 [29,2,5]）中的最先进方法。

Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.

此后，众多努力继续推动循环语言模型和编码器 - 解码器架构的边界。

Recurrent models typically factor computation along the symbol positions of the input and output sequences.

循环模型通常沿着输入和输出序列的符号位置分解计算。

Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$ , as a function of the previous hidden state $h_{t-1}$ and the input for position t.

将位置与计算时间中的步骤对齐，它们根据前一个隐藏状态和位置的输入生成一系列隐藏状态。

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

这种固有的顺序性质排除了训练样本内的并行化，在较长序列长度时这变得至关重要，因为内存限制会限制跨样本的批处理。

Recent work has achieved significant improvements in computational efficiency through factorization tricks [18] and conditional computation [26], while also improving model performance in case of the latter.

最近的工作通过分解技巧 [18] 和条件计算 [26] 在计算效率上实现了显著改进，同时在后者的情况下也提高了模型性能。

The fundamental constraint of sequential computation, however, remains.

然而，顺序计算的基本限制仍然存在。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 16].

注意力机制已成为各种任务中引人注目的序列建模和转换模型的一个组成部分，允许对依赖关系进行建模而无需考虑它们在输入或输出序列中的距离 [2,16]。

In all but a few cases [22], however, such attention mechanisms are used in conjunction with a recurrent network.

但是，在除了少数情况 [22] 之外的所有情况下，这种注意力机制都是与循环网络结合使用的。

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.

在这项工作中，我们提出了 Transformer，一种模型架构，它避开了循环，而是完全依赖注意力机制来提取输入和输出之间的全局依赖关系。

The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

Transformer 允许显著更多的并行化，并且在八块 P100 GPU 上仅训练十二小时后就可以在翻译质量方面达到新的最先进水平。

背景

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.

减少顺序计算的目标也构成了扩展神经 GPU [20]、ByteNet [15] 和 ConvS2S [8] 的基础，它们都使用卷积神经网络作为基本构建模块，为所有输入和输出位置并行计算隐藏表示。

In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.

在这些模型中，关联来自两个任意输入或输出位置的信号所需的操作数量随着位置之间的距离而增加，对于 ConvS2S 是线性增加，对于 ByteNet 是对数增加。

This makes it more difficult to learn dependencies between distant positions [11].

这使得学习远距离位置之间的依赖关系更加困难 [11]。

In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

在 Transformer 中，这被减少到恒定数量的操作，尽管由于对注意力加权位置进行平均而导致有效分辨率降低，但我们通过 3.2 节中描述的多头注意力来抵消这种影响。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

自注意力，有时也称为内部注意力，是一种将单个序列的不同位置相关联以计算该序列表示的注意力机制。

Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 22, 23, 19].

自注意力已在各种任务中成功应用，包括阅读理解、抽象摘要、文本蕴含和学习与任务无关的句子表示 [4,22,23,19]。

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [28].

端到端记忆网络基于循环注意力机制而不是序列对齐的循环，并且已被证明在简单语言问答和语言建模任务中表现良好 [28]。

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

然而，据我们所知，Transformer 是第一个完全依赖自注意力来计算其输入和输出表示而不使用序列对齐的 RNN 或卷积的转换模型。

In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [14, 15] and [8].

在接下来的章节中，我们将描述 Transformer，阐述自注意力的动机，并讨论它相对于 [14,15] 和 [8] 等模型的优势。

模型架构

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29].

大多数有竞争力的神经序列转换模型都具有编码器 - 解码器结构。

Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn).

在这里，编码器将符号表示的输入序列 (x1, ..., xn) 映射到连续表示的序列 z = (z1, ..., zn)。

Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time.

给定 z，解码器然后一次生成一个符号的输出序列 (y1, ..., ym)

At each step the model is auto-regressive [9], consuming the previously generated symbols as additional input when generating the next.

在每个步骤中，模型是自回归的，在生成下一个符号时将先前生成的符号作为额外输入。

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

Transformer 遵循这种总体架构，在编码器和解码器中都使用堆叠的自注意力和逐点全连接层，分别如图 1 的左半部分和右半部分所示。

3.1 Encoder and Decoder Stacks

Encoder:

The encoder is composed of a stack of N = 6 identical layers.

编码器由 N = 6 个相同的层堆叠而成。

Each layer has two sub-layers.

每一层有两个子层。

The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

第一个是多头自注意力机制，第二个是一个简单的、位置对应的全连接前馈网络。

We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1].

我们在这两个子层的每一个周围都使用残差连接 [10]，然后进行层归一化 [1]。

That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.

也就是说，每个子层的输出是 LayerNorm(x + Sublayer(x))，其中 Sublayer(x) 是由子层本身实现的函数。

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model}$ = 512.

为了便于这些残差连接，模型中的所有子层以及嵌入层都产生维度为 $d_{model}$ = 512 的输出。

Decoder:

解码器

The decoder is also composed of a stack of N = 6 identical layers.

解码器同样由 N = 6 个相同的层堆叠而成。

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.

除了每个编码器层中的两个子层之外，解码器插入了第三个子层，该子层对编码器栈的输出执行多头注意力。

Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

与编码器类似，我们在每个子层周围使用残差连接，然后进行层归一化。

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.

我们还修改了解码器栈中的自注意力子层，以防止位置关注后续位置。

This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

这种掩码操作，再加上输出嵌入偏移一个位置的事实，确保了位置 i 的预测只能依赖于小于 i 的位置上的已知输出。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

注意力函数可以描述为将一个查询和一组键值对映射到一个输出，其中查询、键、值和输出都是向量

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

输出是值的加权和，其中分配给每个值的权重是通过查询与相应键的兼容性函数计算得出的。

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2).

我们将我们特定的注意力机制称为“缩放点积注意力”（图 2）。

The input consists of queries and keys of dimension $d_k$ , and values of dimension $d_v$ .

输入由维度为 $d_k$ 的查询和键，以及维度为 $d_v$ 的值组成。

We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$ , and apply a softmax function to obtain the weights on the values.

我们计算查询与所有键的点积，将每个点积除以 $\sqrt{d_k}$ ，然后应用 softmax 函数以获得值上的权重。

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q.

在实践中，我们同时对一组查询计算注意力函数，这些查询被打包到一个矩阵 Q 中。

The keys and values are also packed together into matrices K and V .

键和值也被打包到矩阵 K 和 V 中。

We compute the matrix of outputs as:

我们将输出矩阵计算为：

$QK^T$ 就是查询与所有键的点积

The two most commonly used attention functions are additive attention [2], and dot-product (multi-plicative) attention.

两种最常用的注意力函数是加法注意力 [2] 和点积（乘法）注意力。

Dot-product attention is identical to our algorithm, except for the scaling factor of $\dfrac{1}{\sqrt{d_k}}$ .

点积注意力与我们的算法相同，除了缺少 $\dfrac{1}{\sqrt{d_k}}$ 的缩放因子。

Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.

加法注意力使用具有单个隐藏层的前馈网络计算兼容性函数。

While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

虽然两者在理论复杂度上相似，但在实践中点积注意力更快且更节省空间，因为它可以使用高度优化的矩阵乘法代码实现。

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ [3].

对于较小的 $d_k$ 值，这两种机制表现相似，但对于较大的 $d_k$ 值，在没有缩放的情况下加法注意力优于点积注意力。

We suspect that for large values of $d_k$ , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

我们怀疑对于较大的 $d_k$ 值，点积的数值会变得很大，将 softmax 函数推向梯度极小的区域。

To counteract this effect, we scale the dot products by $\dfrac{1}{\sqrt{d_k}}$ .

为了抵消这种影响，我们将点积 $\dfrac{1}{\sqrt{d_k}}$ 进行缩放。

3.2.2 Multi-Head Attention

多头注意力

Instead of performing a single attention function with $d_{model}$ -dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to $d_k$ , $d_k$ and $d_v$ dimensions, respectively.

我们发现，与其使用 $d_{model}$ 维的键、值和查询执行单个注意力函数，不如用不同的、学习得到的线性投影将查询、键和值分别进行 h 次线性投影到 $d_k$ 、 $d_k$ 和 $d_v$ 维度。

On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values.

在查询、键和值的这些投影版本上，我们并行执行注意力函数，产生 $d_v$ 维的输出值。

These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

这些输出值被连接起来并再次进行投影，得到最终的值，如图 2 所示。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

多头注意力允许模型在不同位置从不同的表示子空间共同关注信息。

With a single attention head, averaging inhibits this.

而只有单个注意力头时，平均操作会抑制这种能力。

Where the projections are parameter matrices $W_i^Q\in\mathbb{R}^{d_{model}\times d_k}$ , $W_i^K\in \mathbb{R}^{d_{model}\times d_k}$ , $W_i^V\in \mathbb{R}^{d_{model}\times d_v}$ and $W^O\in \mathbb{R}^{hd_v\times d_{model}}$ .

In this work we employ h = 8 parallel attention layers, or heads.

在这项工作中，我们采用 h = 8 个并行的注意力层，即“头”。

For each of these we use $d_k=d_v=d_{model}/h=64$ .

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

由于每个“头”的维度降低，总的计算成本与具有全维度的单头注意力的计算成本相似。

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:

Transformer 有三种使用多头注意力的方式：

• In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

在 “编码器 - 解码器注意力” 层中，查询来自上一个解码器层，而记忆键和值来自编码器的输出。

This allows every position in the decoder to attend over all positions in the input sequence.

这允许解码器中的每个位置关注输入序列中的所有位置。

This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [31, 2, 8].

这模仿了序列到序列模型中典型的编码器 - 解码器注意力机制。

• The encoder contains self-attention layers.

编码器包含自注意力层。

In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder.

在自注意力层中，所有的键、值和查询都来自同一个地方，在这种情况下，是编码器中前一层的输出。

Each position in the encoder can attend to all positions in the previous layer of the encoder.

编码器中的每个位置都可以关注编码器前一层中的所有位置。

• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.

类似地，解码器中的自注意力层允许解码器中的每个位置关注解码器中直到并包括该位置的所有位置。

We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.

我们需要阻止解码器中的向左信息流以保持自回归特性。

We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

我们通过在缩放点积注意力内部将 softmax 输入中对应于非法连接的所有值屏蔽（设置为 -∞）来实现这一点。参见图 2。

3.3 Position-wise Feed-Forward Networks

位置全连接前馈网络

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.

除了注意力子层之外，我们编码器和解码器中的每一层都包含一个全连接前馈网络，该网络分别且相同地应用于每个位置。

This consists of two linear transformations with a ReLU activation in between.

它由两个线性变换组成，中间有一个 ReLU 激活函数。

While the linear transformations are the same across different positions, they use different parameters from layer to layer.

虽然线性变换在不同位置是相同的，但它们在不同层之间使用不同的参数。

Another way of describing this is as two convolutions with kernel size 1.

另一种描述方式是将其视为两个内核大小为 1 的卷积。

The dimensionality of input and output is $d_{model}$ = 512, and the inner-layer has dimensionality $d_{ff}$ = 2048.

输入和输出的维度是 $d_{model}$ = 512，内层的维度是 $d_{ff}$ = 2048。

3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$ .

与其他序列转换模型类似，我们使用学习得到的嵌入将输入标记和输出标记转换为维度为 $d_{model}$ 的向量。

We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.

我们还使用通常的学习到的线性变换和 Softmax 函数来将解码器输出转换为预测的下一个标记的概率。

Softmax 函数也称为归一化指数函数。

Softmax 函数将任意实数输入转换为范围在 (0, 1) 之间的数值，并且所有输出值的总和为 1。这样的输出可以被解释为每个类别的概率。

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [24].

在我们的模型中，我们在两个嵌入层和预 Softmax 线性变换之间共享相同的权重矩阵，类似于 [24]。

In the embedding layers, we multiply those weights by $\sqrt{d_{model}}$ .

在嵌入层中，我们将这些权重乘以 $\sqrt{d_{model}}$

3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.

由于我们的模型不包含循环和卷积，为了使模型能够利用序列的顺序信息，我们必须向编码器和解码器栈底部的输入嵌入中注入一些关于序列中标记的相对或绝对位置的信息。

To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.

为此，我们在编码器和解码器栈的底部将 “位置编码” 添加到输入嵌入中。

The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed.

位置编码与嵌入具有相同的维度 $d_{model}$ ，以便两者可以相加。

There are many choices of positional encodings, learned and fixed [8].

位置编码有很多种选择，可以是学习得到的，也可以是固定的 [8]。

In this work, we use sine and cosine functions of different frequencies:

在这项工作中，我们使用不同频率的正弦和余弦函数：

where pos is the position and i is the dimension.

其中 pos 是位置，i 是维度。

That is, each dimension of the positional encoding corresponds to a sinusoid.

也就是说，位置编码的每个维度对应一个正弦波。

The wavelengths form a geometric progression from 2π to 10000 · 2π.

波长从 $2\pi$ 到 $10000\cdot 2\pi$ 形成一个几何级数。

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ .

我们选择这个函数是因为我们假设它可以让模型很容易地学习通过相对位置进行注意力关注，因为对于任何固定的偏移量 k， $PE_{pos+k}$ 可以表示为 $PE_{pos}$ 的线性函数。

We also experimented with using learned positional embeddings [8] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)).

我们也尝试了使用学习得到的位置嵌入 [8]，发现这两个版本产生了几乎相同的结果（见表 3 第 (E) 行）。

We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

我们选择正弦版本是因为它可能允许模型外推到比训练中遇到的序列长度更长的序列。

为什么是自我注意力

Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1, ..., xn) to another sequence of equal length (z1, ..., zn), with $x_i,z_i\in \mathbb{R}^d$ , such as a hidden layer in a typical sequence transduction encoder or decoder.

在本节中，我们将自注意力层的各个方面与常用于将一个可变长度的符号表示序列 (x1, ..., xn) 映射到另一个等长序列 (z1, ..., zn) 的循环层和卷积层进行比较，例如典型序列转换编码器或解码器中的隐藏层。

Motivating our use of self-attention we consider three desiderata.

为了说明我们使用自注意力的原因，我们考虑三个需求

One is the total computational complexity per layer.

一个是每层的总计算复杂度。

Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

另一个是可以并行化的计算量，通过所需的最小顺序操作数来衡量。

The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks.

第三个是网络中长距离依赖关系之间的路径长度。在许多序列转换任务中，学习长距离依赖关系是一个关键挑战。

One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.

影响学习这种依赖关系能力的一个关键因素是前向和后向信号在网络中必须穿越的路径长度。

The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [11].

输入和输出序列中任何位置组合之间的这些路径越短，就越容易学习长距离依赖关系 [11]。

Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.

因此，我们还比较了由不同层类型组成的网络中任何两个输入和输出位置之间的最大路径长度。

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.

如表 1 所示，自注意力层以恒定数量的顺序执行操作连接所有位置，而循环层需要 O(n) 个顺序操作。

不同层类型的最大路径长度、每层复杂度和最少顺序操作数量。n 为序列长度，d 为表示维度，k 为卷积的核大小，r 为受限自注意力中的邻域大小。

In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [31] and byte-pair [25] representations.

就计算复杂度而言，当序列长度小于表示维度时，自注意力层比循环层更快，在机器翻译中最先进模型使用的句子表示（如词片 [31] 和字节对 [25] 表示）中通常就是这种情况。

To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.

为了提高涉及非常长序列的任务的计算性能，可以将自注意力限制为仅考虑输入序列中以相应输出位置为中心的大小为 r 的邻域。

This would increase the maximum path length to O(n/r).

这将把最大路径长度增加到 O(n/r)。

A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions.

具有核宽度 k < n 的单个卷积层不能连接所有的输入和输出位置对。

卷积层的核宽度

定义

在卷积层中，核宽度是卷积核的一个维度属性。

卷积核是一个小的矩阵（在二维卷积的情况下），用于在输入数据上滑动进行卷积操作。

例如，对于一个二维图像数据（假设图像是灰度图，单通道），如果卷积核的大小是 $k\times k$ （这里 k 就是核宽度，同时也是核高度，因为是正方形的卷积核，对于非正方形的卷积核则有不同的宽度和高度值），这个卷积核会在图像上按照一定的步幅滑动，每次计算卷积核覆盖区域与对应输入区域的元素乘积之和，从而得到卷积后的输出值。

作用

影响感受域：核宽度决定了在每次卷积操作时，卷积核所覆盖的输入数据的横向范围。较大的核宽度会使卷积操作在每次滑动时覆盖更宽的输入区域，从而可能捕捉到更广泛的特征信息，但同时也可能导致计算量增加和过拟合风险提高。

控制特征提取：不同的核宽度可以用于提取不同尺度的特征。较小的核宽度适合捕捉局部的细节特征，而较大的核宽度可能更有助于捕捉全局的、更宏观的特征。例如，在图像识别中，3×3 的卷积核可能用于提取图像中的边缘等局部特征，而 7×7 或更大的卷积核可能用于捕捉图像中的整体结构特征。

Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or $O(log_k(n))$ in the case of dilated convolutions [15], increasing the length of the longest paths between any two positions in the network.

在连续核的情况下，需要堆叠 O(n/k) 个卷积层才能做到这一点，或者在扩张卷积的情况下 [15]，堆叠 $O(log_k(n))$ 个卷积层，这会增加网络中任何两个位置之间的最长路径长度。

Convolutional layers are generally more expensive than recurrent layers, by a factor of k.

一般来说，卷积层比循环层更昂贵，贵 k 倍。

Separable convolutions [6], however, decrease the complexity considerably, to $O(k\cdot n\cdot d+n\cdot d^2)$ .

然而，可分离卷积 [6] 大大降低了复杂度，降至 $O(k\cdot n\cdot d+n\cdot d^2)$

Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

但是，即使 k = n，可分离卷积的复杂度也等于自注意力层和逐点前馈层的组合，这是我们在模型中采用的方法。

As side benefit, self-attention could yield more interpretable models.

作为附带好处，自注意力可以产生更具可解释性的模型。

We inspect attention distributions from our models and present and discuss examples in the appendix.

我们检查来自我们模型的注意力分布，并在附录中展示和讨论示例。

Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

不仅单个注意力头明显学会执行不同的任务，许多似乎还表现出与句子的句法和语义结构相关的行为。

训练

5.1 Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs.

我们在标准的 WMT 2014 英德数据集上进行训练，该数据集由大约 450 万个句子对组成。

Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens.

句子使用字节对编码 [3] 进行编码，它有一个大约 37000 个标记的共享源语言 - 目标语言词汇表。

For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31].

对于英法翻译，我们使用了大得多的 WMT 2014 英法数据集，该数据集由 3600 万个句子组成，并将标记拆分为一个 32000 个词片的词汇表 [31]。

Sentence pairs were batched together by approximate sequence length.

句子对按近似序列长度进行批处理。

Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.

每个训练批次包含一组句子对，其中大约包含 25000 个源语言标记和 25000 个目标语言标记。

5.2 Hardware and Schedule

5.3 Optimizer

5.4 Regularization

结果

结论

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

在这项工作中，我们提出了 Transformer，这是第一个完全基于注意力的序列转换模型，用多头自注意力取代了编码器 - 解码器架构中最常用的循环层。

编码器-解码器架构中的循环层

定义

在编码器 - 解码器架构中，循环层是一种处理序列数据的神经网络层。常见的循环层包括长短期记忆网络（LSTM）和门控循环单元（GRU）。

以 LSTM 为例，它内部有多个门（输入门、遗忘门、输出门等）来控制信息的流动。在每个时间步，LSTM 单元接收当前时间步的输入以及上一个时间步的隐藏状态，通过门结构计算出当前时间步的隐藏状态。

GRU 则是一种简化版的循环单元，它只有更新门和重置门，通过这两个门来控制信息的传递和更新隐藏状态。

在编码器 - 解码器架构中的作用

编码器中的作用

在编码器中，循环层用于处理输入序列。例如在机器翻译任务中，对于输入的源语言句子（作为一个单词序列），循环层会逐个单词地处理这个序列。随着时间步的推进，循环层不断更新其隐藏状态，这个隐藏状态会逐渐编码输入序列中的信息。到输入序列处理完毕时，编码器的最终隐藏状态就包含了整个输入序列的语义信息。

解码器中的作用

在解码器中，循环层根据编码器传来的信息以及之前生成的输出序列来生成目标语言句子。它也是逐个时间步地进行操作，在每个时间步生成一个单词（或其他输出单元），并且这个生成过程依赖于之前的输出以及从编码器传递过来的信息。例如，在翻译任务中，解码器从一个起始符号开始，逐步生成目标语言句子中的单词，直到生成结束符号为止。

特点

处理序列顺序：能够很好地处理序列数据中的顺序信息，因为它在每个时间步的计算都依赖于上一个时间步的状态。

长序列处理的局限性：当处理非常长的序列时，可能会遇到梯度消失或梯度爆炸问题。例如，在反向传播过程中，随着时间步的增加，梯度可能会变得非常小（梯度消失）或者非常大（梯度爆炸），从而影响模型的训练效果。

计算效率相对较低：由于在每个时间步都需要顺序计算，不能像卷积层或自注意力层那样进行大规模的并行计算，所以在处理长序列时计算速度可能较慢。

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.

对于翻译任务，Transformer 可以比基于循环层或卷积层的架构训练得快得多。

On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art.

在 WMT 2014 英德翻译任务和 WMT 2014 英法翻译任务上，我们都达到了新的最先进水平。

In the former task our best model outperforms even all previously reported ensembles.

在前者任务中，我们的最佳模型甚至优于之前所有已报道的集成模型。

We are excited about the future of attention-based models and plan to apply them to other tasks.

我们对基于注意力的模型的未来感到兴奋，并计划将它们应用于其他任务。

We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.

我们计划将 Transformer 扩展到涉及文本以外的输入和输出模态的问题，并研究局部的、受限的注意力机制，以有效地处理图像、音频和视频等大型输入和输出。

Making generation less sequential is another research goals of ours.

使生成过程不那么顺序化是我们的另一个研究目标。