论文解读 | Transformer 原理深入浅出

Transformer详解

最新推荐文章于 2025-10-10 22:00:00 发布

原创

最新推荐文章于 2025-10-10 22:00:00 发布 · 2.2k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#自然语言处理 #深度学习 #机器学习 #神经网络

本文深入解析了Transformer模型，包括其核心机制Self-Attention、Multi-HeadAttention、残差连接、LayerNormalization等，以及Transformer在NLP领域的应用。

Attention 机制由 Bengio 团队于 2014 年提出，并广泛应用在深度学习的各个领域。而 Google 提出的用于生成词向量的 Bert 在 NLP 的 11 项任务中取得了效果的大幅提升，Bert 正是基于双向 Transformer。

Transformer 是第一个完全依赖于 Self-Attention 来计算其输入和输出表示的模型，而不使用序列对齐的 RNN 或 CNN。更准确的讲，Transformer 由且仅由 self-Attention 和 Feed Forward Neural Network 组成。一个基于 Transformer 的可训练的神经网络可以通过堆叠 Transformer 的形式进行搭建，作者的实验是通过搭建编码器和解码器各 6 层，总共 12 层的 Encoder-Decoder，并在机器翻译中取得了 BLEU 值得新高。

Transformer 结构

解释一下上面这个结构图。Transformer 采用的也是经典的 Encoder 和 Decoder 架构，由 Encoder 和 Decoder 组成。

Encoder 的结构由 Multi-Head Self-Attention 和 position-wise feed-forward network 组成，Encoder 的输入由 Input Embedding 和 Positional Embedding 求和组成。

Decoder 的结构由 Masked Multi-Head Self-Attention，Multi-Head Self-Attention 和 position-wise feed-forward network 组成。Decoder 的初始输入由 Output Embedding 和 Positional Embedding 求和得到。

上图左半边 Nx 框出来的部分是 Encoder 的一层，Transformer 中 Encoder 有 6 层。

上图右半边 Nx 框出来的部分是 Decoder 的一层，Transformer 中 Decoder 有 6 层。

Encoder

Encoder 由 6 个相同的层组成，每个层包含 2 个部分：

Multi-Head Self-Attention
Position-Wise Feed-Forward Network (全连接层)

两个部分都有残差连接 (redidual connection)，然后接一个 Layer Normalization。

Encoder 的输入由 Input Embedding 和 Positional Embedding 求和组成。

如果你是刚开始学 Transformer，你可能会问：

Multi-Head Self-Attention 是什么？
残差连接 (redidual connection) 是什么？
Layer Normalization 是什么？

后面都会一一解答，请往后看。

Decoder

和 Encoder 相似，Decoder 也是由 6 个相同的层组成，每个层包含 3 个部分：

Multi-Head Self-Attention
Multi-Head Context-Attention
Position-Wise Feed-Forward Network

上面三个部分都有残差连接 (redidual connection)，然后接一个 Layer Normalization。

Decoder 多了个 Multi-Head Context-Attention，如果理解了 Multi-Head Self-Attention，这个就很好理解了，后面会提到这两个 Attention。

Self-Attention 机制

Attention 常用的有两种，一种是加性注意力(Additive Attention)，另一组是点乘注意力(Dot-product Attention)，论文采用的是点乘注意力，这种注意力机制相比加法注意力机制，更快，同时更省空间。

Self-Attention 是 Transformer 的核心内容，然而作者并没用详细讲解。

以下面这句话为例，作为我们翻译的输入语句，我们可以看下 Attention 如何对这句话进行表示。

The animal didn’t cross the street because it was too tired

我们可以思考一个问题，“it” 指代什么？是 “street” 还是 “animal” ? 对人来说，很容易就能知道是 “animal”，但是对于算法来说，并没有这么简单。

模型处理单词 “it” 时，Attention 允许将 “it” 和 “animal” 联系起来。当模型处理每个位置时，Attention 对不同位置产生不同的注意力，使其来更好的编码当前位置的词，如果你熟悉 RNN，就知道 RNN 如何根据之前的隐状态信息来编码当前词。

即：当编码 “it” 时，部分 Attention 集中于 “the animal”，并将其表示合并到 “it” 的编码中。

RNN 要逐步递归才能获取全局信息，因此一般要双向 RNN 才比较好，且下一时刻信息要依赖于前面时刻的信息。CNN 只能获取局部信息，是通过叠层来增大感受野，Attention 思路最为粗暴，一步到位获得了全局信息。

而 Transformer 使用 Self-Attention，简单的解释：通过确定Q和K之间的相似程度来选择V！

使用 Self-Attention 有几个好处：

每一层的复杂度小：
- 如果输入序列 n 小于表示维度 d 的话，Self-Attention 的每一层时间复杂度有优势。
- 当 n 比较大时，作者也给出了解决方案，Self-Attention 中每个词不是和所有词计算 Attention，而是只与限制的 r 个词进行 Attention 计算。
并行 Multi-Head Attention 和 CNN 一样不依赖前一时刻的计算，可以很好的并行，优于 RNN。
长距离依赖 优于 Self-Attention 是每个词和所有词计算 Attention，所以不管他们中间有多长距离，最大路径长度都只是 1，可以捕获长距离依赖关系。

上面讲到 Decoder 中有两种 Attention，一种是 Self-Attention，一种是 Context-Attention。

Context-Attention 也就是 Encoder 和 Decoder 之间的 Attention，也可以称之为 Encoder-Decoder Attention。

无论是Self-Attention 还是 Context-Attention，它们在计算 Attention 分数的时候，可以有很多选择：

additive attention
local-base
general
dot-product
scaled dot-product

那么我们的Transformer模型，采用的是哪种呢？答案是：scaled dot-product attention。

为什么要加这个缩放因子呢？论文里给出了解释：如果 dk 很小，加性注意力和点乘注意力相差不大，但是如果 dk 很大，点乘得到的值很大，如果不做 scaling，结果就没有加性注意力好，另外，点乘结果过大，使得经过 softmax 之后的梯度很小，不利于反向传播的进行，所以我们通过对点乘的结果进行scaling。

先简单说下 Q、K、V 是什么：

Encoder 的 Self-Attention 中，Q、K、V 都来自同一个地方(相等)，他们是上一层 Encoder 的输出，对于第一层 Encoder，他们就是 Word Embedding 和 Positional Embedding 相加得到的输入。
Decoder 的 Self-Attention 中，Q、K、V都来自于同一个地方(相等)，它们是上一层 Decoder 的输出，对于第一层 Decoder，他们就是 Word Embedding 和 Positional Embedding 相加得到的输入。但是对于 Decoder，我们不希望它能获得下一个 time step(将来的信息)，因此我们需要进行 Sequence masking。
在 Encoder-Decoder Attention 中，Q 来自于上一层 Decoder 的输出，K 和 V 来自于 Encoder 的输出，K 和 V 是一样的。