- 该文档由Doc2X翻译提供解析与翻译, 想看更多论文翻译欢迎来Doc2X
On Layer Normalization in the Transformer Architecture
关于Transformer架构中的层归一化
Ruibin Xiong † ∗ 1 2 {}^{\dagger * }{}^{1}{}^{2} †∗12 Yunchang Yang ∗ 3 {}^{ * }{}^{3} ∗3 Di He 4 5 {}^{4}{}^{5} 45 Kai Zheng 4 {}^{4} 4 Shuxin Zheng 5 {}^{5} 5 Chen Xing 6 {}^{6} 6 Huishuai Zhang 5 {}^{5} 5 Yanyan Lan 12 {}^{12} 12 Liwei Wang 43 {}^{43} 43 Tie-Yan Liu 5 {}^{5} 5
熊瑞彬 † ∗ 1 2 {}^{\dagger * }{}^{1}{}^{2} †∗12 杨云昌 ∗ 3 {}^{ * }{}^{3} ∗3 何迪 4 5 {}^{4}{}^{5} 45 郑凯 4 {}^{4} 4 郑书新 5 {}^{5} 5 邢晨 6 {}^{6} 6 张慧帅 5 {}^{5} 5 兰艳艳 12 {}^{12} 12 王立威 43 {}^{43} 43 刘铁岩 5 {}^{5} 5
Abstract
摘要
The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.
Transformer 被广泛应用于自然语言处理任务中。然而,为了训练一个Transformer模型,通常需要精心设计学习率预热阶段,这被证明对最终性能至关重要,但同时也会减慢优化速度并引入更多的超参数调整。在本文中,我们首先从理论上研究了为什么学习率预热阶段是必要的,并展示了层归一化的位置至关重要。具体来说,我们使用平均场理论证明,在初始化时,对于原始设计的Post-LN Transformer(将层归一化放置在残差块之间),输出层附近的参数的期望梯度较大。因此,在这些梯度上使用较大的学习率会使训练不稳定。预热阶段在实践中对避免这一问题很有帮助。另一方面,我们的理论还表明,如果将层归一化放置在残差块内部(最近提出的Pre-LN Transformer),梯度在初始化时表现良好。这激励我们去除Pre-LN Transformer的预热阶段。我们的实验表明,没有预热阶段的Pre-LN Transformer可以在广泛的应用中达到与基准相当的结果,同时显著减少训练时间和超参数调整。
1. Introduction
1. 引言
The Transformer (Vaswani et al., 2017) is one of the most commonly used neural network architectures in natural language processing. Layer normalization (Lei Ba et al., 2016) plays a key role in Transformer’s success. The originally designed Transformer places the layer normalization between the residual blocks, which is usually referred to as the Transformer with Post-Layer Normalization (Post-LN) (Wang et al., 2019). This architecture has achieved state-of-the-art performance in many tasks including language modeling (Dai et al., 2019; Al-Rfou et al., 2018) and machine translation (Dehghani et al., 2018; Edunov et al., 2018). Unsupervised pre-trained models based on the Post-LN Transformer architecture also show impressive performance in many downstream tasks (Radford et al., 2019; Devlin et al., 2018; Yang et al., 2019b).
Transformer (Vaswani 等人, 2017) 是自然语言处理中最常用的一种神经网络架构。层归一化 (Lei Ba 等人, 2016) 在 Transformer 的成功中起到了关键作用。最初设计的 Transformer 将层归一化放置在残差块之间,这通常被称为具有后层归一化 (Post-LN) 的 Transformer (Wang 等人, 2019)。该架构在许多任务中取得了最先进的性能,包括语言建模 (Dai 等人, 2019; Al-Rfou 等人, 2018) 和机器翻译 (Dehghani 等人, 2018; Edunov 等人, 2018)。基于 Post-LN Transformer 架构的无监督预训练模型也在许多下游任务中展示了令人印象深刻的性能 (Radford 等人, 2019; Devlin 等人, 2018; Yang 等人, 2019b)。
Despite its great success, people usually need to deal with the optimization of the Post-LN Transformer more carefully than convolutional networks or other sequence-to-sequence models (Popel & Bojar, 2018). In particular, to train the model from scratch, any gradient-based optimization approach requires a learning rate warm-up stage (Vaswani et al., 2017; Liu et al., 2019a): the optimization starts with an extremely small learning rate, and then gradually increases it to a pre-defined maximum value in a pre-defined number of iterations. Such a warm-up stage not only slows down the optimization process but also brings more hyper-parameter tunings. Popel & Bojar (2018) has shown that the final model performance is quite sensitive to the value of the maximum learning rate and the number of warm-up iterations. Tuning such sensitive hyper-parameters is costly in training large-scale models, e.g., BERT (Devlin et al., 2018) or XLNet (Yang et al., 2019b).
尽管取得了巨大成功,人们通常需要比卷积网络或其他序列到序列模型更仔细地处理Post-LN Transformer的优化(Popel & Bojar, 2018)。特别是,要从零开始训练模型,任何基于梯度的优化方法都需要一个学习率预热阶段(Vaswani et al., 2017; Liu et al., 2019a):优化过程从极小的学习率开始,然后在预定的迭代次数内逐渐增加到预定的最大值。这种预热阶段不仅减缓了优化过程,还引入了更多的超参数调整。Popel & Bojar (2018) 已经表明,最终模型的性能对最大学习率的值和预热迭代次数非常敏感。在大规模模型的训练中,调整这些敏感的超参数代价高昂,例如BERT(Devlin et al., 2018)或XLNet(Yang et al., 2019b)。
In this paper, we try to alleviate this problem by finding ways to safely remove the learning rate warm-up stage. As the warm-up stage happens in the first several iterations, we investigate the optimization behavior at initialization using mean field theory (Lee et al., 2017; Xiao et al., 2018; Yang et al., 2019a; Yang, 2019; Lee et al., 2019; Zhang et al., 2019). According to our theoretical analysis, when putting the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer
在本文中,我们尝试通过寻找安全移除学习率预热阶段的方法来缓解这个问题。由于预热阶段发生在最初的几次迭代中,我们使用平均场理论来研究初始化时的优化行为(Lee et al., 2017; Xiao et al., 2018; Yang et al., 2019a; Yang, 2019; Lee et al., 2019; Zhang et al., 2019)。根据我们的理论分析,当将层归一化放置在残差块之间时,输出层附近的参数的期望梯度
∗ {}^{ * } ∗ Equal contribution † {}^{ \dagger } † Works done while interning at Microsoft Research Asia 1CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technolog, Chinese Academy of Sciences 2 {}^{2} 2 University of Chinese Academy of Sciences 3 {}^{3} 3 Center for Data Science,Peking University,Beijing Institute of Big Data Research 4 {}^{4} 4 Key Laboratory of Machine Perception, MOE,School of EECS,Peking University 5 {}^{5} 5 Microsoft Research 6 {}^{6} 6 College of Computer Science,Nankai University. Correspondence to: Shuxin Zheng shuxin.zheng@microsoft.com, Di He dihe@microsoft.com.
∗ {}^{ * } ∗ 同等贡献 † {}^{ \dagger } † 在微软亚洲研究院实习期间完成的工作 1 中国科学院 计算技术研究所 网络数据科学与技术实验室 2 {}^{2} 2 中国科学院大学 3 {}^{3} 3 北京大数据研究院 北京大学数据科学研究中心 4 {}^{4} 4 北京大学信息科学技术学院 机器感知教育部重点实验室 5 {}^{5} 5 微软研究院 6 {}^{6} 6 南开大学计算机学院。通讯作者:Shuxin Zheng shuxin.zheng@microsoft.com, Di He dihe@microsoft.com。
Proceedings of the 3 γ th 3{\gamma }^{\text{th }} 3γth International Conference on Machine Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by the author(s).
2020年于奥地利维也纳召开的 3 γ th 3{\gamma }^{\text{th }} 3γth 国际机器学习会议论文集,PMLR 108卷。版权所有2020 作者。
Figure 1. (a) Post-LN Transformer layer; (b) Pre-LN Transformer layer.
图1。(a)Post-LN Transformer层;(b)Pre-LN Transformer层。
are large. Therefore, without the warm-up stage, directly using a large learning rate to those parameters can make the optimization process unstable. Using a warm-up stage and training the model with small learning rates practically avoid this problem. Extensive experiments are provided to support our theoretical findings.
非常大。因此,如果没有预热阶段,直接对那些参数使用大的学习率可能导致优化过程不稳定。使用预热阶段并以较小的学习率训练模型可以有效避免这个问题。我们提供了大量实验来支持我们的理论发现。
Our theory also shows that the layer normalization plays a crucial role in controlling the gradient scales. This motivates us to investigate whether there are some other ways of positioning the layer normalization that lead to well-behaved gradients. In particular, we study another variant, the Transformer with Pre-Layer Normalization (Pre-LN) (Baevski & Auli, 2018; Child et al., 2019; Wang et al., 2019). The Pre-LN Transformer puts the layer normalization inside the residual connection and equips with an additional final-layer normalization before prediction (Please see Figure 1 for the differences between the two variants of the Transformer architectures). We show that at initialization, the gradients are well-behaved without any exploding or vanishing for the Pre-LN Transformer both theoretically and empirically.
我们的理论还表明,层归一化在控制梯度尺度方面起着至关重要的作用。这促使我们研究是否存在其他放置层归一化的方式,能够导致表现良好的梯度。特别是,我们研究了另一种变体,即带有预层归一化(Pre-Layer Normalization, Pre-LN)的Transformer(Baevski & Auli, 2018; Child et al., 2019; Wang et al., 2019)。Pre-LN Transformer将层归一化置于残差连接内部,并在预测前配备了一个额外的最终层归一化(有关两种Transformer架构变体之间的差异,请参见图1)。我们从理论和实验上都证明了,在初始化时,Pre-LN Transformer的梯度表现良好,没有出现任何梯度爆炸或消失的情况。
Given the gradients are well-behaved in the Pre-LN Transformer, it is natural to consider removing the learning rate warm-up stage during training. We conduct a variety of experiments, including IWSLT14 German-English translation, WMT14 English-German translation, and BERT pretraining tasks. We show that, in all tasks, the learning rate warm-up stage can be safely removed, and thus, the number of hyper-parameter is reduced. Furthermore, we observe that the loss decays faster for the Pre-LN Transformer model. It can achieve comparable final performances but use much less training time. This is particularly important for training large-scale models on large-scale datasets.
鉴于Pre-LN Transformer中的梯度表现良好,自然可以考虑在训练过程中移除学习率预热阶段。我们进行了多种实验,包括IWSLT14德英翻译、WMT14英德翻译以及BERT预训练任务。我们展示了在所有任务中,学习率预热阶段都可以安全移除,从而减少了超参数的数量。此外,我们观察到Pre-LN Transformer模型的损失下降更快。它能够达到相当的最终性能,但使用更少的训练时间。这对于在大规模数据集上训练大规模模型尤为重要。
Our contributions are summarized as follows:
我们的贡献总结如下:
-
We investigate two Transformer variants, the Post-LN Transformer and the Pre-LN Transformer, using mean field theory. By studying the gradients at initialization, we provide evidence to show why the learning rate warm-up stage is essential in training the Post-LN Transformer.
-
我们使用平均场理论研究了两种Transformer变体,即Post-LN Transformer和Pre-LN Transformer。通过研究初始化时的梯度,我们提供了证据来解释为什么学习率预热阶段在训练Post-LN Transformer时是必不可少的。
-
We are the first to show that the learning-rate warm-up stage can be removed for the Pre-LN Transformer, which eases the hyperparameter tuning. We further show that by using proper learning rate schedulers, the training time can be largely reduced on a wide range of applications.
我们是第一个证明可以为 Pre-LN Transformer 移除学习率预热阶段的研究者,这简化了超参数的调优过程。我们进一步证明,通过使用合适的学习率调度器,可以在广泛的应用中大幅减少训练时间。
2. Related work
2. 相关工作
Gradient descent-based methods (Kingma & Ba, 2014; Zeiler, 2012; Duchi et al., 2011; Tieleman & Hinton, 2012) are popularly used in optimizing deep neural networks. For convolutional neural networks and recurrent neural networks, a relatively large learning rate is usually set in the beginning, and then decreased along with the optimization process (He et al., 2016; 2017; Sutskever et al., 2014; Gehring et al., 2017; He et al., 2019). The learning rate warm-up stage has only been shown essential in dealing with some very specific problems, e.g., the large-batch training. Goyal et al. (2017); He et al. (2019); You et al. (2018) showed that a learning rate warm-up stage is preferred when training neural networks with extremely large batch sizes.
基于梯度下降的方法(Kingma & Ba, 2014; Zeiler, 2012; Duchi et al., 2011; Tieleman & Hinton, 2012)在优化深度神经网络中被广泛使用。对于卷积神经网络和循环神经网络,通常在一开始设置一个相对较大的学习率,然后随着优化过程逐渐减小学习率(He et al., 2016; 2017; Sutskever et al., 2014; Gehring et al., 2017; He et al., 2019)。学习率预热阶段仅在处理一些非常特定的问题时被证明是必要的,例如,大批量训练。Goyal et al. (2017); He et al. (2019); You et al. (2018) 表明,在使用极大 batch size 训练神经网络时,学习率预热阶段是优选的。
However, the learning rate warm-up stage is essential and critical when optimizing the Transformer models in a majority of scenarios (Vaswani et al., 2017; Devlin et al., 2018; Dai et al., 2019; Radford et al., 2019; Lu et al., 2019). Popel & Bojar (2018) investigated the influence of different warmup strategies for the optimization of the Post-LN Transformer model and found that without or with relatively less warm-up iterations, the optimization diverges. The Pre-LN Transformer has been proposed in several recent works (Baevski & Auli, 2018; Child et al., 2019; Wang et al., 2019) to alleviate some optimization issues when training deeper models, but the troublesome warm-up stage still remains in their training pipelines.
然而,学习率预热阶段在大多数情况下对Transformer模型的优化至关重要(Vaswani等,2017;Devlin等,2018;Dai等,2019;Radford等,2019;Lu等,2019)。Popel & Bojar (2018) 研究了不同预热策略对Post-LN Transformer模型优化的影响,发现如果没有或只有相对较少的预热迭代,优化过程会发散。最近的几项工作提出了Pre-LN Transformer(Baevski & Auli,2018;Child等,2019;Wang等,2019),以缓解在训练更深模型时的一些优化问题,但麻烦的预热阶段仍然存在于其训练流程中。
(Liu et al., 2019a) claimed that the benefit of the warm-up stage comes from reducing the variance for the adaptive learning rate in the Adam optimizer (Kingma & Ba, 2014). They proposed to rectify the variance of adaptive learning rate by a new variant of Adam called RAdam. However, we find that not only for Adam, the learning rate warm-up stage also helps quite a lot for other optimizers. This may indicate that Adam is not the prerequisite for the necessity of the warm-up stage. In a concurrent and independent work, Nguyen & Salazar (2019) also empirically observed that the Pre-LN Transformer can be trained without learning rate warm-up stage. Our work provides a more comprehensive study regrading this with a theoretical analysis.
(Liu等,2019a) 声称预热阶段的好处在于减少Adam优化器中自适应学习率的方差(Kingma & Ba,2014)。他们提出通过一种名为RAdam的Adam变体来校正自适应学习率的方差。然而,我们发现不仅对于Adam,学习率预热阶段对其他优化器也有很大帮助。这可能表明Adam并不是预热阶段必要的先决条件。在同期独立工作中,Nguyen & Salazar (2019) 也通过实证观察到Pre-LN Transformer可以在没有学习率预热阶段的情况下进行训练。我们的工作通过理论分析对此进行了更全面的研究。
3. Optimization for the Transformer
3. Transformer的优化
3.1. Transformer with Post-Layer Normalization
3.1. 具有后层归一化的Transformer (Post-Layer Normalization)
The Transformer architecture usually consists of stacked Transformer layers (Vaswani et al., 2017; Devlin et al., 2018), each of which takes a sequence of vectors as input and outputs a new sequence of vectors with the same shape. A Transformer layer has two sub-layers: the (multi-head) self-attention sub-layer and the position-wise feed-forward network sub-layer. Residual connection (He et al., 2016) and layer normalization (Lei Ba et al., 2016) are applied for both sub-layers individually. We first introduce each component of the Transformer layer and then present the entire architecture.
Transformer 架构通常由堆叠的 Transformer 层组成(Vaswani 等,2017;Devlin 等,2018),每一层都将一个向量序列作为输入,并输出形状相同的新的向量序列。一个 Transformer 层有两个子层:多头自注意力子层和位置前馈网络子层。残差连接(He 等,2016)和层归一化(Lei Ba 等,2016)分别应用于这两个子层。我们首先介绍 Transformer 层的每个组件,然后展示整个架构。
Self-attention sub-layer An attention function can be formulated as querying an entry with key-value pairs (Vaswani et al., 2017). The self-attention sub-layer uses scaled dot-product attention, which is defined as: Attention ( Q , K , V ) = softmax ( Q K T d ) V \operatorname{Attention}\left( {Q,K,V}\right) = \operatorname{softmax}\left( \frac{Q{K}^{T}}{\sqrt{d}}\right) V Attention(Q,K,V)=softmax(dQKT)V ,where d d d is the dimensionality of the hidden representations,and Q Q Q (Query), K K K (Key), V V V (Value) are specified as the hidden representations of the previous layer. The multi-head variant of the self-attention sub-layer is popularly used which allows the model to jointly attend to information from different representation sub-spaces, and is defined as
自注意力子层 注意力函数可以被表述为用键值对查询一个条目(Vaswani 等,2017)。自注意力子层使用缩放点积注意力,其定义为: Attention ( Q , K , V ) = softmax ( Q K T d ) V \operatorname{Attention}\left( {Q,K,V}\right) = \operatorname{softmax}\left( \frac{Q{K}^{T}}{\sqrt{d}}\right) V Attention(Q,K,V)=softmax(dQKT)V,其中 d d d 是隐藏表示的维度, Q Q Q(查询), K K K(键), V V V(值)被指定为前一层的隐藏表示。自注意力子层的多头变体被广泛使用,它允许模型联合关注来自不同表示子空间的信息,其定义为
Multi-head ( Q , K , V ) = Concat ( head 1 , ⋯ , head H ) W O \operatorname{Multi-head}\left( {Q,K,V}\right) = \operatorname{Concat}\left( {{\operatorname{head}}_{1},\cdots ,{\operatorname{head}}_{H}}\right) {W}^{O} Multi-head(Q,K,V)=Concat(head1,⋯,headH)WO
head k = Attention ( Q W k Q , K W k K , V W k V ) {\text{head}}_{k} = \operatorname{Attention}\left( {Q{W}_{k}^{Q},K{W}_{k}^{K},V{W}_{k}^{V}}\right) headk=Attention(QWkQ,KWkK,VWkV) ,
where W k Q ∈ R d × d K , W k K ∈ R d × d K , W k V ∈ {W}_{k}^{Q} \in {\mathbb{R}}^{d \times {d}_{K}},{W}_{k}^{K} \in {\mathbb{R}}^{d \times {d}_{K}},{W}_{k}^{V} \in WkQ∈Rd×dK,WkK∈Rd×dK,WkV∈ R d × d V {\mathbb{R}}^{d \times {d}_{V}} Rd×dV ,and W O ∈ R H d V × d {W}^{O} \in {\mathbb{R}}^{H{d}_{V} \times d} WO∈RHdV×d are project parameter matrices, H H H is the number of heads. d K {d}_{K} dK and d V {d}_{V} dV are the dimensionalities of Key and Value. Without any confusion, given a sequence of vectors ( x 1 , … , x n ) \left( {{x}_{1},\ldots ,{x}_{n}}\right) (x1,…,xn) ,we use MultiHeadAtt ( x i , [ x 1 , x 2 , ⋯ , x n ] ) \left( {{x}_{i},\left\lbrack {{x}_{1},{x}_{2},\cdots ,{x}_{n}}\right\rbrack }\right) (xi,[x1,x2,⋯,xn]) as the multi-head self-attention mechanism on position i i i which considers the attention from x i {x}_{i} xi to the entire sequence,i.e.,MultiHeadAtt ( x i , [ x 1 , x 2 , ⋯ , x n ] ) = \left( {{x}_{i},\left\lbrack {{x}_{1},{x}_{2},\cdots ,{x}_{n}}\right\rbrack }\right) = (xi,[x1,x2,⋯,xn])= Multi-head ( x i , [ x 1 , … , x n ] , [ x 1 , … , x n ] ) \left( {{x}_{i},\left\lbrack {{x}_{1},\ldots ,{x}_{n}}\right\rbrack ,\left\lbrack {{x}_{1},\ldots ,{x}_{n}}\right\rbrack }\right) (xi,[x1,…,xn],[x1,…,xn]) .
其中 W k Q ∈ R d × d K , W k K ∈ R d × d K , W k V ∈ {W}_{k}^{Q} \in {\mathbb{R}}^{d \times {d}_{K}},{W}_{k}^{K} \in {\mathbb{R}}^{d \times {d}_{K}},{W}_{k}^{V} \in WkQ∈Rd×dK,WkK∈Rd×dK,WkV∈ R d × d V {\mathbb{R}}^{d \times {d}_{V}} Rd×dV, W O ∈ R H d V × d {W}^{O} \in {\mathbb{R}}^{H{d}_{V} \times d} WO∈RHdV×d 是投影参数矩阵, H H H 是头的数量。 d K {d}_{K} dK 和 d V {d}_{V} dV 是键和值的维度。毫无疑问,给定一个向量序列 ( x 1 , … , x n ) \left( {{x}_{1},\ldots ,{x}_{n}}\right) (x1,…,xn),我们使用 MultiHeadAtt ( x i , [ x 1 , x 2 , ⋯ , x n ] ) \left( {{x}_{i},\left\lbrack {{x}_{1},{x}_{2},\cdots ,{x}_{n}}\right\rbrack }\right) (xi,[x1,x2,⋯,xn]) 作为位置 i i i 上的多头自注意力机制,它考虑了从 x i {x}_{i} xi 到整个序列的注意力,即 MultiHeadAtt ( x i , [ x 1 , x 2 , ⋯ , x n ] ) = \left( {{x}_{i},\left\lbrack {{x}_{1},{x}_{2},\cdots ,{x}_{n}}\right\rbrack }\right) = (xi,[x1,x2,⋯,xn])= 多头 ( x i , [ x 1 , … , x n ] , [ x 1 , … , x n ] ) \left( {{x}_{i},\left\lbrack {{x}_{1},\ldots ,{x}_{n}}\right\rbrack ,\left\lbrack {{x}_{1},\ldots ,{x}_{n}}\right\rbrack }\right) (xi,[x1,…,xn],[x1,…,xn])。
Position-wise FFN sub-layer In addition to the self-attention sub-layer, each Transformer layer contains a fully connected network, which is applied to each position separately and identically. This sub-layer is a two-layer feed-forward network with a ReLU activation function. Given a sequence of vectors h 1 , … , h n {h}_{1},\ldots ,{h}_{n} h1,…,hn ,the computation of a position-wise FFN sub-layer on any h i {h}_{i} hi is defined as:
位置 wise FFN 子层 除了自注意力子层之外,每个 Transformer 层还包含一个全连接网络,该网络对每个位置分别且相同地应用。这个子层是一个两层的前馈网络,带有 ReLU 激活函数。给定一个向量序列 h 1 , … , h n {h}_{1},\ldots ,{h}_{n} h1,…,hn,在任意 h i {h}_{i} hi 上计算位置 wise FFN 子层的公式定义为:
FFN ( h i ) = ReLU ( h i W 1 + b 1 ) W 2 + b 2 , \operatorname{FFN}\left( {h}_{i}\right) = \operatorname{ReLU}\left( {{h}_{i}{W}^{1} + {b}^{1}}\right) {W}^{2} + {b}^{2}, FFN(hi)=ReLU(hiW1+b1)W2+b2,
where W 1 , W 2 , b 1 {W}^{1},{W}^{2},{b}^{1} W1,W2,b1 and b 2 {b}^{2} b2 are parameters.
其中 W 1 , W 2 , b 1 {W}^{1},{W}^{2},{b}^{1} W1,W2,b1 和 b 2 {b}^{2} b2 是参数。
Residual connection and layer normalization Besides the two sub-layers described above, the residual connection and layer normalization are also key components to the Transformer. For any vector v v v ,the layer normalization is computed as LayerNorm ( v ) = γ v − μ σ + β \left( v\right) = \gamma \frac{v - \mu }{\sigma } + \beta (v)=γσv−μ+β ,in which μ , σ \mu ,\sigma μ,σ are the mean and standard deviation of the elements in v v v , i.e., μ = 1 d ∑ k = 1 d v k \mu = \frac{1}{d}\mathop{\sum }\limits_{{k = 1}}^{d}{v}_{k} μ=d1k=1∑dvk and σ 2 = 1 d ∑ k = 1 d ( v k − μ ) 2 {\sigma }^{2} = \frac{1}{d}\mathop{\sum }\limits_{{k = 1}}^{d}{\left( {v}_{k} - \mu \right) }^{2} σ2=d1k=1∑d(vk−μ)2 . Scale γ \gamma γ and bias vector β \beta β are parameters.
残差连接和层归一化 除了上述两个子层之外,残差连接和层归一化也是 Transformer 的关键组件。对于任意向量 v v v,层归一化计算为 LayerNorm ( v ) = γ v − μ σ + β \left( v\right) = \gamma \frac{v - \mu }{\sigma } + \beta (v)=γσv−μ+β,其中 μ , σ \mu ,\sigma μ,σ 是 v v v 中元素的均值和标准差,即 μ = 1 d ∑ k = 1 d v k \mu = \frac{1}{d}\mathop{\sum }\limits_{{k = 1}}^{d}{v}_{k} μ=d1k=1∑dvk 和 σ 2 = 1 d ∑ k = 1 d ( v k − μ ) 2 {\sigma }^{2} = \frac{1}{d}\mathop{\sum }\limits_{{k = 1}}^{d}{\left( {v}_{k} - \mu \right) }^{2} σ2=d1k=1∑d(vk−μ)2。缩放参数 γ \gamma γ 和偏置向量 β \beta β 是参数。
Different orders of the sub-layers, residual connection and layer normalization in a Transformer layer lead to variants of Transformer architectures. One of the original and most popularly used architecture for the Transformer and BERT (Vaswani et al., 2017; Devlin et al., 2018) follows “self-attention (FFN) sub-layer → \rightarrow → residual connection → \rightarrow → layer normalization”, which we call the Transformer with Post-Layer normalization (Post-LN Transformer), as illustrated in Figure 1.
Transformer 层中子层的不同顺序、残差连接和层归一化会导致 Transformer 架构的变体。用于 Transformer 和 BERT 的最初且最广泛使用的架构之一(Vaswani 等人,2017;Devlin 等人,2018)遵循“自注意力 (FFN) 子层 → \rightarrow → 残差连接 → \rightarrow → 层归一化”的顺序,我们称之为具有后层归一化的 Transformer(Post-LN Transformer),如图1所示。
Post-LN Transformer Denote x l , i {x}_{l,i} xl,i as the input of the l l l -th Transformer layer at position i i i ,where x l , i {x}_{l,i} xl,i is a real-valued vector of dimension d , i = 1 , 2 , … , n , l = 1 , 2 , … , L . n d,i = 1,2,\ldots ,n,l = 1,2,\ldots ,L.n d,i=1,2,…,n,l=1,2,…,L.n is the length of the sequence and L L L is the number of layers. For completeness,we define x 0 , i {x}_{0,i} x0,i as the input embedding at position i i i which is usually a combination of word embedding and positional embedding. The computations inside the l l l -th layer are composed of several steps,and we use super-scripts on x x x to present the input(output) of different steps as in Table 1 (left),where W 1 , l , W 2 , l , b 1 , l {W}^{1,l},{W}^{2,l},{b}^{1,l} W1,l,W2,l,b1,l and b 2 , l {b}^{2,l} b2,l are parameters of the FFN sub-layer in the l l l -th layer.
Post-LN Transformer 表示 x l , i {x}_{l,i} xl,i 为位置 i i i 上第 l l l 层 Transformer 层的输入,其中 x l , i {x}_{l,i} xl,i 是一个维度为 d , i = 1 , 2 , … , n , l = 1 , 2 , … , L . n d,i = 1,2,\ldots ,n,l = 1,2,\ldots ,L.n d,i=1,2,…,n,l=1,2,…,L.n 的实值向量, L L L 是层的数量, d , i = 1 , 2 , … , n , l = 1 , 2 , … , L . n d,i = 1,2,\ldots ,n,l = 1,2,\ldots ,L.n d,i=1,2,…,n,l=1,2,…,L.n 是序列的长度。为了完整性,我们定义 x 0 , i {x}_{0,i} x0,i 为位置 i i i 处的输入嵌入,这通常是词嵌入和位置嵌入的组合。第 l l l 层内的计算由若干步骤组成,我们使用 x x x 上的上标来表示不同步骤的输入(输出),如表1(左)所示,其中 W 1 , l , W 2 , l , b 1 , l {W}^{1,l},{W}^{2,l},{b}^{1,l} W1,l,W2,l,b1,l 和 b 2 , l {b}^{2,l} b2,l 是第 l l l 层中 FFN 子层的参数。
3.2.The learning rate warm-up stage
3.2. 学习率预热阶段
We are interested in the learning rate warm-up stage in the optimization of the Post-LN Transformer. Different from the optimization of many other architectures in which the learning rate starts from a relatively large value and then decays (Bahdanau et al., 2017; Dauphin et al., 2017), a learning rate warm-up stage for the Post-LN Transformer seems critical (Popel & Bojar, 2018). We denote the learning rate of the t t t -th iteration as lr ( t ) \operatorname{lr}\left( t\right) lr(t) and the maximum learning rate during training as lr max {\operatorname{lr}}_{\max } lrmax . Given a predefined time frame T warmup {T}_{\text{warmup }} Twarmup , the learning rate scheduler for the first T warmup {T}_{\text{warmup }} Twarmup iterations (Vaswani et al., 2018) is defined as
我们关注 Post-LN Transformer 优化中的学习率预热阶段。与其他许多架构的优化不同,这些架构的学习率从相对较大的值开始然后衰减(Bahdanau 等,2017;Dauphin 等,2017),Post-LN Transformer 的学习率预热阶段似乎是至关重要的(Popel & Bojar,2018)。我们用 lr ( t ) \operatorname{lr}\left( t\right) lr(t) 表示第 t t t 次迭代的学习率,用 lr max {\operatorname{lr}}_{\max } lrmax 表示训练期间的最大学习率。给定一个预定义的时间框架 T warmup {T}_{\text{warmup }} Twarmup ,学习率调度器在前 T warmup {T}_{\text{warmup }} Twarmup 次迭代中(Vaswani 等,2018)定义为
lr ( t ) = t T warmup lr max , t ≤ T warmup . (1) \operatorname{lr}\left( t\right) = \frac{t}{{T}_{\text{warmup }}}{\operatorname{lr}}_{\max },t \leq {T}_{\text{warmup }}. \tag{1} lr(t)=Twarmup tlrmax,t≤Twarmup .(1)
After this warm-up stage, the learning rate will be set by classical learning rate schedulers, such as the linear decay, the inverse square-root decay, or forced decay at particular iterations. We conduct experiments to show that this learning rate warm-up stage is essential for training Post-LN Transformer models.
在这个预热阶段之后,学习率将由经典的学习率调度器设置,例如线性衰减、逆平方根衰减,或在特定迭代时强制衰减。我们进行实验以表明这个学习率预热阶段对于训练 Post-LN Transformer 模型是必不可少的。
Experimental setting We conduct experiments on the IWSLT14 German-to-English (De-En) machine translation task. We mainly investigate two aspects: whether the learning rate warm-up stage is essential and whether the final model performance is sensitive to the value of T warmup {T}_{\text{warmup }} Twarmup . To study the first aspect, we train the model with the Adam optimizer (Kingma & Ba, 2014) and the vanilla SGD optimizer (Ruder, 2016) respectively. For both optimziers, we check whether the warm-up stage can be removed. We follow Vaswani et al. (2017) to set hyper-parameter β \beta β to be(0.9,0.98)in Adam. We also test different lr max {\operatorname{lr}}_{\max } lrmax for both optimizers. For Adam,we set lr max = 5 e − 4 {\operatorname{lr}}_{\max } = 5{e}^{-4} lrmax=5e−4 or 1 e − 3 1{e}^{-3} 1e−3 , and for SGD,we set l max = 5 e − 3 \mathop{\operatorname{l}}\limits_{\text{max }} = 5{e}^{-3} max l=5e−3 or 1 e − 3 1{e}^{-3} 1e−3 . When the warm-up stage is used,we set T warmup = 4000 {T}_{\text{warmup }} = {4000} Twarmup =4000 as suggested by the original paper (Vaswani et al., 2017). To study the second aspect,we set T warmup {T}_{\text{warmup }} Twarmup to be 1/500/4000 (“1” refers to the no warm-up setting) and use lr max = 5 e − 4 {\operatorname{lr}}_{\max } = 5{e}^{-4} lrmax=5e−4 or 1 e − 3 1{e}^{-3} 1e−3 with Adam. For all experiments, a same inverse square root learning rate scheduler is used after the warm-up stage. We use both validation loss and BLEU (Papineni et al., 2002) as the evaluation measure of the model performance.
实验设置 我们在IWSLT14德语到英语(De-En)的机器翻译任务上进行实验。我们主要研究两个方面:学习率预热阶段是否必要,以及最终模型性能是否对 T warmup {T}_{\text{warmup }} Twarmup 的值敏感。为了研究第一个方面,我们分别使用Adam优化器(Kingma & Ba, 2014)和普通的SGD优化器(Ruder, 2016)来训练模型。对于这两种优化器,我们检查是否可以移除预热阶段。我们遵循Vaswani等人(2017)的方法,将超参数 β \beta β 设置为(0.9,0.98)用于Adam。我们还测试了不同的 lr max {\operatorname{lr}}_{\max } lrmax 对两种优化器的影响。对于Adam,我们设置 lr max = 5 e − 4 {\operatorname{lr}}_{\max } = 5{e}^{-4} lrmax=5e−4 或 1 e − 3 1{e}^{-3} 1e−3,而对于SGD,我们设置 l max = 5 e − 3 \mathop{\operatorname{l}}\limits_{\text{max }} = 5{e}^{-3} max l=5e−3 或 1 e − 3 1{e}^{-3} 1e−3。当使用预热阶段时,我们按照原始论文(Vaswani等人, 2017)的建议设置 T warmup = 4000 {T}_{\text{warmup }} = {4000} Twarmup =4000。为了研究第二个方面,我们设置 T warmup {T}_{\text{warmup }} Twarmup 为1/500/4000(“1” 指无预热设置),并使用 lr max = 5 e − 4 {\operatorname{lr}}_{\max } = 5{e}^{-4} lrmax=5e−4 或 1 e − 3 1{e}^{-3} 1e−3 与Adam配合。对于所有实验,在预热阶段之后使用相同的逆平方根学习率调度器。我们使用验证损失和BLEU(Papineni等人, 2002)作为模型性能的评估指标。
Table 1. Post-LN Transformer v.s. Pre-LN Transformer
表1. Post-LN Transformer 与 Pre-LN Transformer 的比较
Figure 2. Performances of the models optimized by Adam and SGD on the IWSLT14 De-En task.
图2. 使用Adam和SGD优化的模型在IWSLT14 De-En任务上的性能表现。
Results and discussions We record the model checkpoints for every epoch during training and calculate the validation loss and BLEU score. The performance of the models are plotted in Figure 2(a) and Figure 2(b). The x \mathrm{x} x -axis is the epoch number and the y \mathrm{y} y -axis is the BLEU score/validation loss. “w/o warm-up” indicates “without the warm-up stage” while “w/ warm-up” indicates “with the warm-up stage”.
结果与讨论 我们在训练期间为每个 epoch 记录了模型检查点,并计算了验证损失和BLEU分数。模型的性能如图2(a)和图2(b)所示。 x \mathrm{x} x 轴表示 epoch 编号, y \mathrm{y} y 轴表示BLEU分数/验证损失。“w/o warm-up” 表示 “没有预热阶段”,而 “w/ warm-up” 表示 “有预热阶段”。
First, we can see that for both optimizers, the learning rate warm-up stage is essential. Without the warm-up stage, the BLEU score of the model trained with Adam optimizer can only achieve 8.45. As a comparison, the model trained using the warm-up stage can achieve around 34 in terms of BLEU score. The same trend can also be observed on the validation loss curves. Although the performance of the model trained with SGD is significantly worse than Adam, we can still see similar phenomena as Adam. The BLEU score is just above zero in 15 epochs without using the warm-up stage.
首先,我们可以看到,对于两种优化器来说,学习率预热阶段是必不可少的。没有预热阶段的情况下,使用Adam优化器训练的模型的BLEU分数只能达到8.45。相比之下,使用预热阶段训练的模型BLEU分数可以达到约34。同样的趋势也可以在验证损失曲线上观察到。尽管使用SGD训练的模型的性能明显比Adam差,但我们仍然可以看到与Adam类似的现象。在没有预热阶段的情况下,15个epoch内BLEU分数仅略高于零。
Second, we can see that the optimization process is sensitive to the value of T warmup {T}_{\text{warmup }} Twarmup ,which means T warmup {T}_{\text{warmup }} Twarmup is an important hyper-parameter in training the Post-LN Transformer. For example,when setting T warmup = 500 {T}_{\text{warmup }} = {500} Twarmup =500 ,the learned models with Adam achieve only 31.16 and 2.77 in term of BLEU score for l r max = 5 e − 4 l{r}_{\max } = 5{e}^{-4} lrmax=5e−4 and 1 e − 3 1{e}^{-3} 1e−3 respectively.
其次,我们可以看到优化过程对 T warmup {T}_{\text{warmup }} Twarmup 的值非常敏感,这意味着 T warmup {T}_{\text{warmup }} Twarmup 是训练Post-LN Transformer中的一个重要超参数。例如,当设置 T warmup = 500 {T}_{\text{warmup }} = {500} Twarmup =500 时,使用Adam训练的模型在BLEU分数上分别仅达到31.16和2.77,分别对应 l r max = 5 e − 4 l{r}_{\max } = 5{e}^{-4} lrmax=5e−4 和 1 e − 3 1{e}^{-3} 1e−3 。
Such a warm-up stage has several disadvantages. First, its configuration significantly affects the final performance. The practitioners need a careful hyper-parameter tuning, which is computationally expensive for large-scale NLP tasks. Second, the warm-up stage could slow down the optimization. Standard optimization algorithms usually start with a large learning rate for fast convergence. However, when using the warm-up stage, the learning rate has to gradually increase from zero, which may make the training inefficient. Liu et al. (2019a) suggests that the warm-up stage plays a role in reducing the undesirably significant variance in Adam in the early stage of model training. However, according to our results, the warm-up stage also helps the training of SGD. This suggests that the benefit of the warm-up stage may be not for a particular optimizer.
这样的预热阶段有几个缺点。首先,其配置会显著影响最终性能。实践者需要仔细地进行超参数调整,而对于大规模的自然语言处理任务,这将带来巨大的计算成本。其次,预热阶段可能会减缓优化过程。标准的优化算法通常以较大的学习率开始以实现快速收敛。然而,当使用预热阶段时,学习率必须从零逐渐增加,这可能会导致训练效率低下。Liu 等人(2019a)认为预热阶段在减少Adam优化器在模型训练初期显著的方差方面起到了作用。然而,根据我们的结果,预热阶段也对SGD的训练有所帮助。这表明预热阶段的益处可能并不局限于特定的优化器。
3.3. Understanding the Transformer at initialization
3.3. 理解初始化时的Transformer
We can see that the Post-LN Transformer cannot be trained with a large learning rate from scratch. This motivates us to investigate what happens at the model initialization. We first introduce the parameter initialization setting for our theoretical analysis and then present our theoretical findings.
我们可以看到,Post-LN Transformer无法从零开始以较大的学习率进行训练。这促使我们去研究在模型初始化时发生了什么。我们首先介绍用于我们理论分析的参数初始化设置,然后展示我们的理论发现。
Notations We denote L ( ⋅ ) \mathcal{L}\left( \cdot \right) L(⋅) as the loss function of one position, L ~ ( ⋅ ) \widetilde{\mathcal{L}}\left( \cdot \right) L (⋅) as the loss function of the whole sequence, ∥ ⋅ ∥ 2 \parallel \cdot {\parallel }_{2} ∥⋅∥2 and ∥ ⋅ ∥ F \parallel \cdot {\parallel }_{F} ∥⋅∥F as the l 2 {l}_{2} l2 norm (spectral norm) and the Frobenius norm, LN ( x ) \operatorname{LN}\left( x\right) LN(x) as the standard layer normalization with scale γ = 1 \gamma = 1 γ=1 and bias β = 0 \beta = 0 β=0 ,and J L N ( x ) = ∂ L N ( x ) ∂ x {\mathbf{J}}_{LN}\left( x\right) = \frac{\partial \mathrm{{LN}}\left( x\right) }{\partial x} JLN(x)=∂x∂LN(x) as the Jacobian matrix of LN ( x ) \operatorname{LN}\left( x\right) LN(x) . Let O ( ⋅ ) \mathcal{O}\left( \cdot \right) O(⋅) denote standard Big-O notation that suppress multiplicative constants.
符号 我们用 L ( ⋅ ) \mathcal{L}\left( \cdot \right) L(⋅) 表示一个位置的损失函数, L ~ ( ⋅ ) \widetilde{\mathcal{L}}\left( \cdot \right) L (⋅) 表示整个序列的损失函数, ∥ ⋅ ∥ 2 \parallel \cdot {\parallel }_{2} ∥⋅∥2 和 ∥ ⋅ ∥ F \parallel \cdot {\parallel }_{F} ∥⋅∥F 分别表示 l 2 {l}_{2} l2 范数(谱范数)和Frobenius范数, LN ( x ) \operatorname{LN}\left( x\right) LN(x) 表示带有缩放参数 γ = 1 \gamma = 1 γ=1 和偏置 β = 0 \beta = 0 β=0 的标准层归一化, J L N ( x ) = ∂ L N ( x ) ∂ x {\mathbf{J}}_{LN}\left( x\right) = \frac{\partial \mathrm{{LN}}\left( x\right) }{\partial x} JLN(x)=∂x∂LN(x) 表示 LN ( x ) \operatorname{LN}\left( x\right) LN(x) 的Jacobian矩阵。用 O ( ⋅ ) \mathcal{O}\left( \cdot \right) O(⋅) 表示标准的Big-O符号,它抑制了乘法常数。
Parameter Initialization The parameter matrices in each Transformer layer are usually initialized by the Xavier initialization (Glorot & Bengio, 2010). Given a matrix of size n in × n out {n}_{\text{in }} \times {n}_{\text{out }} nin ×nout ,the Xavier initialization sets the value of each element by independently sampling from Gaussian distribution N ( 0 , 2 n in + n out ) N\left( {0,\frac{2}{{n}_{\text{in }} + {n}_{\text{out }}}}\right) N(0,nin +nout 2) . The bias vectors are usually initialized as zero vectors. The scale γ \gamma γ in the layer normalization is set to one.
参数初始化 每个Transformer层的参数矩阵通常通过Xavier初始化方法进行初始化 (Glorot & Bengio, 2010)。给定一个大小为 n in × n out {n}_{\text{in }} \times {n}_{\text{out }} nin ×nout 的矩阵,Xavier初始化通过从高斯分布 N ( 0 , 2 n in + n out ) N\left( {0,\frac{2}{{n}_{\text{in }} + {n}_{\text{out }}}}\right) N(0,nin +nout 2) 中独立采样来设置每个元素的值。偏置向量通常被初始化为零向量。层归一化中的尺度参数 γ \gamma γ 被设置为1。
For theoretical analysis, we study a simpler setting. First, we focus on single-head attention instead of the multihead variant and for all layers,we set the shape of W Q , l {W}^{Q,l} WQ,l , W K , l , W V , l , W 1 , l , W 2 , l {W}^{K,l},{W}^{V,l},{W}^{1,l},{W}^{2,l} WK,l,WV,l,W1,l,W2,l to be d × d d \times d d×d . Second,we initialize the parameter matrices in the self-attention sublayer W Q , l {W}^{Q,l} WQ,l and W K , l {W}^{K,l} WK,l to be zero matrices. In this setting, the attention is a uniform distribution at initialization and MultiHeadAtt ( x l , i 1 , [ x l , 1 1 , x l , 2 1 , ⋯ , x l , n 1 ] ) \left( {{x}_{l,i}^{1},\left\lbrack {{x}_{l,1}^{1},{x}_{l,2}^{1},\cdots ,{x}_{l,n}^{1}}\right\rbrack }\right) (xl,i1,[xl,11,xl,21,⋯,xl,n1]) can be simplified as 1 n ∑ j = 1 n x l , j W V , l \frac{1}{n}\mathop{\sum }\limits_{{j = 1}}^{n}{x}_{l,j}{W}^{V,l} n1j=1∑nxl,jWV,l . Third,we assume the input vectors are also sampled from the same Gaussian distribution. This is reasonable since the inputs are linear combinations of word embeddings and learnable positional embeddings, both of which are initialized by Gaussian distributions.
为了进行理论分析,我们研究了一个更简单的设置。首先,我们关注单头注意力机制,而不是多头变体,并且对所有层,我们将 W Q , l {W}^{Q,l} WQ,l、 W K , l , W V , l , W 1 , l , W 2 , l {W}^{K,l},{W}^{V,l},{W}^{1,l},{W}^{2,l} WK,l,WV,l,W1,l,W2,l 的形状设置为 d × d d \times d d×d。其次,我们将在自注意力子层中的参数矩阵 W Q , l {W}^{Q,l} WQ,l 和 W K , l {W}^{K,l} WK,l 初始化为零矩阵。在这种设置下,注意力在初始化时是一个均匀分布,MultiHeadAtt ( x l , i 1 , [ x l , 1 1 , x l , 2 1 , ⋯ , x l , n 1 ] ) \left( {{x}_{l,i}^{1},\left\lbrack {{x}_{l,1}^{1},{x}_{l,2}^{1},\cdots ,{x}_{l,n}^{1}}\right\rbrack }\right) (xl,i1,[xl,11,xl,21,⋯,xl,n1]) 可以简化为 1 n ∑ j = 1 n x l , j W V , l \frac{1}{n}\mathop{\sum }\limits_{{j = 1}}^{n}{x}_{l,j}{W}^{V,l} n1j=1∑nxl,jWV,l。第三,我们假设输入向量也是从相同的高斯分布中采样的。这是合理的,因为输入是词嵌入和可学习的位置嵌入的线性组合,这两者都是通过高斯分布进行初始化的。
Post-LN Transformer v.s. Pre-LN Transformer We compare the Post-LN Transformer with another variant of the Transformer architecture, the Transformer with Pre-Layer Normalization (Pre-LN). The Pre-LN Transformer was implemented in several systems (Vaswani et al., 2018; Klein et al., 2018; Liu et al., 2019b). Wang et al. (2019) suggested that the Pre-LN Transformer outperforms the Post-LN Transformer when the number of layers increases. Different from the Post-LN Transformer that puts the layer normalization between the residual blocks, the Pre-LN Transformer puts the layer normalization inside the residual connection and places it before all other non-linear transformations. Additionally, the Pre-LN Transformer uses a final layer normalization right before the prediction. We provide the mathematical formulations and visualizations of the Post-LN/Pre-LN Transformer in Table 1 and Figure 1.
Post-LN Transformer 与 Pre-LN Transformer 的比较 我们比较了 Post-LN Transformer 与 Transformer 架构的另一种变体,即带有预层归一化(Pre-Layer Normalization, Pre-LN)的 Transformer。Pre-LN Transformer 已在多个系统中实现(Vaswani 等人, 2018; Klein 等人, 2018; Liu 等人, 2019b)。Wang 等人(2019)提出,当层数增加时,Pre-LN Transformer 的表现优于 Post-LN Transformer。与将层归一化置于残差块之间的 Post-LN Transformer 不同,Pre-LN Transformer 将层归一化放在残差连接的内部,并置于所有其他非线性变换之前。此外,Pre-LN Transformer 在预测之前使用了一层最终的层归一化。我们在表1和图1中提供了 Post-LN/Pre-LN Transformer 的数学公式和可视化。
For both architectures,each x L , i {x}_{L,i} xL,i passes through a soft-max layer to produce a distribution over the dictionary V V V . The loss function is defined on the softmax distribution. For example, in sequence prediction, the loss function is defined as L ( x L + 1 , i post ) = − log ( softmax y i ( W emb x L + 1 , i post ) ) \mathcal{L}\left( {x}_{L + 1,i}^{\text{post }}\right) = - \log \left( {{\operatorname{softmax}}_{{y}_{i}}\left( {{W}^{\text{emb }}{x}_{L + 1,i}^{\text{post }}}\right) }\right) L(xL+1,ipost )=−log(softmaxyi(Wemb xL+1,ipost )) for the Post-LN Transformer and L ( x Final , i pre ) = \mathcal{L}\left( {x}_{\text{Final },i}^{\text{pre }}\right) = L(xFinal ,ipre )= − log ( softmax y i ( W e m b x Final , i p r e ) ) - \log \left( {{\operatorname{softmax}}_{{y}_{i}}\left( {{W}^{emb}{x}_{\text{Final },i}^{pre}}\right) }\right) −log(softmaxyi(WembxFinal ,ipre)) for the Pre-LN Transformer,where softmax y i {y}_{i} yi is the probability of ground truth token y i {y}_{i} yi outputted by the softmax distribution and W e m b {W}^{emb} Wemb is the word embedding matrix. The loss of the whole sequence is an average of the loss on each position. Without loss of generality, we assume that all the derivatives are bounded. We introduce the following concentration property of random variables which will be further used in the theorem.
对于这两种架构,每个 x L , i {x}_{L,i} xL,i 都经过一个 soft-max 层以生成字典上的分布 V V V。损失函数定义在 softmax 分布上。例如,在序列预测中,损失函数定义为 L ( x L + 1 , i post ) = − log ( softmax y i ( W emb x L + 1 , i post ) ) \mathcal{L}\left( {x}_{L + 1,i}^{\text{post }}\right) = - \log \left( {{\operatorname{softmax}}_{{y}_{i}}\left( {{W}^{\text{emb }}{x}_{L + 1,i}^{\text{post }}}\right) }\right) L(xL+1,ipost )=−log(softmaxyi(Wemb xL+1,ipost )) 对于 Post-LN Transformer,以及 L ( x Final , i pre ) = \mathcal{L}\left( {x}_{\text{Final },i}^{\text{pre }}\right) = L(xFinal ,ipre )= − log ( softmax y i ( W e m b x Final , i p r e ) ) - \log \left( {{\operatorname{softmax}}_{{y}_{i}}\left( {{W}^{emb}{x}_{\text{Final },i}^{pre}}\right) }\right) −log(softmaxyi(WembxFinal ,ipre)) 对于 Pre-LN Transformer,其中 softmax y i {y}_{i} yi 是 softmax 分布输出的真实标记 y i {y}_{i} yi 的概率, W e m b {W}^{emb} Wemb 是词嵌入矩阵。整个序列的损失是每个位置损失的平均值。在不失一般性的情况下,我们假设所有导数都是有界的。我们引入随机变量的以下集中性质,这将在定理中进一步使用。
Definition 1. A random variable Z ≥ 0 Z \geq 0 Z≥0 is called ( ϵ , δ ) \left( {\epsilon ,\delta }\right) (ϵ,δ) - bounded if with probability at least 1 − δ , Z − E Z E Z ≤ ϵ 1 - \delta ,\frac{Z - \mathbb{E}Z}{\mathbb{E}Z} \leq \epsilon 1−δ,EZZ−EZ≤ϵ ,where ϵ > 0 \epsilon > 0 ϵ>0 and 0 < δ < 1 0 < \delta < 1 0<δ<1 .
定义 1. 如果随机变量 Z ≥ 0 Z \geq 0 Z≥0 以概率至少为 1 − δ , Z − E Z E Z ≤ ϵ 1 - \delta ,\frac{Z - \mathbb{E}Z}{\mathbb{E}Z} \leq \epsilon 1−δ,EZZ−EZ≤ϵ,其中 ϵ > 0 \epsilon > 0 ϵ>0 和 0 < δ < 1 0 < \delta < 1 0<δ<1,则称其为 ( ϵ , δ ) \left( {\epsilon ,\delta }\right) (ϵ,δ) 有界的。
Intuitively,if the random variable Z Z Z is ( ϵ , δ ) \left( {\epsilon ,\delta }\right) (ϵ,δ) -bounded, then with a high probability its realization will not get too far away from its expectation. For example,if Y Y Y is a d d d -dimensional standard Gaussian random vector,then Z = ∥ Y ∥ 2 2 Z = \parallel Y{\parallel }_{2}^{2} Z=∥Y∥22 is ( ϵ , δ ) \left( {\epsilon ,\delta }\right) (ϵ,δ) -bounded with δ = exp ( − d ϵ 2 / 8 ) \delta = \exp \left( {-d{\epsilon }^{2}/8}\right) δ=exp(−dϵ2/8) , 0 < ϵ < 1 0 < \epsilon < 1 0<ϵ<1 (see supplementary material for details). As parameter matrices in self-attention sub-layers and FFN sublayers are initialized by Gaussian distributions, if the norm of the hidden states in the Transformer satisfies the concentrated condition above, we have the following theorem to characterize the scale of the gradients.
直观上,如果随机变量 Z Z Z 是 ( ϵ , δ ) \left( {\epsilon ,\delta }\right) (ϵ,δ) 界定的,那么在大概率下其实现值不会离其期望值太远。例如,如果 Y Y Y 是一个 d d d 维的标准高斯随机向量,那么 Z = ∥ Y ∥ 2 2 Z = \parallel Y{\parallel }_{2}^{2} Z=∥Y∥22 是 ( ϵ , δ ) \left( {\epsilon ,\delta }\right) (ϵ,δ) 界定的,其中 δ = exp ( − d ϵ 2 / 8 ) \delta = \exp \left( {-d{\epsilon }^{2}/8}\right) δ=exp(−dϵ2/8) 和 0 < ϵ < 1 0 < \epsilon < 1 0<ϵ<1(详见补充材料)。由于自注意力子层和前馈神经网络子层的参数矩阵是用高斯分布初始化的,如果Transformer中隐藏状态的范数满足上述集中条件,我们有以下定理来描述梯度的尺度。
Theorem 1 (Gradients of the last layer in the Transformer). Assume that ∥ x L , i p o s t , 5 ∥ 2 2 {\begin{Vmatrix}{x}_{L,i}^{{post},5}\end{Vmatrix}}_{2}^{2} xL,ipost,5 22 and ∥ x L + 1 , i p r e ∥ 2 2 {\begin{Vmatrix}{x}_{L + 1,i}^{pre}\end{Vmatrix}}_{2}^{2} xL+1,ipre 22 are ( ϵ , δ ) \left( {\epsilon ,\delta }\right) (ϵ,δ) -bounded for all i i i ,where ϵ \epsilon ϵ and δ = δ ( ϵ ) \delta = \delta \left( \epsilon \right) δ=δ(ϵ) are small numbers. Then with probability at least 0.99 − δ − ϵ 0.9 + ϵ {0.99} - \delta - \frac{\epsilon }{{0.9} + \epsilon } 0.99−δ−0.9+ϵϵ ,for the Post-LN Transformer with L L L layers,the gradient of the parameters of the last layer satisfies
定理 1(Transformer最后一层的梯度)。假设 ∥ x L , i p o s t , 5 ∥ 2 2 {\begin{Vmatrix}{x}_{L,i}^{{post},5}\end{Vmatrix}}_{2}^{2} xL,ipost,5 22 和 ∥ x L + 1 , i p r e ∥ 2 2 {\begin{Vmatrix}{x}_{L + 1,i}^{pre}\end{Vmatrix}}_{2}^{2} xL+1,ipre 22 对所有 i i i 都是 ( ϵ , δ ) \left( {\epsilon ,\delta }\right) (ϵ,δ) 界定的,其中 ϵ \epsilon ϵ 和 δ = δ ( ϵ ) \delta = \delta \left( \epsilon \right) δ=δ(ϵ) 是较小的数值。那么至少以概率 0.99 − δ − ϵ 0.9 + ϵ {0.99} - \delta - \frac{\epsilon }{{0.9} + \epsilon } 0.99−δ−0.9+ϵϵ ,对于具有 L L L 层的Post-LN Transformer,最后一层的参数梯度满足
∥ ∂ L ~ ∂ W 2 , L ∥ F ≤ O ( d ln d ) {\begin{Vmatrix}\frac{\partial \widetilde{\mathcal{L}}}{\partial {W}^{2,L}}\end{Vmatrix}}_{F} \leq \mathcal{O}\left( {d\sqrt{\ln d}}\right) ∂W2,L∂L F≤O(dlnd)
and for the Pre-LN Transformer with L layers,
而对于具有 L 层的Pre-LN Transformer,
∥ ∂ L ~ ∂ W 2 , L ∥ F ≤ O ( d ln d L ) . {\begin{Vmatrix}\frac{\partial \widetilde{\mathcal{L}}}{\partial {W}^{2,L}}\end{Vmatrix}}_{F} \leq \mathcal{O}\left( {d\sqrt{\frac{\ln d}{L}}}\right) . ∂W2,L∂L F≤O(dLlnd).
From Theorem 1, we can see that for the Post-LN Transformer, the scale of the gradients to the last FFN layer is of order O ( d ln d ) \mathcal{O}\left( {d\sqrt{\ln d}}\right) O(dlnd) which is independent of L L L . For the Pre-LN Transformer, the scale of the gradients is much smaller. We first study the forward propagation of the Post-LN Transformer and the Pre-LN Transformer. Lemma 1 will be served as a basic tool to prove the main theorem and other lemmas.
从定理 1 可以看出,对于Post-LN Transformer,最后一层前馈神经网络的梯度尺度是 O ( d ln d ) \mathcal{O}\left( {d\sqrt{\ln d}}\right) O(dlnd) 量级,且与 L L L 无关。对于Pre-LN Transformer,梯度的尺度要小得多。我们首先研究Post-LN Transformer和Pre-LN Transformer的前向传播。引理 1 将作为证明主要定理和其他引理的基本工具。
Lemma 1. If X ∈ R d X \in {\mathbb{R}}^{d} X∈Rd is a Gaussian vector, X ∼ X \sim X∼ N ( 0 , σ 2 I d ) N\left( {0,{\sigma }^{2}{\mathbf{I}}_{d}}\right) N(0,σ2Id) ,then E ( ∥ ReLU ( X ) ∥ 2 2 ) = 1 2 σ 2 d \mathbb{E}\left( {\parallel \operatorname{ReLU}\left( X\right) {\parallel }_{2}^{2}}\right) = \frac{1}{2}{\sigma }^{2}d E(∥ReLU(X)∥22)=21σ2d .
引理 1。如果 X ∈ R d X \in {\mathbb{R}}^{d} X∈Rd 是一个高斯向量, X ∼ X \sim X∼ N ( 0 , σ 2 I d ) N\left( {0,{\sigma }^{2}{\mathbf{I}}_{d}}\right) N(0,σ2Id) ,那么 E ( ∥ ReLU ( X ) ∥ 2 2 ) = 1 2 σ 2 d \mathbb{E}\left( {\parallel \operatorname{ReLU}\left( X\right) {\parallel }_{2}^{2}}\right) = \frac{1}{2}{\sigma }^{2}d E(∥ReLU(X)∥22)=21σ2d 。
Based on Lemma 1, we have the following lemma to estimate the scale of the hidden states in different layers for the Post-LN Transformer and the Pre-LN Transformer.
基于引理 1,我们有以下引理来估计Post-LN Transformer和Pre-LN Transformer不同层的隐藏状态的尺度。
Lemma 2. At initialization, for the Post-LN Transformer, E ( ∥ x l , i p o s t , 5 ∥ 2 2 ) = 3 2 d \mathbb{E}\left( {\begin{Vmatrix}{x}_{l,i}^{{post},5}\end{Vmatrix}}_{2}^{2}\right) = \frac{3}{2}d E( xl,ipost,5 22)=23d for all l > 0 l > 0 l>0 and i i i . For the Pre-LN Transformer; ( 1 + l 2 ) d ≤ E ( ∥ x l , i pre ∥ 2 2 ) ≤ ( 1 + 3 l 2 ) d f \left( {1 + \frac{l}{2}}\right) d \leq \mathbb{E}\left( {\begin{Vmatrix}{x}_{l,i}^{\text{pre }}\end{Vmatrix}}_{2}^{2}\right) \leq \left( {1 + \frac{3l}{2}}\right) {df} (1+2l)d≤E( xl,ipre 22)≤(1+23l)df or all l > 0 l > 0 l>0 and i i i . Expectations are taken over the input and the randomness of initialization.
引理 2. 在初始化时,对于 Post-LN Transformer,对于所有的 l > 0 l > 0 l>0 和 i i i,有 E ( ∥ x l , i p o s t , 5 ∥ 2 2 ) = 3 2 d \mathbb{E}\left( {\begin{Vmatrix}{x}_{l,i}^{{post},5}\end{Vmatrix}}_{2}^{2}\right) = \frac{3}{2}d E( xl,ipost,5 22)=23d 。对于 Pre-LN Transformer,对于所有的 l > 0 l > 0 l>0 和 i i i,有 ( 1 + l 2 ) d ≤ E ( ∥ x l , i pre ∥ 2 2 ) ≤ ( 1 + 3 l 2 ) d f \left( {1 + \frac{l}{2}}\right) d \leq \mathbb{E}\left( {\begin{Vmatrix}{x}_{l,i}^{\text{pre }}\end{Vmatrix}}_{2}^{2}\right) \leq \left( {1 + \frac{3l}{2}}\right) {df} (1+2l)d≤E( xl,ipre 22)≤(1+23l)df 。期望值取自输入和初始化的随机性。
Lemma 2 studies the expected norm of the hidden states in both Post-LN/Pre-LN Transformer. It is obviously that in the Post-LN Transformer,the norm of x l , i post {x}_{l,i}^{\text{post }} xl,ipost is d \sqrt{d} d and thus we study the norm of x l , i p o s t , 5 {x}_{l,i}^{{post},5} xl,ipost,5 instead. As we can see from Lemma 2, the scale of the hidden states in the Post-LN Transformer keeps to be the same in expectation while the scale of the hidden states in the Pre-LN Transformer grows linearly along with the depth. The next lemma shows that the scale of the hidden states highly relates to the scale of the gradient in the architectures using layer normalization.
引理 2 研究了 Post-LN/Pre-LN Transformer 中隐藏状态的期望范数。显然,在 Post-LN Transformer 中, x l , i post {x}_{l,i}^{\text{post }} xl,ipost 的范数是 d \sqrt{d} d,因此我们研究 x l , i p o s t , 5 {x}_{l,i}^{{post},5} xl,ipost,5 的范数。正如引理 2 所展示的,Post-LN Transformer 中隐藏状态的尺度在期望值上保持不变,而 Pre-LN Transformer 中隐藏状态的尺度随着深度的增加线性增长。下一个引理表明,隐藏状态的尺度与使用层归一化的架构中梯度的尺度密切相关。
Lemma 3. For x ∈ R d x \in {\mathbb{R}}^{d} x∈Rd ,we have ∥ J L N ( x ) ∥ 2 = O ( d ∥ x ∥ 2 ) {\begin{Vmatrix}{\mathbf{J}}_{LN}\left( x\right) \end{Vmatrix}}_{2} = \mathcal{O}\left( \frac{\sqrt{d}}{\parallel x{\parallel }_{2}}\right) JLN(x) 2=O(∥x∥2d) in which J L N ( x ) = ∂ L N ( x ) ∂ x {\mathbf{J}}_{LN}\left( x\right) = \frac{\partial {LN}\left( x\right) }{\partial x} JLN(x)=∂x∂LN(x) .
引理 3. 对于 x ∈ R d x \in {\mathbb{R}}^{d} x∈Rd,我们有 ∥ J L N ( x ) ∥ 2 = O ( d ∥ x ∥ 2 ) {\begin{Vmatrix}{\mathbf{J}}_{LN}\left( x\right) \end{Vmatrix}}_{2} = \mathcal{O}\left( \frac{\sqrt{d}}{\parallel x{\parallel }_{2}}\right) JLN(x) 2=O(∥x∥2d),其中 J L N ( x ) = ∂ L N ( x ) ∂ x {\mathbf{J}}_{LN}\left( x\right) = \frac{\partial {LN}\left( x\right) }{\partial x} JLN(x)=∂x∂LN(x) 。
The proof of Lemma 1, Lemma 2, Lemma 3, and Theorem 1 can be found in the supplementary material. The main idea is that the layer normalization will normalize the gradients. In the Post-LN Transformer, the scale of the inputs to the layer normalization is independent of L L L ,and thus the gradients of parameters in the last layer are independent of L L L . While in the Pre-LN Transformer,the scale of the input to the final layer normalization is linear in L L L ,and thus the gradients of all parameters will be normalized by L \sqrt{L} L .
引理 1、引理 2、引理 3 和定理 1 的证明可以在补充材料中找到。主要思想是层归一化会归一化梯度。在 Post-LN Transformer 中,输入到层归一化的尺度与 L L L 无关,因此最后一层参数的梯度与 L L L 无关。而在 Pre-LN Transformer 中,输入到最后一层归一化的尺度与 L L L 成线性关系,因此所有参数的梯度都会被 L \sqrt{L} L 归一化。
Extended theory to other layers/parameters We have provided a formal proof on the gradients of the last FFN sublayer as above. In order to fully understand the optimization, we also make some preliminary analysis for other layers and other parameters. Our main result is that the gradient norm in the Post-LN Transformer is large for the parameters near the output and will be likely to decay as the layer index l l l decreases. On the contrary,the gradient norm in the Pre-Transformer will be likely to stay the same for any layer l l l . All the preliminary theoretical results are provided in the supplementary material.
将理论扩展到其他层/参数 我们已对上述最后一个FFN子层的梯度提供了正式证明。为了全面理解优化过程,我们还对其他层和其他参数进行了一些初步分析。我们的主要结果是,Post-LN Transformer中参数的梯度范数在接近输出的参数处较大,并且可能会随着层索引 l l l 的减小而衰减。相反,Pre-Transformer中的梯度范数可能会在任何层 l l l 保持不变。所有初步的理论结果均在补充材料中提供。
3.4. Empirical verification of the theory and discussion
3.4. 理论的实证验证与讨论
As our theory is derived based on several simplifications of the problem, we conduct experiments to study whether our theoretical insights are consistent with what we observe in real scenarios. The general model and training configuration exactly follow Section 3.2. The experiments are repeated ten times using different random seeds.
由于我们的理论是基于对问题的若干简化推导出来的,我们进行了实验以研究我们的理论见解是否与现实场景中的观察结果一致。整体模型和训练配置完全遵循第3.2节。实验使用不同的随机种子重复十次。
On the concentration property Given an initialized model, we record the hidden states in the Post-LN/Pre-LN Transformer across batches and find that the norm of the hidden states satisfies the property ( ( 0.1 , 0.125 ) (\left( {{0.1},{0.125}}\right) ((0.1,0.125) -bounded).
关于集中性性质 给定一个已初始化的模型,我们记录了Post-LN/Pre-LN Transformer中跨批次的隐藏状态,发现隐藏状态的范数满足性质 ( ( 0.1 , 0.125 ) (\left( {{0.1},{0.125}}\right) ((0.1,0.125) 有界)。
On Theorem 1 Theorem 1 suggests that for any sizes of the Post-LN Transformer, the scale of the gradient norm in the last FFN sub-layer remains the same. On the contrary, that of the Pre-LN Transformer decreases as the size of the model grows. We calculate and record the gradient norm in the last FFN sub-layer in 6-6/8-8/10-10/12-12/14-14 Post-LN/Pre-LN Transformer models at initialization. The results are plotted in Figure 3© and 3(d). The x-axis is the size of the model, and the y-axis is the value of the gradient norm of W 2 {W}^{2} W2 in the final FFN sub-layer. The figures show when the number of layers grows, the gradient norm remains in the Post-LN Transformer (around 1.6) and decreases in the Pre-LN Transformer. This observation is consistent with our theory.
关于定理1 定理1表明,对于任何大小的Post-LN Transformer,最后一个FFN子层中的梯度范数尺度保持不变。相反,Pre-LN Transformer的梯度范数随着模型大小的增加而减小。我们在初始化时计算并记录了6-6/8-8/10-10/12-12/14-14层的Post-LN/Pre-LN Transformer模型中最后一个FFN子层的梯度范数。结果绘制在图3©和3(d)中。x轴是模型的大小,y轴是最后一个FFN子层中梯度范数的大小 W 2 {W}^{2} W2。图中显示,当层数增加时,Post-LN Transformer中的梯度范数保持不变(约1.6),而Pre-LN Transformer中的梯度范数则减小。这一观察结果与我们的理论一致。
On the extended theory We calculate the gradient norm of each paramter matrix in 6-6 Post-LN/Pre-LN Transformer. We record the gradient for each parameter for different mini-batches. For elements in a parameter matrix, we calculate their expected gradients and use the Frobenius norm of those values as the scale of the expected gradient of the matrix. Figure 3(a) and 3(b) shows those statistics for FFN sub-layers. The x-axis indexes different Transformer layers. It can be seen from the figure, the scale of the expected gradients grows along with the layer index for the Post-LN Transformer. On the contrary, the scale almost keeps the same for different layers in the Pre-LN Transformer. These observations are consistent with our theoretical findings.
关于扩展理论 我们计算了6-6层的Post-LN/Pre-LN Transformer中每个参数矩阵的梯度范数。我们记录了不同小批量中每个参数的梯度。对于参数矩阵中的元素,我们计算其期望梯度,并使用这些值的Frobenius范数作为矩阵期望梯度的尺度。图3(a)和3(b)显示了FFN子层的这些统计数据。x轴索引不同的Transformer层。从图中可以看出,对于Post-LN Transformer,期望梯度的尺度随着层索引的增加而增长。相反,对于Pre-LN Transformer,不同层的期望梯度尺度几乎保持不变。这些观察结果与我们的理论发现一致。
The critical warm-up stage for Post-LN Transformer Given the analysis above, we hypothesize that the gradient scale is one of the reasons that the Post-LN Transformer needs a careful learning rate scheduling. Since the gradients are large for some layers, using a large learning rate without warm-up may make the training unstable.
对于 Post-LN Transformer 的关键预热阶段,基于上述分析,我们假设梯度尺度是 Post-LN Transformer 需要谨慎调度学习率的原因之一。由于某些层的梯度较大,如果在没有预热的情况下使用较大的学习率,可能会使训练不稳定。
To verify this argument, first, we study the gradient statistics for the Post-LN Transformer after the warm-up stage with Adam. It can be seen from Figure 3(a) and 3(b) that the scale of the gradients are very small, and the model can be trained with large learning rates. Second, we conduct an experiment to train the Post-LN Transformer from scratch using a fixed small learning rate,i.e., 1 e − 4 1{e}^{-4} 1e−4 ,to verify whether using small-step updates mitigates the issue. The details are provided in the supplementary material. In general, using a very small and fixed learning rate can mitigate the problem and optimize the Post-LN Transformer to a certain extent but the convergence is significantly slower. Both experiments above are supportive to our claim.
为了验证这一论点,首先,我们研究了使用 Adam 优化器在预热阶段后 Post-LN Transformer 的梯度统计信息。从图 3(a) 和 3(b) 可以看出,梯度的尺度非常小,模型可以使用较大的学习率进行训练。其次,我们进行了一项实验,使用固定的小学习率(即 1 e − 4 1{e}^{-4} 1e−4)从头训练 Post-LN Transformer,以验证使用小步长更新是否能缓解该问题。详细信息见补充材料。总体而言,使用非常小且固定的学习率可以缓解问题并在一定程度上优化 Post-LN Transformer,但收敛速度显著变慢。上述两个实验都支持我们的主张。
Figure 3. The norm of gradients of 1. different layers in the 6-6 Transformer (a,b). 2. W 2 , L {W}^{2,L} W2,L in different size of the Transformer (c,d).
图 3. 6-6 Transformer 中不同层的梯度范数 (a,b)。不同尺寸 Transformer 中的 W 2 , L {W}^{2,L} W2,L 范数 (c,d)。
—— 更多内容请到Doc2X翻译查看——