Training Very Deep Networks论文笔记

最新推荐文章于 2021-11-21 17:30:27 发布

cool whidpers

最新推荐文章于 2021-11-21 17:30:27 发布

阅读量755

点赞数

CC 4.0 BY-SA版权

分类专栏：论文翻译以及理解

本文链接：https://blog.youkuaiyun.com/xiongchengluo1129/article/details/85757632

论文翻译以及理解专栏收录该内容

7 篇文章

订阅专栏

本文提出了一种名为“高速公路网络”的新型神经网络架构，旨在解决深度神经网络训练过程中信息流动受阻的问题。该架构借鉴了长短期记忆（LSTM）网络的概念，通过自适应门控单元调节信息流，确保信息在多层间顺畅传递，即使网络深度达到数百层，也能通过简单梯度下降进行有效训练。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Abstract
Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.

摘要
理论和实证证据表明，神经网络的深度对其性能至关重要。然而，随着深度的增加，训练变得更加困难，这对于深度网络的训练来说仍然是一个悬而未决的问题。在这里，我们介绍一种旨在克服这一点的新架构。我们称其为高速公路网络，此网络允许信息在高速公路网络的多层中畅通无阻。这个网络是受到LSTM的启发，并使用自适应门控单元来调节信息流。即使有数百层，也可以通过简单的梯度下降直接训练高速公路网络。这使得研究极其深入和高效的架构成为可能。

2 Highway Networks
Notation We use boldface letters for vectors and matrices, and italicized capital letters to denote transformation functions. 0 and 1 denote vectors of zeros and ones respectively, and I denotes an identity matrix. The function σ(x) is defined as $σ(x)=11+e−x\sigma \left ( x \right )=\frac{1}{1+e^{-x}}$ ; $xϵRx\epsilon R$ . The dot operator (·) is used to denote element-wise multiplication.
A plain feedforward neural network typically consists of L layers where the $l^{th}$ layer ( $lϵ{1,2,...,L}l\epsilon \left \{ 1,2,...,L \right \}$ ) applies a non-linear transformation H (parameterized by $W_{H,l}$ ) on its input $x_{l}$ to produce its output $y_{l}$ . Thus, $x_{1}$ is the input to the network and $y_{L}$ is the network’s output. Omitting the layer index and biases for clarity,
$y=H(x,WH)y=H\left ( x,W_{H} \right )$ (1)
H is usually an affine transform followed by a non-linear activation function, but in general it may take other forms, possibly convolutional or recurrent. For a highway network, we additionally define two non-linear transforms $T(x,WT)T\left ( x,W_{T} \right )$ and $C(x,WC)C\left ( x,W_{C} \right )$ such that
$y=H(x,WH)⋅T(x,WT)+x⋅C(x,WC)y=H\left ( x,W_{H} \right ) \cdot T\left ( x,W_{T} \right )+x\cdot C\left ( x,W_{C} \right )$ (2)
We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. For simplicity, in this paper we set C = 1 − T, giving
$y=H(x,WH)⋅T(x,WT)+x⋅(1−T(x,WT))y=H\left ( x,W_{H} \right ) \cdot T\left ( x,W_{T} \right )+x\cdot (1-T\left ( x,W_{T} \right ))$ (3)
The dimensionality of x; y; $H(x,WH)H\left ( x,W_{H} \right )$ and $T(x,WT)T\left ( x,W_{T} \right )$ must be the same for Equation 3 to be valid.
Note that this layer transformation is much more flexible than Equation 1. In particular, observe that for particular values of T,
$y={x,ifT(x,WT)=0H(x,WH),ifT(x,WT)=1y=\left\{\begin{matrix} x,& ifT\left ( x,W_{T} \right )= 0\\ H\left ( x,W_{H} \right),& if T\left ( x,W_{T} \right )=1 \end{matrix}\right.$ (4)

Similarly, for the Jacobian of the layer transform,
$dydx={IifT(x,WT)=0H′(x,WH)ifT(x,WT)=1\frac{dy}{dx}=\left\{\begin{matrix} I& if T\left ( x,W_{T} \right )=0\\ H^{'}\left ( x,W_{H} \right ) & if T\left ( x,W_{T} \right )=1 \end{matrix}\right.$ (5)
Thus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior between that of H and that of a layer which simply passes its inputs through. Just as a plain layer consists of multiple computing units such that the $i^{th}$ unit computes $y_{i}=H_{i}(x)$ , a highway network consists of multiple blocks such that the $i^{th}$ block computes a block state $H_{i}(x)$ and transform gate output $T_{i}(x)$ . Finally, it produces the block output $yi=Hi(x)∗Ti(x)+xi∗(1−Ti(x))y_{i}=H_{i}(x)\ast T_{i}(x)+x_{i}\ast (1-T_{i}(x))$ , which is connected to the next layer.2