循环神经网络教程 Part 3笔记

最新推荐文章于 2023-08-02 20:45:09 发布

翻译最新推荐文章于 2023-08-02 20:45:09 发布 · 405 阅读

文章标签：

本文详细解析了循环神经网络（RNN）中的反向传播通过时间（BPTT）算法，并探讨了其与传统反向传播的区别。同时，文章深入分析了梯度消失问题，并介绍了LSTM和GRU两种解决该问题的有效模型。

Note: RECURRENT NEURAL NETWORKS TUTORIAL, PART 3 – BACKPROPAGATION THROUGH TIME AND VANISHING GRADIENTS
本教程包括以下几个部分
1.Introduction To RNNs
2.Implementing a RNN using Python and Theano
3.Understanding the Backpropagation Through Time (BPTT) algorithm and the vanishing gradient problem
4.Implementing a GRU/LSTM RNN
翻译完发现这篇博客http://blog.youkuaiyun.com/rtygbwwwerr/article/details/51012699推导特别清楚，准备再整理一遍。

本文主要讨论BPTT的基本原理及与传统反向传播的区别。之后会讨论梯度消失问题，从而引入LSTM和GRUs（两大NLP神器）。梯度消失问题由Sepp Hochreiter于1991年首次发现，随着深度架构的应用而被重视起来。
理解本教程需要熟悉偏微分和反向传播策略。tutorial here and here and here。

BPTT

符号小换一下： $o$ -> $\hat{y}$ 。
$\begin{aligned} s_t &= \tanh(Ux_t + Ws_{t-1}) \\ \hat{y}_t &= \mathrm{softmax} (Vs_t) \end{aligned}$
我们定义我们的loss（error），为交叉熵损失：
$\begin{aligned} E_t(y_t, \hat{y}_t) &= - y_{t} \log \hat{y}_{t} \\ E(y, \hat{y}) &=\sum\limits_{t} E_t(y_t, \hat{y}_t) \\ &=-\sum\limits_{t} y_{t} \log \hat{y}_{t} \end{aligned}$
$y_t$ 是t时刻正确的单词， $\hat{y}_{t}$ 是我们预测词。我们将一个句子看做一个训练对象，因此总error是每步的error之和。
梯度之和：
$\frac{\partial E}{\partial W} = \sum\limits_{t} \frac{\partial E_t}{\partial W}$
采用微分链式法则。以 $E_3$ 为例。
$\begin{aligned} \frac{\partial E_3}{\partial V} &= \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial{V}}\\ &=\frac{\partial E_3}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial{z_3}}\frac{\partial z_3}{\partial{V}} \\ &= (\hat{y}_3 - y_3) \otimes s_3 \end{aligned}$
上式， $z_3=Vs_3$ , $\otimes$ 是向量外积。 $\frac{\partial E_3}{\partial V}$ 只取决于当前step的值，即 $\hat{y}_3,y_3,s_3$ 。这样计算V的梯度就是一个矩阵乘法了。
对于W和U的求导就不一样了。
这里写图片描述
$\begin{aligned} \frac{\partial E_3}{\partial W} &=\frac{\partial E_3}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial{s_3}}\frac{\partial s_3}{\partial{W}} \end{aligned}$
注意 $s_3= \tanh(Ux_t + Ws_2)$ 与 $s_2$ 相关，其又依赖于W和 $s_1$ 等等。
$\begin{aligned} \frac{\partial E_3}{\partial W} &=\sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial{s_3}}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k}{\partial{W}} \end{aligned}$
我们把每一步的贡献都加起来给梯度。也就是，反向传播梯度求导时就要一直到t=0。
我们定义一个delta向量，这部分推导过程可参考此处。
$\delta_2^{(3)} = \frac{\partial E_3}{\partial z_2} = \frac{\partial E_3}{\partial s_3} \frac{\partial s_3}{\partial s_2} \frac{\partial s_3}{\partial z_2}$
其中 $z_2=Ux_2+Ws_1$ ,（这里的 $z_2$ 就是隐层 $s_2$ 的输入）。应用到链式中
上代码：

def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    #这是delta求导的结果
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation: dL/dz
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            # Add to gradients at each previous step
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step dL/dz at t-1
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]