RNN及其公式推导

最新推荐文章于 2025-11-02 20:05:28 发布

原创最新推荐文章于 2025-11-02 20:05:28 发布 · 1.2w 阅读

29 ·

CC 4.0 BY-SA版权

学习笔记专栏收录该内容

18 篇文章

订阅专栏

本文详细介绍了循环神经网络（RNN）的基本结构和公式推导，包括前向算法和反向传播算法（BPTT）的关键步骤。通过分析RNN的隐藏层和输出层之间的关系，以及误差项和权重更新的计算，帮助读者深入理解RNN的工作原理。

RNN及其公式推导

RNN即循环神经网络，循环神经网络的种类有很多，本文主要是对基本的神经网络进行推导。一开始对推导很晕，在阅读了许多资料之后，整理如下。

结构

这里写图片描述

上图所示的是最基本的循环神经网络，但是这个图是抽象派的，只画了一个圈不代表只有一个隐层。如果把循环去掉，每一个都是一个全连接神经网络。 $x$ 是一个向量，它表示输入层， $s$ 也是一个向量，它表示隐藏层，这里需要注意，一层是有多个节点，节点数与 $s$ 的维度相同。 $U$ 是输入层到隐藏层的权重矩阵，与全连接网络的权重矩阵相同， $o$ 也是一个向量，它表示输出层， $V$ 是隐藏层到输出层的权重矩阵。

公式推导

这里写图片描述

　　RNN的BPTT算法关键就在于理解上图。前向算法很好理解，根据RNN的结构定义则很容易推出，下面是后向算法的推导。它的基本原理和BP算法一样的，同样包含了三个步骤:
　　1. 前向计算每个神经元的输出值
　　2. 方向计算每个神经元的误差项 $\delta_j$ ，它的误差项是指误差函数 $E$ 对神经元 $j$ 的加权输入的偏导数，如上图所示(下文将 $a_j^t$ 记作 $net_j ^t$ )
　　3.计算每个权重的梯度，然后用优化算法更新

　　误差项计算
　　
让我们回到第一张图的表示方法，用向量 $net_t$ 表示 $t$ 表示神经元在 $t$ 时刻的加权输入：

n e t t = U x t + W s t - 1 s t - 1 = f (n e t t - 1)

$net_t = Ux_t + W s_{t-1}\\s_{t-1} = f(net_{t-1})$

误差项计算主要包括两部分，沿着两个方向传播，一个方向是其传递到上一层网络，得到 $\delta_t^{l-1}$ ，这部分只和权重矩阵 $U$ 有关，另一个是沿着时间线传递到初始时间 $t_1$ 时刻，得到 $\delta_1^l$ ，这部分只和权重矩阵 $W$
有关。

第一个方向，沿时间

$t$ 时刻和 $t-1$ 时刻 $net_j$ 的关系如下：

这里写图片描述

第一项：
这里写图片描述

第二项
这里写图片描述

这里写图片描述

最后可求得：
这里写图片描述

第二个方向，和bp一致

n e t l t = U a l - 1 t + W s t - 1 a l - 1 t = f l - 1 (n e t l - 1 t)

$net_t ^ l = Ua_t ^ {l-1} + W s_{t-1}\\ a_t^{l-1} = f^{l-1}(net_t^{l-1})$
这里写图片描述

因此
这里写图片描述

到此，解释了计算了完了误差项。

权值更新
　　 $w_{ji}$ 的更新只与 $net_j^t$ 有关，所以

\partial E \partial w j i = \partial E \partial n e t t j \partial n e t t j \partial w j i = δ t j s t - 1 i

$\frac{\partial E}{\partial w_{ji}} = \frac{\partial E}{\partial net^t_j} \frac{\partial net^t_j}{\partial w_{ji}} = \delta_j ^t s_i^{t-1}$
这里写图片描述

最终的梯度就是每个时刻的梯度之和，证明在此处略过，可以参考这篇文章，本文只给出结论：
这里写图片描述

同理可以计算U
这里写图片描述

这里写图片描述

最后附上代码实现，希望能更好的理解



def rnn_step_forward(x, prev_h, Wx, Wh, b):
  """
  Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
  activation function.

  The input data has dimension D, the hidden state has dimension H, and we use
  a minibatch size of N.

  Inputs:
  - x: Input data for this timestep, of shape (N, D).
  - prev_h: Hidden state from previous timestep, of shape (N, H)
  - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
  - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
  - b: Biases of shape (H,)

  Returns a tuple of:
  - next_h: Next hidden state, of shape (N, H)
  - cache: Tuple of values needed for the backward pass.
  """
  next_h, cache = None, None

  next_h = np.tanh(x.dot(Wx) + prev_h.dot(Wh) + b)
  cache = (x, Wx, Wh, prev_h, next_h)
  return next_h, cache

def rnn_step_backward(dnext_h, cache):
  """
  Backward pass for a single timestep of a vanilla RNN.

  Inputs:
  - dnext_h: Gradient of loss with respect to next hidden state
  - cache: Cache object from the forward pass

  Returns a tuple of:
  - dx: Gradients of input data, of shape (N, D)
  - dprev_h: Gradients of previous hidden state, of shape (N, H)
  - dWx: Gradients of input-to-hidden weights, of shape (N, H)
  - dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
  - db: Gradients of bias vector, of shape (H,)
  """
  dx, dprev_h, dWx, dWh, db = None, None, None, None, None

  x, Wx, Wh, prev_h, next_h = cache
  dtanh = 1 - next_h ** 2  # (N, H)
  dx = (dnext_h * dtanh).dot(Wx.T)  # (N, D)
  dprev_h = (dnext_h * dtanh).dot(Wh.T)  # (N, H)
  dWx = x.T.dot(dnext_h * dtanh)  # (D, H)
  dWh = prev_h.T.dot(dnext_h * dtanh)  # (H, H)
  db = np.sum((dnext_h * dtanh), axis=0)

  return dx, dprev_h, dWx, dWh, db

def rnn_forward(x, h0, Wx, Wh, b):
  """
  Run a vanilla RNN forward on an entire sequence of data. We assume an input
  sequence composed of T vectors, each of dimension D. The RNN uses a hidden
  size of H, and we work over a minibatch containing N sequences. After running
  the RNN forward, we return the hidden states for all timesteps.

  Inputs:
  - x: Input data for the entire timeseries, of shape (N, T, D).
  - h0: Initial hidden state, of shape (N, H)
  - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
  - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
  - b: Biases of shape (H,)

  Returns a tuple of:
  - h: Hidden states for the entire timeseries, of shape (N, T, H).
  - cache: Values needed in the backward pass
  """
  h, cache = None, None
  N, T, D = x.shape
  _, H = h0.shape
  h = np.zeros((N, T, H))
  h_interm = h0
  cache = []
  for i in xrange(T):
    h[:, i, :], cache_sub = rnn_step_forward(x[:, i, :], h_interm, Wx, Wh, b)
    h_interm = h[:, i, :]
    cache.append(cache_sub)
  return h, cache

def rnn_backward(dh, cache):
  """
  Compute the backward pass for a vanilla RNN over an entire sequence of data.

  Inputs:
  - dh: Upstream gradients of all hidden states, of shape (N, T, H)

  Returns a tuple of:
  - dx: Gradient of inputs, of shape (N, T, D)
  - dh0: Gradient of initial hidden state, of shape (N, H)
  - dWx: Gradient of input-to-hidden weights, of shape (D, H)
  - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
  - db: Gradient of biases, of shape (H,)
  """
  dx, dh0, dWx, dWh, db = None, None, None, None, None  
  x, Wx, Wh, prev_h, next_h = cache[-1]
  _, D = x.shape
  N, T, H = dh.shape
  dx = np.zeros((N, T, D))
  dh0 = np.zeros((N, H))
  dWx = np.zeros((D, H))
  dWh = np.zeros((H, H))
  db = np.zeros(H)
  dprev_h_ = np.zeros((N, H))
  for i in xrange(T - 1, -1, -1):
    dx_, dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:, i, :] + dprev_h_, cache.pop())
    dx[:, i, :] = dx_
    dh0 = dprev_h_
    dWx += dWx_
    dWh += dWh_
    db += db_
  return dx, dh0, dWx, dWh, db