RNN及其公式推导
RNN即循环神经网络,循环神经网络的种类有很多,本文主要是对基本的神经网络进行推导。一开始对推导很晕,在阅读了许多资料之后,整理如下。
结构
上图所示的是最基本的循环神经网络,但是这个图是抽象派的,只画了一个圈不代表只有一个隐层。如果把循环去掉,每一个都是一个全连接神经网络。 x 是一个向量,它表示输入层,
s 也是一个向量,它表示隐藏层,这里需要注意,一层是有多个节点,节点数与 s 的维度相同。U 是输入层到隐藏层的权重矩阵,与全连接网络的权重矩阵相同, o 也是一个向量,它表示输出层,V 是隐藏层到输出层的权重矩阵。
公式推导
RNN的BPTT算法关键就在于理解上图。前向算法很好理解,根据RNN的结构定义则很容易推出,下面是后向算法的推导。它的基本原理和BP算法一样的,同样包含了三个步骤:
1. 前向计算每个神经元的输出值
2. 方向计算每个神经元的误差项
δj
,它的误差项是指误差函数
E
对神经元
3.计算每个权重的梯度,然后用优化算法更新
误差项计算
让我们回到第一张图的表示方法,用向量
nett
表示
t
表示神经元在
误差项计算主要包括两部分,沿着两个方向传播,一个方向是其传递到上一层网络,得到
δl−1t
,这部分只和权重矩阵
U
有关,另一个是沿着时间线传递到初始时间
有关。
第一个方向,沿时间
第一项:
第二项
最后可求得:
第二个方向,和bp一致
因此
到此,解释了计算了完了误差项。
权值更新
wji
的更新只与
nettj
有关,所以
最终的梯度就是每个时刻的梯度之和,证明在此处略过,可以参考这篇文章 ,本文只给出结论:
同理可以计算U
最后附上代码实现,希望能更好的理解
def rnn_step_forward(x, prev_h, Wx, Wh, b):
"""
Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
activation function.
The input data has dimension D, the hidden state has dimension H, and we use
a minibatch size of N.
Inputs:
- x: Input data for this timestep, of shape (N, D).
- prev_h: Hidden state from previous timestep, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)
Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- cache: Tuple of values needed for the backward pass.
"""
next_h, cache = None, None
next_h = np.tanh(x.dot(Wx) + prev_h.dot(Wh) + b)
cache = (x, Wx, Wh, prev_h, next_h)
return next_h, cache
def rnn_step_backward(dnext_h, cache):
"""
Backward pass for a single timestep of a vanilla RNN.
Inputs:
- dnext_h: Gradient of loss with respect to next hidden state
- cache: Cache object from the forward pass
Returns a tuple of:
- dx: Gradients of input data, of shape (N, D)
- dprev_h: Gradients of previous hidden state, of shape (N, H)
- dWx: Gradients of input-to-hidden weights, of shape (N, H)
- dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
- db: Gradients of bias vector, of shape (H,)
"""
dx, dprev_h, dWx, dWh, db = None, None, None, None, None
x, Wx, Wh, prev_h, next_h = cache
dtanh = 1 - next_h ** 2 # (N, H)
dx = (dnext_h * dtanh).dot(Wx.T) # (N, D)
dprev_h = (dnext_h * dtanh).dot(Wh.T) # (N, H)
dWx = x.T.dot(dnext_h * dtanh) # (D, H)
dWh = prev_h.T.dot(dnext_h * dtanh) # (H, H)
db = np.sum((dnext_h * dtanh), axis=0)
return dx, dprev_h, dWx, dWh, db
def rnn_forward(x, h0, Wx, Wh, b):
"""
Run a vanilla RNN forward on an entire sequence of data. We assume an input
sequence composed of T vectors, each of dimension D. The RNN uses a hidden
size of H, and we work over a minibatch containing N sequences. After running
the RNN forward, we return the hidden states for all timesteps.
Inputs:
- x: Input data for the entire timeseries, of shape (N, T, D).
- h0: Initial hidden state, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)
Returns a tuple of:
- h: Hidden states for the entire timeseries, of shape (N, T, H).
- cache: Values needed in the backward pass
"""
h, cache = None, None
N, T, D = x.shape
_, H = h0.shape
h = np.zeros((N, T, H))
h_interm = h0
cache = []
for i in xrange(T):
h[:, i, :], cache_sub = rnn_step_forward(x[:, i, :], h_interm, Wx, Wh, b)
h_interm = h[:, i, :]
cache.append(cache_sub)
return h, cache
def rnn_backward(dh, cache):
"""
Compute the backward pass for a vanilla RNN over an entire sequence of data.
Inputs:
- dh: Upstream gradients of all hidden states, of shape (N, T, H)
Returns a tuple of:
- dx: Gradient of inputs, of shape (N, T, D)
- dh0: Gradient of initial hidden state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
- db: Gradient of biases, of shape (H,)
"""
dx, dh0, dWx, dWh, db = None, None, None, None, None
x, Wx, Wh, prev_h, next_h = cache[-1]
_, D = x.shape
N, T, H = dh.shape
dx = np.zeros((N, T, D))
dh0 = np.zeros((N, H))
dWx = np.zeros((D, H))
dWh = np.zeros((H, H))
db = np.zeros(H)
dprev_h_ = np.zeros((N, H))
for i in xrange(T - 1, -1, -1):
dx_, dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:, i, :] + dprev_h_, cache.pop())
dx[:, i, :] = dx_
dh0 = dprev_h_
dWx += dWx_
dWh += dWh_
db += db_
return dx, dh0, dWx, dWh, db
本文详细介绍了循环神经网络(RNN)的基本结构和公式推导,包括前向算法和反向传播算法(BPTT)的关键步骤。通过分析RNN的隐藏层和输出层之间的关系,以及误差项和权重更新的计算,帮助读者深入理解RNN的工作原理。
32

被折叠的 条评论
为什么被折叠?



