背景
搜索GRU的结构图,现在基本都是这张图,理解起来的确有点吃力,还是得用代码跑一遍。
简单的GRUCell
pytorch中的GRUCell的计算公式是这样的,我们先不用管公式,跑个简单的demo看看
简单DEMO
通过以下代码我们可以看到上次的隐藏状态和当前的隐藏状态
input_size = 1
hidden_size = 1
batch_size = 1
gru = nn.GRUCell(input_size, hidden_size)
input = torch.randn(6, batch_size, input_size)
hx = torch.randn(batch_size, hidden_size)
hx, gru(input[0], hx)
提取模型参数
weight_ih, weight_hh, bias_ih, bias_hh = list(gru.parameters())
weight_ih, weight_hh, bias_ih, bias_hh
使用函数模拟
r - 重置门

- 输出范围为[0,1], 基本计算公式和RNNCell一样
weight_ir_l0 = weight_ih[0]
bias_ir_l0 = bias_ih[0]
weight_hr_l0 = weight_hh[0]
bias_hr_l0 = bias_hh[0]
r = torch.sigmoid(weight_ir_l0 * input[0] + bias_ir_l0 + weight_hr_l0 * hx + bias_hr_l0)
r
z-更新门
- 输出范围为[0,1], 和r重置门的计算方式一样,只不过用途不同
weight_iz_l0 = weight_ih[1]
bias_iz_l0 = bias_ih[1]
weight_hz_l0 = weight_hh[1]
bias_hz_l0 = bias_hh[1]
z = torch.sigmoid(weight_iz_l0 * input[0] + bias_iz_l0 + weight_hz_l0 * hx + bias_hz_l0)
z
n-候选隐藏状态,即当前的隐藏状态
weight_in_l0 = weight_ih[2]
bias_in_l0 = bias_ih[2]
weight_hn_l0 = weight_hh[2]
bias_hn_l0 = bias_hh[2]
n = torch.tanh(weight_in_l0 * input[0] + bias_in_l0 + torch.matmul(r, (weight_hn_l0 * hx + bias_hn_l0)))
n
h‘-计算,最终的隐藏状态
h1 = torch.matmul(1 - z, n) + torch.matmul(z, hx)
hx, h1
分析
不出意外,我们得到的记过和h1和上面GRUCell的运行结果是一样的。接下来我们分析下为什么要这么做?
从h‘的公式可以看出,(1-z)、z为[0,1]的权重系数,其大小决定最终状态h‘的组成占比。
- h 我们理解为长期记忆(也就是上次的隐藏状态), n时是通过RNNCell方式计算出来的当前的隐藏状态, 也就是短期依赖,因此通过z(更新门)就控制了长短期依赖的比重。
- 不知道有没有细心的同学发现,z计算的时候用到了重置门r, r 控制了上次隐藏状态h的比例,那是不是和最终隐藏状态h'的计算有点冲突了,两个公式里面都有控制h的权重?
我刚开始看的时候也有这个疑问,后来去原论文中看了看 有这样一段解释
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the pre- vious hidden state and reset with the current input
only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.
On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember long- term information. Furthermore, this may be con- sidered an adaptive variant of a leaky-integration unit (Bengio et al., 2013).
As each hidden unit has separate reset and up- date gates, each hidden unit will learn to capture dependencies over different time scales. Those units that learn to capture short-term dependencies will tend to have reset gates that are frequently ac- tive, but those that capture longer-term dependen- cies will have update gates that are mostly active.
In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.
总结下来就是r和z是两个独立的单元,独立学习,r控制是否丢掉长期的依赖信息,只保留短期依赖,z控制保留长期依赖的比例,两者并不冲突。就类似于 y = wx+b 本身是有w和b控制,只有一个神经元,我们将其优化为y = (wx1 + b1)x2 + b, 其中w的成分变得更复杂了(变成两个神经元),这样会带来更多的灵活性,不知道这样理解对不对?
网络训练是黑盒,我们可能会碰到需要部分神经元丢掉部分不相关的长期依赖,保留当前输入的相关的短期依赖,同时另一个神经元需要保留相关的长期依赖,两者都会对最终的结果最初贡献。
总结
本文主要用简单方便理解的数学公式复现了GRU黑盒,并借此来加深理解,欢迎提出建议。