注意力机制模型
模型: 分为 Encoder层,Attention层 和 Decoder层。
将 Encoder层 的每个时间片的激活值 s<t>s<t> 拷贝 Tx 次然后和全部激活值 a (Tx个时间片) 串联作为Attention 层的输入,经过Attention层的计算输出 nyny 个阿尔法 α,使用不同激活值 a 作为不同阿尔法 α 对每个单词的注意力权重,相乘,即 α⋅a,然后将 nyny 个这样的相乘作为attention层的输出,即作为 decoder层 一个时间片上输入参与后续运算。
主要思想是将一个时间片的激活值分别和不同的单词注意力权重(使用不同时间片的激活值作为权重)相乘。
下图是上图的 一个 Attention 到 context 部分,也是Attention mechanism(注意力机制)的实现:
过程:
There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through TxTx time steps; the post-attention LSTM goes through TyTy time steps.
翻译:模型中有两个LSTM层,主要区别在于使用注意力机制的前后以及相连的时间片不同。下面的叫 pre-attention Bi-LSTM,上面的叫 post-attention LSTM,其实这两个分别属于Seq2Seq的编码(Encoder)和解码(Decoder)部分。Bi-LSTM是指双向 LSTM。pre-attention Bi-LSTM 和 Tx 个输入时间片,post-attention LSTM 和 Ty 个输出时间片相连。
The post-attention LSTM passes s⟨t⟩,c⟨t⟩s⟨t⟩,c⟨t⟩ from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations s⟨t⟩s⟨t⟩. But since we are using an LSTM here, the LSTM has both the output activation s⟨t⟩s⟨t⟩ and the hidden cell state c⟨t⟩c⟨t⟩. However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time tt does will not take the specific generated as input; it only takes s⟨t⟩s⟨t⟩ and c⟨t⟩c⟨t⟩ as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn’t as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.
翻译: 指向 LSTM 的 s post-attention LSTM 的隐藏层的激活值。post-attention LSTM 没有使用双向的原因是这个例子中的机器时间前后相关不大。这一层的 激活值 s 和 记忆细胞的 c 的 初始输入和普通的LSTM一样(一般取0) ,而输入来自 Attention层 的计算。
We use a⟨t⟩=[a→⟨t⟩;a←⟨t⟩]a⟨t⟩=[a→⟨t⟩;a←⟨t⟩] to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM.
翻译:双向 RNN 的说明
The diagram on the right uses a RepeatVector node to copy s⟨t−1⟩s⟨t−1⟩’s value TxTx times, and then Concatenation to concatenate s⟨t−1⟩s⟨t−1⟩ and a⟨t⟩a⟨t⟩ to compute e⟨t,t′e⟨t,t′, which is then passed through a softmax to compute α⟨t,t′⟩α⟨t,t′⟩. We’ll explain how to use RepeatVector and Concatenation in Keras below.
翻译: Attention层 的 s<t>s<t> 是指 Encoder层一个时间片上的输出(激活值),将一个时间片上的激活值复制 Tx 次以便和 Encoder 层 Tx 个时间片的输出 a 进行串联(类似增广矩阵),然后一起作为 Attention层 的输入。
模型实现: implementing two functions: one_step_attention() and model() .
one_step_attention(): At step tt, given all the hidden states of the Bi-LSTM () and the previous hidden state of the second LSTM (s<t−1>s<t−1>), one_step_attention() will compute the attention weights ([α<t,1>,α<t,2>,...,α<t,Tx>][α<t,1>,α<t,2>,...,α<t,Tx>]) and output the context vector (see Figure 1 (right) for details):
context<t>=∑t′=0Tx阿尔法α<t,t′>⋅a<t′>;(1)(1)context<t>=∑t′=0Tx阿尔法α<t,t′>⋅a<t′>;Note that we are denoting the attention in this notebook context⟨t⟩context⟨t⟩. In the lecture videos, the context was denoted c⟨t⟩c⟨t⟩, but here we are calling it context⟨t⟩context⟨t⟩ to avoid confusion with the (post-attention) LSTM’s internal memory cell variable, which is sometimes also denoted c⟨t⟩c⟨t⟩.
翻译: 在每个时间片 t,将Encoder层在 t 时间片上的输出 s<t−1>s<t−1>(激活值, t-1而不是 t 是因为代码从0开始)复制 Tx 次以便和Encoder层所以输出串联作为Attention层的输入,经过计算得到 Tx 个输出 阿尔法 α, 将每个 α 和每个时间片的输出(作为注意力权重)相乘,然后求和作为最终的输入 context,也就是 Decoder层的输入。
model(): Implements the entire model. It first runs the input through a Bi-LSTM to get back [a<1>,a<2>,...,a<Tx>][a<1>,a<2>,...,a<Tx>]. Then, it calls one_step_attention() TyTy times (for loop). At each iteration of this loop, it gives the computed context vector c<t>c<t> to the second LSTM, and runs the output of the LSTM through a dense layer with softmax activation to generate a prediction y^<t>y^<t>.
翻译: 在model函数中先计算Encoder层,然后利用缓存进行 decoder 层的计算,经过 Ty 个时间片运算得到最终的输出,在每个时间片都会调用一次 Attention层(每次都涉及 Tx 个时间片的 Encoder层 缓存输出)