DeepLearning.ai code笔记5：序列模型

最新推荐文章于 2024-07-04 00:13:28 发布

原创最新推荐文章于 2024-07-04 00:13:28 发布 · 658 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#DeepLearning.ai #序列模型 #Attention Mechanism

机器学习同时被 3 个专栏收录

28 篇文章

订阅专栏

深度学习

17 篇文章

订阅专栏

吴恩达深度学习编程作业梳理

4 篇文章

订阅专栏

本文深入解析了注意力机制模型的工作原理，包括其在Encoder-Decoder架构中的应用。特别关注如何通过Attention层处理不同时间步的激活值，以及如何计算注意力权重。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

注意力机制模型

模型： 分为 Encoder层，Attention层和 Decoder层。

将 Encoder层的每个时间片的激活值 $s^{<t>}$ 拷贝 Tx 次然后和全部激活值 a (Tx个时间片) 串联作为Attention 层的输入，经过Attention层的计算输出 $n_y$ 个阿尔法 α，使用不同激活值 a 作为不同阿尔法 α 对每个单词的注意力权重，相乘，即 α⋅a，然后将 $n_y$ 个这样的相乘作为attention层的输出，即作为 decoder层一个时间片上输入参与后续运算。

主要思想是将一个时间片的激活值分别和不同的单词注意力权重（使用不同时间片的激活值作为权重）相乘。

这里写图片描述

下图是上图的一个 Attention 到 context 部分，也是Attention mechanism（注意力机制）的实现:

这里写图片描述

过程：

There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through $T_x$ time steps; the post-attention LSTM goes through $T_y$ time steps.

翻译：模型中有两个LSTM层，主要区别在于使用注意力机制的前后以及相连的时间片不同。下面的叫 pre-attention Bi-LSTM，上面的叫 post-attention LSTM，其实这两个分别属于Seq2Seq的编码（Encoder）和解码（Decoder）部分。Bi-LSTM是指双向 LSTM。pre-attention Bi-LSTM 和 Tx 个输入时间片，post-attention LSTM 和 Ty 个输出时间片相连。
The post-attention LSTM passes $s^{\langle t \rangle}, c^{\langle t \rangle}$ from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations $s^{\langle t\rangle}$ . But since we are using an LSTM here, the LSTM has both the output activation $s^{\langle t\rangle}$ and the hidden cell state $c^{\langle t\rangle}$ . However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time $t$ does will not take the specific generated $y^{\langle t-1 \rangle}$ as input; it only takes $s^{\langle t\rangle}$ and $c^{\langle t\rangle}$ as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn’t as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.

翻译： 指向 LSTM 的 s post-attention LSTM 的隐藏层的激活值。post-attention LSTM 没有使用双向的原因是这个例子中的机器时间前后相关不大。这一层的激活值 s 和记忆细胞的 c 的初始输入和普通的LSTM一样（一般取0），而输入来自 Attention层的计算。
We use $a^{\langle t \rangle} = [\overrightarrow{a}^{\langle t \rangle}; \overleftarrow{a}^{\langle t \rangle}]$ to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM.

翻译：双向 RNN 的说明
The diagram on the right uses a RepeatVector node to copy $s^{\langle t-1 \rangle}$ ’s value $T_x$ times, and then Concatenation to concatenate $s^{\langle t-1 \rangle}$ and $a^{\langle t \rangle}$ to compute $e^{\langle t, t'}$ , which is then passed through a softmax to compute $\alpha^{\langle t, t' \rangle}$ . We’ll explain how to use RepeatVector and Concatenation in Keras below.

翻译： Attention层的 $s^{<t>}$ 是指 Encoder层一个时间片上的输出（激活值），将一个时间片上的激活值复制 Tx 次以便和 Encoder 层 Tx 个时间片的输出 a 进行串联（类似增广矩阵），然后一起作为 Attention层的输入。

模型实现： implementing two functions: one_step_attention() and model() .

one_step_attention(): At step $t$ , given all the hidden states of the Bi-LSTM ( $[a^{\lt 1 \gt},a^{\lt 2 \gt}, ..., a^{\lt T_x \gt}]$ ) and the previous hidden state of the second LSTM ( $s^{\lt t-1 \gt}$ ), one_step_attention() will compute the attention weights ( $[\alpha^{\lt t,1 \gt},\alpha^{\lt t,2 \gt}, ..., \alpha^{\lt t,T_x \gt}]$ ) and output the context vector (see Figure 1 (right) for details):

$c o n t e x t < t > = \sum t' = 0 T x 阿尔法 α < t, t' > \cdot a < t' >; (1)$ $context^{\lt t \gt} = \sum_{t' = 0}^{T_x} 阿尔法 \alpha^{\lt t,t' \gt} \cdot a^{\lt t' \gt;}\tag{1}$
Note that we are denoting the attention in this notebook $context^{\langle t \rangle}$ . In the lecture videos, the context was denoted $c^{\langle t \rangle}$ , but here we are calling it $context^{\langle t \rangle}$ to avoid confusion with the (post-attention) LSTM’s internal memory cell variable, which is sometimes also denoted $c^{\langle t \rangle}$ .

翻译： 在每个时间片 t，将Encoder层在 t 时间片上的输出 $s^{<t−1>}$ （激活值, t-1而不是 t 是因为代码从0开始）复制 Tx 次以便和Encoder层所以输出串联作为Attention层的输入，经过计算得到 Tx 个输出阿尔法 α，将每个 α 和每个时间片的输出（作为注意力权重）相乘，然后求和作为最终的输入 context，也就是 Decoder层的输入。
model(): Implements the entire model. It first runs the input through a Bi-LSTM to get back $[a^{\lt 1 \gt},a^{\lt 2 \gt}, ..., a^{\lt T_x \gt}]$ . Then, it calls one_step_attention() $T_y$ times (for loop). At each iteration of this loop, it gives the computed context vector $c^{\lt t \gt}$ to the second LSTM, and runs the output of the LSTM through a dense layer with softmax activation to generate a prediction $\hat{y}^{\lt t \gt}$ .

翻译： 在model函数中先计算Encoder层，然后利用缓存进行 decoder 层的计算，经过 Ty 个时间片运算得到最终的输出，在每个时间片都会调用一次 Attention层（每次都涉及 Tx 个时间片的 Encoder层缓存输出）