Attention model 注意力模型

最新推荐文章于 2023-03-16 15:04:43 发布

原创最新推荐文章于 2023-03-16 15:04:43 发布 · 227 阅读

0 ·

CC 4.0 BY-SA版权

NLP 专栏收录该内容

52 篇文章

订阅专栏

本文介绍了序列数据处理中的Encoder-Decoder模型，特别是Bi-RNN在Encoder中的应用。Decoder部分详细阐述了注意力机制，其中每个输出y(t)的注意力权重alpha(t,t’)由前一时刻的隐藏状态s(t-1)和Encoder的隐藏状态a'(t)共同决定。通过训练神经网络来计算这些权重，并使用softmax确保权重之和为1。不同的论文提出了各种网络结构以优化注意力的分配，从而提高输出的质量。

1. Encoder：Bi-RNN

在这里插入图片描述
x(1:T): input sequence with the length of T
a(1:T)：hidden state/ activation value/ history vector, 表示当前与之前的信息存储
a(0), a(end): 起始/终止符的a
a‘(1:T)：concatenate 正向&逆向的history vector -->a(t) and <–a(t)。

2. Decoder：

在这里插入图片描述
s：为与a区别，标记为s
y^(1:T): output sequence

”?“ 所标示的input输入十分关键，决定output y(t)所能pay attention to下面Bi-RNN的哪些部分。For each y(t), we have:
在这里插入图片描述

**关键问题：**怎么得到每个y(t)的attention weight = alpha(t, t’) i.e. the attention amount that y(t) pay to each a’(t)?

每个y(t)的attention weight （i.e. alpha(t, t’)）和什么有关？
和之前的y的hidden state s有关，即s(t-1)；和encoder学到的与之对应（time step相同）的hidden state a‘有关，即a’(t).

Hence, 训练一个神经网络，which takes s(t-1) and a’(t) as inputs，and outputs e(t, t’), and alpha(t, t’) = softmax(e(t, t’)), with the length of the sequence length T （对应着对sequence 1：T中每个word的注意力权重，求和 = 1，这是我们使用softmax的原因）. 不同论文提出了不同的神经网络设计方式，让注意力更好的集中，输出更好的结果。

reference:

[1] https://www.bilibili.com/video/BV1cb411W7w9
[2] 4F10 notes: Deep Learning for Sequence Data