Text Understanding with the Attention Sum Reader Network

本文介绍了一个简单的机器阅读理解模型——AttentionSumReader。该模型通过词嵌入、双向GRU编码和注意力机制,有效地解决了CNN/DailyMail及CBT数据集上的问题。实验表明,即使模型简单,也能取得优秀的效果。

本文是机器阅读系列的第四篇文章,本文的模型常出现在最新的机器阅读paper中related works部分,也是很多更好的模型的基础模型,所以很有必要来看下这篇paper,看得远往往不是因为长得高,而是因为站得高。本文的题目是Text Understanding with the Attention Sum Reader Network,作者是来自IBM Watson的研究员Rudolf Kadlec,paper最早于2016年3月4日submit在arxiv上。

本文的模型被称作Attention Sum Reader,具体见下图:

640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=

step 1 通过一层Embedding层将document和query中的word分别映射成向量。

step 2 用一个单层双向GRU来encode document,得到context representation,每个time step的拼接来表示该词。

step 3 用一个单层双向GRU来encode query,用两个方向的last state拼接来表示query。

step 4 每个word vector与query vector作点积后归一化的结果作为attention weights,就query与document中的每个词之前的相关性度量。

step 5 最后做一次相同词概率的合并,得到每个词的概率,最大概率的那个词即为answer。

模型在CNN/Daily Mail和CBT的Nouns、Named Entity数据集上进行了测试,在当时的情况下都取得了领先的结果。并且得到了一些有趣的结论,比如:在CNN/Daily Mail数据集上,随着document的长度增加,测试的准确率会下降,而在CBT数据集上得到了相反的结论。从中可以看得出,两个数据集有着不同的特征,构造方法也不尽相同,因此同一个模型会有着不同的趋势。

本文的模型相比于Attentive Reader和Impatient Reader更加简单,没有那么多繁琐的attention求解过程,只是用了点乘来作为weights,却得到了比Attentive Reader更好的结果,从这里我们看得出,并不是模型越复杂,计算过程越繁琐就效果一定越好,更多的时候可能是简单的东西会有更好的效果。

另外,在这几篇paper中的related works中,都会提到用Memory Networks来解决这个问题。接下来的文章将会分享Memory Networks在机器阅读理解中的应用,大家敬请关注。




来源:paperweekly


原文链接

### Efficient Attention Mechanism with Linear Time Complexity The concept of efficient attention mechanisms has gained prominence due to their ability to reduce computational overhead while maintaining or improving performance. Traditional attention mechanisms often involve generating pairwise attention matrices, which scale quadratically with respect to the spatial and temporal dimensions of the input[^1]. However, certain algorithms avoid this quadratic scaling by employing strategies that achieve linear complexities relative to the input size. One such approach involves factorizing the attention mechanism into smaller components or approximating it using kernel functions. For instance, some methods decompose the dot-product attention computation into multiple lower-dimensional operations, thereby reducing memory requirements and accelerating processing times. These techniques allow models to handle much larger sequences without incurring prohibitive costs associated with full self-attention computations. Additionally, these advancements align closely with broader trends within supervised machine learning where models are trained based on mappings from given inputs to desired outputs as described earlier[^2]. By refining how attentions operate internally—such as limiting interactions only among relevant tokens rather than all possible pairs—the system can learn more efficiently both theoretically (via reduced asymptotic bounds) but also practically when deployed across diverse hardware platforms ranging GPUs/TPUs etcetera. Below is an example implementation illustrating one variant called Performer - A fast transformer utilizing positive orthogonal random features: ```python import torch from performer_pytorch import PerformerLM model = PerformerLM( num_tokens=20000, dim=512, depth=6, heads=8, causal=True, nb_features=None # Set appropriately depending upon use case. ) input_sequence = torch.randint(0, 20000, (1, 512)) output = model(input_sequence) print(output.shape) ``` In summary, adopting effective attention schemes enables achieving superior results at significantly lesser resource utilization levels compared against conventional counterparts; thus making them highly suitable choices especially under scenarios involving extensive datasets requiring real-time analysis capabilities like natural language understanding tasks amongst others.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值