《Attention Is All You Need》

Proposal

propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.

Contributions

1、The Transformer allows for significantly more parallelization.

2、The Transformer can reach a new state of the art【Performance】 in translation quality after being trained for as little as twelve hours【time】 on eight P100 GPUs.

3、The Transformer is more interpretable.

4、The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution.

Architecture

Attention function

Left is the Scaled Dot-Product Attention, and right is the Multi-Head Attention consists of several attention layers running in parallel.
Note that the total computational cost of the Multi-Head Attention is similar to that of single-head attention with full dimensionality.

Position Encoding

inject some information about the relative or absolute position of the token in the sequence.

思路要点

感觉全篇都是要点。

### 关于 'Attention is all you need' 论文及其注意力机制 #### 什么是 Attention Mechanism? 注意力机制是一种模仿人类视觉注意能力的技术,在处理序列数据时能够动态分配权重给输入的不同部分。这种机制允许模型专注于最相关的上下文信息,从而提高性能[^1]。 #### Self-Attention 的定义与作用 Self-Attention 是一种特殊的注意力形式,它通过计算同一序列中的不同位置之间的关系来捕捉全局依赖性。这种方法首次被引入到 Transformer 架构中,并成为其核心组件之一。相比传统的 RNN 或 CNN 方法,Self-Attention 提供了一种更高效的方式来进行长距离依赖建模[^4]。 #### Additive Attention 和其他变体 除了 Self-Attention 外,还有多种实现方式可以用于衡量两个向量间的相似度。例如 additive attention 使用单层前馈神经网络作为兼容函数来评估查询和键之间匹配程度[^2]。每种方法都有各自特点以及适用场景,在实际应用过程中可以根据具体需求选择合适版本。 #### Dropout 技术防止过拟合 为了减少复杂模型可能出现的过拟合现象,Dropout 成为训练大型深度学习架构时常用技巧之一。该策略随机丢弃一部分节点连接以增强泛化能力。Srivastava 等人在研究中证明了这一简单有效的方法对于提升测试集表现具有重要意义[^3]。 ```python import torch import torch.nn as nn class MultiHeadedAttention(nn.Module): def __init__(self, h, d_model, dropout=0.1): super(MultiHeadedAttention, self).__init__() assert d_model % h == 0 # We assume d_v always equals d_k self.d_k = d_model // h self.h = h self.linears = clones(nn.Linear(d_model, d_model), 4) self.attn = None self.dropout = nn.Dropout(p=dropout) def forward(self, query, key, value, mask=None): if mask is not None: # Same mask applied to all h heads. mask = mask.unsqueeze(1) nbatches = query.size(0) # 1) Do all the linear projections in batch from d_model => h x d_k query, key, value = \ [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2) for l, x in zip(self.linears, (query, key, value))] # 2) Apply attention on all the projected vectors in batch. x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout) # 3) "Concat" using a view and apply a final linear. x = x.transpose(1, 2).contiguous() \ .view(nbatches, -1, self.h * self.d_k) del query, key, value return self.linears[-1](x) ``` 以上代码展示了如何构建一个多头自注意力模块(Multi-Head Attention),这是基于论文《Attention Is All You Need》设计的核心结构的一部分。 #### 性能影响因素分析 在实验设置下调整 multi-head attention 中heads数量及key/value维度大小会对最终翻译质量产生显著影响。尽管增加head数目通常有助于改善效果,但如果超过某个阈值则可能导致BLEU分数下降[^4]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值