Attention Is All You Need(Transformer模型)

1 Title

        Attention Is All You Need(Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin)

2 Conclusion

        This work presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.  The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. This study proposes a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

3 Good Sentences

1、Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. (The shortcomings of related works)
2、To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution.(The creation of this work)
3、We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.(The future challenge of Trnasformer mode)


Transformer是一个利用注意力机制来提高模型训练速度的模型。完全基于自注意力机制的一个深度学习模型,因为它适用于并行化计算,和它本身模型的复杂程度导致它在精度和性能上都要高于之前流行的RNN循环神经网络。
解析的人太多了,我勉强看了一遍就看解析去了,这里推荐Transformer模型详解(图解最完整版) - 知乎 (zhihu.com)

### Attention Is All You Need 模型架构详解 Transformer模型完全依赖于自注意力机制,摒弃了传统的循环神经网络(RNN)和卷积神经网络(CNN),从而实现了并行化训练,提升了效率和性能。该模型的核心在于其独特的编码器-解码器结构以及多头自注意力机制。 #### 编码器与解码器堆叠 整个模型由多个相同的编码器层和解码器层堆叠而成。每一层都包含了两个子层:一个多头自注意分子层和一个位置前馈全连接网络[^1]。这种设计使得每一块都能独立处理信息,并且能够捕捉到不同位置之间的关系而不受距离限制的影响。 #### 多头自注意力机制 在传统单头注意力基础上进行了扩展,引入了多头概念来增强表达能力。具体来说,在每次计算过程中会创建若干个平行的“头”,每个都可以学习不同的特征表示方式;之后再把这些结果拼接起来并通过线性变换得到最终输出。这种方式不仅提高了灵活性还增加了模型容量[^2]。 #### 前向传播中的残差连接与归一化 为了防止深层网络可能出现的信息丢失问题,在每一个子层后面加入了跳跃连接(Skip Connection),即把输入直接加到了输出上形成残差形式。与此同时,还在这些相加后的数据之上实施了Layer Normalization操作以稳定数值范围并加速收敛速度。 ```python class LayerNorm(nn.Module): def __init__(self, features, eps=1e-6): super(LayerNorm, self).__init__() self.a_2 = nn.Parameter(torch.ones(features)) self.b_2 = nn.Parameter(torch.zeros(features)) self.eps = eps def forward(self, x): mean = x.mean(-1, keepdim=True) std = x.std(-1, keepdim=True) return self.a_2 * (x - mean) / (std + self.eps) + self.b_2 ``` #### 输入嵌入与位置编码 由于原始序列是没有空间顺序信息的,因此需要额外加入绝对或相对的位置编码(Positional Encoding)给定长度为L的词向量X∈RLdmodel作为补充。这里采用的是正弦函数族的形式来进行编码,它能有效地保留词语间相对距离的同时也方便后续维度拓展。 ```python def positional_encoding(max_len, d_model): pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len).unsqueeze(1) div_term = torch.exp((torch.arange(0, d_model, 2)).float() * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position.float() * div_term) pe[:, 1::2] = torch.cos(position.float() * div_term) return pe.unsqueeze(0) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值