Attention Is All You Need

最新推荐文章于 2025-02-25 10:01:15 发布

qq_36356761

最新推荐文章于 2025-02-25 10:01:15 发布

阅读量226

点赞数

分类专栏： paper reading notes

本文链接：https://blog.youkuaiyun.com/qq_36356761/article/details/80802168

版权

paper reading notes 专栏收录该内容

19 篇文章

订阅专栏

本文提出了Transformer模型，一种仅依赖于注意力机制的机器翻译新方法，替代了传统的RNN和CNN。Transformer解决了RNN训练时无法并行化的难题，通过自注意力层处理全局依赖，并采用多头注意力和全连接前馈网络结构。使用位置编码来考虑序列顺序，提高了模型的效率和解释性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Attention Is All You Need

Abstract

任务：机器翻译
传统方法：RNN/CNN，可能加上attention机制
本文的方法：Transformer（变形金刚？变压器？），只用attention

Introduction

RNN’s inherent sequential nature precludes parallelization within training examples.
attention mechanisms are used in conjunction with a recurrent network
Transformer: relying entirely on an attention mechanism to draw global dependencies between input and output

Background

Self-attention（SAGAN中介绍过）: intra-attention, relating different positions of a single sequence in order to compute a representation of the sequence

Model Architecture

encoder-decoder structure
encoder: $(x_1,...,x_n)\mapsto(z_1,...,z_n)$
auto-regressive: the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term)

X t = c + \sum i = 1 p ϕ i X t - i + ϵ t

$X_t = c + \sum_{i = 1}^p \phi_iX_{t-i} + \epsilon_t$

Scaled Dot-Product Attention: queries $Q$ , keys $K$ , values $V$ , dimension of keys $d_k$ ,

A t t e n t i o n (Q, K, V) = s o f t m a x (Q K T d k - - \sqrt) V

$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{QK^\mathrm{T}}{\sqrt{d_k}})V$

Multi-Head Attention: concatenate, project

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d 1, . . ., h e a d h) W O h e a d i = A t t e n t i o n (Q W Q i, K W K i, V W V i)

$\mathrm{MultiHead}(Q,K,V) = \mathrm{Concat(head_1,...,head}_h)W^O\\ \mathrm{head}_i = \mathrm{Attention}(QW_i^Q, KW_i^K,VW_i^V)$

fully connected feed-forward network: 两层全连接网络

F F N (x) = max (0, x W 1 + b 1) W 2 + b 2

$\mathrm{FFN}(x) = \max(0,xW_1+b_1)W_2+b_2$

Positional Encoding: make use of the order of the sequence, relative or absolute position of the tokens in the sequence.

P E (p o s, 2 i) = sin (p o s 10 4 \cdot 2 i d m o d e l) P E (p o s, 2 i + 1) = cos (p o s 10 4 \cdot 2 i d m o d e l)

$PE(pos,2i) = \sin(\frac{pos}{10^{4\cdot \frac{2i}{d_\mathrm{model}}}})\\ PE(pos,2i+1) = \cos(\frac{pos}{10^{4\cdot \frac{2i}{d_\mathrm{model}}}})$
allow the model to easily learn to attend by relative positions