Attention Is All You Need
Abstract
任务:机器翻译
传统方法:RNN/CNN,可能加上attention机制
本文的方法:Transformer(变形金刚?变压器?),只用attention
Introduction
RNN’s inherent sequential nature precludes parallelization within training examples.
attention mechanisms are used in conjunction with a recurrent network
Transformer: relying entirely on an attention mechanism to draw global dependencies between input and output
Background
Self-attention(SAGAN中介绍过): intra-attention, relating different positions of a single sequence in order to compute a representation of the sequence
Model Architecture
encoder-decoder structure
encoder:
(x1,...,xn)↦(z1,...,zn)
(
x
1
,
.
.
.
,
x
n
)
↦
(
z
1
,
.
.
.
,
z
n
)
auto-regressive: the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term)

Scaled Dot-Product Attention: queries
Q
Q
, keys , values
V
V
, dimension of keys ,

Multi-Head Attention: concatenate, project

fully connected feed-forward network: 两层全连接网络
Positional Encoding: make use of the order of the sequence, relative or absolute position of the tokens in the sequence.
allow the model to easily learn to attend by relative positions
Why Self-Attention
- the total computational complexity per layer
- the amount of computation that can be parallelized
- the path length between long-range dependencies in the network
- yield more interpretable models