The Annotated Transformer

最新推荐文章于 2025-11-11 01:34:56 发布
转载 最新推荐文章于 2025-11-11 01:34:56 发布 · 2.6k 阅读
· 0
· 3
文章标签:

#transformer

机器学习 专栏收录该内容
23 篇文章
订阅专栏
本文深入解析了Transformer模型,一种完全基于自注意力机制的序列到序列模型,它摒弃了传统的递归神经网络和卷积网络,实现了在多种自然语言处理任务上的重大突破。文章详细介绍了Transformer的架构,包括编码器和解码器堆叠、多头自注意力机制、位置编码等关键组件,并提供了从零开始的实现代码。

摘自 哈佛大学NLP研究组 The Annotated Transformer
http://nlp.seas.harvard.edu/2018/04/03/attention.html

<header class="site-header">
<span><img width="30px" style="margin-top:5px" src="/SEAS.png"></span><span><a class="site-title" href="/">harvardnlp</a></span>

<nav class="site-nav">
  <a href="#" class="menu-icon">
    <svg viewBox="0 0 18 15">
      <path fill="#424242" d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.031C17.335,0,18,0.665,18,1.484L18,1.484z"></path>
      <path fill="#424242" d="M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0c0-0.82,0.665-1.484,1.484-1.484 h15.031C17.335,6.031,18,6.696,18,7.516L18,7.516z"></path>
      <path fill="#424242" d="M18,13.516C18,14.335,17.335,15,16.516,15H1.484C0.665,15,0,14.335,0,13.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.031C17.335,12.031,18,12.696,18,13.516L18,13.516z"></path>
    </svg>
  </a>

  <div class="trigger">
    
    
      
    
      
    
      
    
      
    
      
    
      
      <a class="page-link" href="/members/">Members</a>
      
    
      
      <a class="page-link" href="/rush.html">PI</a>
      
    
      
      <a class="page-link" href="/code/">Code</a>
      
    
      
      <a class="page-link" href="/papers/">Publications</a>
      
    
  </div>
</nav>
<div class="page-content">
  <div class="wrapper">
    <article class="post" itemscope="" itemtype="http://schema.org/BlogPosting">

The Annotated Transformer

Apr 3, 2018

from IPython.display import Image
Image(filename='images/aiayn.png')

png

The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the last year. Besides producing major improvements in translation quality, it provides a new architecture for many other NLP tasks. The paper itself is very clearly written, but the conventional wisdom has been that it is quite difficult to implement correctly.

In this post I present an “annotated” version of the paper in the form of a line-by-line implementation. I have reordered and deleted some sections from the original paper and added comments throughout. This document itself is a working notebook, and should be a completely usable implementation. In total there are 400 lines of library code which can process 27,000 tokens per second on 4 GPUs.

To follow along you will first need to install PyTorch. The complete notebook is also available on github or on Google Colab with free GPUs.

Note this is merely a starting point for researchers and interested developers. The code here is based heavily on our OpenNMT packages. (If helpful feel free to cite.) For other full-sevice implementations of the model check-out Tensor2Tensor (tensorflow) and Sockeye (mxnet).

  • Alexander Rush (@harvardnlp or srush@seas.harvard.edu), with help from Vincent Nguyen and Guillaume Klein

Prelims

# !pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib spacy torchtext seaborn 
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline

Table of Contents

  • Prelims
  • Background
  • Model Architecture
    • Encoder and Decoder Stacks
      • Encoder
      • Decoder
      • Attention
      • Applications of Attention in our Model
    • Position-wise Feed-Forward Networks
    • Embeddings and Softmax
    • Positional Encoding
    • Full Model
  • Training
    • Batches and Masking
    • Training Loop
    • Training Data and Batching
    • Hardware and Schedule
    • Optimizer
    • Regularization
      • Label Smoothing
  • A First Example
    • Synthetic Data
    • Loss Computation
    • Greedy Decoding
  • A Real World Example
    • Data Loading
    • Iterators
    • Multi-GPU Training
    • Training the System
  • Additional Components: BPE, Search, Averaging
  • Results
    • Attention Visualization
  • Conclusion

My comments are blockquoted. The main text is all from the paper itself.

Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. End- to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple- language question answering and language modeling tasks.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure (cite). Here, the encoder maps an input sequence of symbol representations (x1,…,xn)(x1,…,xn) of symbols one element at a time. At each step the model is auto-regressive (cite), consuming the previously generated symbols as additional input when generating the next.

class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many 
    other models.
    """
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">tgt</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">):</span>
    <span class="s">"Take in and process masked src and target sequences."</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">),</span> <span class="n">src_mask</span><span class="p">,</span>
                        <span class="n">tgt</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">encode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">encoder</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">src_embed</span><span class="p">(</span><span class="n">src</span><span class="p">),</span> <span class="n">src_mask</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">decode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">decoder</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">tgt_embed</span><span class="p">(</span><span class="n">tgt</span><span class="p">),</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">)</span></code></pre></figure>
class Generator(nn.Module):
    "Define standard linear + softmax generation step."
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">F</span><span class="o">.</span><span class="n">log_softmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">proj</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span></code></pre></figure>

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

Image(filename='images/ModalNet-21.png')

png

Encoder and Decoder Stacks

Encoder

The encoder is composed of a stack of N=6N=6 identical layers.

def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
class Encoder(nn.Module):
    "Core encoder is a stack of N layers"
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="p">):</span>
    <span class="s">"Pass the input (and mask) through each layer in turn."</span>
    <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></code></pre></figure>

We employ a residual connection (cite) around each of the two sub-layers, followed by layer normalization (cite).

class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="n">mean</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">std</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">a_2</span> <span class="o">*</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">std</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">eps</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b_2</span></code></pre></figure>

That is, the output of each sub-layer is LayerNorm(x+Sublayer(x))LayerNorm(x+Sublayer(x)) is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel=512dmodel=512.

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">sublayer</span><span class="p">):</span>
    <span class="s">"Apply residual connection to any sublayer with the same size."</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">sublayer</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">)))</span></code></pre></figure>

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed- forward network.

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="p">):</span>
    <span class="s">"Follow Figure 1 (left) for connections."</span>
    <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">0</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">self_attn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="p">))</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">1</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">feed_forward</span><span class="p">)</span></code></pre></figure>

Decoder

The decoder is also composed of a stack of N=6N=6 identical layers.

class Decoder(nn.Module):
    "Generic N layer decoder with masking."
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></code></pre></figure>

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">):</span>
    <span class="s">"Follow Figure 1 (right) for connections."</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">memory</span>
    <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">0</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">self_attn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">))</span>
    <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">1</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">src_attn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">))</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">2</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">feed_forward</span><span class="p">)</span></code></pre></figure>

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position ii.

def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0

Below the attention mask shows the position each tgt word (row) is allowed to look at (column). Words are blocked for attending to future words during training.

plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])
None

png

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We call our particular attention “Scaled Dot-Product Attention”. The input consists of queries and keys of dimension dkdk, and apply a softmax function to obtain the weights on the values.

Image(filename='images/ModalNet-19.png')

png

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix QQ. We compute the matrix of outputs as:

Attention(Q,K,V)=softmax(QKTdk‾‾√)VAttention(Q,K,V)=softmax(QKTdk)V

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

The two most commonly used attention functions are additive attention (cite), and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1dk√1dk. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of dkdk.

Image(filename='images/ModalNet-20.png')

png

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWQi,KWKi,VWVi)MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV)

Where the projections are parameter matrices WQi∈ℝdmodel×dkWiQ∈Rdmodel×dk. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="s">"Implements Figure 2"</span>
    <span class="k">if</span> <span class="n">mask</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="c"># Same mask applied to all h heads.</span>
        <span class="n">mask</span> <span class="o">=</span> <span class="n">mask</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">nbatches</span> <span class="o">=</span> <span class="n">query</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
    
    <span class="c"># 1) Do all the linear projections in batch from d_model =&gt; h x d_k </span>
    <span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="o">=</span> \
        <span class="p">[</span><span class="n">l</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">nbatches</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">h</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_k</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
         <span class="k">for</span> <span class="n">l</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">linears</span><span class="p">,</span> <span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">))]</span>
    
    <span class="c"># 2) Apply attention on all the projected vectors in batch. </span>
    <span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">attn</span> <span class="o">=</span> <span class="n">attention</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">,</span> 
                             <span class="n">dropout</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">)</span>
    
    <span class="c"># 3) "Concat" using a view and apply a final linear. </span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">contiguous</span><span class="p">()</span> \
         <span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">nbatches</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">h</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_k</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">linears</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">](</span><span class="n">x</span><span class="p">)</span></code></pre></figure>

Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways: 1) In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as (cite).

2) The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

3) Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot- product attention by masking out (setting to −∞−∞) all values in the input of the softmax which correspond to illegal connections.

Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

FFN(x)=max(0,xW1+b1)W2+b2FFN(x)=max(0,xW1+b1)W2+b2

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel=512dmodel=512.

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">w_2</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">w_1</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span></code></pre></figure>

Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodeldmodel.

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">lut</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">d_model</span><span class="p">)</span></code></pre></figure>

Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodeldmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed (cite).

In this work, we use sine and cosine functions of different frequencies: PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i)=sin(pos/100002i/dmodel)

PE(pos,2i+1)=cos(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel).

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop=0.1Pdrop=0.1.

class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
    <span class="c"># Compute the positional encodings once in log space.</span>
    <span class="n">pe</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">max_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
    <span class="n">position</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">max_len</span><span class="p">)</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">div_term</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">d_model</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o">*</span>
                         <span class="o">-</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mf">10000.0</span><span class="p">)</span> <span class="o">/</span> <span class="n">d_model</span><span class="p">))</span>
    <span class="n">pe</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">position</span> <span class="o">*</span> <span class="n">div_term</span><span class="p">)</span>
    <span class="n">pe</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">position</span> <span class="o">*</span> <span class="n">div_term</span><span class="p">)</span>
    <span class="n">pe</span> <span class="o">=</span> <span class="n">pe</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
    <span class="bp">self</span><span class="o">.</span><span class="n">register_buffer</span><span class="p">(</span><span class="s">'pe'</span><span class="p">,</span> <span class="n">pe</span><span class="p">)</span>
    
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">Variable</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">pe</span><span class="p">[:,</span> <span class="p">:</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)],</span> 
                     <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></code></pre></figure>

Below the positional encoding will add in a sine wave based on position. The frequency and offset of the wave is different for each dimension.

plt.figure(figsize=(15, 5))
pe = PositionalEncoding(20, 0)
y = pe.forward(Variable(torch.zeros(1, 100, 20)))
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())
plt.legend(["dim %d"%p for p in [4,5,6,7]])
None

png

We also experimented with using learned positional embeddings (cite) instead, and found that the two versions produced nearly identical results. We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

Full Model

Here we define a function that takes in hyperparameters and produces a full model.

确定要放弃本次机会?
福利倒计时
: :

立减 ¥

普通VIP年卡可用
立即使用
csiao_Bing
关注 关注
  • 0
    点赞
  • 踩
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
  • 分享
    复制链接
    分享到 QQ
    分享到新浪微博
    扫一扫
  • 举报
    举报
专栏目录
Harvard NLP The Annotated Transformer 学习
大模型与Agent智能体
02-16 1779
Harvard NLP The Annotated Transformer 复现Google公司的Transformer论文 “Attention is All You Need” 的Transformer 在过去的一年里一直在很多人的脑海中出现。Transformer在机器翻译质量上有重大改进,它还为许多其他NLP任务提供了一种新的体系结构。论文本身写得很清楚,但传统的看法是论文很难准确...
[NLP]——The Annotated Transformer(模型篇)
jokerxsy的博客
05-04 1105
背景 有很多为了减少序列计算而引出的模型:Neural GPU\ByteNet\ConvS2S,它们使用卷积神经网络来并行化的计算。然而,这些模型的计算次数随着输入和输出间的距离增长而增长,带来了长期依赖问题。Transformer的计算次数是常数,但是是以减弱了位置信息为代价,具体的,它使用Multi-Head Attention来实现。 self-attention应用广泛,Transformer是第一个完全依赖于self-attention的transduction model,关于transduct
参与评论 您还未登录,请先 登录 后发表或查看评论
Annotated Transformer
On_theway10的博客
01-04 381
Transformer结构深度注解
超强Transformer代码逐行注释:annotated-transformer核心函数详解
最新发布
gitblog_01026的博客
11-11 1030
还在为理解Transformer模型的核心实现而头疼吗?本文通过哈佛大学NLP团队的annotated-transformer项目,为你逐行解析Transformer关键代码,让你真正掌握这一革命性架构的核心原理。 读完本文你将获得: - Transformer核心组件的代码级理解 - 多头注意力机制的实现细节 - 位置编码和嵌入层的技术要点 - 完整的编码器-解码器架构解析 ## Tra
[NLP]——The Annotated Transformer(实战篇)
jokerxsy的博客
05-06 557
目录前言训练TOOLSBatches and MaskingTraining LoopTraining Data and Batching 前言 [NLP]——The Annotated Transformer(模型篇) 训练 TOOLS Batches and Masking 两个mask: 1. padding(here) 2. future words(subsequent_mask) class Batch: "Object for holding a batch of data with
The Annotated Transformer(Attention Is All You Need)
sinat_41942180的博客
08-31 1330
"Attention is All You Need"[1] 一文中提出的Transformer网络结构最近引起了很多人的关注。Transformer不仅能够明显地提升翻译质量,还为许多NLP任务提供了新的结构。虽然原文写得很清楚,但实际上大家普遍反映很难正确地实现。所以我们为此文章写了篇注解文档,。本文档删除了原文的一些章节并进行了重新排序,并在整个文章中加入了相应的注解。此外,本文档,本身就是直接可以运行的代码实现,,在4个GPU上每秒可以处理27,000个tokens。
The Annotated Transformer(一)
brandday的博客
09-02 1021
原文地址:http://nlp.seas.harvard.edu/2018/04/03/attention.html 摘要:基于复杂循环卷积神经网络的主要序列转换模型包含了编码器(Encoder)和解码器。表现最好的模型在解码器和编码器之间通过注意力机制进行连接。我们提出了一个新的简单网络结构,Transformer,其仅基于注意力机制,完全和循环卷积分开。两种机器翻译的实验表明这些模型不仅翻译质...
The Annotated Transformer的中文教程
技术宅学长的博客
12-14 983
在过去的五年里,Transformer 一直是很多人关注的焦点。这篇文章以逐行实现的形式呈现了论文的带注释版本。它重新排序并删除了原始论文中的一些部分,并在全文中添加了评论。本文档本身是一个notebook ,并且应该是一个完全可用的实现。代码可以在这里找到。
《The Annotated Transformer》环境配置
vvvdg的博客
11-23 2029
配置《The Annotated Transformer》论文代码运行环境时老是报错,没有找到完整可行的环境配置方法,很多问题更是找不到解决方法,万幸最后终于配了出来,现在把它记录下来,一是防止遗忘,二是希望能够给像我一样的小白提供一些便利前提说明:本人非常小白,可能部分内容原理无法解释清楚,如有错误欢迎指正~
调试The Annotated Transformer
lhp171302512的博客
07-10 1326
The Annotated Transformer 应该是我见过最贴心的‘Attention is All You Need’的复现了。看网页链接像是哈佛大学复现的,质量应该还不错,于是就照着代码按顺序ctrl + c +v了一遍。其实在github上也有代码可以直接下载,只不过是.ipynb格式的。 在调试代码的过程中,遇到了一些问题,在这里记录一下。 1 环境安装 作者没有说明每个依赖库的版本,以下是我个人的版本,可以参考。 python==3.8.8 torch==1.9.0 numpy==1.20.
Transformer 源码中文解读 《The Annotated Transformer》 notebook 中文翻译版
qq_36001281的博客
12-27 1502
这篇文章不仅对原版论文进行了重新排序和删减,还在全文中添加了大量的注释,帮助您更好地理解Transformer的核心思想和实现细节。jupytext==1.13: Jupyter 笔记本扩展,支持将 Jupyter 笔记本转换为其他格式。torchdata==0.3.0: PyTorch 数据加载和预处理库。torch==1.11.0+cu113: PyTorch 深度学习库。torchtext==0.12: PyTorch 文本处理库。pandas==1.3.5: 数据处理和分析库。获得相关的代码资源。
The Annotated Transformer(解读Transformer)
czp_374的博客
03-28 3814
原链接
哈佛大学NLP研究组的Transformer入门
xindebian12的博客
04-15 3430
The Annotated Transformer from IPython.display import Image Image(filename='images/aiayn.png') The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the las...
《The Annotated Transformer》翻译——注释和代码实现《Attention Is All You Need》
qq_56591814的博客
09-13 5882
from IPython.display import Image Image(filename='images/aiayn.png') [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BM5P7LdF-1631548048321)(output_0_0.png)]   “Attention is All You Need” 的 Transformer 在过去的一年里一直在很多人的脑海中出现。 Transformer 在机器翻译质量上有重大改进,它还为许多其它NLP 任务提供
The Annotated Transformer 阅读学习
Together_CZ的博客
03-28 1343
The Annotated Transformer 阅读学习
【转载】The Annotated Transformer
qq_26108393的博客
08-22 706
最近在看Transformer,但是本人并不是NLP研究方向的,所以有诸多细节不理解,网上大多数博客讲的模棱两可又权威性不够,翻看了《attention is all you need》论文中给的源码链接tensor2tensor,工程化的代码又实在太复杂,不适合初学者阅读。今天在BERT中发现作者在引用transformer的同时引用了harvard的这个链接http://nlp.seas.harvard.edu/2018/04/03/attention.html。感觉学的很不错,但是因为墙的原因,可能很
探秘哈佛NLP的Annotated Transformer:深度学习的自然语言处理新星
gitblog_00083的博客
03-21 756
`Annotated Transformer` 是由哈佛大学自然语言处理团队公开的一个深度学习项目,它提供了一个详细的Transformer模型注解版本。该项目旨在帮助研究者和开发者更好地理解Transformer架构,并通过实践探索其在自然语言处理(NLP)任务中的应用。 Transformer由Google在2017年的论文《Attention is All You Need》中首次提出,它...
the annotated transformer译文
06-14
"The Annotated Transformer" 是一本由 Jay Alammar 所著的书籍,它详细讲解了Transformer模型的工作原理,尤其是注意力机制(Attention Mechanism),这是Transformer模型的核心组成部分。这本书以一种易于理解的...
csiao_Bing

博客等级

码龄18年
9
原创
44
点赞
213
收藏
16
粉丝
关注
私信

热门文章

  • XGBRegressor 参数调优 53258
  • 使用回归分析,样本过少时不妨好先看看散点图 6874
  • conda 设置python运行 虚拟环境 4980
  • 天空之城:拉马努金式思维训练法 2741
  • The Annotated Transformer 2662

分类专栏

  • 架构设计
    1篇
  • 机器学习
    23篇
  • 成长大法
    3篇
  • python
    6篇
  • NLP
    4篇
  • 神经网络
    2篇

展开全部 收起

上一篇:
The Illustrated Transformer
下一篇:
机器学习面试之Attention

最新评论

  • XGBRegressor 参数调优

    东皇太后: evalute_result = optimized_GBM.cv_results_['mean_test_score']

大家在看

  • 英伟达模型如何去薅~
  • CSS知识点
  • 竞赛毕业设计定做作品---【芳心科技】F. 雷达波扫描非接触式睡眠监控系统设计与实现
  • WAF/CDN识别与特征分析:为绕过做准备 608
  • 程序员转行怕踩坑?选这 4 类工作:网络安全培训师、IT 审计,计算机技能 + 经验双加持,越老越吃香!

最新文章

  • 神经网络学习-高质量资源
  • 开源k-v Tair
  • python re
2020年1篇
2019年11篇
2018年25篇

目录

展开全部

收起

目录

展开全部

收起

上一篇:
The Illustrated Transformer
下一篇:
机器学习面试之Attention

分类专栏

  • 架构设计
    1篇
  • 机器学习
    23篇
  • 成长大法
    3篇
  • python
    6篇
  • NLP
    4篇
  • 神经网络
    2篇

展开全部 收起

目录

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
表情包 代码片
  • HTML/XML
  • objective-c
  • Ruby
  • PHP
  • C
  • C++
  • JavaScript
  • Python
  • Java
  • CSS
  • SQL
  • 其它
 条评论被折叠 查看
被折叠的  条评论 为什么被折叠? 到【灌水乐园】发言
查看更多评论
添加红包

请填写红包祝福语或标题

个

红包个数最小为10个

元

红包金额最低5元

当前余额3.43元 前往充值 >
需支付:10.00元
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付元
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值