The Annotated Transformer

最新推荐文章于 2024-03-28 02:00:00 发布
csiao_Bing 最新推荐文章于 2024-03-28 02:00:00 发布
阅读量2.5k 收藏 3
点赞数
分类专栏: 机器学习 文章标签: transformer
机器学习 专栏收录该内容
23 篇文章
订阅专栏
本文深入解析了Transformer模型,一种完全基于自注意力机制的序列到序列模型,它摒弃了传统的递归神经网络和卷积网络,实现了在多种自然语言处理任务上的重大突破。文章详细介绍了Transformer的架构,包括编码器和解码器堆叠、多头自注意力机制、位置编码等关键组件,并提供了从零开始的实现代码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

摘自 哈佛大学NLP研究组 The Annotated Transformer
http://nlp.seas.harvard.edu/2018/04/03/attention.html

<header class="site-header">
<span><img width="30px" style="margin-top:5px" src="/SEAS.png"></span><span><a class="site-title" href="/">harvardnlp</a></span>

<nav class="site-nav">
  <a href="#" class="menu-icon">
    <svg viewBox="0 0 18 15">
      <path fill="#424242" d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.031C17.335,0,18,0.665,18,1.484L18,1.484z"></path>
      <path fill="#424242" d="M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0c0-0.82,0.665-1.484,1.484-1.484 h15.031C17.335,6.031,18,6.696,18,7.516L18,7.516z"></path>
      <path fill="#424242" d="M18,13.516C18,14.335,17.335,15,16.516,15H1.484C0.665,15,0,14.335,0,13.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.031C17.335,12.031,18,12.696,18,13.516L18,13.516z"></path>
    </svg>
  </a>

  <div class="trigger">
    
    
      
    
      
    
      
    
      
    
      
    
      
      <a class="page-link" href="/members/">Members</a>
      
    
      
      <a class="page-link" href="/rush.html">PI</a>
      
    
      
      <a class="page-link" href="/code/">Code</a>
      
    
      
      <a class="page-link" href="/papers/">Publications</a>
      
    
  </div>
</nav>
<div class="page-content">
  <div class="wrapper">
    <article class="post" itemscope="" itemtype="http://schema.org/BlogPosting">

The Annotated Transformer

Apr 3, 2018

from IPython.display import Image
Image(filename='images/aiayn.png')

png

The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the last year. Besides producing major improvements in translation quality, it provides a new architecture for many other NLP tasks. The paper itself is very clearly written, but the conventional wisdom has been that it is quite difficult to implement correctly.

In this post I present an “annotated” version of the paper in the form of a line-by-line implementation. I have reordered and deleted some sections from the original paper and added comments throughout. This document itself is a working notebook, and should be a completely usable implementation. In total there are 400 lines of library code which can process 27,000 tokens per second on 4 GPUs.

To follow along you will first need to install PyTorch. The complete notebook is also available on github or on Google Colab with free GPUs.

Note this is merely a starting point for researchers and interested developers. The code here is based heavily on our OpenNMT packages. (If helpful feel free to cite.) For other full-sevice implementations of the model check-out Tensor2Tensor (tensorflow) and Sockeye (mxnet).

  • Alexander Rush (@harvardnlp or srush@seas.harvard.edu), with help from Vincent Nguyen and Guillaume Klein

Prelims

# !pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib spacy torchtext seaborn 
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline

Table of Contents

  • Prelims
  • Background
  • Model Architecture
    • Encoder and Decoder Stacks
      • Encoder
      • Decoder
      • Attention
      • Applications of Attention in our Model
    • Position-wise Feed-Forward Networks
    • Embeddings and Softmax
    • Positional Encoding
    • Full Model
  • Training
    • Batches and Masking
    • Training Loop
    • Training Data and Batching
    • Hardware and Schedule
    • Optimizer
    • Regularization
      • Label Smoothing
  • A First Example
    • Synthetic Data
    • Loss Computation
    • Greedy Decoding
  • A Real World Example
    • Data Loading
    • Iterators
    • Multi-GPU Training
    • Training the System
  • Additional Components: BPE, Search, Averaging
  • Results
    • Attention Visualization
  • Conclusion

My comments are blockquoted. The main text is all from the paper itself.

Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. End- to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple- language question answering and language modeling tasks.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure (cite). Here, the encoder maps an input sequence of symbol representations (x1,…,xn)(x1,…,xn) of symbols one element at a time. At each step the model is auto-regressive (cite), consuming the previously generated symbols as additional input when generating the next.

class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many 
    other models.
    """
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">tgt</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">):</span>
    <span class="s">"Take in and process masked src and target sequences."</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">),</span> <span class="n">src_mask</span><span class="p">,</span>
                        <span class="n">tgt</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">encode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">encoder</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">src_embed</span><span class="p">(</span><span class="n">src</span><span class="p">),</span> <span class="n">src_mask</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">decode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">decoder</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">tgt_embed</span><span class="p">(</span><span class="n">tgt</span><span class="p">),</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">)</span></code></pre></figure>
class Generator(nn.Module):
    "Define standard linear + softmax generation step."
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">F</span><span class="o">.</span><span class="n">log_softmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">proj</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span></code></pre></figure>

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

Image(filename='images/ModalNet-21.png')

png

Encoder and Decoder Stacks

Encoder

The encoder is composed of a stack of N=6N=6 identical layers.

def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
class Encoder(nn.Module):
    "Core encoder is a stack of N layers"
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="p">):</span>
    <span class="s">"Pass the input (and mask) through each layer in turn."</span>
    <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></code></pre></figure>

We employ a residual connection (cite) around each of the two sub-layers, followed by layer normalization (cite).

class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="n">mean</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">std</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">a_2</span> <span class="o">*</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">std</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">eps</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b_2</span></code></pre></figure>

That is, the output of each sub-layer is LayerNorm(x+Sublayer(x))LayerNorm(x+Sublayer(x)) is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel=512dmodel=512.

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">sublayer</span><span class="p">):</span>
    <span class="s">"Apply residual connection to any sublayer with the same size."</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">sublayer</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">)))</span></code></pre></figure>

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed- forward network.

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="p">):</span>
    <span class="s">"Follow Figure 1 (left) for connections."</span>
    <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">0</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">self_attn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="p">))</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">1</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">feed_forward</span><span class="p">)</span></code></pre></figure>

Decoder

The decoder is also composed of a stack of N=6N=6 identical layers.

class Decoder(nn.Module):
    "Generic N layer decoder with masking."
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></code></pre></figure>

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">):</span>
    <span class="s">"Follow Figure 1 (right) for connections."</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">memory</span>
    <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">0</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">self_attn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">tgt_mask</span><span class="p">))</span>
    <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">1</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">src_attn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">))</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">sublayer</span><span class="p">[</span><span class="mi">2</span><span class="p">](</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">feed_forward</span><span class="p">)</span></code></pre></figure>

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position ii.

def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0

Below the attention mask shows the position each tgt word (row) is allowed to look at (column). Words are blocked for attending to future words during training.

plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])
None

png

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We call our particular attention “Scaled Dot-Product Attention”. The input consists of queries and keys of dimension dkdk, and apply a softmax function to obtain the weights on the values.

Image(filename='images/ModalNet-19.png')

png

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix QQ. We compute the matrix of outputs as:

Attention(Q,K,V)=softmax(QKTdk‾‾√)VAttention(Q,K,V)=softmax(QKTdk)V

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

The two most commonly used attention functions are additive attention (cite), and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1dk√1dk. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of dkdk.

Image(filename='images/ModalNet-20.png')

png

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWQi,KWKi,VWVi)MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV)

Where the projections are parameter matrices WQi∈ℝdmodel×dkWiQ∈Rdmodel×dk. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="s">"Implements Figure 2"</span>
    <span class="k">if</span> <span class="n">mask</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="c"># Same mask applied to all h heads.</span>
        <span class="n">mask</span> <span class="o">=</span> <span class="n">mask</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">nbatches</span> <span class="o">=</span> <span class="n">query</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
    
    <span class="c"># 1) Do all the linear projections in batch from d_model =&gt; h x d_k </span>
    <span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="o">=</span> \
        <span class="p">[</span><span class="n">l</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">nbatches</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">h</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_k</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
         <span class="k">for</span> <span class="n">l</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">linears</span><span class="p">,</span> <span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">))]</span>
    
    <span class="c"># 2) Apply attention on all the projected vectors in batch. </span>
    <span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">attn</span> <span class="o">=</span> <span class="n">attention</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">,</span> 
                             <span class="n">dropout</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">)</span>
    
    <span class="c"># 3) "Concat" using a view and apply a final linear. </span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">contiguous</span><span class="p">()</span> \
         <span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">nbatches</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">h</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_k</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">linears</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">](</span><span class="n">x</span><span class="p">)</span></code></pre></figure>

Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways: 1) In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as (cite).

2) The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

3) Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot- product attention by masking out (setting to −∞−∞) all values in the input of the softmax which correspond to illegal connections.

Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

FFN(x)=max(0,xW1+b1)W2+b2FFN(x)=max(0,xW1+b1)W2+b2

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel=512dmodel=512.

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">w_2</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">w_1</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span></code></pre></figure>

Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodeldmodel.

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">lut</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">d_model</span><span class="p">)</span></code></pre></figure>

Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodeldmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed (cite).

In this work, we use sine and cosine functions of different frequencies: PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i)=sin(pos/100002i/dmodel)

PE(pos,2i+1)=cos(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel).

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop=0.1Pdrop=0.1.

class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
    <span class="c"># Compute the positional encodings once in log space.</span>
    <span class="n">pe</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">max_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
    <span class="n">position</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">max_len</span><span class="p">)</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">div_term</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">d_model</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o">*</span>
                         <span class="o">-</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mf">10000.0</span><span class="p">)</span> <span class="o">/</span> <span class="n">d_model</span><span class="p">))</span>
    <span class="n">pe</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">position</span> <span class="o">*</span> <span class="n">div_term</span><span class="p">)</span>
    <span class="n">pe</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">position</span> <span class="o">*</span> <span class="n">div_term</span><span class="p">)</span>
    <span class="n">pe</span> <span class="o">=</span> <span class="n">pe</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
    <span class="bp">self</span><span class="o">.</span><span class="n">register_buffer</span><span class="p">(</span><span class="s">'pe'</span><span class="p">,</span> <span class="n">pe</span><span class="p">)</span>
    
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">Variable</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">pe</span><span class="p">[:,</span> <span class="p">:</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)],</span> 
                     <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></code></pre></figure>

Below the positional encoding will add in a sine wave based on position. The frequency and offset of the wave is different for each dimension.

plt.figure(figsize=(15, 5))
pe = PositionalEncoding(20, 0)
y = pe.forward(Variable(torch.zeros(1, 100, 20)))
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())
plt.legend(["dim %d"%p for p in [4,5,6,7]])
None

png

We also experimented with using learned positional embeddings (cite) instead, and found that the two versions produced nearly identical results. We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

Full Model

Here we define a function that takes in hyperparameters and produces a full model.

确定要放弃本次机会?
福利倒计时
: :

立减 ¥

普通VIP年卡可用
立即使用
csiao_Bing
关注 关注
  • 0
    点赞
  • 踩
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
  • 分享
    复制链接
    分享到 QQ
    分享到新浪微博
    扫一扫
  • 举报
    举报
专栏目录
Harvard NLP The Annotated Transformer 学习
段智华的博客
02-16 1609
Harvard NLP The Annotated Transformer 复现Google公司的Transformer论文 “Attention is All You Need” 的Transformer 在过去的一年里一直在很多人的脑海中出现。Transformer在机器翻译质量上有重大改进,它还为许多其他NLP任务提供了一种新的体系结构。论文本身写得很清楚,但传统的看法是论文很难准确...
调试The Annotated Transformer
lhp171302512的博客
07-10 1212
The Annotated Transformer 应该是我见过最贴心的‘Attention is All You Need’的复现了。看网页链接像是哈佛大学复现的,质量应该还不错,于是就照着代码按顺序ctrl + c +v了一遍。其实在github上也有代码可以直接下载,只不过是.ipynb格式的。 在调试代码的过程中,遇到了一些问题,在这里记录一下。 1 环境安装 作者没有说明每个依赖库的版本,以下是我个人的版本,可以参考。 python==3.8.8 torch==1.9.0 numpy==1.20.
参与评论 您还未登录,请先 登录 后发表或查看评论
The Annotated Transformer(Attention Is All You Need)
sinat_41942180的博客
08-31 1150
"Attention is All You Need"[1] 一文中提出的Transformer网络结构最近引起了很多人的关注。Transformer不仅能够明显地提升翻译质量,还为许多NLP任务提供了新的结构。虽然原文写得很清楚,但实际上大家普遍反映很难正确地实现。所以我们为此文章写了篇注解文档,。本文档删除了原文的一些章节并进行了重新排序,并在整个文章中加入了相应的注解。此外,本文档,本身就是直接可以运行的代码实现,,在4个GPU上每秒可以处理27,000个tokens。
The Annotated Transformer(一)
brandday的博客
09-02 944
原文地址:http://nlp.seas.harvard.edu/2018/04/03/attention.html 摘要:基于复杂循环卷积神经网络的主要序列转换模型包含了编码器(Encoder)和解码器。表现最好的模型在解码器和编码器之间通过注意力机制进行连接。我们提出了一个新的简单网络结构,Transformer,其仅基于注意力机制,完全和循环卷积分开。两种机器翻译的实验表明这些模型不仅翻译质...
The Annotated Transformer(解读Transformer)
czp_374的博客
03-28 3727
原链接
[NLP]——The Annotated Transformer(模型篇)
jokerxsy的博客
05-04 800
背景 有很多为了减少序列计算而引出的模型:Neural GPU\ByteNet\ConvS2S,它们使用卷积神经网络来并行化的计算。然而,这些模型的计算次数随着输入和输出间的距离增长而增长,带来了长期依赖问题。Transformer的计算次数是常数,但是是以减弱了位置信息为代价,具体的,它使用Multi-Head Attention来实现。 self-attention应用广泛,Transformer是第一个完全依赖于self-attention的transduction model,关于transduct
《The Annotated Transformer》环境配置
最新发布
11-23
《The Annotated Transformer》环境配置是一本专注于解读Transformer模型的书籍,该模型由Vaswani等人在2017年提出,对自然语言处理(NLP)领域产生了深远的影响。Transformer模型摒弃了传统的循环神经网络(RNN)和...
The Annotated Transformer.pdf
12-22
Transformer 模型详解 人工智能领域中的Transformer 模型是 Attention 机制的核心组件, recently paper “Attention is All You Need” (Vaswani et al., 2017) 提出了Transformer 模型的详细实现。然而,在实际...
the annotated transformer译文
06-14
"The Annotated Transformer" 是一本由 Jay Alammar 所著的书籍,它详细讲解了Transformer模型的工作原理,尤其是注意力机制(Attention Mechanism),这是Transformer模型的核心组成部分。这本书以一种易于理解的...
Annotated Transformer
On_theway10的博客
01-04 352
Transformer结构深度注解
哈佛大学NLP研究组的Transformer入门
xindebian12的博客
04-15 3337
The Annotated Transformer from IPython.display import Image Image(filename='images/aiayn.png') The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the las...
《The Annotated Transformer》翻译——注释和代码实现《Attention Is All You Need》
qq_56591814的博客
09-13 4906
from IPython.display import Image Image(filename='images/aiayn.png') [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BM5P7LdF-1631548048321)(output_0_0.png)]   “Attention is All You Need” 的 Transformer 在过去的一年里一直在很多人的脑海中出现。 Transformer 在机器翻译质量上有重大改进,它还为许多其它NLP 任务提供
[NLP]——The Annotated Transformer(实战篇)
jokerxsy的博客
05-06 511
目录前言训练TOOLSBatches and MaskingTraining LoopTraining Data and Batching 前言 [NLP]——The Annotated Transformer(模型篇) 训练 TOOLS Batches and Masking 两个mask: 1. padding(here) 2. future words(subsequent_mask) class Batch: "Object for holding a batch of data with
The Annotated Transformer 阅读学习
Together_CZ的博客
03-28 1047
The Annotated Transformer 阅读学习
【转载】The Annotated Transformer
qq_26108393的博客
08-22 626
最近在看Transformer,但是本人并不是NLP研究方向的,所以有诸多细节不理解,网上大多数博客讲的模棱两可又权威性不够,翻看了《attention is all you need》论文中给的源码链接tensor2tensor,工程化的代码又实在太复杂,不适合初学者阅读。今天在BERT中发现作者在引用transformer的同时引用了harvard的这个链接http://nlp.seas.harvard.edu/2018/04/03/attention.html。感觉学的很不错,但是因为墙的原因,可能很
《The Annotated Transformer》 Transformer论文解读与代码注释 中英对照 By Harvard NLP
Kageiii的博客
04-03 2059
原文链接🔗:http://nlp.seas.harvard.edu/2018/04/03/attention.html from IPython.display import Image Image(filename='images/aiayn.png') The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the last year. Besides producing ma
DL之Transformer:《The Annotated Transformer带注释的变压器》的翻译与解读—思路步骤及实现代码
头部AI社区如有邀博主AI主题演讲请私信—心比天高,仗剑走天涯,保持热爱,奔赴向梦想!低调,专注,谦虚,自律,反思,成长,还算比较正能量的博主,公益免费传播…内心特别想在AI界做出一些可以推进历史进程影响力的技术(兴趣使然,有点小情怀,也有点使命感呀
12-16 852
​ DL之Transformer:《The Annotated Transformer带注释的变压器》的翻译与解读—包括代码 目录 《The Annotated Transformer》的翻译与解读 导言 背景 第1部分:模型体系结构 第2部分:模型训练 第3部分:一个真实世界的例子 结果 结论 《The Annotated Transformer》的翻译与解读 地址 GitHub地址:GitHub
csiao_Bing

博客等级

码龄18年
9
原创
44
点赞
212
收藏
16
粉丝
关注
私信

热门文章

  • XGBRegressor 参数调优 52572
  • 使用回归分析,样本过少时不妨好先看看散点图 6690
  • conda 设置python运行 虚拟环境 4851
  • The Annotated Transformer 2564
  • 天空之城:拉马努金式思维训练法 2437

分类专栏

  • 架构设计
    1篇
  • 机器学习
    23篇
  • 成长大法
    3篇
  • python
    6篇
  • NLP
    4篇
  • 神经网络
    2篇

展开全部 收起

最新评论

  • XGBRegressor 参数调优

    东皇太后: evalute_result = optimized_GBM.cv_results_['mean_test_score']

  • XGBRegressor 参数调优

    罚酒饮得: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan.楼主,出现这个问题是啥原因呀表情包表情包

  • XGBRegressor 参数调优

    PENG_gigigi: 对字典进行拆包

  • XGBRegressor 参数调优

    PENG_gigigi: 对字典进行拆包

  • XGBRegressor 参数调优

    摆渡。: 那两个**代表着什么

最新文章

  • 神经网络学习-高质量资源
  • 开源k-v Tair
  • python re
2020年1篇
2019年11篇
2018年25篇

目录

展开全部

收起

相关专栏

AI & Big Data案例实战课程

专栏

343 人学习

AI & Big Data案例实战课程

Harvard NLP The Annotated Transformer 学习

AI/AGI

专栏

246 人学习

本专栏,主要分享人工智能和通用人工智能领域综合性文章,包括机器学习、深度学习、迁移学习、强化学习等综述性文章。

DL之Transformer:Transformer的简介(优缺点/架构详解,基于Transformer的系列架构对比分析)、使用方法(NLP领域/CV领域)、代码实现、案例应用之详细攻略

MCP实战开发AI大模型应用与大数据计算架构

专栏

1552 人学习

深度解析MCP实战开发、AI大模型应用架构与大数据计算原理性能亮点,结合大数据洞察,揭示其在海量数据处理中的优势。同时,聚焦AI人工智能大模型,分享原理、训练技巧与优化策略。辅以金融、医疗等多领域应用案例,助你掌握技术精髓,把握行业趋势。

Transformer大模型实战 了解RoBERTa

目录

展开全部

收起

分类专栏

  • 架构设计
    1篇
  • 机器学习
    23篇
  • 成长大法
    3篇
  • python
    6篇
  • NLP
    4篇
  • 神经网络
    2篇

展开全部 收起

目录

评论
被折叠的  条评论 为什么被折叠? 到【灌水乐园】发言
查看更多评论
添加红包

请填写红包祝福语或标题

个

红包个数最小为10个

元

红包金额最低5元

当前余额3.43元 前往充值 >
需支付:10.00元
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付元
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值