Practical PyTorch实战：基于注意力机制的Seq2Seq翻译模型详解-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00382/article/details/148524278

Practical PyTorch实战：基于注意力机制的Seq2Seq翻译模型详解

practical-pytorch Go to https://github.com/pytorch/tutorials - this repo is deprecated and no longer maintained 项目地址: https://gitcode.com/gh_mirrors/pr/practical-pytorch

Seq2Seq翻译模型示意图

项目概述

本文将带您实现一个基于PyTorch的序列到序列(Seq2Seq)神经网络翻译模型，能够将法语翻译成英语。这个模型结合了注意力机制，显著提升了翻译质量。以下是模型的一些翻译示例：

[KEY: > 输入, = 目标输出, < 模型输出]

> il est en train de peindre un tableau .
= he is painting a picture .
< he is painting a picture .

> pourquoi ne pas essayer ce vin delicieux ?
= why not try that delicious wine ?
< why not try that delicious wine ?

> elle n est pas poete mais romanciere .
= she is not a poet but a novelist .
< she not not a poet but a novelist .

序列到序列学习原理

基本概念

序列到序列网络(Seq2Seq)由两个独立的RNN组成：编码器(Encoder)和解码器(Decoder)。

编码器：逐项读取输入序列，每一步输出一个向量。最终输出作为上下文向量
解码器：使用上下文向量逐步生成输出序列

Seq2Seq架构图

为什么需要Seq2Seq？

传统RNN处理翻译任务时面临两大挑战：

词序差异：如法语"chat noir"对应英语"black cat"
长度差异：如法语"ne...pas"结构对应英语简单否定"not"

Seq2Seq通过将多输入编码为单一向量，再从该向量解码出多输出，完美解决了这些问题。编码后的向量可视为句子的"语义表示"。

注意力机制详解

注意力机制的作用

传统Seq2Seq的瓶颈在于：无论输入句子多长，都压缩到固定长度的上下文向量中。这导致：

信息损失严重
长句子表现不佳
细微差异难以捕捉

工作原理

注意力机制允许解码器在每一步"关注"输入序列的不同部分：

根据当前隐藏状态和所有编码器输出计算注意力权重
用权重对编码器输出加权求和，得到上下文向量
结合上下文向量和隐藏状态预测下一个输出

注意力机制示意图

环境准备

依赖库

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
import matplotlib.pyplot as plt
import numpy as np

GPU设置

USE_CUDA = True  # 如果没有GPU请设置为False

数据处理流程

1. 数据加载与清洗

我们使用法英翻译对数据集，处理步骤包括：

Unicode转ASCII
统一小写
去除特殊字符
规范化标点符号

def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    s = re.sub(r"([,.!?])", r" \1 ", s)
    s = re.sub(r"[^a-zA-Z,.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

2. 构建词汇表

使用Lang类管理词汇表：

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "PAD", 1: "SOS", 2: "EOS"}
        self.n_words = 3  # 包含特殊标记
        
    def index_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

3. 数据过滤策略

长度过滤：保留3-25个单词的句子
词频过滤：剔除低频词(MIN_COUNT=5)
配对过滤：确保输入输出词汇都在词汇表中

# 过滤后结果示例
Trimmed from 25706 pairs to 15896, 0.6184 of total

模型实现

编码器设计

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        
    def forward(self, input, hidden):
        embedded = self.embedding(input)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

带注意力的解码器

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, n_layers=1, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout_p = dropout_p
        
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.attn = nn.Linear(hidden_size * 2, MAX_LENGTH)
        self.attn_combine = nn.Linear(hidden_size * 2, hidden_size)
        self.dropout = nn.Dropout(dropout_p)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

训练技巧

批处理优化

使用pack_padded_sequence处理变长序列
自定义masked_cross_entropy计算损失
动态调整学习率

def train(input_batches, input_lengths, target_batches, target_lengths, 
          encoder, decoder, encoder_optimizer, decoder_optimizer):
    
    # 打包序列
    input_batches = pack_padded_sequence(input_batches, input_lengths, batch_first=True)
    
    # 编码器前向传播
    encoder_outputs, encoder_hidden = encoder(input_batches)
    
    # 解码器初始化
    decoder_input = torch.LongTensor([SOS_token] * batch_size)
    decoder_hidden = encoder_hidden[:decoder.n_layers]
    
    # 使用掩码计算损失
    loss = masked_cross_entropy(
        decoder_outputs, target_batches, target_lengths)
    
    # 反向传播
    loss.backward()
    encoder_optimizer.step()
    decoder_optimizer.step()

结果可视化

训练过程中可以绘制：

损失曲线
注意力权重热力图
翻译示例对比

def show_attention(input_sentence, output_words, attentions):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.numpy(), cmap='bone')
    fig.colorbar(cax)
    
    # 设置坐标轴
    ax.set_xticklabels([''] + input_sentence.split(' '), rotation=90)
    ax.set_yticklabels([''] + output_words)
    plt.show()