PyTorch深度学习实践：从分类到生成-PyTorch构建RNN字符分类器-优快云博客

字符循环神经网络（RNN）分类

学习目标

字符循环神经网络（RNN）分类实验，旨在让学员通过搭建并训练 RNN 模型开展文本分类任务，使其掌握 RNN 原理、自然语言处理数据预处理方法、模型搭建及调优技巧，培养问题解决和系统思维能力。

学习内容

1 字符循环神经网络（RNN）分类

1.1 将 matplotlib 绘制的图形直接嵌入到 Jupyter Notebook 的输出单元格中

%matplotlib inline

1.2 下载数据

!wget https://model-community-picture.obs.cn-north-4.myhuaweicloud.com/ascend-zone/notebook_datasets/8c58bad0e78911ef93b6fa163edcddae/data.zip --no-check-certificate

!unzip data.zip

1.3 导入库

%pip install transformers==4.37.1
%pip install torchvision

from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os
import torch_npu
from torch_npu.contrib import transfer_to_npu

def findFiles(path): return glob.glob(path)

print(findFiles('data/names/*.txt'))

import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# 将Unicode字符串转换为纯ASCII
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))

# 构建category_lines字典，每种语言的名称列表
category_lines = {}
all_categories = []

# 读取文件并拆分为行
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

现在我们有了 category_lines，这是一个字典，它将每个类别（语言）映射到一个行（名称）列表上。我们还记录了 all_categories（仅仅是一个语言列表）和 n_categories，以便后续使用。

print(category_lines['Italian'][:5])

Out：

[‘Abandonato’, ‘Abatangelo’, ‘Abatantuono’, ‘Abate’, ‘Abategiovanni’]

1.4 将名称转换为张量

现在，我们已经整理好了所有名字，接下来只需将它们转换为张量，便可加以使用。为了表示单个字母，我们采用一个大小为<1 x n_letters>的“独热向量”。这种独热向量，除了在对应当前字母的索引位置取值为1外，其余位置均填充为0。例如，字母“b”对应的独热向量表示为<0 1 0 0 0 ...> 。而要表示一个单词时，我们会把一系列这样的独热向量组合成一个二维矩阵，其形状为<line_length x 1 x n_letters> 。这里多出来的一维，是因为PyTorch默认所有数据都以批次形式存在，而我们在此仅使用了大小为1的批次。

import torch

# 从所有_字母中查找字母索引，例如“a”=0
def letterToIndex(letter):
    return all_letters.find(letter)

# 为了演示，将一个字母转换为<1 x n_letters>张量
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

# 将一行转换为<line_length x 1 x n_letters>，
# 或一个热字母向量数组
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print(letterToTensor('J'))

print(lineToTensor('Jones').size())

Out:

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0.]])
torch.Size([5, 1, 57])

1.5 创建网络

在自动求导之前，在 Torch 中创建循环神经网络需要在多个时间步上复制某一层的参数。这些层会保存隐藏状态和梯度，而现在这些完全由计算图自身来处理。这意味着你可以像实现常规的前馈层那样，以一种非常 “纯粹” 的方式来实现循环神经网络。

import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

要运行这个网络的一个步骤，我们需要传递一个输入（在我们的例子中，当前字母的 Tensor张量）和一个之前的隐藏状态（我们首先将其初始化为零）。我们将返回输出（每种语言的概率）和下一个隐藏状态（我们保留到下一步）。

input = letterToTensor('A')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input, hidden)

为了提高效率，我们不想为每个步骤创建一个新的 Tensor，因此我们将使用 lineToTensor 而不是 letterToTensor 并使用切片。这可以通过预先计算 Tensor 的批次来进一步优化。

input = lineToTensor('Albert')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input[0], hidden)
print(output)

Out:

tensor([[-2.9193, -2.9728, -2.9281, -2.9440, -2.9318, -2.8012, -2.9719, -2.7966,
-2.9204, -2.8357, -2.9645, -2.9498, -2.8709, -2.8290, -2.8656, -2.8376,
-2.8890, -2.8291]], grad_fn=)
如你所见，输出是一个形状为 <1 x n_categories> 的张量（Tensor），其中的每一项都是对应类别的可能性（数值越高，该类别出现的可能性就越大）。

1.6 训练

1.6.1 训练准备

在开始训练之前，我们应该编写几个辅助函数。第一个辅助函数是用来解读网络的输出，我们知道网络输出的是每个类别的可能性。我们可以使用Tensor.topk方法来获取最大值的索引：

def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i

print(categoryFromOutput(output))

out:

(‘Czech’, 7)

我们也会需要一种快速获取训练示例（一个名字及其所属语言）的方法：

import random

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

out:

category = English / line = Jonhson
category = Czech / line = Nekuza
category = Greek / line = Strilakos
category = Portuguese / line = Araullo
category = Irish / line = Sluaghadhan
category = English / line = Buxton
category = Vietnamese / line = Thai
category = Portuguese / line = Araullo
category = English / line = Lowe
category = Japanese / line = Hamacho

1.6.2 训练网络

现在，训练这个网络所需要做的就是向它展示一系列的示例，让它做出预测，然后告诉它预测是否错误。

对于损失函数而言，nn.NLLLoss（负对数似然损失函数）是合适的，因为这个循环神经网络（RNN）的最后一层是 nn.LogSoftmax（对数 softmax 层）。

criterion = nn.NLLLoss()

import torch
device = torch.device("npu" if torch.npu.is_available() else "cpu")
print(f"当前设备是: {device}")

out:
当前设备是: npu

也可以使用Apple silicon：MPS（我）或GPU（需要自己写torch适配命令）
在这里插入图片描述

每个训练循环将：

创建输入和目标张量
创建零初始隐藏状态
读入每一个字母，并且保留隐藏状态以便处理下一个字母
将最终输出与目标进行比较
反向传播
返回输出和损失

learning_rate = 0.005 # 如果你把这个（值）设置得过高，它可能会（导致数值）爆炸。如果设置得过低，它可能就无法进行学习。


rnn = rnn.to(device)
criterion = criterion.to(device)

def train(category_tensor, line_tensor):
    
    hidden = rnn.initHidden().to(device)

    rnn.zero_grad()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss.backward()

    # 将参数的梯度添加到其值中，乘以学习率
    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, loss.item()

现在我们只需要用一堆例子来运行它。由于 train 函数返回 output 和 loss，我们可以打印它的猜测，也可以跟踪 loss 进行绘制。由于有 1000 个示例，我们只打印每个 print_every 示例，并取损失的平均值。

import time
import math
import torch
from tqdm import tqdm

# 检查 NPU 是否可用
if torch.npu.is_available():
    device = torch.device("npu")
else:
    raise RuntimeError("NPU is not available.")

n_iters = 10000
print_every = 5000
plot_every = 1000

# 跟踪损失以进行绘图
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

# 使用 tqdm 包装循环
try:
    for iter in tqdm(range(1, n_iters + 1), desc="Training Progress"):
        # 假设 randomTrainingExample 函数已定义
        category, line, category_tensor, line_tensor = randomTrainingExample()
        # 将张量移动到 NPU 设备
        category_tensor = category_tensor.to(device)
        line_tensor = line_tensor.to(device)



        # 假设 train 函数已定义
        output, loss = train(category_tensor, line_tensor)

        current_loss += loss

        # 打印 iter 编号、损失、名称和猜测
        if iter % print_every == 0:
            # 假设 categoryFromOutput 函数已定义
            guess, guess_i = categoryFromOutput(output)
            correct = '✓' if guess == category else '✗ (%s)' % category
            print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

        # 将当前损失平均值添加到损失列表中
        if iter % plot_every == 0:
            all_losses.append(current_loss / plot_every)
            current_loss = 0
except NameError as ne:
    print(f"函数或变量未定义: {ne}")
except RuntimeError as re:
    print(f"运行时错误: {re}")
except Exception as e:
    print(f"发生未知错误: {e}")

1.6.3 绘制结果

绘制 all_losses 的历史损失图，显示网络学习：

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

在这里插入图片描述

1.6.4 评估结果

为了了解该网络在不同类别上的表现如何，我们将创建一个混淆矩阵，该矩阵会指明对于每一种实际的语言（行），网络猜测的是哪种语言（列）。为了计算这个混淆矩阵，我们会使用 evaluate() 让大量样本通过该网络进行运算，evaluate() 的操作与 train() 相同，只是不包含反向传播过程。

import torch
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from tqdm import tqdm

# 检查 NPU 是否可用
if torch.npu.is_available():
    device = torch.device("npu")
else:
    raise RuntimeError("NPU is not available.")

# 将rnn移动到 NPU 设备
rnn = rnn.to(device)

# 在混淆矩阵中跟踪正确的猜测
n_categories = len(all_categories)  # 假设 all_categories 已经定义
confusion = torch.zeros(n_categories, n_categories).to(device)
n_confusion = 10000

# 只需返回给定行的输出
def evaluate(line_tensor):
    hidden = rnn.initHidden().to(device)
    line_tensor = line_tensor.to(device)

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output

# 浏览一堆例子，并记录下正确猜测的例子
for i in tqdm(range(n_confusion), desc="Evaluating for Confusion Matrix"):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    category_tensor = category_tensor.to(device)
    line_tensor = line_tensor.to(device)

    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# 通过将每一行除以其总和进行归一化
for i in tqdm(range(n_categories), desc="Normalizing Confusion Matrix"):
    confusion[i] = confusion[i] / confusion[i].sum()

# 将混淆矩阵移动到 CPU 进行绘图
confusion = confusion.cpu()

# 设置情节
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# 设置轴
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# 每一刻都贴上强制标签
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show()

在这里插入图片描述

你可以从主轴上挑出亮点，显示哪个它猜错了语言，例如韩语的中文和西班牙语意大利。它似乎对希腊语做得很好，但对希腊语却做得很差英语（可能是因为与其他语言重叠）。

1.6.5 根据用户输入运行

def predict(input_line, n_predictions=3):
    print('\n> %s' % input_line)
    with torch.no_grad():
        output = evaluate(lineToTensor(input_line))

        # 获取前N个类别
        topv, topi = output.topk(n_predictions, 1, True)
        predictions = []

        for i in range(n_predictions):
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('(%.2f) %s' % (value, all_categories[category_index]))
            predictions.append([value, all_categories[category_index]])

predict('Dovesky')
predict('Jackson')
predict('Satoshi')