目录
Build the TorchText Vocab objects and convert the sentences into Torch tensors
Create the DataLoader object to be iterated during training
Sequence-to-sequence Transformer
Try translating a Japanese sentence using the trained model
Save the Vocab objects and trained model
Import required packages
Firstly, let’s make sure we have the below packages installed in our system, if you found that some packages are missing, make sure to install them.
import math
import torchtext
import torch
import torch.nn as nn
from torch import Tensor
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
from collections import Counter
from torchtext.vocab import Vocab
from torch.nn import TransformerEncoder, TransformerDecoder, TransformerEncoderLayer, TransformerDecoderLayer
import io
import time
import pandas as pd
import numpy as np
import pickle
import tqdm
import sentencepiece as spm
torch.manual_seed(0)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# print(torch.cuda.get_device_name(0)) ## 如果你有GPU,请在你自己的电脑上尝试运行这一套代码
device
运行结果:
device(type='cpu')
Get the parallel dataset
In this tutorial, we will use the Japanese-English parallel dataset downloaded from JParaCrawl![http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl] which is described as the “largest publicly available English-Japanese parallel corpus created by NTT. It was created by largely crawling the web and automatically aligning parallel sentences.” You can also see the paper here.
# 从'zh-ja.bicleaner05.txt'文件中读取数据,使用制表符作为分隔符,引擎选择'python'(更灵活的处理)
df = pd.read_csv('zh-ja.bicleaner05.txt', sep='\\t', engine='python', header=None)
# 将df中第二列的数据提取为一个Python列表,并赋值给trainen变量
trainen = df[2].values.tolist()#[:10000]
# 将df中第三列的数据提取为一个Python列表,并赋值给trainja变量
trainja = df[3].values.tolist()#[:10000]
# trainen.pop(5972)
# trainja.pop(5972)
After importing all the Japanese and their English counterparts, I deleted the last data in the dataset because it has a missing value. In total, the number of sentences in both trainen and trainja is 5,973,071, however, for learning purposes, it is often recommended to sample the data and make sure everything is working as intended, before using all the data at once, to save time.
Here is an example of sentence contained in the dataset.
print(trainen[500])
print(trainja[500])
输出结果:
Chinese HS Code Harmonized Code System < HS编码 2905 无环醇及其卤化、磺化、硝化或亚硝化衍生物 HS Code List (Harmonized System Code) for US, UK, EU, China, India, France, Japan, Russia, Germany, Korea, Canada ...
Japanese HS Code Harmonized Code System < HSコード 2905 非環式アルコール並びにそのハロゲン化誘導体、スルホン化誘導体、ニトロ化誘導体及びニトロソ化誘導体 HS Code List (Harmonized System Code) for US, UK, EU, China, India, France, Japan, Russia, Germany, Korea, Canada ...
We can also use different parallel datasets to follow along with this article, just make sure that we can process the data into the two lists of strings as shown above, containing the Japanese and English sentences
.
Prepare the tokenizers
Unlike English or other alphabetical languages, a Japanese sentence does not contain whitespaces to separate the words. We can use the tokenizers provided by JParaCrawl which was created using SentencePiece for both Japanese and English, you can visit the JParaCrawl website to download them, or click here.
# 使用SentencePieceProcessor加载预先训练好的英语模型文件'spm.en.nopretok.model',创建分词器对象en_tokenizer
en_tokenizer = spm.SentencePieceProcessor(model_file='spm.en.nopretok.model')
# 使用SentencePieceProcessor加载预先训练好的日语模型文件'spm.ja.nopretok.model',创建分词器对象ja_tokenizer
ja_tokenizer = spm.SentencePieceProcessor(model_file='spm.ja.nopretok.model')
After the tokenizers are loaded, you can test them, for example, by executing the below code.
# 使用en_tokenizer对输入的英语文本进行编码(分词)
en_tokenizer.encode("All residents aged 20 to 59 years who live in Japan must enroll in public pension system.", out_type='str')
运行结果:
['▁All',
'▁residents',
'▁aged',
'▁20',
'▁to',
'▁59',
'▁years',
'▁who',
'▁live',
'▁in',
'▁Japan',
'▁must',
'▁enroll',
'▁in',
'▁public',
'▁pension',
'▁system',
'.']
# 使用ja_tokenizer对输入的日语文本进行编码(分词)
ja_tokenizer.encode("年金 日本に住んでいる20歳~60歳の全ての人は、公的年金制度に加入しなければなりません。", out_type='str')
输出结果:
['▁',
'年',
'金',
'▁日本',
'に住んでいる',
'20', '歳', '~',
'60', '歳の',
'全ての',
'人は',
'、',
'公的',
'年',
'金',
'制度',
'に',
'加入',
'しなければなりません',
'。']
Build the TorchText Vocab objects and convert the sentences into Torch tensors
Using the tokenizers and raw sentences, we then build the Vocab object imported from TorchText. This process can take a few seconds or minutes depending on the size of our dataset and computing power. Different tokenizer can also affect the time needed to build the vocab, I tried several other tokenizers for Japanese but SentencePiece seems to be working well and fast enough for me.
def build_vocab(sentences, tokenizer):
counter = Counter()
for sentence in sentences:
# 使用tokenizer对每个句子进行编码(分词),并更新计数器
counter.update(tokenizer.encode(sentence, out_type=str))
# 使用Counter对象创建Vocab对象,同时指定特殊标记
return Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])
# 使用build_vocab函数构建日语(ja_vocab)和英语(en_vocab)的词汇表
ja_vocab = build_vocab(trainja, ja_tokenizer)
en_vocab = build_vocab(trainen, en_tokenizer)
After we have the vocabulary objects, we can then use the vocab and the tokenizer objects to build the tensors for our training data.
def data_process(ja, en):
data = []
for (raw_ja, raw_en) in zip(ja, en):
# 对每一对原始日语和英语文本进行处理
# 处理日语文本
ja_tensor_ = torch.tensor([ja_vocab[token] for token in ja_tokenizer.encode(raw_ja.rstrip("\n"), out_type=str)],
dtype=torch.long)
# 处理英语文本
en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer.encode(raw_en.rstrip("\n"), out_type=str)],
dtype=torch.long)
# 将处理后的日语张量和英语张量作为元组存入data列表
data.append((ja_tensor_, en_tensor_))
return data
# 调用data_process函数,处理训练数据集trainja和trainen,得到train_data
train_data = data_process(trainja, trainen)
Create the DataLoader object to be iterated during training
Here, I set the BATCH_SIZE to 16 to prevent “cuda out of memory”, but this depends on various things such as your machine memory capacity, size of data, etc., so feel free to change the batch size according to your needs (note: the tutorial from PyTorch sets the batch size as 128 using the Multi30k German-English dataset.)
BATCH_SIZE = 8
# PAD_IDX, BOS_IDX, EOS_IDX: 这些常量分别表示日语词汇表中的特殊标记 <pad>、<bos>(开始标记)和 <eos>(结束标记)所对应的索引值。
PAD_IDX = ja_vocab['<pad>']
BOS_IDX = ja_vocab['<bos>']
EOS_IDX = ja_vocab['<eos>']
def generate_batch(data_batch):
ja_batch, en_batch = [], []
for (ja_item, en_item) in data_batch:
# 在日语和英语句子的开头和结尾添加开始和结束标记
ja_batch.append(torch.cat([torch.tensor([BOS_IDX]), ja_item, torch.tensor([EOS_IDX])], dim=0))
en_batch.append(torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0))
# 对每个批次的日语和英语句子进行填充,以保证批次中所有句子长度相同
ja_batch = pad_sequence(ja_batch, padding_value=PAD_IDX)
en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
return ja_batch, en_batch
train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,
shuffle=True, collate_fn=generate_batch)
Sequence-to-sequence Transformer
The next couple of codes and text explanations (written in italic) are taken from the original PyTorch tutorial [https://pytorch.org/tutorials/beginner/translation_transformer.html]. I did not make any change except for the BATCH_SIZE and the word de_vocabwhich is changed to ja_vocab.
Transformer is a Seq2Seq model introduced in “Attention is all you need” paper for solving machine translation task. Transformer model consists of an encoder and decoder block each containing fixed number of layers.
Encoder processes the input sequence by propagating it, through a series of Multi-head Attention and Feed forward network layers. The output from the Encoder referred to as memory, is fed to the decoder along with target tensors. Encoder and decoder are trained in an end-to-end fashion using teacher forcing technique.
from torch.nn import (TransformerEncoder, TransformerDecoder,
TransformerEncoderLayer, TransformerDecoderLayer)
class Seq2SeqTransformer(nn.Module):
# num_encoder_layers和num_decoder_layers分别指定了编码器和解码器的层数。emb_size是词嵌入的维度。
# src_vocab_size和tgt_vocab_size分别是源语言词汇表和目标语言词汇表的大小。dim_feedforward是Transformer模型中全连接层的维度。
# dropout是模型中的dropout率。
def __init__(self, num_encoder_layers: int, num_decoder_layers: int,
emb_size: int, src_vocab_size: int, tgt_vocab_size: int,
dim_feedforward:int = 512, dropout:float = 0.1):
super(Seq2SeqTransformer, self).__init__()
encoder_layer = TransformerEncoderLayer(d_model=emb_size, nhead=NHEAD,
dim_feedforward=dim_feedforward)
self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)
decoder_layer = TransformerDecoderLayer(d_model=emb_size, nhead=NHEAD,
dim_feedforward=dim_feedforward)
self.transformer_decoder = TransformerDecoder(decoder_layer, num_layers=num_decoder_layers)
self.generator = nn.Linear(emb_size, tgt_vocab_size)
self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)
# 定义了前向传播方法forward
def forward(self, src: Tensor, trg: Tensor, src_mask: Tensor,
tgt_mask: Tensor, src_padding_mask: Tensor,
tgt_padding_mask: Tensor, memory_key_padding_mask: Tensor):
src_emb = self.positional_encoding(self.src_tok_emb(src))
tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
memory = self.transformer_encoder(src_emb, src_mask, src_padding_mask)
outs = self.transformer_decoder(tgt_emb, memory, tgt_mask, None,
tgt_padding_mask, memory_key_padding_mask)
return self.generator(outs)
def encode(self, src: Tensor, src_mask: Tensor):
return self.transformer_encoder(self.positional_encoding(
self.src_tok_emb(src)), src_mask)
def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
return self.transformer_decoder(self.positional_encoding(
self.tgt_tok_emb(tgt)), memory,
tgt_mask)
Text tokens are represented by using token embeddings. Positional encoding is added to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
def __init__(self, emb_size: int, dropout, maxlen: int = 5000):
# PositionalEncoding类接收嵌入大小 emb_size、dropout率 dropout 和最大序列长度 maxlen 作为参数。它首先计算用于位置编码的正弦和余弦函数的参数 den,然后生成位置编码 pos_embedding。这里的位置编码是根据Transformer论文中提出的方法生成的,通过正弦和余弦函数计算位置编码矩阵。
super(PositionalEncoding, self).__init__()
den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
pos = torch.arange(0, maxlen).reshape(maxlen, 1)
pos_embedding = torch.zeros((maxlen, emb_size))
pos_embedding[:, 0::2] = torch.sin(pos * den)
pos_embedding[:, 1::2] = torch.cos(pos * den)
pos_embedding = pos_embedding.unsqueeze(-2)
self.dropout = nn.Dropout(dropout)
self.register_buffer('pos_embedding', pos_embedding)
def forward(self, token_embedding: Tensor):
# token_embedding 是来自TokenEmbedding类的输出,形状为(seq_length, emb_size)
# 返回加上位置编码的token嵌入表示,形状为(seq_length, emb_size)
return self.dropout(token_embedding +
self.pos_embedding[:token_embedding.size(0),:])
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size
def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
We create a subsequent word mask to stop a target word from attending to its subsequent words. We also create masks, for masking source and target padding tokens
def generate_square_subsequent_mask(sz):
mask = (torch.triu(torch.ones((sz, sz), device=device)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
return mask
def create_mask(src, tgt):
src_seq_len = src.shape[0]
tgt_seq_len = tgt.shape[0]
# 生成目标序列掩码
tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
# 创建全零源序列掩码
src_mask = torch.zeros((src_seq_len, src_seq_len), device=device).type(torch.bool)
# 创建源序列填充掩码和目标序列填充掩码
src_padding_mask = (src == PAD_IDX).transpose(0, 1)
tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask
SRC_VOCAB_SIZE = len(ja_vocab) # 源语言词汇表大小
TGT_VOCAB_SIZE = len(en_vocab) # 目标语言词汇表大小
EMB_SIZE = 512 # 词嵌入维度
NHEAD = 8 # 注意力头数
FFN_HID_DIM = 512 # FeedForward网络隐藏层维度
BATCH_SIZE = 16 # 批量大小
NUM_ENCODER_LAYERS = 3 # 编码器层数
NUM_DECODER_LAYERS = 3 # 解码器层数
NUM_EPOCHS = 16 # 训练轮数
# 初始化Transformer模型
transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS,
EMB_SIZE, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE,
FFN_HID_DIM)
# 使用Xavier初始化模型参数
for p in transformer.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
# 将模型移动到指定的设备上
transformer = transformer.to(device)
# 定义损失函数:交叉熵损失,忽略填充标记
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
# 定义优化器:Adam优化器
optimizer = torch.optim.Adam(
transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9
)
def train_epoch(model, train_iter, optimizer):
"""
训练模型一个epoch。
Args:
- model (Seq2SeqTransformer): Transformer模型
- train_iter (DataLoader): 训练数据迭代器
- optimizer (torch.optim.Optimizer): 优化器
Returns:
- float: 训练损失的平均值
"""
model.train()
losses = 0
for idx, (src, tgt) in enumerate(train_iter):
src = src.to(device)
tgt = tgt.to(device)
tgt_input = tgt[:-1, :]
src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
logits = model(src, tgt_input, src_mask, tgt_mask,
src_padding_mask, tgt_padding_mask, src_padding_mask)
optimizer.zero_grad()
tgt_out = tgt[1:,:]
loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
loss.backward()
optimizer.step()
losses += loss.item()
return losses / len(train_iter)
def evaluate(model, val_iter):
model.eval()
losses = 0
for idx, (src, tgt) in (enumerate(valid_iter)):
src = src.to(device)
tgt = tgt.to(device)
tgt_input = tgt[:-1, :]
# 创建掩码
src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
# 前向传播
logits = model(src, tgt_input, src_mask, tgt_mask,
src_padding_mask, tgt_padding_mask, src_padding_mask)
tgt_out = tgt[1:,:]
loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
losses += loss.item()
return losses / len(val_iter)
Start training
Finally, after preparing the necessary classes and functions, we are ready to train our model. This goes without saying but the time needed to finish training could vary greatly depending on a lot of things such as computing power, parameters, and size of datasets.
When I trained the model using the complete list of sentences from JParaCrawl which has around 5.9 million sentences for each language, it took around 5 hours per epoch using a single NVIDIA GeForce RTX 3070 GPU.
Here is the code:
for epoch in tqdm.tqdm(range(1, NUM_EPOCHS+1)):
start_time = time.time()
train_loss = train_epoch(transformer, train_iter, optimizer)
end_time = time.time()
print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, "
f"Epoch time = {(end_time - start_time):.3f}s"))
# 通过这段代码,可以实时监控每个epoch的训练进度和损失,帮助调试和优化模型训练过程。
Try translating a Japanese sentence using the trained model
First, we create the functions to translate a new sentence, including steps such as to get the Japanese sentence, tokenize, convert to tensors, inference, and then decode the result back into a sentence, but this time in English.
def greedy_decode(model, src, src_mask, max_len, start_symbol):
src = src.to(device)
src_mask = src_mask.to(device)
memory = model.encode(src, src_mask)
ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)
for i in range(max_len-1):
memory = memory.to(device)
memory_mask = torch.zeros(ys.shape[0], memory.shape[0]).to(device).type(torch.bool)
tgt_mask = (generate_square_subsequent_mask(ys.size(0))
.type(torch.bool)).to(device)
out = model.decode(ys, memory, tgt_mask)
out = out.transpose(0, 1)
prob = model.generator(out[:, -1])
_, next_word = torch.max(prob, dim = 1)
next_word = next_word.item()
ys = torch.cat([ys,
torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
if next_word == EOS_IDX:
break
return ys
def translate(model, src, src_vocab, tgt_vocab, src_tokenizer):
model.eval()
tokens = [BOS_IDX] + [src_vocab.stoi[tok] for tok in src_tokenizer.encode(src, out_type=str)]+ [EOS_IDX]
num_tokens = len(tokens)
src = (torch.LongTensor(tokens).reshape(num_tokens, 1) )
src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
tgt_tokens = greedy_decode(model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
return " ".join([tgt_vocab.itos[tok] for tok in tgt_tokens]).replace("<bos>", "").replace("<eos>", "")
Then, we can just call the translate function and pass the required parameters.
translate(transformer, "HSコード 8515 はんだ付け用、ろう付け用又は溶接用の機器(電気式(電気加熱ガス式を含む。)", ja_vocab, en_vocab, ja_tokenizer)
trainen.pop(5)
trainja.pop(5)
Save the Vocab objects and trained model
Finally, after the training has finished, we will save the Vocab objects (en_vocab and ja_vocab) first, using Pickle.
import pickle
# open a file, where you want to store the data
file = open('en_vocab.pkl', 'wb')
# dump information to that file
pickle.dump(en_vocab, file)
file.close() # 关闭文件,数据写入完成
file = open('ja_vocab.pkl', 'wb') # 打开一个文件,使用写入二进制模式 ('wb') 存储日语词汇数据
pickle.dump(ja_vocab, file) # 使用 pickle.dump() 序列化 ja_vocab 对象,并将其写入文件
file.close() # 关闭文件,数据写入完成
Lastly, we can also save the model for later use using PyTorch save and load functions. Generally, there are two ways to save the model depending what we want to use them for later. The first one is for inference only, we can load the model later and use it to translate from Japanese to English.
# save model for inference
torch.save(transformer.state_dict(), 'inference_model')
The second one is for inference too, but also for when we want to load the model later, and want to resume the training.
# save model + checkpoint to resume training later
torch.save({
'epoch': NUM_EPOCHS, # 当前的训练轮次
'model_state_dict': transformer.state_dict(), # 模型的状态字典,包含所有参数
'optimizer_state_dict': optimizer.state_dict(), # 优化器的状态字典,包含优化器状态信息
'loss': train_loss, # 当前训练的损失值
}, 'model_checkpoint.tar') # 将这些信息保存到名为 'model_checkpoint.tar' 的文件中