18、序列生成模型评估与神经机器翻译实现-优快云博客

序列生成模型评估与神经机器翻译实现

1. 序列生成模型评估指标

在序列生成任务中，有多种评估指标可用于衡量生成结果与参考结果的接近程度。

1.1 N-gram重叠指标

基于N-gram重叠的指标通过计算ngram重叠统计信息来衡量输出与参考的接近程度。常见的指标包括BLEU、ROUGE和METEOR。其中，BLEU（BiLingual Evaluation Understudy）在机器翻译领域经受住了时间的考验，成为首选指标。实际应用中，可使用NLTK或SacreBLEU等工具包来计算BLEU分数，当有参考数据时，计算过程快速且简单。

1.2 困惑度指标

困惑度是基于信息理论的另一种自动评估指标，可应用于能测量输出序列概率的任何情况。对于序列x，若P(x)是该序列的概率，困惑度的定义如下：

困惑度提供了一种比较不同序列生成模型的简单方法，即测量模型在保留数据集上的困惑度。然而，困惑度在序列生成评估中存在一些问题：
- 指标膨胀 ：困惑度的表达式涉及指数运算，模型性能（似然）的微小差异可能导致困惑度的巨大差异，给人一种显著进步的错觉。
- 与错误率不对应 ：困惑度的变化可能不会转化为通过其他指标观察到的模型错误率的相应变化。
- 与人类感知不一致 ：与BLEU和其他基于ngram的指标一样，困惑度的提高可能不会转化为人类可感知的改进。

2. 神经机器翻译示例

随着深度学习在2010年代初的兴起，使用词嵌入和循环神经网络（RNN）进行两种语言之间的翻译成为一种强大的方法。通过引入注意力机制，机器翻译模型得到了进一步改进。

2.1 机器翻译数据集

本示例使用来自Tatoeba项目的英 - 法句子对数据集。数据预处理步骤如下：
1. 将所有句子转换为小写。
2. 对每个句子对应用NLTK的英语和法语分词器。
3. 应用NLTK特定语言的单词分词器创建标记列表，此列表即为预处理后的数据集。

为简化学习问题，使用指定的句法模式列表选择数据子集，仅选择以“i am”、“he is”、“she is”、“they are”、“you are”或“we are”开头的英语句子。这将数据集从135,842个句子对减少到13,062个句子对。最后，将剩余的句子对按70%训练、15%验证和15%测试的比例进行划分，通过按句子开头分组并合并划分结果来保持每种句法开头句子的比例恒定。

2.2 向量化管道

对源英语和目标法语句子进行向量化需要比之前更复杂的管道，原因有二：
- 源和目标序列处理方式不同 ：源序列在开头插入BEGIN-OF-SEQUENCE标记，结尾添加END-OF-SEQUENCE标记；目标序列向量化为两个偏移一个标记的副本，第一个副本需要BEGIN-OF-SEQUENCE标记，第二个副本需要END-OF-SEQUENCE标记。
- 使用PackedSequence数据结构 ：为使用双向门控循环单元（bi-GRU）对源序列进行编码，需要使用PyTorch的PackedSequence数据结构。为此，每个小批量需要按源句子的长度进行排序。

以下是NMTVectorizer的代码实现：

class NMTVectorizer(object):
    """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
    def __init__(self, source_vocab, target_vocab, max_source_length, 
                 max_target_length):
        """
        Args:
            source_vocab (SequenceVocabulary): maps source words to integers
            target_vocab (SequenceVocabulary): maps target words to integers
            max_source_length (int): the longest sequence in the source dataset
            max_target_length (int): the longest sequence in the target dataset
        """
        self.source_vocab = source_vocab
        self.target_vocab = target_vocab

        self.max_source_length = max_source_length
        self.max_target_length = max_target_length

    @classmethod
    def from_dataframe(cls, bitext_df):
        """Instantiate the vectorizer from the dataset dataframe

        Args:
            bitext_df (pandas.DataFrame): the parallel text dataset
        Returns:
            an instance of the NMTVectorizer
        """
        source_vocab = SequenceVocabulary()
        target_vocab = SequenceVocabulary()
        max_source_length, max_target_length = 0, 0
        for _, row in bitext_df.iterrows():
            source_tokens = row["source_language"].split(" ")
            if len(source_tokens) > max_source_length:
                max_source_length = len(source_tokens)
            for token in source_tokens:
                source_vocab.add_token(token)

            target_tokens = row["target_language"].split(" ")
            if len(target_tokens) > max_target_length:
                max_target_length = len(target_tokens)
            for token in target_tokens:
                target_vocab.add_token(token)

        return cls(source_vocab, target_vocab, max_source_length,
                   max_target_length)

    def _vectorize(self, indices, vector_length=-1, mask_index=0):
        """Vectorize the provided indices
        Args:
            indices (list): a list of integers that represent a sequence
            vector_length (int): forces the length of the index vector
            mask_index (int): the mask_index to use; almost always 0
        """
        if vector_length < 0:
            vector_length = len(indices)
        vector = np.zeros(vector_length, dtype=np.int64)
        vector[:len(indices)] = indices
        vector[len(indices):] = mask_index
        return vector

    def _get_source_indices(self, text):
        """Return the vectorized source text
        Args:
            text (str): the source text; tokens should be separated by spaces
        Returns:
            indices (list): list of integers representing the text
        """
        indices = [self.source_vocab.begin_seq_index]
        indices.extend(self.source_vocab.lookup_token(token) 
                       for token in text.split(" "))
        indices.append(self.source_vocab.end_seq_index)
        return indices

    def _get_target_indices(self, text):
        """Return the vectorized source text
        Args:
            text (str): the source text; tokens should be separated by spaces
        Returns:
            a tuple: (x_indices, y_indices)
                x_indices (list): list of ints; observations in target decoder
                y_indices (list): list of ints; predictions in target decoder
        """
        indices = [self.target_vocab.lookup_token(token) 
                   for token in text.split(" ")]
        x_indices = [self.target_vocab.begin_seq_index] + indices
        y_indices = indices + [self.target_vocab.end_seq_index]
        return x_indices, y_indices

    def vectorize(self, source_text, target_text, use_dataset_max_lengths=True):
        """Return the vectorized source and target text
        Args:
            source_text (str): text from the source language
            target_text (str): text from the target language
            use_dataset_max_lengths (bool): whether to use the max vector lengths
        Returns:
            The vectorized data point as a dictionary with the keys:
                source_vector, target_x_vector, target_y_vector, source_length
        """
        source_vector_length = -1
        target_vector_length = -1
        if use_dataset_max_lengths:
            source_vector_length = self.max_source_length + 2
            target_vector_length = self.max_target_length + 1
        source_indices = self._get_source_indices(source_text)
        source_vector = self._vectorize(source_indices, 
                                        vector_length=source_vector_length, 
                                        mask_index=self.source_vocab.mask_index)

        target_x_indices, target_y_indices = self._get_target_indices(target_text)
        target_x_vector = self._vectorize(target_x_indices,
                                        vector_length=target_vector_length,
                                        mask_index=self.target_vocab.mask_index)
        target_y_vector = self._vectorize(target_y_indices,
                                        vector_length=target_vector_length,
                                        mask_index=self.target_vocab.mask_index)
        return {"source_vector": source_vector,
                "target_x_vector": target_x_vector, 
                "target_y_vector": target_y_vector, 
                "source_length": len(source_indices)}

为生成NMT的小批量数据，对 generate_batches() 函数进行修改，得到 generate_nmt_batches() 函数：

def generate_nmt_batches(dataset, batch_size, shuffle=True,
                            drop_last=True, device="cpu"):
    """A generator function which wraps the PyTorch DataLoader; NMT version """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)
    for data_dict in dataloader:
        lengths = data_dict['x_source_length'].numpy()
        sorted_length_indices = lengths.argsort()[::-1].tolist()

        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name][sorted_length_indices].to(device)
        yield out_data_dict

3. 神经机器翻译模型的编码与解码

在神经机器翻译模型中，通常采用编码器 - 解码器架构。

3.1 编码器

编码器将输入的整数序列映射为每个位置的特征向量。本示例中的编码器使用双向门控循环单元（bi-GRU），具体实现如下：

class NMTEncoder(nn.Module):
    def __init__(self, num_embeddings, embedding_size, rnn_hidden_size):
        """
        Args:
            num_embeddings (int): size of source vocabulary
            embedding_size (int): size of the embedding vectors
            rnn_hidden_size (int): size of the RNN hidden state vectors 
        """
        super(NMTEncoder, self).__init__()

        self.source_embedding = nn.Embedding(num_embeddings, embedding_size,
                                             padding_idx=0)
        self.birnn = nn.GRU(embedding_size, rnn_hidden_size, bidirectional=True,
                            batch_first=True)

    def forward(self, x_source, x_lengths):
        """The forward pass of the model
        Args:
            x_source (torch.Tensor): the input data tensor
                x_source.shape is (batch, seq_size)
            x_lengths (torch.Tensor): vector of lengths for each item in batch
        Returns:
            a tuple: x_unpacked (torch.Tensor), x_birnn_h (torch.Tensor)
                x_unpacked.shape = (batch, seq_size, rnn_hidden_size * 2)
                x_birnn_h.shape = (batch, rnn_hidden_size * 2)
        """
        x_embedded = self.source_embedding(x_source)
        x_lengths = x_lengths.detach().cpu().numpy()
        x_packed = pack_padded_sequence(x_embedded, x_lengths, batch_first=True)
        x_birnn_out, x_birnn_h  = self.birnn(x_packed)
        x_birnn_h = x_birnn_h.permute(1, 0, 2)
        x_birnn_h = x_birnn_h.contiguous().view(x_birnn_h.size(0), -1)
        x_unpacked, _ = pad_packed_sequence(x_birnn_out, batch_first=True)
        return x_unpacked, x_birnn_h

编码器的工作流程如下：
1. 使用嵌入层对输入序列进行嵌入。
2. 为处理可变长度序列的掩码位置，使用PyTorch的PackedSequence数据结构。将嵌入后的序列和序列长度传递给 pack_padded_sequence() 函数，得到PackedSequence对象。
3. 将PackedSequence对象输入到bi-GRU中，得到输出和最终隐藏状态。
4. 对最终隐藏状态进行维度调整和扁平化处理。
5. 使用 pad_packed_sequence() 函数将bi-GRU的输出解包为完整的张量。

3.2 解码器

解码器以编码器的隐藏状态作为初始隐藏状态，并使用注意力机制从源序列中选择不同的信息来生成输出序列。解码器的实现如下：

class NMTDecoder(nn.Module):
    def __init__(self, num_embeddings, embedding_size, rnn_hidden_size, bos_index):
        """
        Args:
            num_embeddings (int): number of embeddings; also the number of 
                unique words in the target vocabulary 
            embedding_size (int): size of the embedding vector
            rnn_hidden_size (int): size of the hidden RNN state
            bos_index(int): BEGIN-OF-SEQUENCE index
        """
        super(NMTDecoder, self).__init__()
        self._rnn_hidden_size = rnn_hidden_size
        self.target_embedding = nn.Embedding(num_embeddings=num_embeddings, 
                                             embedding_dim=embedding_size,
                                             padding_idx=0)
        self.gru_cell = nn.GRUCell(embedding_size + rnn_hidden_size,
                                   rnn_hidden_size)
        self.hidden_map = nn.Linear(rnn_hidden_size, rnn_hidden_size)
        self.classifier = nn.Linear(rnn_hidden_size * 2, num_embeddings)
        self.bos_index = bos_index

    def _init_indices(self, batch_size):
        """ return the BEGIN-OF-SEQUENCE index vector """
        return torch.ones(batch_size, dtype=torch.int64) * self.bos_index

    def _init_context_vectors(self, batch_size):
        """ return a zeros vector for initializing the context """
        return torch.zeros(batch_size, self._rnn_hidden_size)

    def forward(self, encoder_state, initial_hidden_state, target_sequence):
        """The forward pass of the model
        Args:
            encoder_state (torch.Tensor): output of the NMTEncoder
            initial_hidden_state (torch.Tensor): last hidden state in the NMTEncoder
            target_sequence (torch.Tensor): target text data tensor
            sample_probability (float): schedule sampling parameter
                probability of using model's predictions at each decoder step
        Returns:
            output_vectors (torch.Tensor): prediction vectors at each output step
        """
        target_sequence = target_sequence.permute(1, 0)
        h_t = self.hidden_map(initial_hidden_state)
        batch_size = encoder_state.size(0)
        context_vectors = self._init_context_vectors(batch_size)
        y_t_index = self._init_indices(batch_size)

        h_t = h_t.to(encoder_state.device)
        y_t_index = y_t_index.to(encoder_state.device)
        context_vectors = context_vectors.to(encoder_state.device)

        output_vectors = []
        self._cached_p_attn = []
        self._cached_ht = []
        self._cached_decoder_state = encoder_state.cpu().detach().numpy()

        output_sequence_size = target_sequence.size(0)
        for i in range(output_sequence_size):
            y_input_vector = self.target_embedding(target_sequence[i])
            rnn_input = torch.cat([y_input_vector, context_vectors], dim=1)
            h_t = self.gru_cell(rnn_input, h_t)
            self._cached_ht.append(h_t.cpu().data.numpy())
            context_vectors, p_attn, _ = \
                verbose_attention(encoder_state_vectors=encoder_state,
                                  query_vector=h_t)
            self._cached_p_attn.append(p_attn.cpu().detach().numpy())
            prediction_vector = torch.cat((context_vectors, h_t), dim=1)
            score_for_y_t_index = self.classifier(prediction_vector)
            output_vectors.append(score_for_y_t_index)
        return torch.stack(output_vectors, dim=1)

解码器的工作流程如下：
1. 对目标序列进行维度调整，以便按时间步迭代。
2. 使用编码器的隐藏状态作为初始隐藏状态，并通过线性层进行映射。
3. 初始化上下文向量为零向量，第一个输入词索引为BEGIN-OF-SEQUENCE标记。
4. 在每个时间步：
- 嵌入当前输入词并与上一个时间步的上下文向量拼接。
- 通过GRUCell计算新的隐藏状态。
- 使用当前隐藏状态作为查询向量，通过注意力机制创建新的上下文向量。
- 将上下文向量和隐藏状态拼接，通过分类器得到预测向量。

4. 注意力机制详解

注意力机制在神经机器翻译中起着关键作用。在本示例中，解码器的隐藏状态作为查询向量，编码器的状态向量既作为键向量又作为值向量。使用点积评分函数计算权重，具体实现如下：

def verbose_attention(encoder_state_vectors, query_vector):
    """
    encoder_state_vectors: 3dim tensor from bi-GRU in encoder
    query_vector: hidden state in decoder GRU
    """
    batch_size, num_vectors, vector_size = encoder_state_vectors.size()
    vector_scores = \
        torch.sum(encoder_state_vectors * query_vector.view(batch_size, 1,
                                                            vector_size), 
                  dim=2)
    vector_probabilities = F.softmax(vector_scores, dim=1)
    weighted_vectors = \
        encoder_state_vectors * vector_probabilities.view(batch_size, 
                                                          num_vectors, 1)
    context_vectors = torch.sum(weighted_vectors, dim=1)
    return context_vectors, vector_probabilities

def terse_attention(encoder_state_vectors, query_vector):
    """
    encoder_state_vectors: 3dim tensor from bi-GRU in encoder
    query_vector: hidden state
    """
    vector_scores = torch.matmul(encoder_state_vectors,
                                 query_vector.unsqueeze(dim=2)).squeeze()
    vector_probabilities = F.softmax(vector_scores, dim=-1)
    context_vectors = torch.matmul(encoder_state_vectors.transpose(-2, -1), 
                                   vector_probabilities.unsqueeze(dim=2)).squeeze()
    return context_vectors, vector_probabilities

注意力机制的工作流程如下：
1. 计算解码器隐藏状态与编码器状态向量的点积，得到每个编码序列项的标量。
2. 使用softmax函数将这些标量转换为编码器状态向量上的概率分布。
3. 使用这些概率对编码器状态向量进行加权求和，得到每个批次项的单个向量。

5. 学习搜索与计划采样

当前模型假设在解码器的每个时间步都提供目标序列作为输入。但在测试时，模型无法提前知道要生成的序列。为解决这个问题，可采用“学习搜索”和“计划采样”技术，允许模型在训练期间使用自己的预测。

对解码器代码进行以下修改：

class NMTDecoder(nn.Module):
    def __init__(self, num_embeddings, embedding_size, rnn_size, bos_index):
        super(NMTDecoder, self).__init__()
        # ... other init code here ...
        # arbitrarily set; any small constant will be fine
        self._sampling_temperature = 3

    def forward(self, encoder_state, initial_hidden_state, target_sequence, 
               sample_probability=0.0):
        if target_sequence is None:
            sample_probability = 1.0
        else:
            target_sequence = target_sequence.permute(1, 0)
            output_sequence_size = target_sequence.size(0)

        # ... nothing changes from the other implementation

        output_sequence_size = target_sequence.size(0)
        for i in range(output_sequence_size):
            use_sample = np.random.random() < sample_probability
            if not use_sample:
                y_t_index = target_sequence[i]
            # Step 1: Embed word and concat with previous context 
            # ... code omitted for space
            # Step 2: Make a GRU step, getting a new hidden vector
            # ... code omitted for space
            # Step 3: Use current hidden vector to attend to the encoder state
            # ... code omitted for space
            # Step 4: Use current hidden and context vectors
            #         to make a prediction about the next word
            prediction_vector = torch.cat((context_vectors, h_t), dim=1)
            score_for_y_t_index = self.classifier(prediction_vector)
            if use_sample:
                p_y_t_index = F.softmax(score_for_y_t_index *
                                        self._sampling_temperature, dim=1)
                # method 1: choose most likely word
                # _, y_t_index = torch.max(p_y_t_index, 1)
                # method 2: sample from the distribution
                y_t_index = torch.multinomial(p_y_t_index, 1).squeeze()

主要修改如下：
1. 使初始索引更明确为BEGIN-OF-SEQUENCE标记索引。
2. 在生成循环的每个步骤中随机抽样，如果抽样值小于采样概率，则在该迭代中使用模型的预测。
3. 在 if use_sample 条件下进行实际采样。

通过以上步骤，我们实现了一个基于注意力机制的神经机器翻译模型，并采用了学习搜索和计划采样技术来提高模型在测试时的性能。

序列生成模型评估与神经机器翻译实现

6. 训练流程与结果分析

在完成模型的构建后，接下来需要进行训练并分析结果。训练过程主要包括定义损失函数、优化器，以及迭代训练数据进行模型参数更新。

6.1 损失函数与优化器

通常使用交叉熵损失函数来衡量模型预测结果与真实目标之间的差异。优化器可以选择随机梯度下降（SGD）、Adam等。以下是一个简单的示例：

import torch.optim as optim

# 定义模型
model = NMTModel(source_vocab_size, source_embedding_size, 
                 target_vocab_size, target_embedding_size, encoding_size, 
                 target_bos_index)

# 定义损失函数
criterion = nn.CrossEntropyLoss(ignore_index=target_vocab.mask_index)

# 定义优化器
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

6.2 训练循环

训练循环的主要步骤如下：
1. 从数据集中获取一个小批量的数据。
2. 将数据输入到模型中进行前向传播，得到预测结果。
3. 计算预测结果与真实目标之间的损失。
4. 进行反向传播，计算梯度。
5. 使用优化器更新模型参数。

for epoch in range(num_epochs):
    running_loss = 0.0
    for batch in generate_nmt_batches(dataset, batch_size, shuffle=True,
                                      drop_last=True, device=device):
        source_vector = batch["source_vector"].to(device)
        source_length = batch["source_length"].to(device)
        target_x_vector = batch["target_x_vector"].to(device)
        target_y_vector = batch["target_y_vector"].to(device)

        # 清零梯度
        optimizer.zero_grad()

        # 前向传播
        decoded_states = model(source_vector, source_length, target_x_vector)

        # 计算损失
        loss = criterion(decoded_states.view(-1, target_vocab_size),
                         target_y_vector.view(-1))

        # 反向传播
        loss.backward()

        # 更新参数
        optimizer.step()

        running_loss += loss.item()

    print(f'Epoch {epoch + 1}, Loss: {running_loss / len(dataloader)}')

6.3 结果分析

训练完成后，需要对模型的性能进行评估。可以使用之前提到的评估指标，如BLEU、困惑度等。同时，还可以通过人工检查模型的翻译结果，直观地了解模型的性能。

7. 模型改进方向

虽然当前的模型已经取得了一定的效果，但仍有许多改进的空间。以下是一些可能的改进方向：

7.1 模型架构改进

使用更复杂的循环神经网络 ：如长短期记忆网络（LSTM），它可以更好地处理长序列中的依赖关系。
引入多头注意力机制 ：多头注意力机制可以让模型在不同的表示子空间中关注不同的信息，提高模型的表达能力。

7.2 数据增强

增加训练数据 ：使用更多的英 - 法句子对数据进行训练，可以提高模型的泛化能力。
数据扰动 ：对训练数据进行随机扰动，如随机替换单词、插入单词等，增加数据的多样性。

7.3 超参数调整

调整学习率 ：合适的学习率可以让模型更快地收敛，避免过拟合或欠拟合。
调整批量大小 ：不同的批量大小可能会影响模型的训练速度和性能。

8. 总结

本文详细介绍了序列生成模型的评估指标，包括基于N - gram重叠的指标（如BLEU）和困惑度指标，并分析了它们的优缺点。同时，通过一个神经机器翻译的示例，展示了如何构建一个基于编码器 - 解码器架构的模型，包括数据集的预处理、向量化管道的构建、模型的编码与解码过程，以及注意力机制的应用。此外，还介绍了“学习搜索”和“计划采样”技术，以提高模型在测试时的性能。最后，讨论了模型的训练流程、结果分析以及可能的改进方向。

通过本文的学习，读者可以了解到序列生成模型的评估和实现的基本方法，以及如何应用这些方法解决实际的机器翻译问题。在实际应用中，可以根据具体的需求和数据情况，对模型进行进一步的优化和改进。

以下是一个简单的流程图，展示了神经机器翻译模型的整体流程：

graph LR
    A[数据预处理] --> B[向量化]
    B --> C[编码器]
    C --> D[解码器]
    D --> E[注意力机制]
    E --> F[预测结果]
    F --> G[计算损失]
    G --> H[反向传播]
    H --> I[更新参数]
    I --> C

同时，为了更清晰地展示模型的各个组件和流程，我们可以用表格进行总结：
| 步骤 | 描述 | 代码实现 |
| ---- | ---- | ---- |
| 数据预处理 | 将句子转换为小写，分词，选择数据子集 | 见数据预处理部分代码 |
| 向量化 | 对源和目标序列进行向量化 | NMTVectorizer类 |
| 编码器 | 使用bi - GRU对源序列进行编码 | NMTEncoder类 |
| 解码器 | 以编码器的隐藏状态为初始状态，使用注意力机制生成目标序列 | NMTDecoder类 |
| 注意力机制 | 计算解码器隐藏状态与编码器状态向量的注意力权重 | verbose_attention和terse_attention函数 |
| 训练 | 定义损失函数和优化器，迭代训练数据更新参数 | 训练循环代码 |
| 改进方向 | 模型架构、数据增强、超参数调整 | 见模型改进方向部分 |