针对任意长度的文本微调 BERT模型

AI大模型_学习君

已于 2025-01-15 12:11:02 修改

阅读量1k

点赞数 13

CC 4.0 BY-SA版权

文章标签： bert 人工智能深度学习 AI大模型大模型入门大模型微调大模型应用

于 2025-01-15 12:10:50 首次发布

本文链接：https://blog.youkuaiyun.com/python12345678_/article/details/145157669

BERT For Longer Texts:

BERT For Longer Texts: 针对任意长度的文本微调 BERT模型

参考博客:

第一部分: https://www.mim.ai/fine-tuning-bert-model-for-arbitrarily-long-texts-part-1/
第二部分: https://www.mim.ai/fine-tuning-bert-model-for-arbitrarily-long-texts-part-2/
BELT (BERT For Longer Texts)相关代码: https://github.com/mim-solutions/bert_for_longer_texts

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

基于 transformers 架构的模型已成为 NLP 中最先进的解决方案。“transformer”一词确实是著名 BERT、GPT3 和如今大受欢迎的 ChatGPT 名称中的字母“T”所代表的意思。应用这些模型时常见的障碍是输入长度的限制。例如，BERT 模型无法处理长度超过 512 个 token 的文本（粗略地说，一个 token 与一个单词相关联）。

Devlin（BERT 的作者之一）在讨论中提出了解决此问题的方法。在本文中，我们将详细描述如何修改对预训练的 BERT 模型进行微调以完成分类任务的过程。代码可在此处作为开源获得。

1. BERT分类的概述

让我们首先描述 BERT 分类器模型生命周期中的三个阶段：

模型预训练。
模型微调。
应用。

1.1 模型预训练

在第一阶段，BERT 以自监督的方式在大量数据上进行预训练。也就是说，训练数据仅由原始文本组成，没有人工标记。该模型通过两个目标进行评估：猜测句子中的掩码词和预测一个句子是否接在另一个句子之后。

请注意，这两个任务只关注单独的句子，而不是整个上下文。因此，不会截断较长的文本。尽管《追忆似水年华》这本书有超过 120 万个单词，但它可以在预训练期间使用。它只是逐句进行。

我们可以使用 transformers 库加载预训练的基本 BERT 模型：

from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")

警告告诉我们，下载的模型必须在下游任务上进行微调（在我们的例子中，这将是序列的二元分类）。此步骤将在下一小节中描述。
我们使用类似的方法来获取标记器：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer

请注意上面列出的参数 model_max_length=512。这是我们在本文中要解决的主要障碍。事实上，不加修改地应用这个模型只会将每个文本截断为 512 个标记。文档其余部分中的所有信息和上下文在微调和预测阶段都会被丢弃。

最直接和自然的想法是将文本分成更小的块并将它们分别提供给模型。这是我们的策略；然而，正如我们将看到的，魔鬼就在细节中。

1.2 模型的微调(Fine-tuning)

显然，在阅读了许多书籍和整个维基百科后，下载的预训练模型是知识丰富的。然而，它的知识非常笼统。
假设我们只需要预测电影评论是正面的还是负面的，而忽略其庞大而复杂的量子力学和普鲁斯特智慧(wisdom of quantum mechanics and Proust)。更重要的是，我们需要调整模型以适应我们的二元序列分类特定任务。假设我们想要训练模型根据文本识别电影评论是正面的还是负面的。
为此，我们使用监督学习方法。更准确地说，准备手动标记为正面或负面的评论训练集，然后将其输入模型，并在模型顶部添加一个额外的分类层。
修改微调步骤以查看整个文本而不仅仅是前 512 个标记并不是一件容易的事，稍后将详细描述。

1.3 模型的应用(Model application)

最后阶段是将训练好的模型应用到新数据中并获得分类。

2. 对较长的文本使用微调分类器

首先描述修改已经微调的 BERT 分类器以将其应用于较长文本的更直接的过程将很有启发。本节将主要基于出色的教程文章：如何将 Transformers 应用于任意长度的文本。

我们这里的方法之间的主要区别在于允许文本块重叠。

2.1 寻找长篇评论

接下来，我们将考虑来自 IMDB 的著名电影评论数据集。我们有兴趣根据他们的情绪对它们进行分类。也就是说，它们是积极的还是消极的。

经过基本的探索，我们从 huggingface 加载数据集，并找到一篇关于大卫林奇的《穆赫兰道》的很长的评论：

from datasets import load_dataset

imdb = load_dataset("imdb")

long_review = imdb["test"]["text"][21132]
number_of_words = len(long_review.split())
#print(f"The review: {long_review.split()}")
print(f"The review has {number_of_words} words.")

我们可以看到，这篇评论相当复杂，包含 2278 个单词。我们希望将其拆分成足够小的块，以适应 BERT 输入的 512 个限制。

2.1 加载已经微调的 BERT 分类器

在本节中，我们假设我们已经有一个经过微调的 BERT 分类器。让我们从 huggingface 下载在 IMDB 数据集上训练的分类器：

from transformers import BertForSequenceClassification, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('fabriceyhc/bert-base-uncased-imdb')
model = BertForSequenceClassification.from_pretrained('fabriceyhc/bert-base-uncased-imdb')

2.2 对整个文本进行标记

现在我们要对整个评论进行标记：

tokens = tokenizer(long_review, add_special_tokens=False, truncation=False, return_tensors="pt")

请注意以下几点：

我们将 add_special_tokens 设置为 False，因为我们将在拆分过程之后手动在开头和结尾添加特殊标记。
我们将 truncation 设置为 False，因为我们不想丢弃文本的任何部分。
我们将 return_tensor 设置为“pt”，以 torch Tensor 的形式获取结果。

警告告诉我们标记化的序列太长（标记化后我们获得了 3155 个标记，这比单词的数量还要多得多）。如果我们只是将这样的张量放入模型中，它将不起作用。
确实，让我们尝试一下：

try:
    prediction = model(**tokens)
except RuntimeError as e:
    print(e)

2.3 tokens是什么?

现在让我们看看我们所指的这些令牌到底是什么。

example = ["the man went to the store and bought a gallon of milk"]
tokens = tokenizer(example, add_special_tokens=False, truncation=False, return_tensors="pt")
tokens

example = ["the man went to the store and bought a gallon of milk"]
tokens = tokenizer(example, add_special_tokens=True, truncation=False, return_tensors="pt")
tokens

我们可以看到，标记化的文本相当于具有以下键的 Python 字典：

input_ids — 这部分至关重要 — 它将单词编码为整数。它还可以包含一些特殊标记，表示文本的开头（值 101）和结尾（值 102）。我们将在拆分过程之后手动添加它们。
token_type_ids — 这个二进制张量用于在 BERT 的一些特定应用中区分问题和答案(separate question and answer)。因为我们只对分类任务感兴趣，所以我们可以忽略这部分。
attention_mask — 这个二进制张量表示填充索引的位置。稍后我们将在那里手动添加零，以确保所有块都具有所需的大小 512。

2.4 分割分词(Splitting the tokens)

为了使标记适合模型，我们需要将它们拆分成长度为 512 个标记或更少的块。但是，我们还需要在开头和结尾放置 2 个特殊标记；因此上限是 510。

三个参数将决定拆分过程：chunk_size、stride 和 minimal_chunk_size，含义如下：

参数 chunk_size 定义每个块的长度。更准确地说，将标记拆分成相等的部分可能是不可能的，最后的块可能小于 chunk_size。
参数 stride 修改标记列表上的移动量（这类似于卷积神经网络中此参数的含义）。换句话说，这允许块重叠。
参数 minimal_chunk_size 标识块的最小大小。正如我们已经提到的，拆分标记列表后，我们可能会在末尾获得一些剩余部分，这些部分可能太小而无法包含任何有意义的信息。

为了清楚起见，我们将通过几个示例演示此过程：

from torch import Tensor

def split_overlapping(tensor: Tensor, chunk_size: int, stride: int, minimal_chunk_length: int) -> list[Tensor]:
    """Helper function for dividing 1-dimensional tensors into overlapping chunks."""
    result = [tensor[i : i + chunk_size] for i in range(0, len(tensor), stride)]
    if len(result) > 1:
        # ignore chunks with less than minimal_length number of tokens
        result = [x for x in result if len(x) >= minimal_chunk_length]
    return result

example_tensor = tokens["input_ids"][0]
example_tensor

splitted = split_overlapping(example_tensor, chunk_size=5, stride=5, minimal_chunk_length=5)
splitted

splitted = split_overlapping(example_tensor, chunk_size=5, stride=3, minimal_chunk_length=5)
splitted

splitted = split_overlapping(example_tensor, chunk_size=5, stride=3, minimal_chunk_length=3)
splitted

2.5 添加特殊标记(Adding special tokens)

分成更小的块后，我们必须在开头和结尾添加特殊的标记：

def add_special_tokens_at_beginning_and_end(input_id_chunks: list[Tensor], mask_chunks: list[Tensor]) -> None:
    """
    Adds special CLS token (token id = 101) at the beginning.
    Adds SEP token (token id = 102) at the end of each chunk.
    Adds corresponding attention masks equal to 1 (attention mask is boolean).
    """
    for i in range(len(input_id_chunks)):
        # adding CLS (token id 101) and SEP (token id 102) tokens
        input_id_chunks[i] = torch.cat([Tensor([101]), input_id_chunks[i], Tensor([102])])
        # adding attention masks  corresponding to special tokens
        mask_chunks[i] = torch.cat([Tensor([1]), mask_chunks[i], Tensor([1])])

接下来，我们必须添加一些填充标记以确保所有块的大小恰好为 512：

def add_padding_tokens(input_id_chunks: list[Tensor], mask_chunks: list[Tensor]) -> None:
    """Adds padding tokens (token id = 0) at the end to make sure that all chunks have exactly 512 tokens."""
    for i in range(len(input_id_chunks)):
        # get required padding length
        pad_len = 512 - input_id_chunks[i].shape[0]
        # check if tensor length satisfies required chunk size
        if pad_len > 0:
            # if padding length is more than 0, we must add padding
            input_id_chunks[i] = torch.cat([input_id_chunks[i], Tensor([0] * pad_len)])
            mask_chunks[i] = torch.cat([mask_chunks[i], Tensor([0] * pad_len)])

from transformers import PreTrainedTokenizerBase
from typing import Any, Optional, Union

def transform_single_text(
    text: str,
    tokenizer: PreTrainedTokenizerBase,
    chunk_size: int,
    stride: int,
    minimal_chunk_length: int,
    maximal_text_length: Optional[int],
) -> tuple[Tensor, Tensor]:
    """Transforms (the entire) text to model input of BERT model."""
    if maximal_text_length:
        tokens = tokenize_text_with_truncation(text, tokenizer, maximal_text_length)
    else:
        tokens = tokenize_whole_text(text, tokenizer)
    input_id_chunks, mask_chunks = split_tokens_into_smaller_chunks(tokens, chunk_size, stride, minimal_chunk_length)
    add_special_tokens_at_beginning_and_end(input_id_chunks, mask_chunks)
    add_padding_tokens(input_id_chunks, mask_chunks)
    input_ids, attention_mask = stack_tokens_from_all_chunks(input_id_chunks, mask_chunks)
    return input_ids, attention_mask

2.6 堆叠张量(Stacking the tensors)

将此过程应用于单个文本后，input_ids 是大小为 512 的 K 个张量的列表，其中 K 是块的数量。要将其放入 BERT 模型中，我们必须将这 K 个张量堆叠成一个大小为 K x 512 的张量，并确保张量值具有适当的类型：

def stack_tokens_from_all_chunks(input_id_chunks: list[Tensor], mask_chunks: list[Tensor]) -> tuple[Tensor, Tensor]:
    """Reshapes data to a form compatible with BERT model input."""
    input_ids = torch.stack(input_id_chunks)
    attention_mask = torch.stack(mask_chunks)

    return input_ids.long(), attention_mask.int()

2.7 Wrapping it into one function

为了方便起见，我们可以将所有前面的步骤包装到单个函数中：

from typing import Optional
from transformers import BatchEncoding, PreTrainedTokenizerBase
# reference: https://github.com/mim-solutions/bert_for_longer_texts/blob/main/belt_nlp/splitting.py#L67

def tokenize_whole_text(text: str, tokenizer: PreTrainedTokenizerBase) -> BatchEncoding:
    """Tokenizes the entire text without truncation and without special tokens."""
    tokens = tokenizer(text, add_special_tokens=False, truncation=False, return_tensors="pt")
    return tokens

def split_tokens_into_smaller_chunks(
    tokens: BatchEncoding,
    chunk_size: int,
    stride: int,
    minimal_chunk_length: int,
) -> tuple[list[Tensor], list[Tensor]]:
    """Splits tokens into overlapping chunks with given size and stride."""
    input_id_chunks = split_overlapping(tokens["input_ids"][0], chunk_size, stride, minimal_chunk_length)
    mask_chunks = split_overlapping(tokens["attention_mask"][0], chunk_size, stride, minimal_chunk_length)
    return input_id_chunks, mask_chunks
    
def transform_single_text(
    text: str,
    tokenizer: PreTrainedTokenizerBase,
    chunk_size: int,
    stride: int,
    minimal_chunk_length: int,
    maximal_text_length: Optional[int],
) -> tuple[Tensor, Tensor]:
    """Transforms (the entire) text to model input of BERT model."""
    if maximal_text_length:
        tokens = tokenize_text_with_truncation(text, tokenizer, maximal_text_length)
    else:
        tokens = tokenize_whole_text(text, tokenizer)
    input_id_chunks, mask_chunks = split_tokens_into_smaller_chunks(tokens, chunk_size, stride, minimal_chunk_length)
    add_special_tokens_at_beginning_and_end(input_id_chunks, mask_chunks)
    add_padding_tokens(input_id_chunks, mask_chunks)
    input_ids, attention_mask = stack_tokens_from_all_chunks(input_id_chunks, mask_chunks)
    return input_ids, attention_mask

2.8 选定长评论的处理程序

现在让我们将上述所有步骤结合起来，作为示例长篇评论。我们将使用参数 chunk_size = 510、stride=510 和 minimal_chunk_size=1，这意味着只分成不重叠的部分：

input_ids, attention_mask = transform_single_text(long_review, tokenizer, 510, 510, 1, None)

input_ids, attention_mask

input_ids.shape

因此，评论被分为 7 个部分。

2.9 在准备好的数据上使用微调模型

准备好的数据就可以插入到我们微调的分类器中了：

model_output = model(input_ids, attention_mask)
model_output

probs = torch.nn.functional.softmax(model_output[0], dim=-1)
probs

probabilities = probs[:,1]
probabilities

probabilities.mean()

probabilities.max()

让我们总结一下：

经过微调的模型返回了每个块的 logit 值。
我们应用了 softmax 函数和切片来获取评论为正面的概率。
我们获得了每个概率的列表：[0.9997、0.9996、0.5399、0.9994、0.9995、0.9975、0.9987]
最后，我们可以应用一些池化函数（平均值或最大值）来获得整个评论的一个聚合概率。

2.10 结论

在这一部分中，我介绍了如何在任意长度的文本上使用已经微调的 BERT。但是，当我们想自己微调它时该怎么办？我将在即将发布的系列文章第二部分中回答这个问题。

3. 对较长的文本进行预训练 BERT 的微调

现在，是时候解决之前方法中存在的一个问题了。我们很幸运地找到了针对 IMDB 数据集的已微调模型。然而，更常见的情况是，当我们拥有标记数据集时，我们需要从头开始微调分类器。在这种情况下，我们使用监督方法，下载通用的预训练模型，将分类头放在上面，并在标记数据上进行训练。

3.1 有三条路可走

在精彩的 huggingface 教程中详细描述了微调分类器模型的过程。
现在让我们简要总结一下主要步骤。首先，我们对文本进行标记。同样，标准方法是将所有标记截断为 512 个标记。在此预处理阶段之后，有三种方法可以微调模型：

通过 huggingface 的 Trainer API 使用黑盒方法。
使用带有 Keras 的 TensorFlow 框架。
在原生 PyTorch 中训练模型。

我们将遵循最后一种方法，因为它是最明确的，并且由于它，我们将能够根据我们的需求进行调整。主要目标是修改程序以避免截断较长的文本。

3.2 主要思想

这里讨论了使用和微调 BERT 处理较长文本的问题。解决这个问题的主要思想在 BERT 的作者之一Jacob Devlin 的评论中有所描述。

让我们强调一下评论的以下部分：

所以从 BertModel 的角度来看，这是一个 3×6 的小批量

它告诉我们上一节中关于应用模型所做的事情与我们现在需要做的事情之间的关键区别。

回想一下，要将微调后的分类器模型应用于单个长文本，我们首先对整个序列进行标记，然后将其拆分成块，获得每个块的模型预测并计算预测的平均值/最大值。按顺序执行没有问题，即：

将第一个块放入模型，获得第一个预测。
将第二个块放入模型，获得第二个预测。
等等……
取这些预测的平均值/最大值并停止。

但是，按顺序对每个块进行训练会导致无数问题和疑问：

将第一个文本的第一个块放入模型，计算预测和标签的损失……
什么标签？
整个文本只有一个二进制标签……然后也许运行反向传播？但什么时候？
我们真的应该在每个块之后更新模型权重吗？

相反，我们必须通过将所有块放入一个小批量来一次性完成所有操作。这解决了我们所有的问题：

从为第一个文本获得的 K 个块中，创建 1 个小批量并获得 K 个预测。
使用均值/最大值函数汇集预测以获得整个文本的单个预测。
计算这个单个预测和单个标签之间的损失。
运行反向传播。在运行 loss.backward() 之前，请务必确保所有张量操作都在具有附加梯度的张量上完成。

3.3 常规微调(Usual fine-tuning)

现在，我们将概述如何修改教程中的程序。微调的基本步骤是：

使用截断对训练集的文本进行标记。粗略地说，标记集是具有键input_ids 和attention_mask 的字典，值是大小精确等于 512 的张量。
使用选定的 batch_size 创建 Dataloader 对象。这将允许我们迭代数据批次。换句话说，假设 batch_size=N。
在 train_dataloader 中对 batch 进行训练循环期间，我们将获得对象 batch。这里的 batch 再次是具有键 input_ids 和tention_mask 的字典。但这一次，它的值是大小为 N x 512 的堆叠张量。
使用输出 = model(**batch) 将每个加载的批次放入模型，使用 loss = output.loss 计算损失并运行反向传播 loss.backward()。

接下来，我们将描述如何更改每个阶段。

3.4 Tokenization with splitting

回想一下我们用来标记单个文本的函数 transform_single_text。现在我们想要一个向量化版本来处理文本列表：

import torch
from torch import Tensor
from typing import Optional
from transformers import BatchEncoding, PreTrainedTokenizerBase
# reference: https://github.com/mim-solutions/bert_for_longer_texts/blob/main/belt_nlp/splitting.py#L67

def tokenize_whole_text(text: str, tokenizer: PreTrainedTokenizerBase) -> BatchEncoding:
    """Tokenizes the entire text without truncation and without special tokens."""
    tokens = tokenizer(text, add_special_tokens=False, truncation=False, return_tensors="pt")
    return tokens

def split_tokens_into_smaller_chunks(
    tokens: BatchEncoding,
    chunk_size: int,
    stride: int,
    minimal_chunk_length: int,
) -> tuple[list[Tensor], list[Tensor]]:
    """Splits tokens into overlapping chunks with given size and stride."""
    input_id_chunks = split_overlapping(tokens["input_ids"][0], chunk_size, stride, minimal_chunk_length)
    mask_chunks = split_overlapping(tokens["attention_mask"][0], chunk_size, stride, minimal_chunk_length)
    return input_id_chunks, mask_chunks
    
def transform_single_text(
    text: str,
    tokenizer: PreTrainedTokenizerBase,
    chunk_size: int,
    stride: int,
    minimal_chunk_length: int,
    maximal_text_length: Optional[int],
) -> tuple[Tensor, Tensor]:
    """Transforms (the entire) text to model input of BERT model."""
    if maximal_text_length:
        tokens = tokenize_text_with_truncation(text, tokenizer, maximal_text_length)
    else:
        tokens = tokenize_whole_text(text, tokenizer)
    input_id_chunks, mask_chunks = split_tokens_into_smaller_chunks(tokens, chunk_size, stride, minimal_chunk_length)
    add_special_tokens_at_beginning_and_end(input_id_chunks, mask_chunks)
    add_padding_tokens(input_id_chunks, mask_chunks)
    input_ids, attention_mask = stack_tokens_from_all_chunks(input_id_chunks, mask_chunks)
    return input_ids, attention_mask

与往常一样，亲手实践一下示例很有启发。让我们来看一下此函数的结果，进行一次简短回顾和一次长回顾，并将其与通常的截断方法进行比较：

from datasets import load_dataset

imdb = load_dataset("imdb")

long_review = imdb["test"]["text"][21132]
number_of_words = len(long_review.split())
print(f"The review has {number_of_words} words.")

short_review = imdb["test"]["text"][0]
number_of_words = len(short_review.split())
print(f"The review has {number_of_words} words.")

def tokenize_truncated(list_of_texts):
    return tokenizer(list_of_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")

def transform_list_of_texts(
    texts: list[str],
    tokenizer: PreTrainedTokenizerBase,
    chunk_size: int,
    stride: int,
    minimal_chunk_length: int,
    maximal_text_length: Optional[int] = None,
) -> BatchEncoding:
    model_inputs = [
        transform_single_text(text, tokenizer, chunk_size, stride, minimal_chunk_length, maximal_text_length)
        for text in texts
    ]
    input_ids = [model_input[0] for model_input in model_inputs]
    attention_mask = [model_input[1] for model_input in model_inputs]
    tokens = {"input_ids": input_ids, "attention_mask": attention_mask}
    return BatchEncoding(tokens)

from transformers import AutoModel
from transformers import AutoTokenizer

model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokens_splitted = transform_list_of_texts([short_review, long_review], tokenizer, 510, 510, 1, None)
tokens_truncated = tokenize_truncated([short_review, long_review])

首先，应用通常的截断方法：

type(tokens_truncated["input_ids"])
tokens_truncated['input_ids'].shape

可以看到结果是大小为的堆叠张量。接下来，我们来看看拆分的结果：

type(tokens_splitted["input_ids"])
[tensor.shape for tensor in tokens_splitted['input_ids']]

这是大小为的堆叠张量的列表，其中是文本的块的数量。由于文本的长度可能不同，因此我们无法将此张量列表转换为一个堆叠张量。

这里的关键观察是，我们的标记化返回不同大小的张量列表，因为文本的长度可能不同。不幸的是，我们无法将不同大小的张量堆叠在一起。同样，我们无法将两个不同大小的向量连接成一个矩形矩阵（我们都记得从幼儿园开始）。

从现在开始，我们必须非常小心，不要犯哲学家所说的类别错误和普通程序员所说的类型错误。

3.5 Creating the dataset and the dataloader

下一步是将标记化的文本放入 torch Dataset 对象中。我们定义它如下：

from torch.utils.data import Dataset

class TokenizedDataset(Dataset):
    """Dataset for tokens with optional labels."""

    def __init__(self, tokens: BatchEncoding, labels: Optional[list] = None):
        self.input_ids = tokens["input_ids"]
        self.attention_mask = tokens["attention_mask"]
        self.labels = labels

    def __len__(self) -> int:
        return len(self.input_ids)

    def __getitem__(self, idx: int) -> Union[tuple[Tensor, Tensor, Any], tuple[Tensor, Tensor]]:
        if self.labels:
            return self.input_ids[idx], self.attention_mask[idx], self.labels[idx]
        return self.input_ids[idx], self.attention_mask[idx]

再次，让我们用两个评论的玩具示例来尝试一下：

from torch.utils.data import Dataset, RandomSampler, DataLoader

dataset_truncated = TokenizedDataset(tokens_truncated, [0,1])
dataset_splitted = TokenizedDataset(tokens_splitted, [0,1])
train_dataloader_truncated = DataLoader(dataset_truncated, sampler=RandomSampler(dataset_truncated), batch_size=2)
train_dataloader_splitted = DataLoader(dataset_splitted, sampler=RandomSampler(dataset_splitted), batch_size=2)

到目前为止一切顺利。两种情况下均无错误。但是，让我们尝试使用准备好的数据加载器作为迭代器：

for batch in train_dataloader_truncated:
    break

截断方法没有问题。现在让我们看看分割方法：

try:
    for batch in train_dataloader_splitted:
        break
except RuntimeError as e:
    print(e)

我们可以看到，torch Dataloader 会自动堆叠所有张量，但是如果我们有不同大小的张量，这是不可能的！

这可不行……事实证明，torch Dataloader 的默认行为禁止使用不同大小的输入张量！经过一番谷歌搜索，我们发现了下面关于这个问题的讨论。

3.6 Overriding the default dataloader

在分析了链接的讨论之后，我们决定通过创建自定义 collate_fn 函数来覆盖 Dataloader 的默认行为。让我们再次看一下代码：

from torch import Tensor

def collate_fn_pooled_tokens(data):
    input_ids = [data[i][0] for i in range(len(data))]
    attention_mask = [data[i][1] for i in range(len(data))]
    if len(data[0]) == 2:
        collated = [input_ids, attention_mask]
    else:
        labels = Tensor([data[i][2] for i in range(len(data))])
        collated = [input_ids, attention_mask, labels]
    return collated

train_dataloader_splitted = DataLoader(dataset_splitted, sampler=RandomSampler(dataset_splitted), batch_size=2, collate_fn=collate_fn_pooled_tokens)

try:
    for batch in train_dataloader_splitted:
        break
except RuntimeError as e:
    print(e)
finally:
    print("It works now!")

自定义函数 collate_fn_pooled_tokens 只是强制 torch 将每个批次视为（可能大小不同的）张量列表，并禁止它尝试堆叠它们。

我们终于准备好查看训练循环了。

3.7 Modifying the training loop

分类器模型的标准torch训练循环如下所示：

# 完整代码参考: https://github.com/mim-solutions/bert_for_longer_texts/blob/main/belt_nlp/bert.py#L111
'''
    def _train_single_epoch(self, dataloader: DataLoader, optimizer: Optimizer) -> None:
        self.neural_network.train()
        cross_entropy = BCELoss()

        for step, batch in enumerate(dataloader):
            optimizer.zero_grad()
            labels = batch[-1].float().cpu()
            predictions = self._evaluate_single_batch(batch)

            loss = cross_entropy(predictions, labels)
            loss.backward()
            optimizer.step()
'''

其中关键方法 _evaluate_single_batch 定义如下：

# 完整代码参考: https://github.com/mim-solutions/bert_for_longer_texts/blob/main/belt_nlp/bert_truncated.py#L55
'''
    def _evaluate_single_batch(self, batch: tuple[Tensor]) -> Tensor:
        batch = [t.to(self.device) for t in batch]
        model_input = batch[:2]

        predictions = self.neural_network(*model_input)
        predictions = torch.flatten(predictions).cpu()
        return predictions
'''

这里的 self.neural_network 是返回单个概率的分类器模型。

为了使训练循环适应每个批次都是具有不同大小的张量列表的情况，我们需要进行一些调整：

## 完整代码参考: https://github.com/mim-solutions/bert_for_longer_texts/blob/main/belt_nlp/bert_with_pooling.py#L103
'''
    def _evaluate_single_batch(self, batch: tuple[Tensor]) -> Tensor:
        input_ids = batch[0]
        attention_mask = batch[1]
        number_of_chunks = [len(x) for x in input_ids]

        # concatenate all input_ids into one batch

        input_ids_combined = []
        for x in input_ids:
            input_ids_combined.extend(x.tolist())

        input_ids_combined_tensors = torch.stack([torch.tensor(x).to(self.device) for x in input_ids_combined])

        # concatenate all attention masks into one batch

        attention_mask_combined = []
        for x in attention_mask:
            attention_mask_combined.extend(x.tolist())

        attention_mask_combined_tensors = torch.stack(
            [torch.tensor(x).to(self.device) for x in attention_mask_combined]
        )

        # get model predictions for the combined batch
        preds = self.neural_network(input_ids_combined_tensors, attention_mask_combined_tensors)

        preds = preds.flatten().cpu()

        # split result preds into chunks

        preds_split = preds.split(number_of_chunks)

        # pooling
        if self.pooling_strategy == "mean":
            pooled_preds = torch.cat([torch.mean(x).reshape(1) for x in preds_split])
        elif self.pooling_strategy == "max":
            pooled_preds = torch.cat([torch.max(x).reshape(1) for x in preds_split])
        else:
            raise ValueError("Unknown pooling strategy!")

        return pooled_preds
'''

以下是一些注释：

在训练期间，我们基本上执行与预测期间相同的步骤，关键部分是 cat/stack/split/mean/max 类型的所有操作都是在附加梯度的张量上完成的。
为此，我们使用内置的 torch 张量转换。不允许任何中间转换为列表或数组。否则，关键反向传播命令 loss.backward() 将不起作用。

3.8 结论

在本文中，我们学习了如何在应用或微调中延长 BERT 的输入。我邀请您查看我们的存储库，您可以在其中找到本教程中使用的所有代码。

如果您有任何疑问，请通过我的 LinkedIn 与我联系。

4. 使用bert_for_longer_texts进行长文本分类的案例

由于中国大陆地区无法直接从HuggingFace下载模型，可以手动将所有文件下载后放到一个文件夹，然后指定加载目录。

imbd数据集下载地址: https://huggingface.co/datasets/stanfordnlp/imdb
BERT模型下载地址: https://huggingface.co/google-bert/bert-base-uncased

## 传统的Truncated策略
import os
from datasets import load_dataset
import numpy as np

from belt_nlp.bert_classifier_truncated import BertClassifierTruncated

model_path="/public/home/jialh/metaHiC/LLMs/bert_for_longer_texts/bert-base-uncased"
dataset = load_dataset("/home1/jialh/metaHiC/LLMs/bert_for_longer_texts/imdb")

print(f"dataset: {dataset}")

X_train = dataset["train"]["text"]
y_train = dataset["train"]["label"]
X_test = dataset["test"]["text"]
y_test = dataset["test"]["label"]

MODEL_PARAMS = {
    "num_labels": 2,
    "batch_size": 32,
    "learning_rate": 5e-5,
    "epochs": 3,
    "device": "cuda",
    "many_gpus": True,
    "pretrained_model_name_or_path": model_path
}

model_file="/public/home/jialh/metaHiC/LLMs/bert_for_longer_texts/results/imdb_belt.pt"
if not os.path.exists(model_file):
    model = BertClassifierTruncated(**MODEL_PARAMS)
    model.fit(X_train, y_train, epochs=3)
    torch.save(model, model_file)
else:
    model = torch.load(model_file, weights_only=False)

classes = model.predict(X_test).detach().cpu()
probabilities = model.predict_scores(X_test)

accurate = sum(classes == np.array(y_test))
accuracy = accurate / len(y_test)

print(f"Test accuracy: {accuracy}")

## 使用bert_for_longer_texts策略
import os
import torch
from datasets import load_dataset
import numpy as np

# from belt_nlp.bert_classifier_truncated import BertClassifierTruncated
from belt_nlp.bert_classifier_with_pooling import BertClassifierWithPooling

model_path="/public/home/jialh/metaHiC/LLMs/bert_for_longer_texts/bert-base-uncased"
dataset = load_dataset("/home1/jialh/metaHiC/LLMs/bert_for_longer_texts/imdb")

print(f"dataset: {dataset}")

X_train = dataset["train"]["text"]
y_train = dataset["train"]["label"]
X_test = dataset["test"]["text"]
y_test = dataset["test"]["label"]

MODEL_PARAMS = {
    "num_labels": 2,
    "batch_size": 16,
    "learning_rate": 5e-5,
    "epochs": 3,
    "chunk_size": 510,
    "stride": 510,
    "minimal_chunk_length": 510,
    "maximal_text_length": 510 * 4,
    "pooling_strategy": "mean",
    "device": "cuda",
    "many_gpus": True,
    "pretrained_model_name_or_path": model_path
}
model_file="/public/home/jialh/metaHiC/LLMs/bert_for_longer_texts/results/imdb_belt.pt"
if not os.path.exists(model_file):
    model = BertClassifierWithPooling(**MODEL_PARAMS)
    model.fit(X_train, y_train, epochs=3)
    torch.save(model, model_file)
else:
    model = torch.load(model_file, weights_only=False)

classes = model.predict(X_test).detach().cpu()
probabilities = model.predict_scores(X_test)

accurate = sum(classes == np.array(y_test))
accuracy = accurate / len(y_test)

print(f"Test accuracy: {accuracy}")