如何使用transformers的Trainer工具进行BERT模型的训练

背景:

为了比较系统的学习大模型的训练与应用有了该样例,仅供由此爱好的初学者参考和给自己留存。

本文将以去哪网酒店预订评论数据作为训练材料进行模型训练,力求达到输入中文描述,模型自动匹配该描述是好评还是差评。

材料:

1、准备数据(自行准备,内容结构如下)

label,review
0,"距离川沙公路较近,但是公交指示不对,如果是""蔡陆线""的话,会非常麻烦.建议用别的路线.房间较为简单."
1,商务大床房,房间很大,床有2M宽,整体感觉经济实惠不错!
0,早餐太差,无论去多少人,那边也不加食品的。酒店应该重视一下这个问题了。房间本身很好。
1,宾馆在小街道上,不大好找,但还好北京热心同胞很多~宾馆设施跟介绍的差不多,房间很小,确实挺小,但加上低价位因素,还是无超所值的;环境不错,就在小胡同内,安静整洁,暖气好足-_-||。。。呵还有一大优势就是从宾馆出发,步行不到十分钟就可以到梅兰芳故居等等,京味小胡同,北海距离好近呢。总之,不错。推荐给节约消费的自助游朋友~比较划算,附近特色小吃很多~
0,"CBD中心,周围没什么店铺,说5星有点勉强.不知道为什么卫生间没有电吹风"
1,总的来说,这样的酒店配这样的价格还算可以,希望他赶快装修,给我的客人留些好的印象
0,价格比比较不错的酒店。这次免费升级了,感谢前台服务员。房子还好,地毯是新的,比上次的好些。早餐的人很多要早去些。
1,不错,在同等档次酒店中应该是值得推荐的!

2、 下载预训练模型(由于从源头开始训练模型太费时也费资源,我们采用已训练好的预训练模型进行微调),对于已经预训练好的模型bert-base-chinese的下载可以去Hugging face下载,网址是:Hugging Face – The AI community building the future.

 通过搜索找到模型并下载

3、准备python 环境(python 版本要>3.8) 

制作:

模型训练

1、引入的库

import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer, EvalPrediction
from sklearn.metrics import f1_score, accuracy_score

2、加载待训练数据并处理

# 取数
def get_trans_data(file_path):
    data = []
    with open(file_path, 'r', encoding='UTF-8') as file:
        for line in file:
            if len(line) > 0:
                label, review = line.strip().split(',', 1)
                # 如果 label 是标签,需要在这里将其映射为数字
                data.append({'text': review.strip(), 'label': int(label.strip() if label.strip().isdigit() else 0)})
    return data

3、 对数据进行加工,将数据分割为训练数据、验证数据并借助dataset库中的方法将数据转换成便于大模型训练的数据集

# 对数据进行加工,使其变成大模型训练需要的向量数据结构
def chenge_data(sourceData):
    # 为了使用train_test_split 数据拆分工具类,借助pandas 对数据进行转换
    data_df = pd.DataFrame(sourceData)
    train_data, test_data = train_test_split(data_df, test_size=0.2, random_state=50)
    # 分别对数据进行向量转换.
    train_dataset = Dataset.from_pandas(train_data)
    test_dataset = Dataset.from_pandas(test_data)
    datasetDict = DatasetDict({'tran_dataset': train_dataset, 'test_dataset': test_dataset})
    return datasetDict

4、训练模型并评估模型

# 加载预训练模型并进行词句切割
def train_model(datasetDict):
    # 加载模型分词器(简化,其实需要做现行回归)
    tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
    model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
    # model = BERT_Classifier(output_dim=3)
    encoded_dataset = datasetDict.map(lambda examples: tokenizer(examples['text'], truncation=True, padding=True, max_length=512), batched=True)
    # 数据填充器,以确保批处理时输入序列的长度一致
    data_collator = DataCollatorWithPadding(tokenizer)
    # 训练参数设置,输出目录、评估策略、学习率、批处理大小、训练轮数、权重衰减
    trainning_args = TrainingArguments(
        output_dir='E:/bertTest/results',
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=10,
        weight_decay=0.01
    ) 
    # 创建Trainer ,管理训练过程
    trainer = Trainer(
        model=model,
        args=trainning_args,
        train_dataset=encoded_dataset['tran_dataset'],
        eval_dataset=encoded_dataset['test_dataset'],
        data_collator=data_collator,
        tokenizer=tokenizer
    )
    # 开始训练
    print('我要开始训练了')
    trainer.train()
    # 保存模型和分词器
    print('我训练完成了')
    trainer.save_model("E:/bertTest/bert")

 模型评估

1、借助sklearn.metrics库对模型进行评估(样例中采用了accuracy_score、f1_score)

# 验证集
def multi_label_metrics(predictions, labels, threshold=0.5):
    probs = np.argmax(predictions, -1)       
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=probs, average='micro')
    accuracy = accuracy_score(y_true, probs)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'accuracy': accuracy}
    return metrics
 
def compute_metrics(p: EvalPrediction):
    print(p)
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    result = multi_label_metrics(predictions=preds, labels=p.label_ids)
    return result

2、将评估函数引入到上述训练过程中

# 加载预训练模型并进行词句切割
def train_model(datasetDict):
    # 加载模型分词器(简化,其实需要做现行回归)
    tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
    model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
    # model = BERT_Classifier(output_dim=3)
    encoded_dataset = datasetDict.map(lambda examples: tokenizer(examples['text'], truncation=True, padding=True, max_length=512), batched=True)
    # 数据填充器,以确保批处理时输入序列的长度一致
    data_collator = DataCollatorWithPadding(tokenizer)
    # 训练参数设置,输出目录、评估策略、学习率、批处理大小、训练轮数、权重衰减
    trainning_args = TrainingArguments(
        output_dir='E:/bertTest/results',
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=10,
        weight_decay=0.01
    ) 
    # 创建Trainer ,管理训练过程
    trainer = Trainer(
        model=model,
        args=trainning_args,
        train_dataset=encoded_dataset['tran_dataset'],
        eval_dataset=encoded_dataset['test_dataset'],
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    # 开始训练
    print('我要开始训练了')
    trainer.train()
    # 模型评估
    trainer.evaluate()
    # 保存模型和分词器
    print('我训练完成了')
    trainer.save_model("E:/bertTest/bert")

 模型验证

1、利用predict 函数对模型进行验证

predictions = trainer.predict(encoded_dataset["test_dataset"])
print(predictions)

模型回调函数 

当你需要在训练过程中插入自定义逻辑,或者想要实现一些高级功能(如早停、学习率调整、模型保存等)时,可以使用回调函数。

1、定义回调函数(on_log 只是TrainerCallback 中的一个可以重写的方法)

# 通过回调函数调整
class LossCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        print('on_log', state)
        if state.is_local_process_zero:
            print(logs)

2、训练任务配置回调

# 通过回调函数调整
class LossCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        print('on_log', state)
        if state.is_local_process_zero:
            print(logs)

# 加载预训练模型并进行词句切割
def train_model(datasetDict):
    # 加载模型分词器(简化,其实需要做现行回归)
    tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
    model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
    # model = BERT_Classifier(output_dim=3)
    encoded_dataset = datasetDict.map(lambda examples: tokenizer(examples['text'], truncation=True, padding=True, max_length=512), batched=True)
    # 数据填充器,以确保批处理时输入序列的长度一致
    data_collator = DataCollatorWithPadding(tokenizer)

    loss_callback = LossCallback()
    # 训练参数设置,输出目录、评估策略、学习率、批处理大小、训练轮数、权重衰减
    trainning_args = TrainingArguments(
        output_dir='E:/bertTest/results',
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=10,
        weight_decay=0.01
    ) 
    # 创建Trainer ,管理训练过程
    trainer = Trainer(
        model=model,
        args=trainning_args,
        train_dataset=encoded_dataset['tran_dataset'],
        eval_dataset=encoded_dataset['test_dataset'],
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[loss_callback]
    )
    # 开始训练
    print('我要开始训练了')
    trainer.train()
    # 模型评估
    trainer.evaluate()
    # 保存模型和分词器
    print('我训练完成了')
    trainer.save_model("E:/bertTest/bert")

3、还有哪些回调函数呢(官方)

def on_init_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called at the end of the initialization of the [`Trainer`].
        """
        pass

    def on_train_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called at the beginning of training.
        """
        pass

    def on_train_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called at the end of training.
        """
        pass

    def on_epoch_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called at the beginning of an epoch.
        """
        pass

    def on_epoch_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called at the end of an epoch.
        """
        pass

    def on_step_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called at the beginning of a training step. If using gradient accumulation, one training step might take
        several inputs.
        """
        pass

    def on_pre_optimizer_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called before the optimizer step but after gradient clipping. Useful for monitoring gradients.
        """
        pass

    def on_optimizer_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called after the optimizer step but before gradients are zeroed out. Useful for monitoring gradients.
        """
        pass

    def on_substep_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called at the end of an substep during gradient accumulation.
        """
        pass

    def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called at the end of a training step. If using gradient accumulation, one training step might take
        several inputs.
        """
        pass

    def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called after an evaluation phase.
        """
        pass

    def on_predict(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, metrics, **kwargs):
        """
        Event called after a successful prediction.
        """
        pass

    def on_save(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called after a checkpoint save.
        """
        pass

    def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called after logging the last logs.
        """
        pass

    def on_prediction_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        """
        Event called after a prediction step.
        """
        pass

 模型结构

1、按照上述样例配置,训练后的模型位于E:/bertTest/results 目录下的checkpoint-10 或者E:/bertTest/ 下的 bert.(如果trainer.save_model('bert')这样设置,模型会保存到TrainingArguments 配置的output_dir 目录下)

应用模型

正常业务中如果要使用上述模型,不能按照上面的训练方法进行的。

1、引入库文件

from transformers import BertTokenizer, AutoModelForSequenceClassification
import torch

2、加载模型文件和模型切词器(切词模型的选择务必要与训练模型属于同族)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "E:/bertTest/bert"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

3、测试并输出

encoding = tokenizer("酒店应该重视一下这个问题了", return_tensors="pt")
res = model(**encoding)
predicted_label_classes = res.logits.argmax(-1)
print(predicted_label_classes)
outputs = predicted_label_classes.tolist()
print(outputs)

效果演示: 

1、训练效果(重点关注截图中红色标注,这2个参数只有训练时添加了评估函数才有)

2、模型预测效果:(参考上面样式看)

知识提示: 

 1、BertForSequenceClassification 模型其实是BertPreTrainedModel 的子类,其集成了线性回归

class BertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config

        self.bert = BertModel(config)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

2、AutoModelForSequenceClassification 模型是transformers库中的一个模型类,专门用于处理序列分类任务。这个模型类封装了不同序列分类任务所需的模型架构,提供了便捷的接口,使得用户可以轻松加载和部署预训练模型进行序列分类任务。(模型应用时多采用)

3、Trainer 工具大大降低了模型训练的难度,但其配置文件和参数比较多,需要细心了解。

### 如何优化和加速 BERT 模型训练过程 为了有效提升 BERT 模型训练效率并加快训练速度,可以采取多种方法和技术手段: #### 1. 数据处理层面的优化 采用高效的采样策略能够显著降低计算资源消耗。具体来说,在不影响最终模型性能的前提下,可以选择具有代表性的子集用于训练,而非使用全部可用的数据集[^3]。 #### 2. 训练算法方面的改进 利用混合精度技术可以在保持相同收敛特性和准确性的同时大幅削减内存占用以及浮点运算需求。这主要是因为大多数现代GPU都支持FP16半精度格式下的更快执行速度,而关键部分仍保留为FP32全精度以维持数值稳定性。 #### 3. 调整超参数设置 适当调整学习率、批大小等超参数有助于找到更优解路径,进而可能减少达到目标损失所需的epoch数。此外,合理配置这些参数还可以防止过拟合现象的发生,使得模型更加泛化良好。 #### 4. 并行分布式训练架构的应用 借助多机多卡集群环境实施同步或异步SGD更新机制,可极大程度上分摊单节点的压力,实现近线性扩展效益。这种做法特别适合于大规模语料库场景下快速完成预训练任务的需求[^1]。 ```python import torch.distributed as dist from transformers import BertForMaskedLM, Trainer, TrainingArguments model = BertForMaskedLM.from_pretrained('bert-base-uncased') training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=8, gradient_accumulation_steps=2, ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, ) if not dist.is_initialized(): dist.init_process_group(backend='nccl') trainer.train() ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值