背景:
为了比较系统的学习大模型的训练与应用有了该样例,仅供由此爱好的初学者参考和给自己留存。
本文将以去哪网酒店预订评论数据作为训练材料进行模型训练,力求达到输入中文描述,模型自动匹配该描述是好评还是差评。
材料:
1、准备数据(自行准备,内容结构如下)
label,review
0,"距离川沙公路较近,但是公交指示不对,如果是""蔡陆线""的话,会非常麻烦.建议用别的路线.房间较为简单."
1,商务大床房,房间很大,床有2M宽,整体感觉经济实惠不错!
0,早餐太差,无论去多少人,那边也不加食品的。酒店应该重视一下这个问题了。房间本身很好。
1,宾馆在小街道上,不大好找,但还好北京热心同胞很多~宾馆设施跟介绍的差不多,房间很小,确实挺小,但加上低价位因素,还是无超所值的;环境不错,就在小胡同内,安静整洁,暖气好足-_-||。。。呵还有一大优势就是从宾馆出发,步行不到十分钟就可以到梅兰芳故居等等,京味小胡同,北海距离好近呢。总之,不错。推荐给节约消费的自助游朋友~比较划算,附近特色小吃很多~
0,"CBD中心,周围没什么店铺,说5星有点勉强.不知道为什么卫生间没有电吹风"
1,总的来说,这样的酒店配这样的价格还算可以,希望他赶快装修,给我的客人留些好的印象
0,价格比比较不错的酒店。这次免费升级了,感谢前台服务员。房子还好,地毯是新的,比上次的好些。早餐的人很多要早去些。
1,不错,在同等档次酒店中应该是值得推荐的!
2、 下载预训练模型(由于从源头开始训练模型太费时也费资源,我们采用已训练好的预训练模型进行微调),对于已经预训练好的模型bert-base-chinese的下载可以去Hugging face下载,网址是:Hugging Face – The AI community building the future.
通过搜索找到模型并下载
3、准备python 环境(python 版本要>3.8)
制作:
模型训练
1、引入的库
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer, EvalPrediction
from sklearn.metrics import f1_score, accuracy_score
2、加载待训练数据并处理
# 取数
def get_trans_data(file_path):
data = []
with open(file_path, 'r', encoding='UTF-8') as file:
for line in file:
if len(line) > 0:
label, review = line.strip().split(',', 1)
# 如果 label 是标签,需要在这里将其映射为数字
data.append({'text': review.strip(), 'label': int(label.strip() if label.strip().isdigit() else 0)})
return data
3、 对数据进行加工,将数据分割为训练数据、验证数据并借助dataset库中的方法将数据转换成便于大模型训练的数据集
# 对数据进行加工,使其变成大模型训练需要的向量数据结构
def chenge_data(sourceData):
# 为了使用train_test_split 数据拆分工具类,借助pandas 对数据进行转换
data_df = pd.DataFrame(sourceData)
train_data, test_data = train_test_split(data_df, test_size=0.2, random_state=50)
# 分别对数据进行向量转换.
train_dataset = Dataset.from_pandas(train_data)
test_dataset = Dataset.from_pandas(test_data)
datasetDict = DatasetDict({'tran_dataset': train_dataset, 'test_dataset': test_dataset})
return datasetDict
4、训练模型并评估模型
# 加载预训练模型并进行词句切割
def train_model(datasetDict):
# 加载模型分词器(简化,其实需要做现行回归)
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
# model = BERT_Classifier(output_dim=3)
encoded_dataset = datasetDict.map(lambda examples: tokenizer(examples['text'], truncation=True, padding=True, max_length=512), batched=True)
# 数据填充器,以确保批处理时输入序列的长度一致
data_collator = DataCollatorWithPadding(tokenizer)
# 训练参数设置,输出目录、评估策略、学习率、批处理大小、训练轮数、权重衰减
trainning_args = TrainingArguments(
output_dir='E:/bertTest/results',
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=10,
weight_decay=0.01
)
# 创建Trainer ,管理训练过程
trainer = Trainer(
model=model,
args=trainning_args,
train_dataset=encoded_dataset['tran_dataset'],
eval_dataset=encoded_dataset['test_dataset'],
data_collator=data_collator,
tokenizer=tokenizer
)
# 开始训练
print('我要开始训练了')
trainer.train()
# 保存模型和分词器
print('我训练完成了')
trainer.save_model("E:/bertTest/bert")
模型评估
1、借助sklearn.metrics库对模型进行评估(样例中采用了accuracy_score、f1_score)
# 验证集
def multi_label_metrics(predictions, labels, threshold=0.5):
probs = np.argmax(predictions, -1)
y_true = labels
f1_micro_average = f1_score(y_true=y_true, y_pred=probs, average='micro')
accuracy = accuracy_score(y_true, probs)
# return as dictionary
metrics = {'f1': f1_micro_average,
'accuracy': accuracy}
return metrics
def compute_metrics(p: EvalPrediction):
print(p)
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
result = multi_label_metrics(predictions=preds, labels=p.label_ids)
return result
2、将评估函数引入到上述训练过程中
# 加载预训练模型并进行词句切割
def train_model(datasetDict):
# 加载模型分词器(简化,其实需要做现行回归)
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
# model = BERT_Classifier(output_dim=3)
encoded_dataset = datasetDict.map(lambda examples: tokenizer(examples['text'], truncation=True, padding=True, max_length=512), batched=True)
# 数据填充器,以确保批处理时输入序列的长度一致
data_collator = DataCollatorWithPadding(tokenizer)
# 训练参数设置,输出目录、评估策略、学习率、批处理大小、训练轮数、权重衰减
trainning_args = TrainingArguments(
output_dir='E:/bertTest/results',
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=10,
weight_decay=0.01
)
# 创建Trainer ,管理训练过程
trainer = Trainer(
model=model,
args=trainning_args,
train_dataset=encoded_dataset['tran_dataset'],
eval_dataset=encoded_dataset['test_dataset'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
# 开始训练
print('我要开始训练了')
trainer.train()
# 模型评估
trainer.evaluate()
# 保存模型和分词器
print('我训练完成了')
trainer.save_model("E:/bertTest/bert")
模型验证
1、利用predict 函数对模型进行验证
predictions = trainer.predict(encoded_dataset["test_dataset"])
print(predictions)
模型回调函数
当你需要在训练过程中插入自定义逻辑,或者想要实现一些高级功能(如早停、学习率调整、模型保存等)时,可以使用回调函数。
1、定义回调函数(on_log 只是TrainerCallback 中的一个可以重写的方法)
# 通过回调函数调整
class LossCallback(TrainerCallback):
def on_log(self, args, state, control, logs=None, **kwargs):
print('on_log', state)
if state.is_local_process_zero:
print(logs)
2、训练任务配置回调
# 通过回调函数调整
class LossCallback(TrainerCallback):
def on_log(self, args, state, control, logs=None, **kwargs):
print('on_log', state)
if state.is_local_process_zero:
print(logs)
# 加载预训练模型并进行词句切割
def train_model(datasetDict):
# 加载模型分词器(简化,其实需要做现行回归)
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
# model = BERT_Classifier(output_dim=3)
encoded_dataset = datasetDict.map(lambda examples: tokenizer(examples['text'], truncation=True, padding=True, max_length=512), batched=True)
# 数据填充器,以确保批处理时输入序列的长度一致
data_collator = DataCollatorWithPadding(tokenizer)
loss_callback = LossCallback()
# 训练参数设置,输出目录、评估策略、学习率、批处理大小、训练轮数、权重衰减
trainning_args = TrainingArguments(
output_dir='E:/bertTest/results',
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=10,
weight_decay=0.01
)
# 创建Trainer ,管理训练过程
trainer = Trainer(
model=model,
args=trainning_args,
train_dataset=encoded_dataset['tran_dataset'],
eval_dataset=encoded_dataset['test_dataset'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
callbacks=[loss_callback]
)
# 开始训练
print('我要开始训练了')
trainer.train()
# 模型评估
trainer.evaluate()
# 保存模型和分词器
print('我训练完成了')
trainer.save_model("E:/bertTest/bert")
3、还有哪些回调函数呢(官方)
def on_init_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called at the end of the initialization of the [`Trainer`].
"""
pass
def on_train_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called at the beginning of training.
"""
pass
def on_train_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called at the end of training.
"""
pass
def on_epoch_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called at the beginning of an epoch.
"""
pass
def on_epoch_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called at the end of an epoch.
"""
pass
def on_step_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called at the beginning of a training step. If using gradient accumulation, one training step might take
several inputs.
"""
pass
def on_pre_optimizer_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called before the optimizer step but after gradient clipping. Useful for monitoring gradients.
"""
pass
def on_optimizer_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called after the optimizer step but before gradients are zeroed out. Useful for monitoring gradients.
"""
pass
def on_substep_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called at the end of an substep during gradient accumulation.
"""
pass
def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called at the end of a training step. If using gradient accumulation, one training step might take
several inputs.
"""
pass
def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called after an evaluation phase.
"""
pass
def on_predict(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, metrics, **kwargs):
"""
Event called after a successful prediction.
"""
pass
def on_save(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called after a checkpoint save.
"""
pass
def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called after logging the last logs.
"""
pass
def on_prediction_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""
Event called after a prediction step.
"""
pass
模型结构
1、按照上述样例配置,训练后的模型位于E:/bertTest/results 目录下的checkpoint-10 或者E:/bertTest/ 下的 bert.(如果trainer.save_model('bert')这样设置,模型会保存到TrainingArguments 配置的output_dir 目录下)
应用模型
正常业务中如果要使用上述模型,不能按照上面的训练方法进行的。
1、引入库文件
from transformers import BertTokenizer, AutoModelForSequenceClassification
import torch
2、加载模型文件和模型切词器(切词模型的选择务必要与训练模型属于同族)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "E:/bertTest/bert"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)
3、测试并输出
encoding = tokenizer("酒店应该重视一下这个问题了", return_tensors="pt")
res = model(**encoding)
predicted_label_classes = res.logits.argmax(-1)
print(predicted_label_classes)
outputs = predicted_label_classes.tolist()
print(outputs)
效果演示:
1、训练效果(重点关注截图中红色标注,这2个参数只有训练时添加了评估函数才有)
2、模型预测效果:(参考上面样式看)
知识提示:
1、BertForSequenceClassification 模型其实是BertPreTrainedModel 的子类,其集成了线性回归
class BertForSequenceClassification(BertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.config = config
self.bert = BertModel(config)
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
# Initialize weights and apply final processing
self.post_init()
2、AutoModelForSequenceClassification 模型是transformers库中的一个模型类,专门用于处理序列分类任务。这个模型类封装了不同序列分类任务所需的模型架构,提供了便捷的接口,使得用户可以轻松加载和部署预训练模型进行序列分类任务。(模型应用时多采用)
3、Trainer 工具大大降低了模型训练的难度,但其配置文件和参数比较多,需要细心了解。