基于Data Science on AWS项目的BERT模型微调与文本分类实践

最新推荐文章于 2025-06-11 09:17:10 发布

富晓微Erik

最新推荐文章于 2025-06-11 09:17:10 发布

阅读量258

点赞数 4

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/gitblog_01015/article/details/148578393

基于Data Science on AWS项目的BERT模型微调与文本分类实践

data-science-on-aws AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker 项目地址: https://gitcode.com/gh_mirrors/da/data-science-on-aws

引言

在自然语言处理(NLP)领域，BERT(Bidirectional Encoder Representations from Transformers)已经成为最先进的预训练语言模型之一。本文将详细介绍如何在Data Science on AWS项目中，使用TensorFlow和Transformers库对BERT模型进行微调，构建一个能够预测商品评分的文本分类器。

准备工作

在开始模型训练之前，我们需要确保已经完成了以下准备工作：

数据预处理：将原始评论文本转换为BERT模型可接受的输入格式
特征工程：使用预训练的BERT模型生成文本嵌入
数据集划分：将数据分为训练集、验证集和测试集
数据格式转换：将数据集保存为TFRecord格式以优化TensorFlow训练性能

技术架构概述

本项目的核心架构如下图所示：

BERT训练架构

我们使用的是BERT的一个轻量级变体——DistilBERT，它在保持较高准确率的同时，显著减少了内存和计算资源的消耗。DistilBERT通过知识蒸馏技术，将原始BERT模型的规模减小了40%，而性能仅下降约3%。

代码实现详解

1. 导入必要的库

首先导入所有必需的Python库：

import time
import random
import pandas as pd
from glob import glob
import tensorflow as tf
from transformers import DistilBertTokenizer
from transformers import TFDistilBertForSequenceClassification
from transformers import DistilBertConfig

2. 数据加载与预处理

我们定义了几个关键函数来处理TFRecord格式的数据：

def select_data_and_label_from_record(record):
    x = {
        "input_ids": record["input_ids"],
        "input_mask": record["input_mask"],
    }
    y = record["label_ids"]
    return (x, y)

def file_based_input_dataset_builder(channel, input_filenames, pipe_mode, is_training, drop_remainder):
    # 数据加载和预处理逻辑
    ...

这些函数负责：

从TFRecord中提取特征和标签
构建高效的数据管道
实现数据批处理和缓存
为训练数据添加随机打乱功能

3. 加载数据集

我们分别加载训练、验证和测试数据集：

train_data = "./data-tfrecord/bert-train"
train_data_filenames = glob("{}/*.tfrecord".format(train_data))
train_dataset = file_based_input_dataset_builder(...)

validation_data = "./data-tfrecord/bert-validation"
validation_data_filenames = glob("{}/*.tfrecord".format(validation_data))
validation_dataset = file_based_input_dataset_builder(...)

test_data = "./data-tfrecord/bert-test"
test_data_filenames = glob("{}/*.tfrecord".format(test_data))
test_dataset = file_based_input_dataset_builder(...)

4. 模型配置与构建

我们使用DistilBERT作为基础模型，并添加自定义分类层：

CLASSES = [1, 2, 3, 4, 5]  # 评分等级1-5星

config = DistilBertConfig.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(CLASSES),
    id2label={0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    label2id={1: 0, 2: 1, 3: 2, 4: 3, 5: 4},
)

transformer_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", config=config)

# 构建完整的模型架构
input_ids = tf.keras.layers.Input(shape=(max_seq_length,), name="input_ids", dtype="int32")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), name="input_mask", dtype="int32")

embedding_layer = transformer_model.distilbert(input_ids, attention_mask=input_mask)[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
X = tf.keras.layers.Dense(50, activation="relu")(X)
X = tf.keras.layers.Dropout(0.2)(X)
X = tf.keras.layers.Dense(len(CLASSES), activation="softmax")(X)

model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=X)

5. 模型训练配置

我们配置了训练所需的损失函数、优化器和评估指标：

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

6. 模型训练与评估

开始训练模型并在验证集上评估性能：

callbacks = [tf.keras.callbacks.TensorBoard(log_dir="./tmp/tensorboard/")]

history = model.fit(
    train_dataset,
    shuffle=True,
    epochs=1,
    steps_per_epoch=10,
    validation_data=validation_dataset,
    validation_steps=10,
    callbacks=callbacks,
)

# 在测试集上评估模型
test_history = model.evaluate(test_dataset, steps=10, callbacks=callbacks)

7. 模型保存与预测

训练完成后，我们保存模型并进行预测：

# 保存模型
model.save("./tmp/tensorflow/", include_optimizer=False, overwrite=True)

# 使用模型进行预测
sample_review_body = "This product is terrible."
encode_plus_tokens = tokenizer.encode_plus(
    sample_review_body, padding='max_length', max_length=max_seq_length, truncation=True, return_tensors="tf"
)

outputs = model.predict(x=(encode_plus_tokens["input_ids"], encode_plus_tokens["attention_mask"]))
prediction = [{"label": config.id2label[item.argmax()], "score": item.max().item()} for item in outputs]

print('Predicted star_rating "{}" for review_body "{}"'.format(prediction[0]["label"], sample_review_body))