BERT 文本分类实操

最新推荐文章于 2025-07-11 22:05:41 发布

冰__蓝

最新推荐文章于 2025-07-11 22:05:41 发布

阅读量3.7k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： NLP技术文章标签： BERT 文本分类

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.youkuaiyun.com/ling620/article/details/97749908

本文目录

上篇文章介绍了如何安装和使用BERT进行文本相似度任务，包括如何修改代码进行训练和测试。本文在此基础上介绍如何进行文本分类任务。

文本相似度任务具体见： BERT介绍及中文文本相似度任务实践

文本相似度任务和文本分类任务的区别在于数据集的准备以及run_classifier.py中数据类的构造部分。

0. 准备工作

如果想要根据我们准备的数据集进行fine-tuning，则需要先下载预训练模型。由于是处理中文文本，因此下载对应的中文预训练模型。

BERTgit地址： google-research/bert

BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

文件名为 chinese_L-12_H-768_A-12.zip。将其解压至bert文件夹，包含以下三种文件：

配置文件(bert_config.json)：用于指定模型的超参数
词典文件(vocab.txt)：用于WordPiece 到 Word id的映射
Tensorflow checkpoint（bert_model.ckpt）：包含了预训练模型的权重（实际包含三个文件）

1. 数据集的准备

对于文本分类任务，需要准备的数据集的格式如下：
label, 文本 ，其中标签可以是中文字符串，也可以是数字。
如: 天气, 一会好像要下雨了 或者0, 一会好像要下雨了

将准备好的数据存放于文本文件中，如.txt， .csv等。至于用什么名字和后缀，只要与数据类中的名称一致即可。
如，在run_classifier.py中的数据类get_train_examples方法中，默认训练集文件是train.csv，可以修改为自己命名的文件名即可。

    def get_train_examples(self, data_dir):
        """See base class."""
        file_path = os.path.join(data_dir, 'train.csv')

2. 增加自定义数据类

将新增的用于文本分类的数据类命名为 TextClassifierProcessor，如下

class TextClassifierProcessor(DataProcessor):

重写其父类的四个方法，从而实现数据的获取过程。

get_train_examples：对训练集获取InputExample的集合
get_dev_examples：对验证集…
get_test_examples：对测试集…
get_labels：获取数据集分类标签列表

InputExample类的作用是对于单个分类序列的训练/测试样例。构建了一个InputExample，包含id, text_a, text_b, label。
其定义如下：

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.

        Args:
          guid: Unique id for the example.
          text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
          text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
          label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label

重写get_train_examples方法，对于文本分类任务，只需要label和一个文本即可，因此，只需要赋值给text_a。

因为准备的数据集标签和文本是以逗号隔开的，因此先将每行数据以逗号隔开，则split_line[0]为标签赋值给label，split_line[1]为文本赋值给text_a。

此处，准备的数据集标签和文本是以逗号隔开的，难免文本中没有同样的英文逗号，为了避免获取到不完整的文本数据，建议使用 str.find(',')找到第一个逗号出现的位置，则 label = line[:line.find(',')].strip()

对于测试集和验证集的处理相同。

    def get_train_examples(self, data_dir):
        """See base class."""
        file_path = os.path.join(data_dir, 'train.csv')
        examples

最低0.47元/天解锁文章

200万优质内容无限畅学