文章目录
背景
使用bert进行微调,本篇文章的任务是情感分析
前置准备
# 下载相关包
pip install datasets # 我的版本是3.2.0
pip install accelerate # 1.2.1
步骤
from transformers import BertForSequenceClassification, BertTokenizerFast,Trainer, TrainingArguments
from datasets import load_dataset
import torch
import numpy as np
# 1. 加载数据集
dataset = load_dataset('imdb')
print(dataset)
# 输出如下
'''
DatasetDict({
train: Dataset({
features: ['label', 'text'],
num_rows: 25000
})
test: Dataset({
features: ['label', 'text'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['label', 'text'],
num_rows: 50000
})
})
'''
# 2. 创建训练集和测试集
train_set = dataset['train']
test_set = dataset['test']
# 3. 下载并加载预训练bert-base-un-cased模型和词元分析器。
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# 这里使用了BertTokenizerFast类创建词元分析器,而不是使用BertTokenizer。与BertTokenizer相比,BertTokenizerFast类有很多优点。
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
'''
出现以下提示,这是正常的,因为BertForSequenceClassification模型包含一个额外的分类层,用于将输出转换为分类标签。这个额外的层被随机初始化了。
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
'''
# 4. 对训练集和测试集进行预处理,tokenizer这个函数的原理可以参考最后
def preprocess(data):
return tokenizer(data['text'], padding