用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

该博客介绍了如何使用Hugging Face的transformers库,在PyTorch环境下微调预训练的BERT模型进行文本分类任务。作者详细讲解了使用Trainer进行训练、自定义数据处理以及原生PyTorch实现的方法,并分享了关键代码片段。
该文章已生成可运行项目,

诸神缄默不语-个人优快云博文目录

本文属于huggingface.transformers全部文档学习笔记博文的一部分。
全文链接:huggingface transformers包 文档学习笔记(持续更新ing…)

本部分网址:https://huggingface.co/docs/transformers/main/en/training
本部分以文本分类任务为例,介绍transformers上如何微调预训练模型。
由于本人主要使用PyTorch框架,因此本文仅介绍使用transformers.Trainer(文档:https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer)和使用原生PyTorch来进行微调的方法。
由于教程中的代码是分散的,所以我会在这两个部分的最后一节各自呈现完整的脚本代码。
此外,由于①本人有用自己数据集的需要。②由于我的服务器不好挂代理,所以用datasets不方便。所以本文将用一些篇幅介绍不使用datasets实现所需功能的方式。(但是也会介绍本部分文档所提到datasets包的使用内容)

另请注意:我一部分代码是在jupyter notebook上跑的,一部分代码是用脚本跑的,而且使用的环境有所改变,所以输出的环境可能不统一。

一个本文代码可用的Python环境:Python 3.8,PyTorch 1.8.1,cudatoolkit 10.2,transformers 4.18.0,datasets 2,scikit-learn 1.0.2
(据我观察别的版本应该也可以,影响不大)

在一个特定任务的数据集上训练预训练模型,就叫微调(finetune)。

1. 使用datasets包加载数据集

我专门另写了一篇博文介绍datasets包的使用方法,请转见这篇博文:huggingface.datasets使用说明

总之最终用了Yelp Reviews(dataset = load_dataset("yelp_review_full"))的一小部分数据来作为我们这次实验的数据集:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mypath/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"],padding="max_length",truncation=True,max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)



small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

2. 使用Trainer(以PyTorch为后端框架)进行微调

在这里插入图片描述

2.1 定义分类模型

这个数据集的标签有5类。

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("mypath/bert-base-cased", num_labels=5)

输出:

Some weights of the model checkpoint at mypath/bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at mypath/bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

(对该输出的解释可参考我之前写的博文:Some weights of the model checkpoint at mypath/bert-base-chinese were not used when initializing Ber_诸神缄默不语的博客-优快云博客

这个代码也可以这么写:

from transformers import AutoConfig,AutoModelForSequenceClassification

model_path="mypath/bert-base-cased"
config=AutoConfig.from_pretrained(model_path,num_labels=5)
model=AutoModelForSequenceClassification.from_pretrained(model_path,config=config)

2.2 训练超参数

TrainingArguments类(文档:https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments)包含了所有可调的超参数、训练设置。在本教程中用的是默认超参数。

定义checkpoint存储位置:

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

2.3 指标

Trainer不会自动评估模型,所以需要传递给它用以计算和打印指标的函数。
更多指标相关的内容可参考:https://huggingface.co/docs/datasets/metrics.html

accuracy(准确率)指标的huggingface官方网页:Hugging Face – The AI community building the future.

加载准确率指标:

import numpy as np
metric=datasets.load_metric("accuracy")

在metric上调用compute()方法,就可以计算预测值(模型返回值中的logits)的准确率了。

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

如果想要在微调过程中监测指标的变化情况,需要在TrainingArguments中定义evaluation_strategy超参,以在每个epoch结束时打印测试集上的指标):

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

2.4 Trainer

定义Trainer对象:

from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

开始训练:

trainer.train()

脚本运行输出:(在这里就可以看到,text列没有传入模型)

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
myenv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 96
  0%|                                                                                  | 0/96 [00:00<?, ?it/s]myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 33%|████████████████████████▎                                                | 32/96 [00:19<00:23,  2.73it/s]The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.219325304031372, 'eval_accuracy': 0.487, 'eval_runtime': 5.219, 'eval_samples_per_second': 191.609, 'eval_steps_per_second': 6.131, 'epoch': 1.0}                                                           
 33%|████████████████████████▎                                                | 32/96 [00:24<00:23,  2.73it/smyenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 67%|████████████████████████████████████████████████▋                        | 64/96 [00:37<00:11,  2.87it/s]The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.0443027019500732, 'eval_accuracy': 0.57, 'eval_runtime': 5.1937, 'eval_samples_per_second': 192.539, 'eval_steps_per_second': 6.161, 'epoch': 2.0}                                                          
 67%|████████████████████████████████████████████████▋                        | 64/96 [00:42<00:11,  2.87it/smyenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [00:55<00:00,  2.87it/s]The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 0.9776290655136108, 'eval_accuracy': 0.598, 'eval_runtime': 5.2137, 'eval_samples_per_second': 191.803, 'eval_steps_per_second': 6.138, 'epoch': 3.0}                                                         
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [01:00<00:00,  2.87it/s]
                                                                                                              
Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 60.8009, 'train_samples_per_second': 49.341, 'train_steps_per_second': 1.579, 'train_loss': 1.0931960741678874, 'epoch': 3.0}
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [01:00<00:00,  1.58it/s]

jupyter notebook的输出效果,看起来比脚本输出更清晰一些:

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
myenv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 96
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '

在这里插入图片描述

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)


TrainOutput(global_step=96, training_loss=1.1009167830149333, metrics={'train_runtime': 60.9212, 'train_samples_per_second': 49.244, 'train_steps_per_second': 1.576, 'total_flos': 789354427392000.0, 'train_loss': 1.1009167830149333, 'epoch': 3.0})



因为我为了debug在colab上也跑了一遍,所以也展示一下colab上的输出效果(我也用了GPU,但还是比在本地慢了很多,不知道为啥。我本地是有4张卡,但这明显不止慢了4倍啊):

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375

在这里插入图片描述

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)


TrainOutput(global_step=375, training_loss=1.2140440266927084, metrics={'train_runtime': 780.671, 'train_samples_per_second': 3.843, 'train_steps_per_second': 0.48, 'total_flos': 789354427392000.0, 'train_loss': 1.2140440266927084, 'epoch': 3.0})

(注意这里还有一点在于torch.nn.parallel的报错,colab运行时没有报错,我怀疑要么是因为colab只有一张卡,要么是因为torch版本的问题(我本地用的是PyTorch 1.8.1,colab是PyTorch 1.10)。但是这玩意不好验证,我就猜猜)

2.5 完整的脚本代码

import datasets
import numpy as np
from transformers import AutoTokenizer,AutoModelForSequenceClassification,TrainingArguments,Trainer

dataset=datasets.load_from_disk("datasets/yelp_full_review_disk")

tokenizer = AutoTokenizer.from_pretrained("pretrained_models/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"],padding="max_length",truncation=True,max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

model = AutoModelForSequenceClassification.from_pretrained("pretrained_models/bert-base-cased",
                                                            num_labels=5)

training_args = TrainingArguments(output_dir="pt_save_pretrained",evaluation_strategy="epoch")

metric=datasets.load_metric('datasets/accuracy.py')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

3. 使用原生PyTorch进行微调

Trainer虽好,屁事太多。太难debug了,不如直接用原生PyTorch写。

这一部分的理解可参考我之前写的博文60分钟闪击速成PyTorch(Deep Learning with PyTorch: A 60 Minute Blitz)学习笔记_诸神缄默不语的博客-优快云博客

一个training loop:
将训练数据输入模型,得到预测结果→计算损失函数→计算梯度→更新参数→重新将训练数据输入模型,得到预测结果
在这里插入图片描述

如果在notebook上照着之前的代码后面继续跑的,建议先把之前的模型、Trainer之类的先删掉,清一下cuda上的缓存,以省出内存。或者直接重启notebook:

del model
del trainer
torch.cuda.empty_cache()

3.1 数据集

预处理dataset(后文会介绍如何使用Python原生的数据对象来生成所需的数据集):

from torch.utils.data import DataLoader

tokenized_datasets = tokenized_datasets.remove_columns(["text"])
#删除模型不用的text列

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
#改名label列为labels,因为AutoModelForSequenceClassification的入参键名为label
#我不知道为什么dataset直接叫label就可以啦……

tokenized_datasets.set_format("torch")  #将值转换为torch.Tensor对象

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
#抽样出一部分数据来,快速完成教程

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
#将数据集转为DataLoader,就是键-值相对应的形式,后文可以看出是通过**batch的形式将数据传入模型的

使用自己的数据集:

示例数据集是这么得到的:

example_dict={'labels':dataset['train']['label'],'text':dataset['train']['text']}

展示数据:

print(type(example_dict['labels']))
print(example_dict['labels'][12345])
print(type(example_dict['text']))
print(example_dict['text'][12345])

输出:

<class 'list'>
2
<class 'list'>
I went here in search of a crepe with Nutella and I got a really good crepe. I wouldn't exactly say this place is authentic French because you've got Americans cooking the food,  but my crepe was still good. \n\nIt doesn't taste like the ones I had in France, Carmon's puts a twist on (or maybe it was just overcooked) theirs by making the crepe more firm. \n\nThe whipped cream was also made fresh and delightful. The prices were horrid though.\n\nCrepes don't cost that much to make, so they're clearly overpricing here. Price is the only reason I won't come back so often.

①使用torch的DataSet和DataLoader类(跟上面将datasets.Dataset最后得到的东西相当于是一样的):

from torch.utils.data import Dataset,DataLoader

#定义DataSet
class YelpDataset(Dataset):
    def __init__(self,dict_data) -> None:
        """
        dict_data: dict格式的data,键labels对应标签列表(元素是数值),键text对应文本列表
        """
        super(YelpDataset,self).__init__()

        self.data=dict_data
    
    def __getitem__(self, index):
        return [self.data['text'][index],self.data['labels'][index]]
        #返回一个列表,第一个元素是文本,第二个元素是标签
    
    def __len__(self):
        return len(self.data['text'])

#定义collate函数
def collate_fn(batch):
    pt_batch=tokenizer([b[0] for b in batch],padding=True,truncation=True,max_length=512,
                        return_tensors='pt')
    labels=torch.tensor([b[1] for b in batch])
    return {'labels':labels,'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],
            'attention_mask':pt_batch['attention_mask']}

train_data=YelpDataset(example_dict)

train_dataloader=DataLoader(train_data,batch_size=8,shuffle=True,collate_fn=collate_fn)

②手写DataLoader:
在每个training loop中,如此遍历(大多数变量我觉得都能看名字就看出来什么意思,就不做详细介绍了):

#训练部分
#(验证部分差不多)
train_data_length=len(example_dict['labels'])

if train_data_length%batch_size==0:
    batch_num=int(train_data_length/batch_size)
else:
    batch_num=int(train_data_length/batch_size)+1

for b in range(batch_num):
    index_begin=b*batch_size
    index_end=min(train_data_length,index_begin+batch_size)

    this_batch_text=example_dict['text'][index_begin:index_end]
    this_batch_labels=example_dict['labels'][index_begin:index_end]

    pt_batch=tokenizer(this_batch_text,padding=True,truncation=True,max_length=512,return_tensors='pt')
    #pt_batch我就懒得按键拆开了,以下运行训练部分代码,和用DataLoader的类似,一目了然不言而喻,略

3.2 神经网络模型

定义分类模型:

from transformers import AutoModelForSequenceClassification

model=AutoModelForSequenceClassification.from_pretrained("mypath/bert-base-cased",
                                                        num_labels=5)

3.3 优化器和learning rate scheduler

在前文可以看到,transformers的Trainer默认调用的是transformers的AdamW优化器,并会报此警告:
FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning

所以以前的AdamW都别用了,用PyTorch官方的AdamW优化器:

from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

从Trainer创建默认的learning rate scheduler:

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

3.4 运行设备

指定设备(单卡情况)并将模型转移到指定设备上:

import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

3.5 Training Loop

tqdm包官网:https://tqdm.github.io/

from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

输出:
在这里插入图片描述
在真实代码中还可以加上早停、保存在验证集上指标最高的模型等功能。

3.6 指标

和使用Trainer时一样,用datasets包的Metric来计算指标。
这里的验证过程是在训练结束后,通过Metric的add_batch()函数(文档:https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=add_batch#datasets.Metric.add_batch)来累积所有batch。

metric = load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

输出:{'accuracy': 0.588}

3.7 完整的脚本代码

from tqdm.auto import tqdm

import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

import datasets
from transformers import AutoTokenizer,AutoModelForSequenceClassification,get_scheduler

dataset=datasets.load_from_disk("datasets/yelp_full_review_disk")

tokenizer = AutoTokenizer.from_pretrained("pretrained_models/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",truncation=True,max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

#Postprocess dataset
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
#删除模型不用的text列

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
#改名label列为labels,因为AutoModelForSequenceClassification的入参键名为label
#我不知道为什么dataset直接叫label就可以啦……

tokenized_datasets.set_format("torch")  #将值转换为torch.Tensor对象

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

model=AutoModelForSequenceClassification.from_pretrained\
                        ("pretrained_models/bert-base-cased",num_labels=5)

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

metric=datasets.load_metric('datasets/accuracy.py')
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

print(metric.compute())

4. 教程中给出的其他学习资源

  1. 🤗 Transformers Examples:这个我有撰写相关学习笔记博文的计划。
  2. 🤗 Transformers Notebooks:这个我也也许会撰写相关学习笔记。
本文章已经生成可运行项目
<think> 我们已知用户想了解如何在Hugging Face模型库(https://huggingface.co/models)上下载情感分析(sentiment-analysis)模型。根据之前的对话历史和引用内容,我们可以总结如下: 1. 引用[1]提到,默认的pipeline下载的模型是“distilbert-base-uncased-finetuned-sst-2-english”,并且我们可以通过Hugging Face网站找到更多文本分类模型。 2. 引用[2]提到,Hugging Face提供了Inference API,可以免费试用模型。 3. 引用[3]提到,Hugging Face有NLP相关的模型,并提供了官方文档链接。 4. 引用[4]展示了如何使用`huggingface_hub`库编程化下载模型。 因此,我们可以为用户提供以下步骤来下载情感分析模型: ### 步骤1: 在Hugging Face模型库中查找情感分析模型 打开链接:https://huggingface.co/models,然后在左侧的筛选条件中: - 选择“文本分类”(Text Classification)或者直接搜索“sentiment-analysis” - 在右侧的搜索框中输入“sentiment”进行搜索 这样会列出所有适用于情感分析的预训练模型。 ### 步骤2: 选择模型 浏览模型列表,根据需求选择模型。注意模型的以下几个属性: - 语言(如中文或英文) - 模型大小 - 许可证 - 模型的准确率等指标 例如,对于英文情感分析,可以选择: - `distilbert-base-uncased-finetuned-sst-2-english`(轻量级,适用于二分类) - `nlptown/bert-base-multilingual-uncased-sentiment`(多语言,支持五星评分) 对于中文情感分析,可以选择: - `nlptown/bert-base-multilingual-uncased-sentiment`(多语言,支持中文) - `bert-base-chinese`(需要自己微调) - `finiteautomata/bertweet-base-sentiment-analysis`(适用于推文) ### 步骤3: 下载模型 #### 方法1: 使用`transformers`库(Python代码) 安装transformers库(如果未安装): ```bash pip install transformers ``` 然后使用以下代码下载模型和分词器(以英文模型为例): ```python from transformers import pipeline # 创建情感分析pipeline,会自动下载模型 classifier = pipeline('sentiment-analysis') ``` 如果要指定模型,可以在创建pipeline时传入模型名称: ```python classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english') ``` #### 方法2: 使用`huggingface_hub`库(编程下载) 如果只需要下载模型文件到本地,而不加载到内存中,可以使用`huggingface_hub`库: ```bash pip install huggingface_hub ``` 然后使用以下代码: ```python from huggingface_hub import snapshot_download # 下载整个模型仓库到本地目录 snapshot_download(repo_id="distilbert-base-uncased-finetuned-sst-2-english", cache_dir="./sentiment_model") ``` #### 方法3: 使用Git(大型模型) 对于大型模型,可以使用Git LFS(需要先安装Git和Git LFS): ```bash git lfs install git clone https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english ``` ### 步骤4: 使用模型进行推理 下载完成后,可以使用`transformers`库加载模型进行预测: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification model_path = "./sentiment_model" # 或者使用模型名称 tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSequenceClassification.from_pretrained(model_path) # 对文本进行情感分析 inputs = tokenizer("I love this movie!", return_tensors="pt") outputs = model(**inputs) ``` ### 步骤5: 使用Inference API(无需下载) 如果暂时不想下载模型,可以使用Hugging Face的Inference API(免费): ```python import requests API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english" headers = {"Authorization": "Bearer YOUR_API_TOKEN"} def query(payload): response = requests.post(API_URL, headers=headers, json=payload) return response.json() output = query({ "inputs": "I love this movie!", }) ``` ### 注意事项 1. 下载模型可能需要较大的存储空间(几百MB到几GB不等)。 2. 首次使用时会下载模型,之后会缓存在本地(默认在`~/.cache/huggingface`目录)。 3. 对于中文情感分析,请确保选择支持中文的模型,或者使用多语言模型。 ### 相关问题 1. 如何选择适合特定语言的情感分析模型? 2. 如何将下载的Hugging Face模型转换为ONNX格式以提高推理速度? 3. 如何微调预训练的情感分析模型以适应特定领域? 4. 如何评估情感分析模型的性能? </think> 以下是针对 Hugging Face 模型库(https://huggingface.co/models)下载情感分析(sentiment-analysis)模型的详细指南,结合引用内容和实际操作方法: --- ### 一、模型筛选步骤(网页端操作) 1. **访问模型库** 打开 https://huggingface.co/models → 左上角筛选器选择: ``` Tasks → Text Classification → Sentiment Analysis Languages → 按需选择(如Chinese/English) ``` ![筛选器示例](https://huggingface.co/front/assets/hub/tasks-0ffebf2460062e5e1b4809fec4388e5c.png) *通过任务类型快速定位情感分析模型[^3]* 2. **选择模型** 推荐热门模型: - 英文:`distilbert-base-uncased-finetuned-sst-2-english`(轻量高效)[^1] - 中文:`nlptown/bert-base-multilingual-uncased-sentiment`(支持多语言) - 金融领域:`ProsusAI/finbert`(金融情感分析) --- ### 二、三种下载方式 #### 方式1:transformers库自动下载(推荐) ```python from transformers import pipeline # 自动下载并加载模型 sentiment_analyzer = pipeline( task="sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english" # 替换为中文ID ) # 使用示例 result = sentiment_analyzer("I love this product!") print(result) # [{'label': 'POSITIVE', 'score': 0.9998}] ``` > 首次运行自动下载模型到缓存目录(默认`~/.cache/huggingface`) #### 方式2:编程化下载(精确控制) ```python from huggingface_hub import hf_hub_download from transformers import AutoModel, AutoTokenizer # 下载模型文件 model_path = hf_hub_download( repo_id="distilbert-base-uncased-finetuned-sst-2-english", filename="pytorch_model.bin", cache_dir="./my_models" ) # 加载模型 model = AutoModel.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") ``` *适合离线部署或指定存储位置[^4]* #### 方式3:API快速测试(免下载) ```python import requests API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english" headers = {"Authorization": "Bearer YOUR_TOKEN"} def query(payload): response = requests.post(API_URL, headers=headers, json=payload) return response.json() output = query({"inputs": "The movie was terrible!"}) ``` *免费测试模型效果,无需本地资源[^2]* --- ### 三、中文模型特殊处理 针对中文情感分析: ```python # 使用多语言模型 cn_analyzer = pipeline( "sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment" ) # 中文文本分析 result = cn_analyzer("这家餐厅的服务太糟糕了") print(result) # [{'label': '1 star', 'score': 0.781}] ``` --- ### 四、模型管理技巧 | 操作场景 | 命令/方法 | 说明 | |-------------------|-----------------------------------|--------------------------| | 查看缓存位置 | `from transformers import file_utils; file_utils.default_cache_path` | 获取默认存储路径 | | 清除缓存 | `!rm -rf ~/.cache/huggingface` | 释放磁盘空间 | | 指定下载目录 | `pipeline(..., cache_dir="custom_path")` | 自定义存储位置 | | 模型离线使用 | 下载后复制 `config.json`, `pytorch_model.bin` 等文件 | 脱离网络环境使用[^4] | > **注意**:完整模型通常包含以下文件: > ``` > config.json # 模型架构配置 > pytorch_model.bin # 权重文件 (PyTorch) > tf_model.h5 # TensorFlow权重 > tokenizer.json # 分词器配置 > ``` ---
评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

诸神缄默不语

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值