HuggingFace SetFit项目快速入门：基于小样本的文本分类实战-优快云博客

HuggingFace SetFit项目快速入门：基于小样本的文本分类实战

notebooks Notebooks using the Hugging Face libraries 🤗 项目地址: https://gitcode.com/gh_mirrors/note/notebooks

前言

在自然语言处理领域，文本分类是一项基础且重要的任务。传统深度学习方法通常需要大量标注数据才能达到理想效果，而SetFit作为一种创新框架，能够在极少量标注数据的情况下实现出色的分类性能。本文将带您快速掌握SetFit的核心使用方法。

环境准备

首先需要安装SetFit基础包：

pip install setfit

如果您的设备配备NVIDIA显卡并支持CUDA，建议安装支持CUDA的PyTorch版本以加速训练和推理：

pip install torch --index-url https://download.pytorch.org/whl/cu118

SetFit核心概念

SetFit是一个高效的小样本文本分类框架，其核心优势在于：

仅需少量标注样本即可达到良好效果
训练速度快，推理延迟低
基于强大的Sentence Transformer模型

完整工作流程

1. 模型初始化

选择适合的Sentence Transformer模型作为基础。这里我们使用性能优异的BAAI/bge-small-en-v1.5模型：

from setfit import SetFitModel

model = SetFitModel.from_pretrained("BAAI/bge-small-en-v1.5")

2. 数据准备

加载数据集

我们使用SST-2情感分析数据集，包含电影评论的正负面评价：

from datasets import load_dataset

dataset = load_dataset("SetFit/sst2")

小样本采样

实际场景中标注数据往往很少，我们为每个类别仅采样8个样本：

from setfit import sample_dataset

train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
test_dataset = dataset["test"]

设置标签映射

为模型配置可读的标签名称：

model.labels = ["negative", "positive"]

3. 训练配置

SetFit训练分为两个阶段：嵌入微调和分类头训练。相关参数可以分别设置：

from setfit import TrainingArguments

args = TrainingArguments(
    batch_size=32,
    num_epochs=10,
)

4. 训练与评估

初始化训练器并开始训练：

from setfit import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
)
trainer.train()

评估模型性能：

metrics = trainer.evaluate(test_dataset)
print(metrics)  # 输出类似：{'accuracy': 0.851}

5. 模型保存与加载

保存到本地：

model.save_pretrained("setfit-bge-small-v1.5-sst2-8-shot")

或推送到模型中心：

model.push_to_hub("your-username/setfit-model")

加载模型：

# 从模型中心加载
model = SetFitModel.from_pretrained("your-username/setfit-model")
# 或从本地加载
model = SetFitModel.from_pretrained("local-path")

6. 推理应用

使用训练好的模型进行预测：

preds = model.predict([
    "The movie was absolutely fantastic!",
    "I found the plot quite boring.",
    "An average film with some good moments."
])
print(preds)  # 输出类似：["positive", "negative", "positive"]