Hugging Face 的 Transformers 库快速入门（一）开箱即用的 pipelines

liu_chengwei

已于 2022-10-20 15:49:01 修改

阅读量6.8k

点赞数 15

CC 4.0 BY-SA版权

文章标签：自然语言处理 pytorch transformer 机器学习

于 2022-10-19 17:40:25 首次发布

本文链接：https://blog.youkuaiyun.com/liu_chengwei/article/details/126487671

注：本系列教程仅供学习使用, 由原作者授权, 均转载自小昇的博客。

文章目录

前言
开箱即用的 pipelines
这些 pipeline 背后做了什么？
总结

前言

Transformers 是由 Hugging Face 开发的一个 NLP 包，支持加载目前绝大部分的预训练模型。随着 BERT、GPT 等大规模语言模型的兴起，越来越多的公司和研究者采用 Transformers 库来构建 NLP 应用，因此熟悉 Transformers 库的使用方法很有必要。

注：本系列教程只专注于处理文本，多模态方法请查阅相关文档。

开箱即用的 pipelines

Transformers 库将目前的 NLP 任务归纳为几下几类：

文本分类： 例如情感分析、句子对关系判断等；
对文本中的词语进行分类： 例如词性标注 (POS)、命名实体识别 (NER) 等；
文本生成： 例如填充预设的模板 (prompt)、预测文本中被遮掩掉 (masked) 的词语；
从文本中抽取答案： 例如根据给定的问题从一段文本中抽取出对应的答案；
根据输入文本生成新的句子： 例如文本翻译、自动摘要等。

Transformers 库最基础的对象就是 pipeline() 函数，它封装了预训练模型和对应的前处理和后处理环节。我们只需输入文本，就能得到预期的答案。目前常用的 pipelines 有：

feature-extraction （获得文本的向量化表示）
fill-mask （填充被遮盖的词、片段）
ner （命名实体识别）
question-answering （自动问答）
sentiment-analysis （情感分析）
summarization （自动摘要）
text-generation （文本生成）
translation （机器翻译）
zero-shot-classification （零训练样本分类）

下面我们以常见的几个 NLP 任务为例，展示如何调用这些 pipeline 模型。

情感分析

借助情感分析 pipeline，我们只需要输入文本，就可以得到其情感标签（积极/消极）以及对应的概率：

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I've been waiting for a HuggingFace course my whole life.")
print(result)
results = classifier(
  ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)
print(results)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)

[{
   
   'label': 'POSITIVE', 'score': 0.9598048329353333}]
[{
   
   'label': 'POSITIVE', 'score': 0.9598048329353333}, {
   
   'label': 'NEGATIVE', 'score': 0.9994558691978455}]

pipeline 模型会自动完成以下三个步骤：

将文本预处理为模型可以理解的格式；
将预处理好的文本送入模型;
对模型的预测值进行后处理，输出人类可以理解的格式。

pipeline 会自动选择合适的预训练模型来完成任务。例如对于情感分析，默认就会选择微调好的英文情感模型 distilbert-base-uncased-finetuned-sst-2-english。

Transformers 库会在创建对象时下载并且缓存模型，只有在首次加载模型时才会下载，后续会直接调用缓存好的模型。

零训练样本分类

零训练样本分类 pipeline 允许我们在不提供任何标注数据的情况下自定义分类标签。

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
result = classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)
print(result)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)

{
   
   'sequence': 'This is a course about the Transformers library', 
 'labels': ['education', 'business', 'politics'], 
 'scores': [0.8445973992347717, 0.11197526752948761, 0.043427325785160065]}

可以看到，pipeline 自动选择了预训练好的 facebook/bart-large-mnli 模型来完成任务。

文本生成

我们首先根据任务需要构建一个模板 (prompt)，然后将其送入到模型中来生成后续文本。注意，由于文本生成具有随机性，因此每次运行都会得到不同的结果。

这种模板被称为前缀模板 (Preﬁx Prompt)，了解更多详细信息可以查看《Prompt 方法简介》。

from transformers import pipeline

generator = pipeline("text-generation")
results = generator("In this course, we will teach you how to")
print(results)
results = generator(
    "In this course, we will teach you how to",
    num_return_sequences=2,
    max_length=50
) 
print(results)

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)

[{
   
   'generated_text': "In this course, we will teach you how to use data and models that can be applied in any real-world, everyday situation. In most cases, the following will work better than other courses I've offered for an undergrad or student. In order"