【自然语言处理|迁移学习-04】Transformers库介绍以及pipeline的使用

最新推荐文章于 2025-04-25 14:36:56 发布

爱学习不掉头发

最新推荐文章于 2025-04-25 14:36:56 发布

阅读量1.5k

点赞数 27

分类专栏：自然语言处理（NLP）深度学习文章标签：自然语言处理迁移学习人工智能

本文链接：https://blog.youkuaiyun.com/weixin_51385258/article/details/144327240

版权

深度学习同时被 2 个专栏收录

46 篇文章

订阅专栏

自然语言处理（NLP）

32 篇文章

订阅专栏

文章目录

1 Transformers库介绍
2 使用pipeline完成不同的任务

1 Transformers库介绍

1.1 Transformers库简介

Transformers是开源的、基于 transformer 模型结构的，提供预训练语言库

Transformers 提供了NLP领域大量state-of-art的预训练语言模型和调用框架
举个例子Transformers 库提供了很多SOTA的预训练模型，比如BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL
它支持 Pytorch，Tensorflow2.0，并且支持两个框架的相互转换

Huggingface总部位于纽约，是一家专注于自然语言处理、人工智能和分布式系统的创业公司

他们所提供的聊天机器人技术一直颇受欢迎，但更出名的是他们在NLP开源社区上的贡献。
Huggingface一直致力于自然语言处理NLP技术的平民化(democratize)，希望每个人都能用上最先进(SOTA, state-ofthe-art)的NLP技术，而非困窘于训练资源的匮乏
HuggingFacede 开源社区。尤其是在github上开源的预训练模型库Transformers已成为工业风向标
社区Transformer的访问地址为：https://huggingface.co/
* 点击 Model链接可查看、下载预训练模型。
点击Datasets链接可查看、下载数据集。
点击Docs链接可以阅读预训练模型的编程文档，十分方便
SOTA（state-of-the-art）是指目前对某项任务“最好的”算法或技术

1.2 Transformer库的三层应用结构

管道（Pipline）方式：高度集成的极简使用方式，只需要几行代码即可实现一个NLP任务
- 比如：使用比较简单，对api函数进行高度封装，初学者就可使用
自动模型（AutoMode）方式：可载入并使用BERTology系列模型
- 比如：按照分类任务、摘要任务调用模型，按照任务类别已经规定好模型的输入格式和输出格式
具体模型（SpecificModel）方式：在使用时，需要明确指定具体的模型，并按照每个BERTology系列模型中的特定参数进行调用，该方式相对复杂，但具有较高的灵活度
- 比如：直接面对bert模型，适合于专业人士

在这里插入图片描述

1.3 Transformer库的安装

下载安装transformer库、datasets数据库

# 注意在执行clone之前，要查看当前是在那个目录下，比如$HOME/nlpdev/目录下
# 克隆huggingface的transfomers文件
git clone https://github.com/huggingface/transformers.git

# 进入transformers文件夹
cd transformers 

# 切换transformers到指定版本
git checkout v4.19.0

# 安装transformers包
pip install .

安装datasets数据库

# '( )*+conda env list,-./01python2345&67892345:!" ;conda activate xxx<
pip install datasets

2 使用pipeline完成不同的任务

pipeline工具作用：提供一个简单高效的方式执行各种自然语言处理任务

2.1 文本(情感)分类任务

文本分类是指模型可以根据文本中的内容来进行分类。例如根据内容对情绪进行分类，根据内容对商品分类等。
文本分类模型一般是通过有监督训练得到的。对文本内容的具体分类，依赖于训练时所使用的样本标签。

从transformers库中到如pipeline工具
使用pipeline实例化一个对象
task - 用于指定任务类型
model - 用于指定模型的路径

将文本传入给模型，得到输出结果

import torch
from transformers import pipeline
import numpy as np

def dm01_test_classification():
    # 1 使用中文预训练模型 chinese_sentiment
    # 模型下载地址 git clone https://huggingface.co/techthiyanes/chinese_sentiment

    # 2 实例化pipeline对象
    # my_model = pipeline(task='', model='')
    my_model = pipeline(task='sentiment-analysis', model='./bert-base-chinese')
    print('my_model-->', my_model)
    
    # 3 文本送给模型 进行文本分类
    output = my_model('我爱北京天安门，天安门上太阳升。')
    print('output-->', output)

在这里插入图片描述

注意：pipeline函数可以自动从官网下载预训练模型，也可以加载本地的预训练模型
一个任务可以采用多个模型进行解决

2.2 特征提取任务

特征抽取任务只返回文本处理后的特征，属于预训练模型的范畴。特征抽取任务的输出结果需要和其他模型一起工作

def dm02_test_feature_extraction():
    # 1 使用中文预训练模型 bert-base-chinese
    # 2 实例化pipeline对象
    model = pipeline(task="feature-extraction",model="./bert-base-chinese")
    # 3 文本送给模型 进行文本分类
    output = model("莫道桑榆晚,为霞尚满天")

    # 分类任务output返回的是一个列表
    # output的形状是(1,13,768)，一个样本，序列长度为13,每个词的特征维度为768
    # 输入的句子长度为11 ， output的序列长度为13是因为模型内部对数据进行编码的时候添加了起始和结束符号
    # 起始符号为 [cls] - 在构词表中的代码是101
    # 结束符号是 [sep] - 在构词表中的代码是102
    # 模型内部添加完起始和结束符之后，进行文本数字化的时候，会将标志符转换为对应的代码
    # 然后将数据送入给模型，提取句子特征，提取完句子特征之后，会结合其他模型一起工作
    print("output-->",type(output),np.array(output).shape)

送入到模型的文本，模型问添加起始和结束符号，然后将其转换为数值
不带头的任务输出：特征抽取任务属于不带任务头输出，bert-base-chinese模型的11个字，每个字的特征维度为768
带头任务头输出：其他有指定任务类型的比如文本分类，完型填空属于带头任务输出，会根据具体任务类型不同输出不同的结果

2.3 完形填空

完型填空任务又被叫做“遮蔽语言建模任务”，它属于BERT模型训练过程中的子任务。下面完成一个中文场景的完型填空。

将句子中的某个字采用[MASK]标志进行替换，得到训练的样本，对应这个句子的目标值y就是被[MASK]标记的字
bert模型是根据字进行分词编码的
在进行文本张量化的时候[MASK]被转换为103

input = '我想明天去[MASK]家吃饭。
对于这一条样本

x - 我想明天去[MASK]家吃饭
y - 他（假设正确答案）

# 完形填空任务 实现思路分析 dm03_test_fill_mask():
# 1 使用中文预训练模型 chinese-bert-wwm
# 模型下载地址 git clone https://huggingface.co/hfl/chinese-bert-wwm  全词模型
# 2 实例化pipeline对象
# my_model = pipeline(task='', model='')
# 3 文本送给模型 进行文本分类
# output = my_model('xxxx')
def dm03_test_fill_mask():
    mymodel = pipeline(task='fill-mask', model='./bert-base-chinese')

    # 2 准备数据 给模型喂数据
    input = '我想明天去[MASK]家吃饭。'
    output = mymodel(input)

    # 3 打印输出
    print('output-->', output)
    for i in output:
        print(i)

在这里插入图片描述

2.4 阅读理解

阅读理解任务又称为"抽取式问答任务"，即输入一段文本和一个问题，让模型输出结果
- 模型输入参数：context - 上下文文本（文本序列）
- 模型输入参数：questions - 问题列表

# 4 阅读理解任务(抽取式问答) 实现思路分析 dm04_test_question_answering():
# 1 使用中文预训练模型 chinese_pretrain_mrc_roberta_wwm_ext_large
# 模型下载地址 git clone https://huggingface.co/luhua/chinese_pretrain_mrc_roberta_wwm_ext_large
# 2 实例化pipeline对象
# my_model = pipeline('question-answering', model='./chinese_pretrain_mrc_roberta_wwm_ext_large')
# 3 文本送给模型 进行文本分类
# output = model(context=context, question=questions)
def dm04_test_question_answering():

    # 问答语句
    context = '我叫张三，我是一个程序员，我的喜好是打篮球。'
    questions = ['我是谁？', '我是做什么的？', '我的爱好是什么？']

    # 1 下载模型 git clone https://huggingface.co/luhua/chinese_pretrain_mrc_roberta_wwm_ext_large

    # 2 实例化化pipeline 返回模型
    # model = pipeline('question-answering', model='./chinese_pretrain_mrc_roberta_wwm_ext_large')
    model = pipeline('question-answering', model='./bert-base-chinese')

    # 3 给模型送数据 的预测结果
    print(model(context=context, question=questions))

    # 输出结果
    '''
    [{'score': 1.2071758523357623e-12, 'start': 2, 'end': 4, 'answer': '张三'},
     {'score': 2.60890374192968e-06, 'start': 9, 'end': 12, 'answer': '程序员'},
     {'score': 4.1686924134864967e-08, 'start': 18, 'end': 21, 'answer': '打篮球'}]
    '''

2.5 文本摘要任务

摘要生成任务：输入一段文本，输出一段概况、简单的文字
摘要生成可以分为两类：
- 生成式摘要
- 提取式摘要

# 5 文本摘要 实现思路分析 dm_test_summarization():
# 1 使用中文预训练模型 chinese-bert-wwm
# 模型下载地址 git clone https://huggingface.co/sshleifer/distilbart-cnn-12-6
# 2 实例化pipeline对象 返回模型
# my_model = pipeline(task='', model='')
# 3 文本送给模型 进行文本分类
# output = my_model('xxxx')
def dm05_test_summarization():

    # 1 下载模型 git clone https://huggingface.co/sshleifer/distilbart-cnn-12-6

    # 2 实例化pipline 返回模型
    # my_model = pipeline(task = 'summarization', model="./distilbart-cnn-12-6")
    my_model = pipeline(task = 'summarization', model="./bert-base-chinese")

    # 3 准备文本 送给模型
    text = "BERT is a transformers model pretrained on a large corpus of English data " \
           "in a self-supervised fashion. This means it was pretrained on the raw texts " \
           "only, with no humans labelling them in any way (which is why it can use lots " \
           "of publicly available data) with an automatic process to generate inputs and " \
           "labels from those texts. More precisely, it was pretrained with two objectives:Masked " \
           "language modeling (MLM): taking a sentence, the model randomly masks 15% of the " \
           "words in the input then run the entire masked sentence through the model and has " \
           "to predict the masked words. This is different from traditional recurrent neural " \
           "networks (RNNs) that usually see the words one after the other, or from autoregressive " \
           "models like GPT which internally mask the future tokens. It allows the model to learn " \
           "a bidirectional representation of the sentence.Next sentence prediction (NSP): the models" \
           " concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to " \
           "sentences that were next to each other in the original text, sometimes not. The model then " \
           "has to predict if the two sentences were following each other or not."
    output = my_model(text)

    # 4 打印摘要结果
    print('output--->', output)

    # 抽取式摘要 - 从原文中直接抽取重要的话
    # 生成式摘要 - 理解原文,生成摘要

2.6 命名实体识别（NER）任务

实体词识别（NER）任务是NLP中的基础任务。
它用于识别文本中的人名（PER）、地名（LOC）、组织（ORG）以及其他实体（MISC）等。
- 例如：(王 B-PER) (小 I-PER) (明 I-PER) (在 O) (办 B-LOC) (公 I-LOC) (室 I-LOC)。
- 其中O表示一个非实体，B表示一个实体的开始，I表示一个实体块的内部。
实体词识别本质上是一个分类任务（又叫序列标注任务），实体词识别是句法分析的基础，而句法分析优势NLP任务的核心

# 6 ner 实现思路分析 dm_test_ner()
# 1 使用中文预训练模型 chinese-bert-wwm
# 模型下载地址 git clone https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese
# 2 实例化pipeline对象 返回模型
# my_model = pipeline(task='', model='')
# 3 文本送给模型 进行文本分类
# output = my_model('xxxx')
def dm06_test_ner():

    # 1 下载模型 git clone https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese

    # 2 实例化pipeline 返回模型
    model = pipeline('ner', model='./roberta-base-finetuned-cluener2020-chinese')

    # 3 给模型送数据 打印NER结果
    print(model('我爱北京天安门，天安门上太阳升。'))

    '''
    [{'entity': 'B-address', 'score': 0.8838121, 'index': 3, 'word': '北', 'start': 2, 'end': 3},
     {'entity': 'I-address', 'score': 0.83543754, 'index': 4, 'word': '京', 'start': 3, 'end': 4},
     {'entity': 'I-address', 'score': 0.4240591, 'index': 5, 'word': '天', 'start': 4, 'end': 5},
     {'entity': 'I-address', 'score': 0.7524443, 'index': 6, 'word': '安', 'start': 5, 'end': 6},
     {'entity': 'I-address', 'score': 0.6949866, 'index': 7, 'word': '门', 'start': 6, 'end': 7},
     {'entity': 'B-address', 'score': 0.65552264, 'index': 9, 'word': '天', 'start': 8, 'end': 9},
     {'entity': 'I-address', 'score': 0.5376768, 'index': 10, 'word': '安', 'start': 9, 'end': 10},
     {'entity': 'I-address', 'score': 0.510813, 'index': 11, 'word': '门', 'start': 10, 'end': 11}]
    '''

2.7 小结

pipeline工具使用过程
- 下载模型
- 使用pipeline加载模型
- 给模型送入数据，得到预测结果
常见的NLP任务
- sentiment-analysis - 文本情感分类任务
- feature-extraction - 特征提取任务
- fill-mask - 完形填空任务
- question-answering - 阅读理解任务
- summarization - 文本摘要任务
- ner - 命名实体识别任务