HuggingFace Transformers教程：TensorFlow版文本处理全流程解析-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00206/article/details/148549162

HuggingFace Transformers教程：TensorFlow版文本处理全流程解析

notebooks Notebooks using the Hugging Face libraries 🤗 项目地址: https://gitcode.com/gh_mirrors/note/notebooks

前言

在自然语言处理(NLP)领域，HuggingFace的Transformers库已经成为事实上的标准工具。本文将深入讲解如何结合TensorFlow使用Transformers库完成完整的文本处理流程，从基础的分词到完整的模型推理。

环境准备

首先需要安装必要的Python库：

pip install datasets evaluate transformers[sentencepiece]

这个安装命令包含了三个核心组件：

transformers: HuggingFace的核心NLP库
datasets: 用于加载和处理数据集
evaluate: 用于模型评估

分词器(Tokenizer)基础

初始化分词器

使用预训练模型的第一步是加载对应的分词器：

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

这里我们使用的是DistilBERT模型，它是在小写英文文本上预训练，并在SST-2情感分析任务上微调的版本。

单句分词

sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)

分词器会将文本转换为模型可理解的数字形式（token IDs），同时自动添加特殊token如[CLS]和[SEP]。

多句处理

Transformers的分词器可以同时处理多个句子：

sequences = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "So have I!"
]
model_inputs = tokenizer(sequences)

高级分词选项

填充(Padding)

处理批量数据时，句子长度不一致需要填充：

# 按批次中最长句子填充
model_inputs = tokenizer(sequences, padding="longest")

# 按模型最大长度填充(如BERT是512)
model_inputs = tokenizer(sequences, padding="max_length")

# 指定最大长度填充
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

截断(Truncation)

对于超长文本，需要进行截断：

# 按模型最大长度截断
model_inputs = tokenizer(sequences, truncation=True)

# 指定最大长度截断
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

返回张量类型

可以指定返回的张量类型以适应不同框架：

# PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

分词过程解析

让我们深入看看分词的具体过程：

sequence = "I've been waiting for a HuggingFace course my whole life."

# 完整分词流程
model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])
# 输出: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]

# 分步处理
tokens = tokenizer.tokenize(sequence)  # 分词
ids = tokenizer.convert_tokens_to_ids(tokens)  # 转换为ID
print(ids)
# 输出: [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

注意到完整流程比手动分步多了两个token(101和102)，它们分别代表[CLS]和[SEP]特殊token。

解码还原

我们可以将token IDs解码回文本：

print(tokenizer.decode(model_inputs["input_ids"]))
# 输出: [CLS] i've been waiting for a huggingface course my whole life. [SEP]

print(tokenizer.decode(ids))
# 输出: i've been waiting for a huggingface course my whole life.

完整模型推理

结合TensorFlow进行完整的文本分类：

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# 加载模型和分词器
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

# 准备输入数据
sequences = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "So have I!"
]

# 分词处理
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")

# 模型推理
output = model(**tokens)