huggingface调用一些细节记录

最新推荐文章于 2025-06-16 14:10:58 发布

COHREZ

最新推荐文章于 2025-06-16 14:10:58 发布

阅读量848

点赞数

CC 4.0 BY-SA版权

文章标签： python 深度学习人工智能

本文链接：https://blog.youkuaiyun.com/qq_39448884/article/details/127396548

这篇笔记详细记录了使用 HuggingFace 库中 BertModel 的一些关键点，包括 ModelInput 的创建，通过 tokenizer 进行文本预处理，以及 ModelForward 中的输出解释。强调了理解 tokenizer 类的直接调用和内部函数的区别，并提醒在前传模型时添加 return_tensors 参数。还介绍了 BERTModel 的输出，特别是 last_hidden_state 和 pooler_output 的含义。建议读者多查阅官方文档以加深理解。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

写给我自己看的一些小细节，因为不是每天写代码，总是会忘

要多看文档！！！

Model Input

使用tokenizer进行分词、添加特殊token例如[cls] [sep]、添加attention mask等等操作

注意类的直接调用和类中函数调用好吗？？我每次都来tokenizer()和tokenizer.encode()之间来回横跳不长记性

（本文都是用bert进行举例，其他预训练模型同理）

类的直接调用tokenizer()：

encoded_dict = tokenizer("你是谁？", "我是你妈。")
print(encoded_dict)

结果：

{'input_ids': [101, 872, 3221, 6443, 8043, 102, 2769, 3221, 872, 1968, 511, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

注意：前传的时候无脑加入参数return_tensors="pt"，省的还得自己转换为tensor

类中函数调用tokenizer.encode()，直接返回input_ids

input_ids = tokenizer.encode("你是谁？", "我是你妈。")
print(input_ids)

结果：

[101, 872, 3221, 6443, 8043, 102, 2769, 3221, 872, 1968, 511, 102]

关于model input 单/双输入、对齐等细节，参考官方文档，事无巨细，生怕我不会用这个接口，谢谢宁

https://huggingface.co/docs/transformers/v4.23.1/en/glossary#feed-forward-chunking

关于tokenzier的基类说明，事无巨细了

https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizer

Model Foward

BertModel

函数参数：

返回值：

一般也就用到last_hidden_state，维度见上图

官方例子就很清楚了，copy一下：
https://huggingface.co/docs/transformers/v4.23.1/en/model_doc/bert#transformers.BertModel

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") # 自动转换为tensor
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state # 返回值取last_hidden_state

注意看这个pooler_output：即为[cls]的表示

pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

再强调一下，仔细看文档！