首先需要安装transformers:
pip install transformers
以bert-base-uncased为例,进入网站:https://huggingface.co/bert-base-uncased/tree/main,可以看到这个模型的所有文件,下载:
- config.json
- pytorch_model.bin
- vocab.txt
这三个文件,然后放到本地叫:bert_model
的文件夹下
加载预训练模型
然后在同一目录创建python文件,测试一下读取模型:
from transformers import BertModel
bert_model = BertModel.from_pretrained("bert_model")
Huggingface基本类
Huggingface
主要使用的只有以下三个类:
- configuration:https://huggingface.co/transformers/main_classes/configuration.html,主要用于加载配置文件
- models:https://huggingface.co/transformers/main_classes/model.html,模型基类
- tokenizer:https://huggingface.co/transformers/main_classes/tokenizer.html,将文本->编码的工具
Tokenizer使用
Tokenizer的作用主要是把文本 ——> 词index
from transformers import BertTokenizer
def get_tokenizer():
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path="../bert_model")
sentences = "Natural language processing is the ability of a computer program to understand human language."
# 编码:句子 -> index
# 编码方式一:
words_list = tokenizer.tokenize(sentences) # 分词
print("分词:", words_list)
tokens = ["[CLS]"] + words_list + ["[SEP]"]
encoded_1 = tokenizer.convert_tokens_to_ids(tokens)
print("编码方式1:", encoded_1)
# 编码方法二:
encoded_2 = tokenizer(sentences) # 编码:句子 -> index
print("编码方式2:", encoded_2['input_ids'])
# 解码:index -> 句子
decode_string = tokenizer.decode(encoded_2['input_ids'])
print("解码 index -> 句子 :", decode_string)
# 限制句子的输入到一个固定长度:超过截断,不足补齐
sentence_index = tokenizer(["natural", "language is ability and nlp is useful"],
truncation=True, # 超过最大长度截断
padding=True, # 设置长度不足就补齐
max_length=5, # 最大长度
add_special_tokens=True) # 添加默认的token
print("转换为固定长度输入:", sentence_index['input_ids']) # 便于模型输出
return tokenizer
if __name__ == '__main__':
get_tokenizer()
结果如下:
分词: ['natural', 'language', 'processing', 'is', 'the', 'ability', 'of', 'a', 'computer', 'program', 'to', 'understand', 'human', 'language', '.']
编码方式1: [101, 3019, 2653, 6364, 2003, 1996, 3754, 1997, 1037, 3274, 2565, 2000, 3305, 2529, 2653, 1012, 102]
编码方式2: [101, 3019, 2653, 6364, 2003, 1996, 3754, 1997, 1037, 3274, 2565, 2000, 3305, 2529, 2653, 1012, 102]
解码 index -> 句子 : [CLS] natural language processing is the ability of a computer program to understand human language. [SEP]
转换为固定长度输入: [[101, 3019, 102, 0, 0], [101, 2653, 2003, 3754, 102]]
参考链接
- Huggingface github页面:https://github.com/huggingface/transformers
- Huggingface 模型库:https://huggingface.co/models
- Huggingface Quick start文档:https://huggingface.co/transformers/quicktour.html