HuggingFace学习1：tokenizer学习与将文本编码为固定长度（pytorch）

呆萌的代Ma

已于 2022-03-28 15:45:03 修改

阅读量7.1k

点赞数 9

CC 4.0 BY-SA版权

分类专栏：自然语言处理文章标签： python NLP

于 2022-03-28 14:32:50 首次发布

本文为优快云博主"呆萌的代Ma"原创文章，转载请注明博客链接：https://blog.youkuaiyun.com/weixin_35757704/

本文链接：https://blog.youkuaiyun.com/weixin_35757704/article/details/123794160

自然语言处理专栏收录该内容

56 篇文章

订阅专栏

首先需要安装transformers：

pip install transformers

以bert-base-uncased为例，进入网站：https://huggingface.co/bert-base-uncased/tree/main，可以看到这个模型的所有文件，下载：

config.json
pytorch_model.bin
vocab.txt
这三个文件，然后放到本地叫：bert_model的文件夹下

加载预训练模型

然后在同一目录创建python文件，测试一下读取模型：

from transformers import BertModel

bert_model = BertModel.from_pretrained("bert_model")

Huggingface基本类

Huggingface主要使用的只有以下三个类：

configuration：https://huggingface.co/transformers/main_classes/configuration.html，主要用于加载配置文件
models：https://huggingface.co/transformers/main_classes/model.html，模型基类
tokenizer：https://huggingface.co/transformers/main_classes/tokenizer.html，将文本->编码的工具

Tokenizer使用

Tokenizer的作用主要是把文本 ——> 词index

from transformers import BertTokenizer


def get_tokenizer():
    tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path="../bert_model")
    sentences = "Natural language processing is the ability of a computer program to understand human language."
    
    # 编码：句子 -> index
    # 编码方式一：
    words_list = tokenizer.tokenize(sentences)  # 分词
    print("分词：", words_list)
    tokens = ["[CLS]"] + words_list + ["[SEP]"]
    encoded_1 = tokenizer.convert_tokens_to_ids(tokens)
    print("编码方式1：", encoded_1)
    
    # 编码方法二：
    encoded_2 = tokenizer(sentences)  # 编码：句子 -> index
    print("编码方式2：", encoded_2['input_ids'])
    
    # 解码：index -> 句子
    decode_string = tokenizer.decode(encoded_2['input_ids'])
    print("解码 index -> 句子 ：", decode_string)
    
    # 限制句子的输入到一个固定长度：超过截断，不足补齐
    sentence_index = tokenizer(["natural", "language is ability and nlp is useful"],
                               truncation=True,  # 超过最大长度截断
                               padding=True,  # 设置长度不足就补齐
                               max_length=5,  # 最大长度
                               add_special_tokens=True)  # 添加默认的token
    print("转换为固定长度输入：", sentence_index['input_ids'])  # 便于模型输出
    return tokenizer


if __name__ == '__main__':
    get_tokenizer()

结果如下：

分词： ['natural', 'language', 'processing', 'is', 'the', 'ability', 'of', 'a', 'computer', 'program', 'to', 'understand', 'human', 'language', '.']
编码方式1： [101, 3019, 2653, 6364, 2003, 1996, 3754, 1997, 1037, 3274, 2565, 2000, 3305, 2529, 2653, 1012, 102]
编码方式2： [101, 3019, 2653, 6364, 2003, 1996, 3754, 1997, 1037, 3274, 2565, 2000, 3305, 2529, 2653, 1012, 102]
解码 index -> 句子 ： [CLS] natural language processing is the ability of a computer program to understand human language. [SEP]
转换为固定长度输入： [[101, 3019, 102, 0, 0], [101, 2653, 2003, 3754, 102]]