HuggingFace——Tokenizer的简单记录

最新推荐文章于 2025-10-18 11:34:57 发布

原创

最新推荐文章于 2025-10-18 11:34:57 发布 · 4.4k 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#HuggingFace #Pytorch #python #transformers #tokenizer

本文围绕Tokenizer展开，介绍了不同预训练模型对应的分词工具及下载使用方法，展示了其输出结果。阐述了Tokenizer的工作流程，包括Normalization和Pre-tokenization。还探讨了分词底层原理，如BPE、WordPiece等。最后讲解了根据已有tokenizer训练新tokenizer的相关内容。

Tokenizer [ 中文Course | API|详述文档]

下载使用

针对AutoTokenizer来说，如果是从在线仓库中下载，其是要访问：

    commit_hash = kwargs.get("_commit_hash", None)
    resolved_config_file = cached_file(
        pretrained_model_name_or_path,
        TOKENIZER_CONFIG_FILE,
        cache_dir=cache_dir,
        force_download=force_download,
        resume_download=resume_download,
        proxies=proxies,
        use_auth_token=use_auth_token,
        revision=revision,
        local_files_only=local_files_only,
        _raise_exceptions_for_missing_entries=False,
        _raise_exceptions_for_connection_errors=False,
        _commit_hash=commit_hash,
    )
    if resolved_config_file is None:
        logger.info("Could not locate the tokenizer configuration file, will try to use the model config instead.")
        return {
   
   }
    commit_hash = extract_commit_hash(resolved_config_file, commit_hash)

    with open(resolved_config_file, encoding="utf-8") as reader:
        result = json.load(reader) # 加载"tokenizer_config.json"
    result["_commit_hash"] = commit_hash
    return result

其中TOKENIZER_CONFIG_FILE是指：

# Slow tokenizers used to be saved in three separated files
SPECIAL_TOKENS_MAP_FILE = "special_tokens_map.json"
ADDED_TOKENS_FILE = "added_tokens.json"
TOKENIZER_CONFIG_FILE = "tokenizer_config.json"

针对不同的预训练模型，分词工具是不同的，比如：

Byte-level BPE, 用于 GPT-2；
WordPiece, 用于 BERT；
SentencePiece or Unigram, 用于多个多语言模型
一般来说，具体使用哪些分词工具是在repo里面的tokenizer.json文件中配置的，比如hfl/roberta-ext模型中的tokenizer.json中可以看到下面的配置信息：

"model":
    {
   
   
        "type": "WordPiece",
        "unk_token": "[UNK]",
        "continuing_subword_prefix": "##",
        "max_input_chars_per_word": 100,
        "vocab":
        {
   
   
            "[PAD]": 0,
            ……
        }       
    }

官方是建议使用Auto* 类，因为Auto* 类设计与架构无关。

Tokenizer的一些输出展示

tokenizer = AutoTokenizer.from_pretrained

最低0.47元/天解锁文章