将扩充的中文 tokenizer 模型应用于自己的LLM模型上（embedding参数修改）

落难Coder

于 2024-08-06 21:44:11 发布

阅读量641

点赞数 4

分类专栏： LLMs NLP 文章标签： embedding 深度学习

本文链接：https://blog.youkuaiyun.com/u014297502/article/details/140967041

版权

在《手把手带你了解和实践扩充 LLaMA 大语言模型的 tokenizer 模型（实现中文token过程）》中我们详细讲述了如何基于自己的数据对tokennizer模型进行训练调整。

在本文中，我们将继续讲述基于SentencePiece扩充LLaMa的词表，但是扩充了词表后的下一步该怎么操作呢？如何将新增的token在模型的embedding层和lm_head层初始化呢？

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "../sentencepiece/llama2-7b-hf" # llama2模型的位置
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

new_tokenizer = AutoTokenizer.from_pretrained("../sentencepiece/merged_tokenizer_hf_test") # 新训练的分词器的位置

我们加载完模型和分词器以及新增的分词器后，看一下模型的结构：

model
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v