C#AI系列(7):从零开始LLM之Tokenizer实现

转载于 2025-12-21 12:30:02 发布 · 2 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://www.cnblogs.com/luojin765/p/19378939

文章标签：

#it

一、前言: token是什么

LLM只做一个事情，就是吃掉token吐出token，token是LLM（大语言模型）的基本元素。token与LLM的关系，相当于乐高积木与乐高工厂，我的世界方块与我的世界游戏。那么token到底是什么呢？有人翻译成令牌，有人翻译成词源。我们不妨换个概念理解，token就是最小操作、最小信息单元的意思。这个最小是相对于LLM要处理的原始文本来说的。

举个栗子，当一个句子文本输入到电脑中，天然就就具有字符级别的切分。如果不打算继续拆分或组合，我们可以通过一个映射关系，将现有这些字符转换为整数数组，称为编码过程。编码后数组内的元素就是token，元素取值就等于token取值。LLM可以吃掉这个token数组，并吐出新数组。对这个新数组按前前述的映射进行逆转换，称为解码过程。解码后我们就能得到人类可以理解的文本了。

/* by 01022.hk - online tools website : 01022.hk/zh/calorie.html */
// 原句子
"我有一个 apple."

// 句子拆分
["我","有","一","个"," ","a","p","p","l","e",".","\0"]

// 编码为整数数组
[1,2,3,4,5,6,7,7,8,9,10,11]

从实际应用看，主流LLM几乎不用纯字符级级别切分，而是为了更好效果，使用BPE/WordPiece/SentencePiece等子词（sub-word）算法。此时"hello"大概率是1个或2个token，而不是5个。对于中文来说，"我有一个" 可能切成了 "我/有/一/个"，也可能是"我有/一个"，取决于词表。在字词算法中，单个token拎出来会存在不可解释性，因为是打散的词根。

但是无论怎么处理，LLM传入传出的都是一个整数数组，数组元素的数量，就是token数量（也是LLM服务的计费标准）。

再从实际应用看，主流LLM几乎都采用BPE或BBPE方式进行Tokenizer。我们接下来继续了解BPE。

二、BPE(字节对编码)

字节对编码是一种简单的数据压缩形式，这种方法用数据中不存的一个字节表示最常出现的连续字节数据。这样的替换需要重建全部原始数据。编码过程如下：

/* by 01022.hk - online tools website : 01022.hk/zh/calorie.html */
// wiki的BPE案例
"aaabdaaabac": "aa"=>"Z" //“aa”出现次数最多，用中没有出现的“Z”替换
"ZabdZabac": "aa"=>"Z", "Za"=>"Y" //同上，更新替换表
"YbdYbac": "aa"=>"Z", "Za"=>"Y", "Yb"=>"X" //同上，更新替换表
"XdXac":"aa"=>"Z", "Za"=>"Y", "Yb"=>"X" // 无可用替换

我们将"aaabdaaabac"通过BPE方式编码成了"XdXac"。解码时只需要对附带的替换表("aa"=>"Z", "Za"=>"Y", "Yb"=>"X")按顺序逆向操作，就能得到原信息。

BPE 用“比字符大、比单词小”的子词当积木，之所以能流行主要是因为其编码后的token数量适中，处于单字符切分，全词切分之间。相对与全词切分，BPE是子词切分，不仅可以控制上限避免词库膨胀，还能最小可退到字节/字符，最大可保留整词，粒度随频率动态伸缩。就算预见新的词组也无所谓，不存在未登录词的问题。而且一套算法与英语、阿拉伯语语言无关，都是一套处理方式。还具有词表可读性好，在一定效果下计算成本低等特点。

三、BPE Tokenizer

一个BPE Tokenizer，主要功能可分为1.训练处理得到词表；2.编解码。词表的训练上面已经做了示意，接下来我们主要针对编解码部分。

训练好的BPE的数据主要包括三个部分：

vocab.json：符号 → id 的字典；
merges.txt：按合并顺序排列的“信息对”；
tokenizer_config.json：预处理规则(regex文本)、特殊标记。

另外常见的还有tokenizer.json文件，他是Hugging Face 生态把“原本分散的三份文件”压进一个JSON文件。典型的结构如下（在不同版本中，merges可能会有字符串和数组两种对象存储方式，解析时候需要注意）：

// cl100k_base
{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [ //特殊token
    {
      "id": 100257,
      "content": "<|endoftext|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    }
  ],
  "normalizer": null,
  "pre_tokenizer": {  // 有的有，有的没有，因此regex需要预先硬编码
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",  // 预处理分割，“防呆尺”
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",   //一般默认写死，命中正则的片段保留，没命中的扔掉（与invert 配合）。
        "invert": true //一般默认写死，把“命中/没命中”反转——最终只保留上面正则抓到的那些片段，其余全部丢弃。
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },
  "post_processor": null,
  "decoder": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": true,
    "use_regex": true
  },
  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false,
    "vocab": {
      "!": 0,
       "<|endofprompt|>": 100276
    },
    "merges": [
      "ĠCon veyor"		// 或 ["Ġp","ain"]
    ]
  }
}

通过读取预先的数据，BPE Tokenizer就可以用了，其核心的功能就是编码和解码，即Encode和Decode。

四、Tokenizer的C#实现

在python中，可以直接用HuggingFace的AutoTokenizer载入本地权重。在C#中我们可以拉取SharpToken (2.0.4)和 TiktokenSharp(1.2.0)计算Token。

但是，如果我们要自己在C# 开发LLM（尽管很少有人这么干），一个好的Tokenizer就很重要了，需要更多自定义的功能，如支持huggingFac的tokenizer.json数据，并灵活的处理special token，充分优化。

于是就有了LumTokenizer这个项目。

主要功能实现如下:

读取tokenizer.json数据，如果没有regex，内置了3种pretoken的regex，Regex50KBase：≈GPT-2 的 5 万级别基础词表；RegexCl100KBase：≈OpenAI CLIP / GPT-3.5 / GPT-4 使用的 10 万级别 CL-100K 词表；
RegexO200KBase：≈Meta LLaMA、Mistral 等开源模型偏好的 20 万级别 O-200K 词表
高效的特殊token切分：如果是模型训练用，tokenizer需要单独高效处理特殊token。因为特殊token的目的是正文出现越少越好，因此一般不会出现在词表中，需要通过单独切分的机制进行识别和切分。
高效的缓存机制：LumTokenizer 在分词阶段，订制了一套SpanDictionary, 为了实现高效的切片搜索，也就是说一个stirng可以基于NET的Span特性切成多个Slice，而SpanDictionary可以直接基于Span 执行Key的匹配（Span无法作为传统Dictionary的泛型），极大节省了子串string转换的开销。

Benchmark测试如下：在含有中文这种多字节字符的长文（500字符左右）处理时，具有很好的性能。

Method	text	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
SharpToken_cl100k_base	Chinese	122.99 us	2.314 us	2.273 us	5.71	0.12	0.7324	9.1 KB	1.19
TiktokenSharp_cl100k_base	Chinese	96.00 us	1.829 us	2.106 us	4.45	0.11	0.4883	6.34 KB	0.83
LumTokenizer_cl100k_base	Chinese	21.56 us	0.268 us	0.251 us	1.00	0.02	0.6104	7.63 KB	1.00

SharpToken_cl100k_base	English	26.77 us	0.520 us	0.639 us	1.02	0.03	0.6714	8.38 KB	0.74
TiktokenSharp_cl100k_base	English	20.21 us	0.383 us	0.376 us	0.77	0.02	0.4272	5.51 KB	0.49
LumTokenizer_cl100k_base	English	26.13 us	0.495 us	0.509 us	1.00	0.03	0.9155	11.31 KB	1.00

SharpToken_cl100k_base	Mixed	90.97 us	1.580 us	1.478 us	3.78	0.09	0.8545	10.9 KB	1.23
TiktokenSharp_cl100k_base	Mixed	63.85 us	1.274 us	1.564 us	2.65	0.08	0.4883	6.74 KB	0.76
LumTokenizer_cl100k_base	Mixed	24.08 us	0.465 us	0.435 us	1.00	0.03	0.7019	8.83 KB	1.00

具体可以去仓库看详细代码。

  [MemoryDiagnoser]
  public class CompareBenchmark
  {
      internal GptEncoding _sharpToken;
      internal TikToken _tikToken;
      internal BPETokenizer _tokenizer1;
      internal BPETokenizer _tokenizer2;

      [GlobalSetup]
      public void Setup()
      {
          _sharpToken = GptEncoding.GetEncoding("cl100k_base");
          _tikToken = TikToken.GetEncodingAsync("cl100k_base").ConfigureAwait(false).GetAwaiter().GetResult();
          _tokenizer1 = BPETokenizer.CreateTokenizer(
              @"D:\Data\Personal\AI\llm\tokenizer\cl100k.txt", true, RegexType.RegexCl100KBase);
          _tokenizer2 = BPETokenizer.CreateTokenizer(
              @"D:\Data\Personal\AI\llm\tokenizer\qw_tokenizer.json", false, RegexType.RegexCl100KBase);
      }

      // ====== 1. 声明参数源 ======
      public IEnumerable<string> TextSamples()
      {
          yield return TextCatalog.English;
          yield return TextCatalog.Chinese;
          yield return TextCatalog.Mixed;
      }

      // ====== 2. 每个方法改成带参数 ======
      [Benchmark]
      [ArgumentsSource(nameof(TextSamples))]
      public int SharpToken_cl100k_base(string text)
      {
          var encoded = _sharpToken.Encode(text);
          var decoded = _sharpToken.Decode(encoded);
          return encoded.Count;
      }

      [Benchmark]
      [ArgumentsSource(nameof(TextSamples))]
      public int TiktokenSharp_cl100k_base(string text)
      {
          var encoded = _tikToken.Encode(text);
          var decoded = _tikToken.Decode(encoded);
          return encoded.Count;
      }

      [Benchmark(Baseline =true)]
      [ArgumentsSource(nameof(TextSamples))]
      public int LumTokenizer_cl100k_base(string text)
      {
          var encoded = _tokenizer1.Encode(text, false);
          var decoded = _tokenizer1.Decode(encoded, false);
          return encoded.Count;
      }
            
      public int LumTokenizer_qwen150k(string text)
      {
          var encoded = _tokenizer2.Encode(text, false);
          var decoded = _tokenizer2.Decode(encoded, false);
          return encoded.Count;
      }
  }

五、单元测试

现在单元测试可以说是越来越重要了，因为只有具有了完善的单元测试，才能放心的让ai去优化修改已有代码。
本文这个BPE Tokenizer项目单元测试分了5类。

P0_BasicTest：基础测试，测试编解码，数据读取，词表完善性等主要功能；
P1_RobustnessTests：鲁棒性测试，针对边缘条件，如仅空字符、仅特殊字符、超长文本、越界id等情况；
P2_VocabBpeTests：编解码准确性，要求正确的对原文进行分割，并准确编码，通过几种特定情况下的案例进行兜底。
P3_ChineseSubwordTests：中文字符测试，其中也包含了token压缩率的检验。主要是考虑在代码编写过程中，可能导致部分尾字节或特殊混编情况下不能准确字节合并的bug。
P4_EnglishSubwordTests：英文字符测试，目的同上，部分bug出现时，尽管decode正常，但encode编码也可能未达到预期（忽略了某些合并环节导致压缩率过高）。

编解码准确度与常用库比较：

LumTokenizer_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|>

SharpToken_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|>

TikTokenr_cl100k_base
34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|>


LumTokenizer_qwen150k
33555,59978,11,825,315,41382,594,75969,323,1429,72035,11088,11,10742,279,3364,315,279,45237,323,12011,12681,59978,11,879,64828,806,25079,11,438,566,1558,806,3527,37605,11,4092,311,51571,323,73471,13,59978,748,7901,438,264,6981,2922,360,3848,5561,323,806,1879,304,62255,323,30826,13,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645
King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.<|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|><|im_start|>hello  你好<|im_end|>

六、最后

LumTokenizer这个项目现在版本是1.0.6.1，整体效果较好，很快速稳定，现在自己训练模型就在用它，尽管目前某些常用习惯写死了，但大家需要的可自行适配和扩展。MiniGPT和MiniMind都是很好的LLM学习入门python项目，但C#基本没有。Tokenier是C#开发LLM的重要环节，奈何.Net生态还是差很多，资料也少，现在AI生成的内容都千篇一律，很多现有库更新的又很慢。真要用C#来干LLM真是难上加难（估计也没人这么干）。

如果您觉得有收获的话，请多多支持本系列。再次感谢您的阅读，本案例及更加完整丰富的机器学习模型案例的代码已全部开源，新朋友们可以关注公众号回复Tokenizer查看仓库地址，获取全部完整代码实现。