You are using the legacy behaviour of the <class ‘transformers.models.t5.tokenization_t5.T5Tokenizer

本文讲述了在使用mT5模型时,如何处理tokenizer的legacy模式与fast模式,以及它们在编码特殊符号(如</s>)和数字时的不同行为,建议根据具体情况调整legacy和use_fast参数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

诸神缄默不语-个人优快云博文目录

这是在调用mT5时出现的警告信息。

原代码:

from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small")

警告信息全文:
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 env_path/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.

这个大概就是说直接这么调用会使用老板的tokenizer,有bug,就是</s>后面会多个空格。debug后的结果需要设置legacy=False调用。此外如果直接调用fast版本,这个版本也没有debug。所以如果想调用debug前的版本,需要用legacy=False,use_fast=False参数。

如果你是直接调用mT5来进行推理,建议采用训练时tokenizer的参数。如果是自己微调的话,我感觉应该没有区别,建议用legacy=False,use_fast=False参数。到底有没有区别,看你的文本里会不会出现</s>这种特殊token,如果不出现的话就没有区别(尤其在中文里,应该就是没有区别,看下文,只在编码环节有差异:

测试不同版本的输出:

tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small",legacy=True,use_fast=False)

a_sentence="没有任何特殊符号的部分,加上有特殊符号比如</s>、数字202401111356的部分"
print(tokenizer.encode(a_sentence))
print(tokenizer.tokenize(a_sentence))
print(tokenizer.decode(tokenizer.encode(a_sentence)))

输出:

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
[259, 13342, 43803, 64479, 82061, 3688, 227830, 261, 158919, 1637, 64479, 82061, 3688, 82734, 1, 259, 227830, 1]
['▁', '没有', '任何', '特殊', '符', '号', '的部分', ',', '加上', '有', '特殊', '符', '号', '比如', '</s>', '▁', '的部分']
没有任何特殊符号的部分,加上有特殊符号比如</s> 、数字202401111356的部分</s>
tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small",legacy=True,use_fast=True)

a_sentence="没有任何特殊符号的部分,加上有特殊符号比如</s>、数字202401111356的部分"
print(tokenizer.encode(a_sentence))
print(tokenizer.tokenize(a_sentence))

输出:

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
env_path/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
[259, 13342, 43803, 64479, 82061, 3688, 227830, 261, 158919, 1637, 64479, 82061, 3688, 82734, 1, 259, 227830, 1]
['▁', '没有', '任何', '特殊', '符', '号', '的部分', ',', '加上', '有', '特殊', '符', '号', '比如', '</s>', '▁', '的部分']
没有任何特殊符号的部分,加上有特殊符号比如</s> 、数字202401111356的部分</s>
tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small",legacy=False,use_fast=False)

a_sentence="没有任何特殊符号的部分,加上有特殊符号比如</s>、数字202401111356的部分"
print(tokenizer.encode(a_sentence))
print(tokenizer.tokenize(a_sentence))

输出:

[259, 13342, 43803, 64479, 82061, 3688, 227830, 261, 158919, 1637, 64479, 82061, 3688, 82734, 1, 292, 39542, 79806, 122631, 176372, 227830, 1]
['▁', '没有', '任何', '特殊', '符', '号', '的部分', ',', '加上', '有', '特殊', '符', '号', '比如', '</s>', '、', '数字', '2024', '0111', '1356', '的部分']
没有任何特殊符号的部分,加上有特殊符号比如</s> 、数字202401111356的部分</s>
tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small",legacy=False,use_fast=True)

a_sentence="没有任何特殊符号的部分,加上有特殊符号比如</s>、数字202401111356的部分"
print(tokenizer.encode(a_sentence))
print(tokenizer.tokenize(a_sentence))

输出:

env_path/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
[259, 13342, 43803, 64479, 82061, 3688, 227830, 261, 158919, 1637, 64479, 82061, 3688, 82734, 1, 259, 292, 39542, 79806, 122631, 176372, 227830, 1]
['▁', '没有', '任何', '特殊', '符', '号', '的部分', ',', '加上', '有', '特殊', '符', '号', '比如', '</s>', '▁', '、', '数字', '2024', '0111', '1356', '的部分']
没有任何特殊符号的部分,加上有特殊符号比如</s> 、数字202401111356的部分</s>

参考信息:

  1. ⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ by ArthurZucker · Pull Request #24565 · huggingface/transformers
  2. Using transformers legacy tokenizer · Issue #305 · OpenAccess-AI-Collective/axolotl
  3. Slow Tokenizer adds whitespace after special token · Issue #25073 · huggingface/transformers
我已经下载了tiktoken和protobuf库,D:\PythonProject\deepseekai.venv\Scripts\python.exe D:\PythonProject\deepseekai\train_weather_model.py PyTorch 版本: 2.3.1+cu118 CUDA 可用: True GPU 名称: NVIDIA GeForce GTX 1650 Ti You are using the default legacy behaviour of the <classtransformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast’>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. Traceback (most recent call last): File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\convert_slow_tokenizer.py”, line 1737, in convert_slow_tokenizer ).converted() ^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\convert_slow_tokenizer.py”, line 1631, in converted tokenizer = self.tokenizer() ^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\convert_slow_tokenizer.py”, line 1624, in tokenizer vocab_scores, merges = self.extract_vocab_merges_from_model(self.vocab_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\convert_slow_tokenizer.py”, line 1600, in extract_vocab_merges_from_model bpe_ranks = load_tiktoken_bpe(tiktoken_url) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\tiktoken\load.py”, line 148, in load_tiktoken_bpe contents = read_file_cached(tiktoken_bpe_file, expected_hash) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\tiktoken\load.py”, line 48, in read_file_cached cache_key = hashlib.sha1(blobpath.encode()).hexdigest() ^^^^^^^^^^^^^^^ AttributeError: ‘NoneType’ object has no attribute ‘encode’ During handling of the above exception, another exception occurred: Traceback (most recent call last): File “D:\PythonProject\deepseekai\train_weather_model.py”, line 31, in <module> tokenizer = AutoTokenizer.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\models\auto\tokenization_auto.py”, line 1032, in from_pretrained return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\tokenization_utils_base.py”, line 2025, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\tokenization_utils_base.py”, line 2278, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\models\llama\tokenization_llama_fast.py”, line 154, in init super().init( File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\tokenization_utils_fast.py”, line 139, in init fast_tokenizer = convert_slow_tokenizer(self, from_tiktoken=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “D:\PythonProject\deepseekai.venv\Lib\site-packages\transformers\convert_slow_tokenizer.py”, line 1739, in convert_slow_tokenizer raise ValueError( ValueError: Converting from SentencePiece and Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast converters: [‘AlbertTokenizer’, ‘BartTokenizer’, ‘BarthezTokenizer’, ‘BertTokenizer’, ‘BigBirdTokenizer’, ‘BlenderbotTokenizer’, ‘CamembertTokenizer’, ‘CLIPTokenizer’, ‘CodeGenTokenizer’, ‘ConvBertTokenizer’, ‘DebertaTokenizer’, ‘DebertaV2Tokenizer’, ‘DistilBertTokenizer’, ‘DPRReaderTokenizer’, ‘DPRQuestionEncoderTokenizer’, ‘DPRContextEncoderTokenizer’, ‘ElectraTokenizer’, ‘FNetTokenizer’, ‘FunnelTokenizer’, ‘GPT2Tokenizer’, ‘HerbertTokenizer’, ‘LayoutLMTokenizer’, ‘LayoutLMv2Tokenizer’, ‘LayoutLMv3Tokenizer’, ‘LayoutXLMTokenizer’, ‘LongformerTokenizer’, ‘LEDTokenizer’, ‘LxmertTokenizer’, ‘MarkupLMTokenizer’, ‘MBartTokenizer’, ‘MBart50Tokenizer’, ‘MPNetTokenizer’, ‘MobileBertTokenizer’, ‘MvpTokenizer’, ‘NllbTokenizer’, ‘OpenAIGPTTokenizer’, ‘PegasusTokenizer’, ‘Qwen2Tokenizer’, ‘RealmTokenizer’, ‘ReformerTokenizer’, ‘RemBertTokenizer’, ‘RetriBertTokenizer’, ‘RobertaTokenizer’, ‘RoFormerTokenizer’, ‘SeamlessM4TTokenizer’, ‘SqueezeBertTokenizer’, ‘T5Tokenizer’, ‘UdopTokenizer’, ‘WhisperTokenizer’, ‘XLMRobertaTokenizer’, ‘XLNetTokenizer’, ‘SplinterTokenizer’, ‘XGLMTokenizer’, ‘LlamaTokenizer’, ‘CodeLlamaTokenizer’, ‘GemmaTokenizer’, ‘Phi3Tokenizer’] Process finished with exit code 1
最新发布
06-13
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

诸神缄默不语

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值