2025-07-07 16:19:59.098 | INFO | __main__:prepare_tokenizer:22 - Loading tokenizer from HuggingFace...
2it [00:00, 4999.17it/s]
0%| | 0/2 [00:05<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "D:\software\python\Lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 16, in count_token_single
num += len(TOKENIZER.tokenize(sample[key]))
^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'tokenize'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 61, in <module>
fire.Fire(main)
File "D:\software\python\Lib\site-packages\fire\core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\fire\core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\fire\core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 55, in main
token_count += res.get()
^^^^^^^^^
File "D:\software\python\Lib\multiprocessing\pool.py", line 774, in get
raise self._value
AttributeError: 'NoneType' object has no attribute 'tokenize'
PS D:\DATAJUICER> python data-juicer-main/tools/postprocess/count_token.py `
>> --data_path token.jsonl `
>> --text_keys text `
>> --tokenizer_method ' D:/DATAJUICER/gpt2' `
>> --num_proc 1
2025-07-07 16:21:56.465 | INFO | __main__:prepare_tokenizer:22 - Loading tokenizer from HuggingFace...
Traceback (most recent call last):
File "D:\software\python\Lib\site-packages\transformers\utils\hub.py", line 470, in cached_files
hf_hub_download(
File "D:\software\python\Lib\site-packages\huggingface_hub\utils\_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "D:\software\python\Lib\site-packages\huggingface_hub\utils\_validators.py", line 154, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': ' D:/DATAJUICER/gpt2'. Use `repo_type` argument if needed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 61, in <module>
fire.Fire(main)
File "D:\software\python\Lib\site-packages\fire\core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\fire\core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\fire\core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 35, in main
prepare_tokenizer(tokenizer_method)
File "D:\DATAJUICER\data-juicer-main\tools\postprocess\count_token.py", line 23, in prepare_tokenizer
TOKENIZER = AutoTokenizer.from_pretrained(tokenizer_method, trust_remote_code=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 982, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 814, in get_tokenizer_config
resolved_config_file = cached_file(
^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\transformers\utils\hub.py", line 312, in cached_file
file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\transformers\utils\hub.py", line 523, in cached_files
_get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\transformers\utils\hub.py", line 140, in _get_cache_file_to_return
resolved_file = try_to_load_from_cache(path_or_repo_id, full_filename, cache_dir=cache_dir, revision=revision)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\python\Lib\site-packages\huggingface_hub\utils\_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "D:\software\python\Lib\site-packages\huggingface_hub\utils\_validators.py", line 154, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': ' D:/DATAJUICER/gpt2'. Use `repo_type` argument if needed.
最新发布