/home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
W0703 16:30:36.069853 3914856 torch/distributed/run.py:766]
W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] *****************************************
W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] *****************************************
/home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,321 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file chat_template.jinja
/home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
/home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
/home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
[INFO|tokenization_utils_base.py:2313] 2025-07-03 16:30:43,904 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:697] 2025-07-03 16:30:43,913 >> loading configuration file /mnt/data1/models/1.5B/config.json
[INFO|configuration_utils.py:771] 2025-07-03 16:30:43,919 >> Model config Qwen2Config {
"_name_or_path": "/mnt/data1/models/1.5B",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 1536,
"initializer_range": 0.02,
"intermediate_size": 8960,
"max_position_embeddings": 131072,
"max_window_layers": 21,
"model_type": "qwen2",
"num_attention_heads": 12,
"num_hidden_layers": 28,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000,
"sliding_window": 4096,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.49.0",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 151936
}
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2313] 2025-07-03 16:30:44,493 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[rank1]:[W703 16:30:45.102845887 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[rank2]:[W703 16:30:45.126706430 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[rank3]:[W703 16:30:45.136836682 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 120 examples [00:00, 6525.39 examples/s]
Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s]
/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[rank0]:[W703 16:31:05.679961201 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[rank0]: multiprocess.pool.RemoteTraceback:
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 688, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3501, in _map_single
[rank0]: for i, example in iter_outputs(shard_iterable):
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3475, in iter_outputs
[rank0]: yield i, apply_function(example, i, offset=offset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3398, in apply_function
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/converter.py", line 94, in __call__
[rank0]: if self.dataset_attr.prompt and example[self.dataset_attr.prompt]:
[rank0]: ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 278, in __getitem__
[rank0]: value = self.data[key]
[rank0]: ~~~~~~~~~^^^^^
[rank0]: KeyError: 'instruction'
[rank0]: """
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]: launch()
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/tuner.py", line 72, in _training_function
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 51, in run_sft
[rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 304, in get_dataset
[rank0]: dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 182, in _get_merged_dataset
[rank0]: datasets[dataset_name] = _load_single_dataset(dataset_attr, model_args, data_args, training_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 162, in _load_single_dataset
[rank0]: return align_dataset(dataset, dataset_attr, data_args, training_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/converter.py", line 279, in align_dataset
[rank0]: return dataset.map(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3171, in map
[rank0]: for rank, done, content in iflatmap_unordered(
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 728, in iflatmap_unordered
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 728, in <listcomp>
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
[rank0]: raise self._value
[rank0]: KeyError: 'instruction'
[rank0]:[W703 16:31:06.912491219 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0703 16:31:07.960560 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914916 closing signal SIGTERM
W0703 16:31:07.961188 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914917 closing signal SIGTERM
W0703 16:31:07.961536 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914918 closing signal SIGTERM
E0703 16:31:08.371267 3914856 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 3914915) of binary: /usr/bin/python3.11
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-07-03_16:31:07
host : wiseatc-Super-Server
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3914915)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/home/wiseatc/.local/bin/llamafactory-cli", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/wiseatc/LLaMA-Factory/src/llamafactory/cli.py", line 130, in main
process = subprocess.run(
^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 569, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '4', '--master_addr', '127.0.0.1', '--master_port', '41919', '/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py', 'saves/DeepSeek-R1-1.5B-Distill/lora/train_2025-07-03-16-29-46/training_args.yaml']' returned non-zero exit status 1.
最新发布