warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))

这篇博客详细介绍了如何使用pip命令安装适用于CUDA 9.0的PyTorch 0.3.1版本。首先,需要卸载已有的PyTorch版本,然后通过指定URL安装特定Python 3.6的whl文件。此方法确保了与Python 3.6的兼容性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

refer to this link

直接用 pip install torch==0.3.1 安装的是 cuda8.0 的,使用如下命令:
pip install http://download.pytorch.org/whl/cu90/torch-0.3.1-cp35-cp35m-linux_x86_64.whl

安裝前,記得先卸載原先已安裝的torch,再進行安裝。cp35-cp35m,中的35指的是python版本。我用的3.6 python,所以改為cp36-cp36m

2025-06-26 16:12:03,164 - mmdet - INFO - Saving checkpoint at 12 epochs [ ] 0/81, elapsed: 0s, ETA:/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the &#39;trunc&#39; function NOT &#39;floor&#39;). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode=&#39;trunc&#39;), or for actual floor division, use torch.div(a, b, rounding_mode=&#39;floor&#39;). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) [ ] 1/81, 0.3 task/s, elapsed: 3s, ETA: 273s/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/fut_nms_free_coder.py:78: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( /media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/map_nms_free_coder.py:82: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 81/81, 4.5 task/s, elapsed: 18s, ETA: 0s Traceback (most recent call last): File "tools/train.py", line 266, in <module> main() File "tools/train.py", line 255, in main custom_train_model( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/train.py", line 21, in custom_train_model custom_train_detector( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/mmdet_train.py", line 194, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook(&#39;after_train_epoch&#39;) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 88, in _do_evaluate key_score = self.evaluate(runner, results) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/datasets/nuscenes_vad_dataset.py", line 1781, in evaluate all_metric_dict[key] += results[i][&#39;metric_results&#39;][key] KeyError: 0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2522198) of binary: /home/wangbaihui/anaconda3/envs/vad/bin/python /home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 2522198 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example: from torch.distributed.elastic.multiprocessing.errors import record @record def trainer_main(args): # do train ********************************************************************** warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module> main() File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper return f(*args, **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** tools/train.py FAILED ======================================= Root Cause: [0]: time: 2025-06-26_16:12:28 rank: 0 (local_rank: 0) exitcode: 1 (pid: 2522198) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> ***********2025-06-26 16:12:03,164 - mmdet - INFO - Saving checkpoint at 12 epochs [ ] 0/81, elapsed: 0s, ETA:/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the &#39;trunc&#39; function NOT &#39;floor&#39;). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode=&#39;trunc&#39;), or for actual floor division, use torch.div(a, b, rounding_mode=&#39;floor&#39;). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) [ ] 1/81, 0.3 task/s, elapsed: 3s, ETA: 273s/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/fut_nms_free_coder.py:78: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( /media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/map_nms_free_coder.py:82: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 81/81, 4.5 task/s, elapsed: 18s, ETA: 0s Traceback (most recent call last): File "tools/train.py", line 266, in <module> main() File "tools/train.py", line 255, in main custom_train_model( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/train.py", line 21, in custom_train_model custom_train_detector( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/mmdet_train.py", line 194, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook(&#39;after_train_epoch&#39;) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 88, in _do_evaluate key_score = self.evaluate(runner, results) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/datasets/nuscenes_vad_dataset.py", line 1781, in evaluate all_metric_dict[key] += results[i][&#39;metric_results&#39;][key] KeyError: 0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2522198) of binary: /home/wangbaihui/anaconda3/envs/vad/bin/python /home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 2522198 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example: from torch.distributed.elastic.multiprocessing.errors import record @record def trainer_main(args): # do train ********************************************************************** warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module> main() File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper return f(*args, **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** tools/train.py FAILED ======================================= Root Cause: [0]: time: 2025-06-26_16:12:28 rank: 0 (local_rank: 0) exitcode: 1 (pid: 2522198) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> ***********
06-27
/home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] ***************************************** W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] ***************************************** /home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,321 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file chat_template.jinja /home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources [INFO|tokenization_utils_base.py:2313] 2025-07-03 16:30:43,904 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:697] 2025-07-03 16:30:43,913 >> loading configuration file /mnt/data1/models/1.5B/config.json [INFO|configuration_utils.py:771] 2025-07-03 16:30:43,919 >> Model config Qwen2Config { "_name_or_path": "/mnt/data1/models/1.5B", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "hidden_act": "silu", "hidden_size": 1536, "initializer_range": 0.02, "intermediate_size": 8960, "max_position_embeddings": 131072, "max_window_layers": 21, "model_type": "qwen2", "num_attention_heads": 12, "num_hidden_layers": 28, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 10000, "sliding_window": 4096, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0", "use_cache": true, "use_mrope": false, "use_sliding_window": false, "vocab_size": 151936 } [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2313] 2025-07-03 16:30:44,493 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank1]:[W703 16:30:45.102845887 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. /usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank2]:[W703 16:30:45.126706430 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. /usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank3]:[W703 16:30:45.136836682 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard. Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 120 examples [00:00, 6525.39 examples/s] Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s] Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s] Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s] /usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank0]:[W703 16:31:05.679961201 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. [rank0]: multiprocess.pool.RemoteTraceback: [rank0]: """ [rank0]: Traceback (most recent call last): [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker [rank0]: result = (True, func(*args, **kwds)) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 688, in _write_generator_to_queue [rank0]: for i, result in enumerate(func(**kwargs)): [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3501, in _map_single [rank0]: for i, example in iter_outputs(shard_iterable): [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3475, in iter_outputs [rank0]: yield i, apply_function(example, i, offset=offset) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3398, in apply_function [rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/converter.py", line 94, in __call__ [rank0]: if self.dataset_attr.prompt and example[self.dataset_attr.prompt]: [rank0]: ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 278, in __getitem__ [rank0]: value = self.data[key] [rank0]: ~~~~~~~~~^^^^^ [rank0]: KeyError: &#39;instruction&#39; [rank0]: """ [rank0]: The above exception was the direct cause of the following exception: [rank0]: Traceback (most recent call last): [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module> [rank0]: launch() [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch [rank0]: run_exp() [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp [rank0]: _training_function(config={"args": args, "callbacks": callbacks}) [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/tuner.py", line 72, in _training_function [rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 51, in run_sft [rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 304, in get_dataset [rank0]: dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 182, in _get_merged_dataset [rank0]: datasets[dataset_name] = _load_single_dataset(dataset_attr, model_args, data_args, training_args) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 162, in _load_single_dataset [rank0]: return align_dataset(dataset, dataset_attr, data_args, training_args) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/converter.py", line 279, in align_dataset [rank0]: return dataset.map( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper [rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3171, in map [rank0]: for rank, done, content in iflatmap_unordered( [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 728, in iflatmap_unordered [rank0]: [async_result.get(timeout=0.05) for async_result in async_results] [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 728, in <listcomp> [rank0]: [async_result.get(timeout=0.05) for async_result in async_results] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get [rank0]: raise self._value [rank0]: KeyError: &#39;instruction&#39; [rank0]:[W703 16:31:06.912491219 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0703 16:31:07.960560 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914916 closing signal SIGTERM W0703 16:31:07.961188 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914917 closing signal SIGTERM W0703 16:31:07.961536 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914918 closing signal SIGTERM E0703 16:31:08.371267 3914856 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 3914915) of binary: /usr/bin/python3.11 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 892, in main run(args) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-07-03_16:31:07 host : wiseatc-Super-Server rank : 0 (local_rank: 0) exitcode : 1 (pid: 3914915) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ Traceback (most recent call last): File "/home/wiseatc/.local/bin/llamafactory-cli", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/wiseatc/LLaMA-Factory/src/llamafactory/cli.py", line 130, in main process = subprocess.run( ^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/subprocess.py", line 569, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command &#39;[&#39;torchrun&#39;, &#39;--nnodes&#39;, &#39;1&#39;, &#39;--node_rank&#39;, &#39;0&#39;, &#39;--nproc_per_node&#39;, &#39;4&#39;, &#39;--master_addr&#39;, &#39;127.0.0.1&#39;, &#39;--master_port&#39;, &#39;41919&#39;, &#39;/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py&#39;, &#39;saves/DeepSeek-R1-1.5B-Distill/lora/train_2025-07-03-16-29-46/training_args.yaml&#39;]&#39; returned non-zero exit status 1.
最新发布
07-04
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值