分析跑internVL断点续训练的报错原因,并提供解决方法。
[INFO|deepspeed.py:400] 2025-11-06 08:03:17,906 >> Attempting to resume from work_dirs/InternVL_1b_remote_sensing_ViTP_03NTL_WAS_digit_weight=2n+1_loss/checkpoint-6000
[2025-11-06 08:03:17,907] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from work_dirs/InternVL_1b_remote_sensing_ViTP_03NTL_WAS_digit_weight=2n+1_loss/checkpoint-6000/global_step6000/mp_rank_00_model_states.pt...
[2025-11-06 08:03:21,497] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from work_dirs/InternVL_1b_remote_sensing_ViTP_03NTL_WAS_digit_weight=2n+1_loss/checkpoint-6000/global_step6000/mp_rank_00_model_states.pt.
[2025-11-06 08:03:21,631] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from work_dirs/InternVL_1b_remote_sensing_ViTP_03NTL_WAS_digit_weight=2n+1_loss/checkpoint-6000/global_step6000/mp_rank_00_model_states.pt...
[2025-11-06 08:03:22,980] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from work_dirs/InternVL_1b_remote_sensing_ViTP_03NTL_WAS_digit_weight=2n+1_loss/checkpoint-6000/global_step6000/mp_rank_00_model_states.pt.
[2025-11-06 08:03:23,347] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from work_dirs/InternVL_1b_remote_sensing_ViTP_03NTL_WAS_digit_weight=2n+1_loss/checkpoint-6000/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[rank1]: Traceback (most recent call last):
[rank1]: File "/filesdir/ZZH/ViTP/ViTP/internvl/train/internvl_chat_finetune.py", line 1161, in <module>
[rank1]: main()
[rank1]: File "/filesdir/ZZH/ViTP/ViTP/internvl/train/internvl_chat_finetune.py", line 1146, in main
[rank1]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
[rank1]: return inner_training_loop(
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
[rank1]: deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
[rank1]: load_path, _ = deepspeed_engine.load_checkpoint(
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2770, in load_checkpoint
[rank1]: success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2955, in _load_zero_checkpoint
[rank1]: zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3030, in _get_all_zero_checkpoints
[rank1]: return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3009, in _get_all_zero_checkpoint_state_dicts
[rank1]: _state = self.checkpoint_engine.load(
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 28, in load
[rank1]: partition = torch.load(path, map_location=map_location)
[rank1]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/serialization.py", line 1529, in load
[rank1]: raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank1]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
[rank1]: (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank1]: (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank1]: WeightsUnpickler error: Unsupported global: GLOBAL deepspeed.runtime.fp16.loss_scaler.LossScaler was not an allowed global by default. Please use `torch.serialization.add_safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` or the `torch.serialization.safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` context manager to allowlist this global if you trust this class/function.
[rank1]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
[rank0]: Traceback (most recent call last):
[rank0]: File "/filesdir/ZZH/ViTP/ViTP/internvl/train/internvl_chat_finetune.py", line 1161, in <module>
[rank0]: main()
[rank0]: File "/filesdir/ZZH/ViTP/ViTP/internvl/train/internvl_chat_finetune.py", line 1146, in main
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
[rank0]: return inner_training_loop(
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
[rank0]: deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
[rank0]: load_path, _ = deepspeed_engine.load_checkpoint(
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2770, in load_checkpoint
[rank0]: success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2955, in _load_zero_checkpoint
[rank0]: zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3030, in _get_all_zero_checkpoints
[rank0]: return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3009, in _get_all_zero_checkpoint_state_dicts
[rank0]: _state = self.checkpoint_engine.load(
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 28, in load
[rank0]: partition = torch.load(path, map_location=map_location)
[rank0]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/serialization.py", line 1529, in load
[rank0]: raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank0]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
[rank0]: (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank0]: (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank0]: WeightsUnpickler error: Unsupported global: GLOBAL deepspeed.runtime.fp16.loss_scaler.LossScaler was not an allowed global by default. Please use `torch.serialization.add_safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` or the `torch.serialization.safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` context manager to allowlist this global if you trust this class/function.
[rank0]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
[rank3]: Traceback (most recent call last):
[rank3]: File "/filesdir/ZZH/ViTP/ViTP/internvl/train/internvl_chat_finetune.py", line 1161, in <module>
[rank3]: main()
[rank3]: File "/filesdir/ZZH/ViTP/ViTP/internvl/train/internvl_chat_finetune.py", line 1146, in main
[rank3]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
[rank3]: return inner_training_loop(
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
[rank3]: deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
[rank3]: load_path, _ = deepspeed_engine.load_checkpoint(
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2770, in load_checkpoint
[rank3]: success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2955, in _load_zero_checkpoint
[rank3]: zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3030, in _get_all_zero_checkpoints
[rank3]: return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3009, in _get_all_zero_checkpoint_state_dicts
[rank3]: _state = self.checkpoint_engine.load(
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 28, in load
[rank3]: partition = torch.load(path, map_location=map_location)
[rank3]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/serialization.py", line 1529, in load
[rank3]: raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank3]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
[rank3]: (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank3]: (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank3]: WeightsUnpickler error: Unsupported global: GLOBAL deepspeed.runtime.fp16.loss_scaler.LossScaler was not an allowed global by default. Please use `torch.serialization.add_safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` or the `torch.serialization.safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` context manager to allowlist this global if you trust this class/function.
[rank3]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
[rank2]: Traceback (most recent call last):
[rank2]: File "/filesdir/ZZH/ViTP/ViTP/internvl/train/internvl_chat_finetune.py", line 1161, in <module>
[rank2]: main()
[rank2]: File "/filesdir/ZZH/ViTP/ViTP/internvl/train/internvl_chat_finetune.py", line 1146, in main
[rank2]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
[rank2]: return inner_training_loop(
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
[rank2]: deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
[rank2]: load_path, _ = deepspeed_engine.load_checkpoint(
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2770, in load_checkpoint
[rank2]: success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2955, in _load_zero_checkpoint
[rank2]: zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3030, in _get_all_zero_checkpoints
[rank2]: return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3009, in _get_all_zero_checkpoint_state_dicts
[rank2]: _state = self.checkpoint_engine.load(
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 28, in load
[rank2]: partition = torch.load(path, map_location=map_location)
[rank2]: File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/serialization.py", line 1529, in load
[rank2]: raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank2]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
[rank2]: (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank2]: (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank2]: WeightsUnpickler error: Unsupported global: GLOBAL deepspeed.runtime.fp16.loss_scaler.LossScaler was not an allowed global by default. Please use `torch.serialization.add_safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` or the `torch.serialization.safe_globals([deepspeed.runtime.fp16.loss_scaler.LossScaler])` context manager to allowlist this global if you trust this class/function.
[rank2]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
[rank0]:[W1106 08:03:25.600984408 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1106 08:03:26.770491 1735 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1813 closing signal SIGTERM
W1106 08:03:26.771584 1735 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1814 closing signal SIGTERM
W1106 08:03:26.771985 1735 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1815 closing signal SIGTERM
E1106 08:03:27.887712 1735 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 1812) of binary: /root/miniconda3/envs/ViTP/bin/python3.9
Traceback (most recent call last):
File "/root/miniconda3/envs/ViTP/bin/torchrun", line 7, in <module>
sys.exit(main())
File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/ViTP/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
internvl/train/internvl_chat_finetune.py FAILED
最新发布