(rgb) zq@DESKTOP-57JG85J:~/MiLNet$ CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=2 train_mf.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2025-06-28 17:27:24 | Conf | use logdir run/2025-06-28-17-27(irseg-M0)
/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/mmcv/__init__.py:21: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
'On January 1, 2023, MMCV will release v2.0.0, in which it will remove '
/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/mmcv/__init__.py:21: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
'On January 1, 2023, MMCV will release v2.0.0, in which it will remove '
terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::Errorc10::Error'
'
what(): CUDA error: an illegal memory access was encountered
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb7337d5457 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb73379f3ec in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7fb75e845c64 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1e3e5 (0x7fb75e81d3e5 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7fb75e820054 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4f6823 (0x7fb789725823 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7fb7337b59e0 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb7337b5af9 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #8: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7fb789727d1b in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xe0847 (0x7fb779ce2847 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #10: <unknown function> + 0x5216f35 (0x7fb763b13f35 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x521caba (0x7fb763b19aba in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: c10d::ops::allgather(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10d::AllgatherOptions const&) + 0x144 (0x7fb763b17524 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<std::weak_ptr<c10d::Logger> > const&) + 0x310 (0x7fb763b752f0 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xbdac6d (0x7fb789e09c6d in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3e5a3a (0x7fb789614a3a in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #16: _PyMethodDef_RawFastCallKeywords + 0x237 (0x4bb4e7 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #17: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4baf40]
frame #18: _PyEval_EvalFrameDefault + 0x469a (0x4b793a in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #19: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #20: _PyFunction_FastCallKeywords + 0x29c (0x4c638c in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #21: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4bae2f]
frame #22: _PyEval_EvalFrameDefault + 0x971 (0x4b3c11 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #23: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #24: _PyFunction_FastCallDict + 0x2d6 (0x4cd006 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #25: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4d162e]
frame #26: _PyObject_FastCallKeywords + 0x19d (0x4c39ed in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #27: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4baf69]
frame #28: _PyEval_EvalFrameDefault + 0x15d2 (0x4b4872 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #29: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #30: _PyFunction_FastCallKeywords + 0x29c (0x4c638c in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #31: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4bae2f]
frame #32: _PyEval_EvalFrameDefault + 0x971 (0x4b3c11 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #33: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #34: PyEval_EvalCodeEx + 0x39 (0x4b1e39 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #35: PyEval_EvalCode + 0x1b (0x5537fb in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #36: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x56cfa3]
frame #37: PyRun_FileExFlags + 0x97 (0x573107 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #38: PyRun_SimpleFileExFlags + 0x184 (0x572974 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #39: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x549353]
frame #40: _Py_UnixMain + 0x3c (0x548fec in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #41: __libc_start_main + 0xf3 (0x7fb7aa7a4083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x548e9e]
what(): CUDA error: an illegal memory access was encountered
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f60e305f457 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f60e30293ec in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f610e0cfc64 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1e3e5 (0x7f610e0a73e5 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f610e0aa054 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4f6823 (0x7f6138faf823 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f60e303f9e0 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f60e303faf9 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #8: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7f6138fb1d1b in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xe0847 (0x7f612956c847 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #10: <unknown function> + 0x5216f35 (0x7f611339df35 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x521caba (0x7f61133a3aba in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: c10d::ops::allgather(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10d::AllgatherOptions const&) + 0x144 (0x7f61133a1524 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<std::weak_ptr<c10d::Logger> > const&) + 0x310 (0x7f61133ff2f0 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xbdac6d (0x7f6139693c6d in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3e5a3a (0x7f6138e9ea3a in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #16: _PyMethodDef_RawFastCallKeywords + 0x237 (0x4bb4e7 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #17: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4baf40]
frame #18: _PyEval_EvalFrameDefault + 0x469a (0x4b793a in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #19: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #20: _PyFunction_FastCallKeywords + 0x29c (0x4c638c in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #21: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4bae2f]
frame #22: _PyEval_EvalFrameDefault + 0x971 (0x4b3c11 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #23: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #24: _PyFunction_FastCallDict + 0x2d6 (0x4cd006 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #25: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4d162e]
frame #26: _PyObject_FastCallKeywords + 0x19d (0x4c39ed in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #27: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4baf69]
frame #28: _PyEval_EvalFrameDefault + 0x15d2 (0x4b4872 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #29: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #30: _PyFunction_FastCallKeywords + 0x29c (0x4c638c in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #31: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4bae2f]
frame #32: _PyEval_EvalFrameDefault + 0x971 (0x4b3c11 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #33: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #34: PyEval_EvalCodeEx + 0x39 (0x4b1e39 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #35: PyEval_EvalCode + 0x1b (0x5537fb in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #36: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x56cfa3]
frame #37: PyRun_FileExFlags + 0x97 (0x573107 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #38: PyRun_SimpleFileExFlags + 0x184 (0x572974 in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #39: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x549353]
frame #40: _Py_UnixMain + 0x3c (0x548fec in /home/zq/anaconda3/envs/rgb/bin/python3.7)
frame #41: __libc_start_main + 0xf3 (0x7f615a02e083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x548e9e]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 227528) of binary: /home/zq/anaconda3/envs/rgb/bin/python3.7
Traceback (most recent call last):
File "/home/zq/anaconda3/envs/rgb/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_mf.py FAILED
-------------------------------------------------------
Failures:
[1]:
time : 2025-06-28_17:27:32
host : DESKTOP-57JG85J.
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 227529)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 227529
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-06-28_17:27:32
host : DESKTOP-57JG85J.
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 227528)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 227528
=======================================================