Traceback (most recent call last):
File "/data16/jiugan/code/DEIM-514/train.py", line 93, in <module>
main(args)
File "/data16/jiugan/code/DEIM-514/train.py", line 64, in main
solver.fit(cfg_str)
File "/data16/jiugan/code/DEIM-514/engine/solver/det_solver.py", line 86, in fit
train_stats = train_one_epoch(
^^^^^^^^^^^^^^^^
File "/data16/jiugan/code/DEIM-514/engine/solver/det_engine.py", line 112, in train_one_epoch
outputs = model(samples, targets=targets) # 前向传播
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward
else self._run_ddp_forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/jiugan/code/DEIM-514/engine/deim/deim.py", line 29, in forward
x = self.decoder(x, targets)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/jiugan/code/DEIM-514/engine/deim/dfine_decoder.py", line 827, in forward
self._get_decoder_input(memory, spatial_shapes, denoising_logits, denoising_bbox_unact)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/jiugan/code/DEIM-514/engine/deim/dfine_decoder.py", line 745, in _get_decoder_input
anchors, valid_mask = self._generate_anchors(spatial_shapes, device=memory.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/jiugan/code/DEIM-514/engine/deim/dfine_decoder.py", line 731, in _generate_anchors
anchors = torch.concat(anchors, dim=1).to(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1711403408687/work/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f46a4f80d87 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f46a4f3175f in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f46a60628a8 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f46416909ec in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4641694b08 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f464169823a in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f4641698e79 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd8198 (0x7f469e6eb198 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x94b43 (0x7f46a7094b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126a00 (0x7f46a7126a00 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1711403408687/work/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f46a4f80d87 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f46a4f3175f in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f46a60628a8 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f46416909ec in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4641694b08 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f464169823a in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f4641698e79 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd8198 (0x7f469e6eb198 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x94b43 (0x7f46a7094b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126a00 (0x7f46a7126a00 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1711403408687/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f46a4f80d87 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdef733 (0x7f46413ef733 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd8198 (0x7f469e6eb198 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #3: <unknown function> + 0x94b43 (0x7f46a7094b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126a00 (0x7f46a7126a00 in /lib/x86_64-linux-gnu/libc.so.6)
[2025-05-28 13:49:53,599] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 639748) of binary: /data16/home/zjl/miniconda3/envs/deim/bin/python
Traceback (most recent call last):
File "/data16/home/zjl/miniconda3/envs/deim/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-05-28_13:49:53
host : cv147cv
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 639748)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 639748
=======================================================报错
最新发布