build pyrun/python from source

quite easy on ubuntu.

 

1: build sqlite first.. use the latest autoconf version

2: apt-get install libbz2-dev

3: apt-get install zlib1g-dev

4: apt-get install libssl-dev

5: apt-get install libffi-dev

6: apt-get install libreadline-dev

7: apt-get install libncurses5-dev

 

for pyrun 

cd PyRun and make.

 

for python

./configure

make 

make install

--with-pydebug for a debug version 

 

enjoy.

(rgb) zq@DESKTOP-57JG85J:~/MiLNet$ CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=2 train_mf.py WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** 2025-06-28 17:27:24 | Conf | use logdir run/2025-06-28-17-27(irseg-M0) /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/mmcv/__init__.py:21: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. 'On January 1, 2023, MMCV will release v2.0.0, in which it will remove ' /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/mmcv/__init__.py:21: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. 'On January 1, 2023, MMCV will release v2.0.0, in which it will remove ' terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::Errorc10::Error' ' what(): CUDA error: an illegal memory access was encountered Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb7337d5457 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb73379f3ec in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7fb75e845c64 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: <unknown function> + 0x1e3e5 (0x7fb75e81d3e5 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7fb75e820054 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #5: <unknown function> + 0x4f6823 (0x7fb789725823 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7fb7337b59e0 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so) frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb7337b5af9 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so) frame #8: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7fb789727d1b in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #9: <unknown function> + 0xe0847 (0x7fb779ce2847 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #10: <unknown function> + 0x5216f35 (0x7fb763b13f35 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #11: <unknown function> + 0x521caba (0x7fb763b19aba in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #12: c10d::ops::allgather(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10d::AllgatherOptions const&) + 0x144 (0x7fb763b17524 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #13: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<std::weak_ptr<c10d::Logger> > const&) + 0x310 (0x7fb763b752f0 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #14: <unknown function> + 0xbdac6d (0x7fb789e09c6d in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #15: <unknown function> + 0x3e5a3a (0x7fb789614a3a in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #16: _PyMethodDef_RawFastCallKeywords + 0x237 (0x4bb4e7 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #17: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4baf40] frame #18: _PyEval_EvalFrameDefault + 0x469a (0x4b793a in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #19: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #20: _PyFunction_FastCallKeywords + 0x29c (0x4c638c in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #21: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4bae2f] frame #22: _PyEval_EvalFrameDefault + 0x971 (0x4b3c11 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #23: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #24: _PyFunction_FastCallDict + 0x2d6 (0x4cd006 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #25: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4d162e] frame #26: _PyObject_FastCallKeywords + 0x19d (0x4c39ed in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #27: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4baf69] frame #28: _PyEval_EvalFrameDefault + 0x15d2 (0x4b4872 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #29: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #30: _PyFunction_FastCallKeywords + 0x29c (0x4c638c in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #31: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4bae2f] frame #32: _PyEval_EvalFrameDefault + 0x971 (0x4b3c11 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #33: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #34: PyEval_EvalCodeEx + 0x39 (0x4b1e39 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #35: PyEval_EvalCode + 0x1b (0x5537fb in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #36: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x56cfa3] frame #37: PyRun_FileExFlags + 0x97 (0x573107 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #38: PyRun_SimpleFileExFlags + 0x184 (0x572974 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #39: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x549353] frame #40: _Py_UnixMain + 0x3c (0x548fec in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #41: __libc_start_main + 0xf3 (0x7fb7aa7a4083 in /lib/x86_64-linux-gnu/libc.so.6) frame #42: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x548e9e] what(): CUDA error: an illegal memory access was encountered Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f60e305f457 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f60e30293ec in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f610e0cfc64 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: <unknown function> + 0x1e3e5 (0x7f610e0a73e5 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f610e0aa054 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #5: <unknown function> + 0x4f6823 (0x7f6138faf823 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f60e303f9e0 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so) frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f60e303faf9 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libc10.so) frame #8: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7f6138fb1d1b in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #9: <unknown function> + 0xe0847 (0x7f612956c847 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #10: <unknown function> + 0x5216f35 (0x7f611339df35 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #11: <unknown function> + 0x521caba (0x7f61133a3aba in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #12: c10d::ops::allgather(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10d::AllgatherOptions const&) + 0x144 (0x7f61133a1524 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #13: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<std::weak_ptr<c10d::Logger> > const&) + 0x310 (0x7f61133ff2f0 in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #14: <unknown function> + 0xbdac6d (0x7f6139693c6d in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #15: <unknown function> + 0x3e5a3a (0x7f6138e9ea3a in /home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #16: _PyMethodDef_RawFastCallKeywords + 0x237 (0x4bb4e7 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #17: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4baf40] frame #18: _PyEval_EvalFrameDefault + 0x469a (0x4b793a in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #19: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #20: _PyFunction_FastCallKeywords + 0x29c (0x4c638c in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #21: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4bae2f] frame #22: _PyEval_EvalFrameDefault + 0x971 (0x4b3c11 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #23: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #24: _PyFunction_FastCallDict + 0x2d6 (0x4cd006 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #25: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4d162e] frame #26: _PyObject_FastCallKeywords + 0x19d (0x4c39ed in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #27: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4baf69] frame #28: _PyEval_EvalFrameDefault + 0x15d2 (0x4b4872 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #29: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #30: _PyFunction_FastCallKeywords + 0x29c (0x4c638c in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #31: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x4bae2f] frame #32: _PyEval_EvalFrameDefault + 0x971 (0x4b3c11 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #33: _PyEval_EvalCodeWithName + 0x201 (0x4b2041 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #34: PyEval_EvalCodeEx + 0x39 (0x4b1e39 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #35: PyEval_EvalCode + 0x1b (0x5537fb in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #36: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x56cfa3] frame #37: PyRun_FileExFlags + 0x97 (0x573107 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #38: PyRun_SimpleFileExFlags + 0x184 (0x572974 in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #39: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x549353] frame #40: _Py_UnixMain + 0x3c (0x548fec in /home/zq/anaconda3/envs/rgb/bin/python3.7) frame #41: __libc_start_main + 0xf3 (0x7f615a02e083 in /lib/x86_64-linux-gnu/libc.so.6) frame #42: /home/zq/anaconda3/envs/rgb/bin/python3.7() [0x548e9e] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 227528) of binary: /home/zq/anaconda3/envs/rgb/bin/python3.7 Traceback (most recent call last): File "/home/zq/anaconda3/envs/rgb/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run )(*cmd_args) File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zq/anaconda3/envs/rgb/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ======================================================= train_mf.py FAILED ------------------------------------------------------- Failures: [1]: time : 2025-06-28_17:27:32 host : DESKTOP-57JG85J. rank : 1 (local_rank: 1) exitcode : -6 (pid: 227529) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 227529 ------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2025-06-28_17:27:32 host : DESKTOP-57JG85J. rank : 0 (local_rank: 0) exitcode : -6 (pid: 227528) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 227528 =======================================================
06-29
在 PyTorch 分布式训练中,遇到 **CUDA error: an illegal memory access was encountered** 错误通常表明程序试图访问无效或未分配的内存区域。这种错误可能由多个因素引发,包括张量生命周期管理不当、分布式训练初始化问题、数据加载器逻辑异常以及模型封装顺序错误等。 ### CUDA 张量操作中的非法内存访问 非法内存访问往往出现在张量操作过程中,例如反向传播时某些参数的 `.grad` 属性为空。这种情况可能导致 `clip_grad_norm_` 或其他梯度处理函数在调用 `torch.norm(p.grad.detach(), norm_type)` 时触发异常[^1]。为排查此类问题,可以在反向传播之前检查所有参数是否具有有效梯度: ```python for p in model.parameters(): if p.grad is None: print(f"Parameter {p} has no gradient") ``` 此外,可以使用调试工具如 `printTensor` 函数来打印张量信息,以确认张量维度和设备是否正确[^3]: ```python def printTensor(t, tag: str): sz = t.size() p = t for i in range(len(sz)-1): p = p[0] if len(p) > 3: p = p[:3] print('\t%s.size' % tag, t.size(), ' dev :', t.device, ": ", p.data) ``` ### 分布式训练初始化与进程组配置 在使用 `DistributedDataParallel` 时,若进程组初始化不当,也可能导致非法内存访问。确保每个进程的 `rank` 和 `world_size` 设置正确,并且 `dist.init_process_group` 被正确调用: ```python import torch.distributed as dist dist.init_process_group(backend='nccl') model = torch.nn.parallel.DistributedDataParallel(model) ``` 同时验证 NCCL 后端是否可用,以排除通信后端的问题: ```python if dist.is_available() and 'nccl' in dist.Backend.__members__: print("NCCL backend is available") else: print("NCCL backend not available") ``` ### 数据加载器与预处理逻辑 如果数据加载器中存在不规范的数据变换逻辑(如超出范围的裁剪或归一化),可能导致后续 GPU 张量操作出错。建议检查数据集和数据加载器中的变换函数是否应用得当,例如 `ToTensor()` 或 `Normalize()` 是否被正确使用。此外,可尝试将 `num_workers` 设置为 0 来测试单线程数据加载,以排除多线程潜在问题: ```python train_loader = torch.utils.data.DataLoader( dataset, batch_size=32, shuffle=True, num_workers=0 # 测试单线程数据加载 ) ``` ### 模型并行封装顺序 如果模型被封装在 `DataParallel` 或 `DistributedDataParallel` 中,但封装顺序不正确,可能导致内部张量布局混乱,从而引发非法内存访问。应确保在模型定义完成之后再进行并行封装,而不是在模型构造过程中就进行封装[^4]。示例代码如下: ```python model = MyModel() model = torch.nn.parallel.DistributedDataParallel(model) ``` ### CUDA 版本与驱动兼容性 非法内存访问错误也可能由 CUDA 驱动版本过旧或与 PyTorch 编译所用的 CUDA 版本不一致引起。使用 `nvidia-smi` 查看当前 CUDA 驱动版本,并确认安装的 PyTorch 版本支持该 CUDA 版本。可以通过 [PyTorch 官网](https://pytorch.org/) 查询对应关系。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值