Pytorch分布式报错1.non-zero exit status，2.cuDNN error:CUDNN_STATUS_INTERNAL_，3.CUDA error:illegal memory

xiangyong58

于 2021-07-26 15:53:54 发布

阅读量2.9k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Machine & Deep Learning 文章标签： pytorch DL

本文链接：https://blog.youkuaiyun.com/xiangyong58/article/details/119111327

Machine & Deep Learning 专栏收录该内容

78 篇文章 ¥9.90 ¥99.00

订阅专栏

超级会员免费看

这篇博客主要介绍了Pytorch分布式训练时遇到的三个错误：1）non-zero exit status 1；2）cuDNN error: CUDNN_STATUS_INTERNAL_ERROR；3）CUDA error: illegal memory access。错误一的解决方案是在DistributedDataParallel中设置find_unused_parameters=True；错误二是由于torch版本与CUDA版本不匹配，需要更新或重建conda环境并匹配CUDA版本；错误三同样是torch和CUDA版本不兼容，需要进行相应版本调整。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. returned non-zero exit status 1.

One epoch之后报错，信息如下：

RuntimeError: Expected to have finished reduction in the prior iteration before 
starting a new one. This error indicates that your module has parameters that were 
not used in producing loss. You can enable unused parameter detection by (1) 
passing the keyword argument `find_unused_parameters=True` to 
`torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function 
outputs participate in calculating loss. If you already have done the above two 
steps, then the distributed data parallel module wasn't able to locate the output 
tensors in the return value of your module's `forward` funct