RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

最新推荐文章于 2025-04-05 14:52:45 发布

豆奶泡油条

最新推荐文章于 2025-04-05 14:52:45 发布

阅读量386

点赞数

文章标签： python 深度学习 pytorch

本文链接：https://blog.youkuaiyun.com/u014295602/article/details/130781392

版权

在进行多卡分布式训练时遇到RuntimeError，错误表明模型中有在forward函数中未使用的参数。解决方案是确保只在__init__方法中定义并使用到的模块，注释掉未在forward中调用的模块。这样可以避免DistributedDataParallel模块无法定位到forward函数的输出张量。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

问题：非分布式训练时没有问题，使用多卡分布式训练时报错

Error message
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making su re all forward function outputs participate in calculating loss. If you already have done the above two steps, then th e distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forwar d function. Please include the loss function and the structure of the return value of forward of your module when rep orting this issue (e.g. list, dict, iterable).