pytorch DistributedDataParallel提示未参与loss计算的变量错误

最新推荐文章于 2025-10-23 11:20:43 发布

原创

最新推荐文章于 2025-10-23 11:20:43 发布 · 1.3w 阅读

13 ·

CC 4.0 BY-SA版权

错误提示：

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates th
at your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passin
g the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `f
orward` function outputs participate in calculating loss. If you already have done the above two steps, then the distribute
d data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Pl
ease include the loss function and the structure of the return value of `forward` of your module when reporting this issue
(e.g. list, dict, iterable).

解决方法：

forward不要return任何不计算loss的变量！

比如

model = nn.parallel.DistributedDataParallel(model, device_ids=[config.args.local_rank],

最低0.47元/天解锁文章

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

j-o-l-i-n

关注关注

14
点赞
踩
13

收藏

觉得还不错? 一键收藏
2
评论
分享

复制链接

分享到 QQ

分享到新浪微博

扫一扫
举报

举报

专栏目录

MMSeg搭建模型的坑

littleyy666的博客

04-24

604

自己搭建模型的时候，经常会遇到二者不匹配，以这种情况为例，是因为部分模型没有加载到CUDA上面造成的。以下是错误代码，我搞了半天，不理解为什么总是不对，正确的应该是。注意搭建模型的时候，所有层都应该在init函数中完成初始化。搭建代码时，希望记住。没有系统教学，只能自己在网上摸索。其次，对于List、Tuple这种类型，不建议直接用。

PyTorch 的并行计算——PyTorch 中文文档

08-09

2031

近年来，随着深度学习模型规模和数据集的不断增长，训练时间成为了制约模型性能提升的关键因素。为了加速模型训练，并行计算技术应运而生，并逐渐成为深度学习领域的研究热点。PyTorch 作为一款流行的深度学习框架，提供了丰富的并行计算功能，可以帮助开发者轻松实现模型的并行训练，从而大幅缩短训练时间。实现数据并行，将模型复制到多个 GPU 上，并将数据分发到不同的 GPU 进行计算。实现分布式训练，支持多种后端，如 MPI、Gloo 和 NCCL。

2 条评论您还未登录，请先登录后发表或查看评论

making sure all `forward` function outputs participate in calculating loss.

weixin_44180836的博客

08-18

934

参考文章： pytorch使用过程中的报错收集（持续更新…）如何查找网络中没用到的参数

DistributedDataParallel提示未参与loss计算的变量错误

ZauberC的博客

10-18

1642

如果你的观察变量不会影响结果，即y2，你就可以将分布式中的torch.nn.parallel.DistributedDataParallel的参数find_unused_parameters=True，就不会报错了；如果会影响结果说明forward函数写错了，查看错误即可。其中y_tgt就是一个未参与loss计算的变量，就不要输出出来！不然find_unused_parameters=True都救不了。forward不要return任何不计算loss的变量！

DDP多卡训练报Warning: find_unused_parameters=True

最新发布

badwoman_12138的博客

10-23

540

适用于“每一步用到的参数集合不变、计算图结构不变”的模型，会进一步减少同步开销；如果前向里有条件分支导致有时跳过某些层，就不要设。，否则那一轮没用到的参数会挂起梯度同步，导致死锁或报错。，或者重构前向/损失让所有需要训练的参数每步都参与计算。：确实存在未用到的参数（可能是分支被跳过、某些张量。了、某个 loss 被条件屏蔽等）→ 需要保持。→ 但要接受一点点性能损失（多一次图遍历）。冻结，且不要把这些参数给优化器）。，除非你“真的”会有未用参数的步。，或重构让每步都用到训练参数。

pytorch distributed常见错误

芒果干的博客

01-16

7298

计算图问题 RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing th

Pytorch Distribted training中loss.backward()报错

loopun的博客

04-14

3578

参考issue 错误触发：

pytorch使用DataParallel时遇到的几个问题

tyler的博客

11-15

3183

import Model import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = Model(input) model.to(device) model = torch.nn.DataParallel(model) ... 1、问题1： ValueError: only one element tensors can be converted to Python scalars

Pytorch的nn.DataParallel

anshiquanshu的专栏

08-23

2804

在多卡的GPU服务器，当我们在上面跑程序的时候，当迭代次数或者epoch足够大的时候，我们通常会使用nn.DataParallel函数来用多个GPU来加速训练。一般我们会在代码中加入以下这句： device_ids = [0, 1] net = torch.nn.DataParallel(net, device_ids=device_ids) 似乎只要加上这一行代码，你在ternimal下执行watch -n 1 nvidia-smi后会发现确实会使用多个GPU来并行训练。但是细心点会发现其实第一块卡

PyTorch学习（3）：并行训练DataParallel与DistributedDataParallel

tecsai的博客

03-29

2210

在使用pytorch训练网络时，一般都会使用多GPU进行并行训练，以提高训练速度，一般有单机单卡，单机多卡，多机多卡等训练方式。这就会使用到pytorch提供的DataParallel(DP)和DistributedDataParallel(DDP)这两个函数来实现。

Pytorch并行计算(二): DistributedDataParallel介绍

芒果干的博客

11-19

1万+

PyTorch并行计算一、为什么要并行计算？二、基本概念三、DistributedDataParallel的使用1. multiprocessing2. distributed二者区别一些BUG和问题 nn.parallel.DistributedDataParallel 这部分是nn.DataParallel的后续，想看nn.DataParallel的点击这里为什么要用nn.parallel.DistributedDataParallel呢，首先我们看PyTorch官网对nn.DataParallel的

PyTorch 分布式并行计算

撒旦先生的博客

11-22

1449

pytorch 的 Distributed Data Parallel

pytorch 多GPU训练 loss不能正常下降

zzz1515151的博客

06-05

6305

在使用pytorch的DataParallel进行多GPU并行训练过程中，参考了很多博客，其中两个博客的内容有误，会导致多GPU训练时loss不能正常下降，且运行过程中不会有报错，下面做简单的说明。有问题的博客链接： https://blog.csdn.net/daniaokuye/article/details/79110365 https://blog.csdn.net/qq_19...

Pytorch——基于mmseg/mmdet训练报错：RuntimeError: Expected to have finished reduction in the prior iteration

Irving.Gao的博客

07-23

4929

训练报错解决

Pytorch报错 “RuntimeError: Expected to have finished reduction in the prior iteration ... ” 的解决方案

进阶的CVCoder的博客

09-06

1507

多卡的时候出现报错信息

MMdetection使用多卡训练时出现“ your module has parameters that were not used in producing loss”

weixin_43380510的博客

10-20

3232

使用mmdetection进行多卡训练时报错，错误内容是发现网络中的部分参数没有参与损失loss的计算

解决pytorch报错——RuntimeError: Expected to have finished reduction in the prior iteration...

热门推荐

Polaris的博客

01-05

2万+

解决pytorch报错——RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) pas

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

清晨的光明

12-17

1238

打印出来的参数就是没有参与loss运算的部分，他们梯度为None。1. 首先尝试使用单卡训练，查看是否报错，

【多卡训练错误】error indicates that your module has parameters that were not used in producing loss

qq_43672756的博客

08-22

5806

pytorch单机多卡训练DistributedDataParallel

02-26

### PyTorch 单机多卡训练与 DistributedDataParallel 使用指南 #### 初始化环境配置为了实现高效的单机多GPU训练，在启动程序前需设置合适的环境变量来指定可见的 GPU 设备数量以及分布式通信后端。通常情况下，`CUDA_VISIBLE_DEVICES` 可用于控制可用显卡列表。 ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 # 假设使用四张GPU卡 ``` #### 导入必要的模块并初始化进程组在 Python 脚本中导入 `torch.distributed` 和其他必需组件之后，通过调用 `init_process_group()` 函数完成对所有参与节点之间的连接建立工作[^1]。 ```python import torch import torch.nn as nn from torch.nn.parallel import DistributedDataParallel as DDP import os def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' # initialize the process group torch.distributed.init_process_group( backend='nccl', rank=rank, world_size=world_size) # set device for this process torch.cuda.set_device(rank) ``` #### 构建模型实例并与DDP绑定创建神经网络结构对象，并将其传递给 `DistributedDataParallel` 类构造器以启用跨设备同步梯度更新机制。这一步骤确保了每个 worker 上运行相同的计算图副本的同时保持参数一致性和高效性[^2]。 ```python class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10).to('cuda') self.relu = nn.ReLU() self.net2 = nn.Linear(10, 5).to('cuda') def forward(self, x): x = self.relu(self.net1(x)) return self.net2(x) model = ToyModel().to(f'cuda:{os.getenv("LOCAL_RANK")}') ddp_model = DDP(model, device_ids=[int(os.getenv("LOCAL_RANK"))]) ``` #### 数据加载器设定当处理大规模数据集时，建议采用 `DistributedSampler` 来代替默认采样方式，从而保证不同进程中读取的数据样本互不重叠且均匀分布于整个集合之上。此外还需注意调整 batch size 参数使其适应实际硬件条件下的最优性能表现。 ```python from torch.utils.data.distributed import DistributedSampler from torchvision.datasets import MNIST from torchvision.transforms import ToTensor dataset = MNIST(root='./data', train=True, transform=ToTensor(), download=True) sampler = DistributedSampler(dataset) dataloader = DataLoader(dataset, batch_size=8, sampler=sampler) ``` #### 训练循环逻辑编写最后定义完整的迭代过程，包括前向传播、损失函数计算、反向传播及优化步骤等操作。值得注意的是，在每次 epoch 结束后应当调用 `sampler.set_epoch(epoch)` 方法以便重新打乱输入顺序防止过拟合现象发生。 ```python criterion = nn.CrossEntropyLoss().cuda(int(os.getenv("LOCAL_RANK"))) optimizer = optim.SGD(ddp_model.parameters(), lr=0.001) for epoch in range(num_epochs): running_loss = 0.0 for i, data in enumerate(dataloader, start=0): inputs, labels = data[0].cuda(non_blocking=True), data[1].cuda(non_blocking=True) optimizer.zero_grad() outputs = ddp_model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item() print(f'[Epoch {epoch + 1}] Loss: {running_loss / (i + 1)}') sampler.set_epoch(epoch + 1) ```