pytorch DistributedDataParallel提示未参与loss计算的变量错误

错误提示:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates th
at your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passin
g the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `f
orward` function outputs participate in calculating loss. If you already have done the above two steps, then the distribute
d data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Pl
ease include the loss function and the structure of the return value of `forward` of your module when reporting this issue
(e.g. list, dict, iterable).

解决方法:

forward不要return任何不计算loss的变量!

比如

model = nn.parallel.DistributedDataParallel(model, device_ids=[config.args.local_rank],
                           
### PyTorch 单机多卡训练与 DistributedDataParallel 使用指南 #### 初始化环境配置 为了实现高效的单机多GPU训练,在启动程序前需设置合适的环境变量来指定可见的 GPU 设备数量以及分布式通信后端。通常情况下,`CUDA_VISIBLE_DEVICES` 可用于控制可用显卡列表。 ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 # 假设使用四张GPU卡 ``` #### 导入必要的模块并初始化进程组 在 Python 脚本中导入 `torch.distributed` 和其他必需组件之后,通过调用 `init_process_group()` 函数完成对所有参与节点之间的连接建立工作[^1]。 ```python import torch import torch.nn as nn from torch.nn.parallel import DistributedDataParallel as DDP import os def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' # initialize the process group torch.distributed.init_process_group( backend='nccl', rank=rank, world_size=world_size) # set device for this process torch.cuda.set_device(rank) ``` #### 构建模型实例并与DDP绑定 创建神经网络结构对象,并将其传递给 `DistributedDataParallel` 类构造器以启用跨设备同步梯度更新机制。这一步骤确保了每个 worker 上运行相同的计算图副本的同时保持参数一致性和高效性[^2]。 ```python class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10).to('cuda') self.relu = nn.ReLU() self.net2 = nn.Linear(10, 5).to('cuda') def forward(self, x): x = self.relu(self.net1(x)) return self.net2(x) model = ToyModel().to(f'cuda:{os.getenv("LOCAL_RANK")}') ddp_model = DDP(model, device_ids=[int(os.getenv("LOCAL_RANK"))]) ``` #### 数据加载器设定 当处理大规模数据集时,建议采用 `DistributedSampler` 来代替默认采样方式,从而保证不同进程中读取的数据样本互不重叠且均匀分布于整个集合之上。此外还需注意调整 batch size 参数使其适应实际硬件条件下的最优性能表现。 ```python from torch.utils.data.distributed import DistributedSampler from torchvision.datasets import MNIST from torchvision.transforms import ToTensor dataset = MNIST(root='./data', train=True, transform=ToTensor(), download=True) sampler = DistributedSampler(dataset) dataloader = DataLoader(dataset, batch_size=8, sampler=sampler) ``` #### 训练循环逻辑编写 最后定义完整的迭代过程,包括前向传播、损失函数计算、反向传播及优化步骤等操作。值得注意的是,在每次 epoch 结束后应当调用 `sampler.set_epoch(epoch)` 方法以便重新打乱输入顺序防止过拟合现象发生。 ```python criterion = nn.CrossEntropyLoss().cuda(int(os.getenv("LOCAL_RANK"))) optimizer = optim.SGD(ddp_model.parameters(), lr=0.001) for epoch in range(num_epochs): running_loss = 0.0 for i, data in enumerate(dataloader, start=0): inputs, labels = data[0].cuda(non_blocking=True), data[1].cuda(non_blocking=True) optimizer.zero_grad() outputs = ddp_model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item() print(f'[Epoch {epoch + 1}] Loss: {running_loss / (i + 1)}') sampler.set_epoch(epoch + 1) ```
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值