问题描述
训练网络时出现错误:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
原因分析
参考该issue
I met the same issue.
But i solved it.
The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function,
but in forward function I only use 4 of them.
When I use all of them, the problem was solved.
This is my supposed conclusion: you should use all output of each module in forward function.
简单来说,就是在网络中定义了module,但没有使用
当然,这个module可以使最简单的网络层,也可以是更复杂的结构
解决方案
找到定义但没有使用的module,要么删掉,要么使用
update 20220716:
If finding the unused params becomes hard, there is a way to find all the pramas unused by check their gradient:
for name, param in runner.model.named_parameters():
if param.grad is None:
print(name)
This code should be added between loss.backward()
and optimizer.step()