问题描述:
使用PyTorch 1.10.0,训练报错:
ConnectionResetError: [Errno 104] Connection reset by peer
问题解析
参见pytorch的issue
I believe the issue is only triggered for the case that both
persistent_workers
andpin_memory
are turned on and iteration is
terminated at the time that worker is sending data to queue. First,
persistent worker would keep iterator with workers running without
proper cleaning up (using__del__
in_MultiProcessingDataLoaderIter
.
And, if any background worker (daemon process) is terminated when it
is sending data to the_worker_result_queue
, such Error would be
triggered as thepin_memory_thread
want to get such data from Queue.
I can send a PR
解决方法
目前的解决方法是增大batchsize,或者可以尝试issue中的其他方法
I have experienced this issue as well where the dataloader exits with a
ConnectionResetError: [Errno 104] Connection reset by peer error
. I observed that this error goes away away with either a) adding a sleep, or b) using larger batch sizes. I suspect there is race condition that is triggered if the dataloader completes very quickly. I am running Pytorch 1.10.