[debug] PyTorch报错：ConnectionResetError: [Errno 104] Connection reset by peer

最新推荐文章于 2025-04-01 14:34:51 发布

Harry嗷

最新推荐文章于 2025-04-01 14:34:51 发布

阅读量1.1w

点赞数 5

分类专栏： Python BUG解决文章标签： pytorch 深度学习 python

本文链接：https://blog.youkuaiyun.com/qq_41683065/article/details/122643637

版权

Python 同时被 2 个专栏收录

62 篇文章

订阅专栏

BUG解决

27 篇文章

订阅专栏

问题描述：

使用PyTorch 1.10.0，训练报错：

ConnectionResetError: [Errno 104] Connection reset by peer

问题解析

参见pytorch的issue

I believe the issue is only triggered for the case that both
persistent_workers and pin_memory are turned on and iteration is
terminated at the time that worker is sending data to queue. First,
persistent worker would keep iterator with workers running without
proper cleaning up (using __del__ in _MultiProcessingDataLoaderIter.
And, if any background worker (daemon process) is terminated when it
is sending data to the _worker_result_queue, such Error would be
triggered as the pin_memory_thread want to get such data from Queue.

I can send a PR

解决方法

目前的解决方法是增大batchsize，或者可以尝试issue中的其他方法

I have experienced this issue as well where the dataloader exits with a ConnectionResetError: [Errno 104] Connection reset by peer error. I observed that this error goes away away with either a) adding a sleep, or b) using larger batch sizes. I suspect there is race condition that is triggered if the dataloader completes very quickly. I am running Pytorch 1.10.