pyspark在windows加载数据集训练模型出现以下错误 Connection reset by peer: socket write error

最新推荐文章于 2024-10-31 20:35:38 发布

伙伴几时见

最新推荐文章于 2024-10-31 20:35:38 发布

阅读量829

点赞数

分类专栏： hadoop spark python

spark python 同时被 2 个专栏收录

15 篇文章

订阅专栏

hadoop

7 篇文章

订阅专栏

博客介绍了PySpark中worker.py的一个临时解决办法。需在main函数内的process函数末尾添加两行代码，之后要重建python/lib文件夹中的pyspark.zip。问题可能是工作进程在执行器写入所有数据前就完成，该办法是让工作进程拉取所有数据，虽低效但可应急。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

As a workaround you might try the following change to python/pyspark/worker.py

Add the following 2 lines to the end of the process function defined inside the main function

for obj in iterator:
     pass

... so the process function now looks like this (in spark 1.5.2 at least):

      
     def process():
            iterator = deserializer.load_stream(infile)
            serializer.dump_stream(func(split_index, iterator), outfile)
            for obj in iterator:
                pass

After the change you will need to rebuild your pyspark.zip in the python/lib folder to include the change.

The issue may be that the worker process is completing before the executor has written all the data to it. The thread writing the data down the socket throws an exception and if this happens before the executor marks the task as complete it will cause trouble. The idea is to try to get the worker to pull all the data from the executor even if its not needed to lazily compute the output. This is very inefficient of course so it is a workaround rather than a proper solution.