pyspark在windows加载数据集训练模型出现 以下错误 Connection reset by peer: socket write error

博客介绍了PySpark中worker.py的一个临时解决办法。需在main函数内的process函数末尾添加两行代码,之后要重建python/lib文件夹中的pyspark.zip。问题可能是工作进程在执行器写入所有数据前就完成,该办法是让工作进程拉取所有数据,虽低效但可应急。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

As a workaround you might try the following change to python/pyspark/worker.py

Add the following 2 lines to the end of the process function defined inside the main function

for obj in iterator:
     pass

... so the process function now looks like this (in spark 1.5.2 at least):

      
     def process():
            iterator = deserializer.load_stream(infile)
            serializer.dump_stream(func(split_index, iterator), outfile)
            for obj in iterator:
                pass

After the change you will need to rebuild your pyspark.zip in the python/lib folder to include the change.

The issue may be that the worker process is completing before the executor has written all the data to it. The thread writing the data down the socket throws an exception and if this happens before the executor marks the task as complete it will cause trouble. The idea is to try to get the worker to pull all the data from the executor even if its not needed to lazily compute the output. This is very inefficient of course so it is a workaround rather than a proper solution.

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值