As a workaround you might try the following change to python/pyspark/worker.py
Add the following 2 lines to the end of the process function defined inside the main function
for obj in iterator:
pass
... so the process function now looks like this (in spark 1.5.2 at least):
def process():
iterator = deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)
for obj in iterator:
pass
After the change you will need to rebuild your pyspark.zip in the python/lib folder to include the change.
The issue may be that the worker process is completing before the executor has written all the data to it. The thread writing the data down the socket throws an exception and if this happens before the executor marks the task as complete it will cause trouble. The idea is to try to get the worker to pull all the data from the executor even if its not needed to lazily compute the output. This is very inefficient of course so it is a workaround rather than a proper solution.
PySpark worker.py问题的临时解决办法

博客介绍了PySpark中worker.py的一个临时解决办法。需在main函数内的process函数末尾添加两行代码,之后要重建python/lib文件夹中的pyspark.zip。问题可能是工作进程在执行器写入所有数据前就完成,该办法是让工作进程拉取所有数据,虽低效但可应急。
2243

被折叠的 条评论
为什么被折叠?



