from pyspark import SparkConf,SparkContext
conf=SparkConf().setAppName('sparkApp1').setMaster("local")
sc=SparkContext.getOrCreate(conf)
rdd01=sc.parallelize([1,2,3,4,5,6],7)
print(rdd01.collect())
print(rdd01.glom().collect())
出现以下错误 Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 2.0 failed 1 times, most recent failure: Lost task 5.0 in stage 2.0 (TID 19) : org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:188) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:108) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:121) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:162) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketTimeoutException: Accept timed out at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method) at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:175) ... 14 more
小白 完全新手 完成选修课作业而已、、按网上教程装了jdk1.8 hadoop2.7 spark 3.0.1
一运行就这样环境配置也没毛病 cmd里面spark-shell也正常
一开始还重装了 结果还是不行 其实如果spark-shell正常的话就应该是pyspark的问题
现在也不知道是什么原因,最后就是这两个操作然后就好了
1.使用findspark
在所有代码前面加
import findspark
findspark.init()
2.降低pyspark版本
本来直接pip install pyspark默认下了个3.x.y的版本,好像超过3的版本改了一个接口还是什么 和java什么东西不兼容
pip uninstall pyspark
pip install pyspark==2.3.2
在Windows环境中,使用pyspark进行大数据处理时遇到错误,尝试通过findspark初始化和降低pyspark版本至2.3.2解决了问题。问题起因可能是pyspark 3.x版本与某些接口不兼容导致。
1818





