spark报错总结
- spark sql
- 1、spark sql 删除分区报错 mismatched input '<=' expecting
- 2、spark sql 读取hive parquet表 所有字段都为null,使用 hive 和 spark1.x查询正常。
- 3、pyspark windows 系统下启动报 modulenotfounderror: no module named 'resource' 或者 Python worker failed to connect back 。
- 4、pyspark udf 函数定义返回集合报错 missing 1 required positional argument: 'elementType'
- 5、spark 使用count(distinct xx) over(partition by xxx) 报错 Distinct window functions are not supported: count(distinct _w0# ) windowspecdefinition
spark sql
1、spark sql 删除分区报错 mismatched input ‘<=’ expecting
- 记录时间
2021年11月23日20:11:07
- spark版本 2.1.0
- 报错详细信息
Traceback (most recent call last): File "/tmp/voldemort/0000003358711/resource.test.py", line 28, in <module> spark.sql(sql_drop_part) File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera4-1.cdh5.13.3.p0.818552/lib/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 545, in sql File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera4-1.cdh5.13.3.p0.818552/lib/spark2/python/lib/py4j-0.10.7- src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera4-1.cdh5.13.3.p0.818552/lib/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco pyspark.sql.utils.ParseException: u"\nmismatched input '<=' expecting {')', ','}(line 1, pos 85)\n\n== SQL ==\nalter table ods.test_table_name drop if exists partition(dt <= '20211102')\n-------------------------------------------------------------------------------------^^^\n"
spark版本问题,删除分区,无法匹配 '>' 或者 ‘<’,详情见
url - 目前处理方式
# 使用 = sql = """ alter table ods.test_table_name drop if exists partition(dt = '20211102') """
2、spark sql 读取hive parquet表 所有字段都为null,使用 hive 和 spark1.x查询正常。
- 记录时间
2022年07月15日18:05:07
- spark版本 2.1.0
- 目前处理方式
# 可以直接在spark-shell中执行 spark.conf.set("spark.sql.hive.convertMetastoreParquet","false") # 获取参数值 spark.conf.get("spark.sql.hive.convertMetastoreParquet")
3、pyspark windows 系统下启动报 modulenotfounderror: no module named ‘resource’ 或者 Python worker failed to connect back 。
-
记录时间
2022年10月25日17:52:11
-
spark版本 2.4.0
-
报错详细信息
File "D:\dataTools\conda\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "D:\dataTools\conda\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\dataTools\conda\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 25, in <module> ModuleNotFoundError: No module named 'resource' 2022-10-25 16:22:11 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketTimeoutException: Accept timed out at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method) at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164) ... 35 more 2022-10-25 16:22:11 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketTimeoutException: Accept timed out at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method) at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164) ... 35 more
-
目前处理方式
# spark 2.4.0 windows有bug # 处理方式1:将spark 版本做升级
方法2: 在原版上修复错误。这里贴出一些参考文档
4、pyspark udf 函数定义返回集合报错 missing 1 required positional argument: ‘elementType’
-
记录时间
2022年11月22日19:22:11
-
spark版本 2.4.0
-
报错详细信息
TypeError: __init__() missing 1 required positional argument: 'elementType'
-
代码详情
我期望处理结果并返回一个数组
3 pyspark中的数据 """ +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ |A |B | +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ |a |[[95.0], [25.0, 25.0], [40.0]]| |a |[[95.0], [20.0, 80.0]] | |a |[[95.0], [25.0, 75.0]] | |b |[[95.0], [25.0, 75.0]] | |b |[[95.0], [12.0, 88.0]] | +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ """ # result """ +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ |A |B | +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ |a |[25.0, 25.0, 40.0] | |a |[20.0, 80.0] | |a |[25.0, 75.0] | |b |[25.0, 75.0] | |b |[12.0, 88.0] | +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ """ import pyspark.sql.types import pyspark.sql.functions as F import numpy as np def remove_highest(col): return np.sort( np.asarray([item for sublist in col for item in sublist]) )[:‑1] udf_remove_highest = F.udf(remove_highest, ArrayType())
-
目前处理方式
报错信息是缺少一个参数,由于udf返回的是一个 array类型,所以还需要指定array里面元素的类型
udf_remove_highest = F.udf(remove_highest, ArrayType(FloatType()) )
5、spark 使用count(distinct xx) over(partition by xxx) 报错 Distinct window functions are not supported: count(distinct _w0# ) windowspecdefinition
-
记录时间
2024年07月01日18:22:11
-
spark版本 2.4.0
-
报错详细信息
Distinct window functions are not supported: count(distinct _w0# ) windowspecdefinition
-
代码详情
我期望根据A列分组统计B列数值的不重复个数
3 pyspark中的数据 """ +‑‑‑+‑‑‑‑‑‑----+ |A |B | +‑‑‑+‑‑‑‑‑‑----+ |a |15 | |a |12 | |a |12 | |b |12 | |b |12 | +‑‑‑+‑‑‑‑‑‑----+ """ # result """ +‑‑‑+‑‑‑‑‑‑----+‑‑‑‑‑‑----+ |A |B |B_ | +‑‑‑+‑‑‑‑‑‑----+‑‑‑‑‑‑----+ |a |15 |2 | |a |12 |2 | |a |12 |2 | |b |12 |1 | |b |12 |1 | +‑‑‑+‑‑‑‑‑‑----+‑‑‑‑‑‑----+ """ import sys from pyspark.sql import SparkSession from pyspark.sql import functions as f sel_sql = """ select a,b,count(distinct b) over(partition by a ) as B_ from temp """ sparl.sql(sql).show()
-
目前处理方式
报错原因,spark目前暂不支持 count(distinct xx) over(partition by xxx),可以使用 collect_set与size结合使用
# size(collect_set(xx) over(partition by xxx)) sel_sql = """ select a,b,size(collect_set(b) over(partition by a )) as B_ from temp """ sparl.sql(sql).show()