spark版本bug总结

spark sql

1、spark sql 删除分区报错 mismatched input ‘<=’ expecting

  • 记录时间
    2021年11月23日20:11:07
  • spark版本 2.1.0
  • 报错详细信息
    Traceback (most recent call last):
    File "/tmp/voldemort/0000003358711/resource.test.py", line 28, in <module>
    spark.sql(sql_drop_part)
    File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera4-1.cdh5.13.3.p0.818552/lib/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", 	line 545, in sql
    File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera4-1.cdh5.13.3.p0.818552/lib/spark2/python/lib/py4j-0.10.7-		src.zip/py4j/java_gateway.py", line 1257, in __call__
    File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera4-1.cdh5.13.3.p0.818552/lib/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
    pyspark.sql.utils.ParseException: u"\nmismatched input '<=' expecting {')', ','}(line 1, pos 85)\n\n== SQL ==\nalter table ods.test_table_name drop if exists partition(dt <= '20211102')\n-------------------------------------------------------------------------------------^^^\n"
    
    spark版本问题,删除分区,无法匹配 '>' 或者 ‘<’,详情见 url
  • 目前处理方式
    # 使用 = 
    sql = """
    alter table ods.test_table_name drop if exists partition(dt = '20211102')
    """
    

2、spark sql 读取hive parquet表 所有字段都为null,使用 hive 和 spark1.x查询正常。

  • 记录时间
    2022年07月15日18:05:07
  • spark版本 2.1.0
  • 目前处理方式
    # 可以直接在spark-shell中执行
    spark.conf.set("spark.sql.hive.convertMetastoreParquet","false")
    
    # 获取参数值
    spark.conf.get("spark.sql.hive.convertMetastoreParquet")
    

3、pyspark windows 系统下启动报 modulenotfounderror: no module named ‘resource’ 或者 Python worker failed to connect back 。

  • 记录时间
    2022年10月25日17:52:11

  • spark版本 2.4.0

  • 报错详细信息

      File "D:\dataTools\conda\lib\runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "D:\dataTools\conda\lib\runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "D:\dataTools\conda\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 25, in <module>
    ModuleNotFoundError: No module named 'resource'
    2022-10-25 16:22:11 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
    org.apache.spark.SparkException: Python worker failed to connect back.
            at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
            at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
            at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
            at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
            at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
            at org.apache.spark.scheduler.Task.run(Task.scala:121)
            at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
            at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.SocketTimeoutException: Accept timed out
            at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
            at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
            at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
            at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
            at java.net.ServerSocket.implAccept(ServerSocket.java:545)
            at java.net.ServerSocket.accept(ServerSocket.java:513)
            at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
            ... 35 more
    2022-10-25 16:22:11 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
            at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
            at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
            at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
            at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
            at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
            at org.apache.spark.scheduler.Task.run(Task.scala:121)
            at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
            at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.SocketTimeoutException: Accept timed out
            at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
            at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)
            at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
            at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
            at java.net.ServerSocket.implAccept(ServerSocket.java:545)
            at java.net.ServerSocket.accept(ServerSocket.java:513)
            at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
            ... 35 more
    
  • 目前处理方式

    # spark 2.4.0 windows有bug
    # 处理方式1:将spark 版本做升级
    

    方法2: 在原版上修复错误。这里贴出一些参考文档

    参考1-stackoverflow
    参考2-github
    参考3-spark官网

4、pyspark udf 函数定义返回集合报错 missing 1 required positional argument: ‘elementType’

  • 记录时间
    2022年11月22日19:22:11

  • spark版本 2.4.0

  • 报错详细信息

    TypeError: __init__() missing 1 required positional argument: 'elementType'
    
  • 代码详情

    我期望处理结果并返回一个数组

    3 pyspark中的数据
    """
    +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+
    |A  |B                             |
    +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+
    |a  |[[95.0], [25.0, 25.0], [40.0]]|
    |a  |[[95.0], [20.0, 80.0]]        |
    |a  |[[95.0], [25.0, 75.0]]        |
    |b  |[[95.0], [25.0, 75.0]]        |
    |b  |[[95.0], [12.0, 88.0]]        |
    +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+
    """
    # result
    """
    +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+
    |A  |B                             |
    +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+
    |a  |[25.0, 25.0, 40.0]            |
    |a  |[20.0, 80.0]                  |
    |a  |[25.0, 75.0]                  |
    |b  |[25.0, 75.0]                  |
    |b  |[12.0, 88.0]                  |
    +‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+
    """
    import pyspark.sql.types
    import pyspark.sql.functions as F
    import numpy as np
    def remove_highest(col):
        return np.sort( np.asarray([item for sublist in col for item in sublist])  )[:1]
    
    udf_remove_highest = F.udf(remove_highest, ArrayType())
    
  • 目前处理方式

    报错信息是缺少一个参数,由于udf返回的是一个 array类型,所以还需要指定array里面元素的类型

    udf_remove_highest = F.udf(remove_highest, ArrayType(FloatType()) )
    

5、spark 使用count(distinct xx) over(partition by xxx) 报错 Distinct window functions are not supported: count(distinct _w0# ) windowspecdefinition

  • 记录时间
    2024年07月01日18:22:11

  • spark版本 2.4.0

  • 报错详细信息

    Distinct window functions are not supported: count(distinct _w0# ) windowspecdefinition
    
  • 代码详情

    我期望根据A列分组统计B列数值的不重复个数

    3 pyspark中的数据
    """
    +‑‑‑+‑‑‑‑‑‑----+
    |A  |B         |
    +‑‑‑+‑‑‑‑‑‑----+
    |a  |15        |
    |a  |12        |
    |a  |12        |
    |b  |12        |
    |b  |12        |
    +‑‑‑+‑‑‑‑‑‑----+
    """
    # result
    """
    +‑‑‑+‑‑‑‑‑‑----+‑‑‑‑‑‑----+
    |A  |B         |B_        |
    +‑‑‑+‑‑‑‑‑‑----+‑‑‑‑‑‑----+
    |a  |15        |2         |
    |a  |12        |2         |
    |a  |12        |2         |
    |b  |12        |1         |
    |b  |12        |1         |
    +‑‑‑+‑‑‑‑‑‑----+‑‑‑‑‑‑----+
    """
    import sys
    from pyspark.sql import SparkSession
    from pyspark.sql import functions as f
    sel_sql = """
    select 
       a,b,count(distinct b) over(partition by a ) as B_
    from temp
    """
    sparl.sql(sql).show()
    
  • 目前处理方式

    报错原因,spark目前暂不支持 count(distinct xx) over(partition by xxx),可以使用 collect_set与size结合使用

    # size(collect_set(xx) over(partition by xxx))
    sel_sql = """
    select 
       a,b,size(collect_set(b) over(partition by a )) as B_
    from temp
    """
    sparl.sql(sql).show()
    
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一年又半

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值