问题

本文深入探讨Spark大数据处理中常见的问题及解决策略,包括处理大规模数据集的限制、优化内存使用、提升并行处理效率,以及如何正确配置Spark参数以应对不同场景的挑战。

问题

  • RowMatrix 65535
    • This cannot be computed on matrices with more than 65535 columns.
    • 只能处理65535维的数据
  • vituralbox 启动spark集群,当result过大时,driver需要通过blockmanager 获取,由于端口不是固定的所以无法配置端口映射
docker run -ti --rm \
    --name sparkmaster \
    --hostname=host.a \
    --add-host host.a:xx.xx.xx.xx \
    --add-host host.b:xx.xx.xx.xx \
    -p 18080:18080 \

    .config("spark.fileserver.port", "7002") \
    .config("spark.broadcast.port", "7003") \
    .config("spark.replClassServer.port", "7004") \
    .config("spark.blockManager.port", "7005") \
    .config("spark.broadcast.blockSize", "4096") \
    .getOrCreate()

    //spark.driver.maxResultSize
    // spark.task.maxDirectResultSize

     conf.getSizeAsBytes("spark.task.maxDirectResultSize", 1L << 20),
    RpcUtils.maxMessageSizeBytes(conf))
     private val MAX_MESSAGE_SIZE_IN_MB = Int.MaxValue / 1024 / 1024
      ==>
          .config("spark.task.maxDirectResultSize", "2147483648") \
    .config("spark.driver.maxResultSize","3g") \

  • python pandas datafram语法问题:
    • group_by_post_colums_sum = data_T.apply(lambda x:sum(x))
  File "/Users/seki/git/dyd/data_foundation/python/com/dyd/data/test/TestRecommend.py", line 158, in <lambda>
    group_by_post_colums_sum = data_T.apply(lambda x:sum(x))
  File "/Users/seki/git/learn/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 40, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: ("'NoneType' object has no attribute '_jvm'", u'occurred at index 6293615542250297450')

spark on yarn spark ui

17/05/03 17:58:02 ERROR client.TransportClient: Failed to send RPC 7785784597803174149 to /172.26.159.91:56630: java.nio.channels.ClosedChannelException

spark on yarn cdh 5.49 Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.

  • startshell --driver-class-path $CONF_DIR:$LIB_JARS == 》export SPARK_CLASSPATH="$SPARK_CLASSPATH:$CONF_DIR:$LIB_JARS"

spark broadcast Failed to get broadcast_8_piece0 of broadcast_8

  • spark 初始化 不能是成员变量

breeze Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS

 Spark SQL从2.0开始已经不再支持ALTER TABLE

spark sql 无法读取 hive acid delta 文件

  • shell 脚本定期执行compact major ,压缩后的 base文件可以读取到,

Caused by: java.lang.NullPointerException: Cannot suppress a null exception.at java.lang.Throwable.addSuppressed(Throwable.java:1046)at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:243)

  • mysql 账号没有写入权限

java.io.IOException: No space left on deviceat java.io.FileOutputStream.writeBytes(Native Method)at java.io.FileOutputStream.write(FileOutputStream.java:326)at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)

使用hbase-spark Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging

  • 用1.6.0 logging打成jar
  • 添加到driver运行机器的spark_home的lib
  • driver运行时,add-jar添加

org.apache.spark.shuffle.FetchFailedException: /data/yarn/local/usercache/root/appcache/application_1514973560079_1031/blockmgr-b8eb9d9c-e6e2-4c4c-a5ac-5120a52b1e32/1d/shuffle_9_1_0.index (No such file or directory)

spark als 推荐慢问题

  • blockify -》 blockSize 调整

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 222.0 failed 4 times, most recent failure: Lost task 1.3 in stage 222.0 (TID 9505, uhadoop-adarbt-core1): java.lang.ArrayIndexOutOfBoundsException at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)

spark2.3.1 调试问题,java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local class incompatible: stream classdesc serialVersionUID = -3720498261147521051, local class serialVersionUID = -6655865447853211720 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)

  • core 编译target 目录 original-spark-core_2.11-2.3.1-SNAPSHOT.jar 替换 assembly 项目下 scala-2.11/jars对应的spark core jar

spark dataframe保存到hdfs 数据量较小11m,保存只用了1个分区,spark repartion无用,集中在一个节点,任务计算时也只用一个节点

  • 问题一: 小数据加载后cache缓存在一个节点的一个executor
  • 问题二: 缓存在一个节点的一个executor后,并行处理的task只分配这个executor ,其他executor无任务,并行度下降

spark.locality.wait和spark.locality.wait.process,spark.locality.wait.node, spark.locality.wait.rack这几个参数影响了任务分配时的本地性策略的相关细节。

  • sparkConf.set("spark.locality.wait","0"),不等待,直接分配

笛卡尔积优化策略:大表物品特征表:25万 180维度的数据, 小表 用户特征 : 1000 用户,180维度,

  • 2个exeutor , 每个2个cpu, 总共4个cpu
  • postFactorsDf,物品特征表四个分区,每个分区数据量: 88.0 MB / 23
  • postFactorsDf加载后调用cache缓存,cache block为4mb ,所以 显示为 23条记录
  • 每个分区 实际记录条数 63250 , 每一块缓存数据为 63250/ 23 = 2750 ~ 4mb
  • 0.0013913043 mb/g
  • 用户数据 18.3 KB / 1
 def recommendPosts4LongIdSmallSrcFactors(rank: Int,
                                           srcFactors: DataFrame,
                                           dstFactors: DataFrame,
                                           srcOutputColumn: String,
                                           dstOutputColumn: String,
                                           num: Int
                                           ): DataFrame = {
       import srcFactors.sparkSession.implicits._

    val srcFactorsBlocked = blockify4Long(srcFactors.as[(Long, Array[Float])],srcSize)
    val dstFactorsBlocked = blockify4Long(dstFactors.as[(Long, Array[Float])],desSize)

    val ratings = broadcast(srcFactorsBlocked).join(dstFactorsBlocked)
      .as[(Seq[(Long, Array[Float])], Seq[(Long, Array[Float])])]
      .flatMap { case (srcIter, dstIter) =>
        val m = srcIter.size
        val n = math.min(dstIter.size, num)
        val output = new Array[(Long, Long, Float)](m * n)
        var i = 0
        val pq = new BoundedPriorityQueue[(Long, Float)](num)(Ordering.by(_._2))
        srcIter.foreach { case (srcId, srcFactor) =>
          dstIter.foreach { case (dstId, dstFactor) =>
            val score = f2jBLAS.sdot(rank, srcFactor, 1, dstFactor, 1)
            pq += dstId -> score
          }
          pq.foreach { case (dstId, score) =>
            output(i) = (srcId, dstId, score)
            i += 1
          }
          pq.clear()
        }
        output.filter(_ != null).toSeq
      }
    val topKAggregator = new TopByKeyAggregator[Long, Long, Float](num * 10, Ordering.by(_._2))
    val recs = ratings.as[(Long, Long, Float)].groupByKey(_._1).agg(topKAggregator.toColumn)
      .toDF("id", "recommendations")
    recs
  }

doTest(userDf,1000,3000)

  • Peak Execution Memory 160.0 MB
  • 一个 分区 的 一个dstIter = 3000 条 4mb ,23 块dstIter
  • Shuffle Write Size / Records = 10.9 MB / 1000

doTest(userDf,1000,50000)

  • Peak Execution Memory 24.0 MB
  • Shuffle Write Size / Records = 2.2 MB / 1000
  • 一个 分区 的 一个dstIter = 50000 条 70m ,2 块dstIter

doTest(userDf,1000,30000)

  • Peak Execution Memory 24.0 MB
  • Shuffle Write Size / Records 3.2 MB / 1000
  • 3 块dstIter

doTest(userDf,1000,10000)

  • Peak Execution Memory 64.0 MB
  • Shuffle Write Size / Records 7.6 MB / 1000
  • 7 块dstIter

doTest(userDf,1000,100000)

  • Peak Execution Memory 20.0 MB
  • Shuffle Write Size / Records 1.1 mb / 1000

总结

  • group数决定了有多少块dstIter,每一块dstIter 会维持 srcIter.size 也就用户数 1000 * 100 的 top 100 的记录在 output
    • 划分多少块 dstIter 就会有多少个1000 * 100 的 top 100的 记录数
    • 影响 Peak Execution Memory 和 Shuffle Write Size
    • 单块dstIter的 Shuffle Write Size = 1.1mb ,
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值