记录spark-streaming-kafka-0-10_2.11的2.3.2版本StructuredStreaming水印除重操作OOM解决

本文介绍了如何通过调整Spark SQL Streaming的配置参数，减少HDFS Backed State Store在内存上的消耗，以优化Structured Streaming任务的性能。重点讨论了`spark.sql.streaming.minBatchesToRetain`和`spark.sql.streaming.maxBatchesToRetainInMemory`在内存管理中的作用，以及如何根据工作负载调整以降低10倍至80倍的内存占用。

代码主要部分：

    val df = kafkaReadStream(spark, KAFKA_INIT_OFFSETS, KAFKA_TOPIC)
      .option("maxOffsetsPerTrigger",1000)//限流:对每个触发器间隔处理的最大偏移量的速率限制。指定的偏移量总数将按比例划分到不同卷的topicPartitions上。
      .option("fetchOffset.numRetries",3)//尝试次数
      .option("failOnDataLoss",false) //数据丢失警告
      .load()
      .selectExpr("cast (value as string) as json")
      .select(from_json($"json", schema = getKafkaDNSLogSchema()).as("data"))
      //      .select("data.time","data.host","data.content")
      .select("data.content")
      .filter($"content".isNotNull)
      .map(row => {
        val content = JsonDNSDataHandler(row.getString(0))
        val date1 = CommonUtils.timeStamp2Date(content.split("\t")(0).toLong, "yyyy-MM-dd HH:mm:ss.SSSSSS")
        val timestamp = java.sql.Timestamp.valueOf(date1)

        (timestamp,content)
      }).as[(Timestamp,String)].toDF("timestamp","content")
      .withWatermark("timestamp", "10 minutes") //重复记录设置水印时间为十分钟
      .dropDuplicates("content")

    val query = df.writeStream
      .outputMode(OutputMode.Update()) //保留更新的数据
            .trigger(Trigger.ProcessingTime("2 minutes")) //默认0sThe default value is `ProcessingTime(0)` and it will run the query as fast as possible.
//      .trigger(Trigger.ProcessingTime(0))
      //      .format("console") // 输出到控制台 debug用
      .format("cn.pcl.csrc.spark.streaming.HiveSinkProvider") //自定义HiveSinkProvider
      .option("checkpointLocation", KAFKA_CHECK_POINTS)
      .start()
    query.awaitTermination()

出错日志：

21/09/09 09:24:35 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
21/09/09 11:02:17 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 120000 milliseconds, but spent 137183 milliseconds
21/09/09 11:06:04 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 120000 milliseconds, but spent 227294 milliseconds
21/09/09 11:12:35 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 120000 milliseconds, but spent 155791 milliseconds
21/09/09 11:18:41 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 120000 milliseconds, but spent 161555 milliseconds
21/09/09 11:19:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 2 for reason Container marked as failed: container_e32_1631077447110_0028_01_000003 on host: hdp03.p
cl-test.com. Exit status: 143. Diagnostics: [2021-09-09 11:17:45.087]Container killed on request. Exit code is 143
[2021-09-09 11:17:45.088]Container exited with a non-zero exit code 143.
[2021-09-09 11:17:45.090]Killed by external signal

21/09/09 11:19:29 ERROR YarnScheduler: Lost executor 2 on hdp03.pcl-test.com: Container marked as failed: container_e32_1631077447110_0028_01_000003 on host: hdp03.pcl-test.com. Exit status: 143. Diagnost
ics: [2021-09-09 11:17:45.087]Container killed on request. Exit code is 143
[2021-09-09 11:17:45.088]Container exited with a non-zero exit code 143.
[2021-09-09 11:17:45.090]Killed by external signal

21/09/09 11:19:29 WARN TaskSetManager: Lost task 0.0 in stage 112.0 (TID 11256, hdp03.pcl-test.com, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Containe
r marked as failed: container_e32_1631077447110_0028_01_000003 on host: hdp03.pcl-test.com. Exit status: 143. Diagnostics: [2021-09-09 11:17:45.087]Container killed on request. Exit code is 143
[2021-09-09 11:17:45.088]Container exited with a non-zero exit code 143.
[2021-09-09 11:17:45.090]Killed by external signal

21/09/09 11:19:29 ERROR TaskSetManager: Task 0 in stage 112.0 failed 1 times; aborting job
21/09/09 11:19:29 ERROR WriteToDataSourceV2Exec: Data source writer com.hortonworks.spark.sql.hive.llap.HiveStreamingDataSourceWriter@7050cc3f is aborting.
21/09/09 11:19:29 ERROR WriteToDataSourceV2Exec: Data source writer com.hortonworks.spark.sql.hive.llap.HiveStreamingDataSourceWriter@7050cc3f aborted.
21/09/09 11:19:29 ERROR MicroBatchExecution: Query [id = 60ca7eca-727c-494c-84a8-aa542340eb53, runId = 2987ae03-058c-4c86-bc68-421083b72fab] terminated with error
org.apache.spark.SparkException: Writing job aborted.
        at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:112)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:256)
        at cn.pcl.csrc.spark.streaming.HiveSink.addBatch(HiveSink.scala:39)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:
475)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
        at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
        at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
        at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
        at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
        at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
        at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 112.0 failed 1 times, most recent failure: Lost task 0.0 in stage 112.0 (TID 11256, hdp03.pcl-test.com, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_e32_1631077447110_0028_01_000003 on host: hdp03.pcl-test.com. Exit status: 143. Diagnostics: [2021-09-09 11:17:45.087]Container killed on request. Exit code is 143
[2021-09-09 11:17:45.088]Container exited with a non-zero exit code 143.
[2021-09-09 11:17:45.090]Killed by external signal

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
        at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:82)
        ... 30 more
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted.
=== Streaming Query ===
Identifier: [id = 60ca7eca-727c-494c-84a8-aa542340eb53, runId = 2987ae03-058c-4c86-bc68-421083b72fab]
Current Committed Offsets: {KafkaSource[Subscribe[recursive-log]]: {"recursive-log":{"0":2504047}}}
Current Available Offsets: {KafkaSource[Subscribe[recursive-log]]: {"recursive-log":{"0":2504247}}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
Deduplicate [content#39]
+- EventTimeWatermark timestamp#38: timestamp, interval 10 minutes
   +- Project [_1#32 AS timestamp#38, _2#33 AS content#39]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#32, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#33]
         +- MapElements <function1>, interface org.apache.spark.sql.Row, [StructField(content,StringType,true)], obj#31: scala.Tuple2
            +- DeserializeToObject createexternalrow(content#25.toString, StructField(content,StringType,true)), obj#30: org.apache.spark.sql.Row
               +- Filter isnotnull(content#25)
                  +- Project [data#23.content AS content#25]
                     +- Project [jsontostructs(StructField(time,StringType,true), StructField(host,StringType,true), StructField(content,StringType,true), json#21, Some(Asia/Shanghai), true) AS data#23]
                        +- Project [cast(value#8 as string) AS json#21]
                           +- StreamingExecutionRelation KafkaSource[Subscribe[recursive-log]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]

        at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
        at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Writing job aborted.
        at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:112)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:256)
        at cn.pcl.csrc.spark.streaming.HiveSink.addBatch(HiveSink.scala:39)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:475)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
        at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
        at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
        at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
        at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
        at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
        at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
        at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
        ... 1 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 112.0 failed 1 times, most recent failure: Lost task 0.0 in stage 112.0 (TID 11256, hdp03.pcl-test.com, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_e32_1631077447110_0028_01_000003 on host: hdp03.pcl-test.com. Exit status: 143. Diagnostics: [2021-09-09 11:17:45.087]Container killed on request. Exit code is 143
[2021-09-09 11:17:45.088]Container exited with a non-zero exit code 143.
[2021-09-09 11:17:45.090]Killed by external signal

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
        at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:82)
        ... 30 more

问题解惑链接：

问题描述

目前只是用这部分，已经解决问题。

Spark2.2(三十八)：Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗内存比较多的问题（Memory issue with spark structured streaming）调研
在spark中《Memory usage of state in Spark Structured Streaming》讲解Spark内存分配情况，以及提到了HDFSBackedStateStoreProvider存储多个版本的影响；从stackoverflow上也可以看到别人遇到了structured streaming中内存问题，同时也对问题做了分析《Memory issue with spark structured streaming》；另外可以从spark的官网问题修复列表中查看到如下内容：

1）在流聚合中从值中删除冗余密钥数据（Split out min retain version of state for memory in HDFSBackedStateStoreProvider）
问题描述：

HDFSBackedStateStoreProvider has only one configuration for minimum versions to retain of state which applies to both memory cache and files. As default version of "spark.sql.streaming.minBatchesToRetain" is set to high (100), which doesn't require strictly 100x of memory, but I'm seeing 10x ~ 80x of memory consumption for various workloads. In addition, in some cases, requiring 2x of memory is even unacceptable, so we should split out configuration for memory and let users adjust to trade-off memory usage vs cache miss.

In normal case, default value '2' would cover both cases: success and restoring failure with less than or around 2x of memory usage, and '1' would only cover success case but no longer require more than 1x of memory. In extreme case, user can set the value to '0' to completely disable the map cache to maximize executor memory.

修复情况：

对应官网bug情况概述《[SPARK-24717][SS] Split out max retain version of state for memory in HDFSBackedStateStoreProvider #21700》、《Split out min retain version of state for memory in HDFSBackedStateStoreProvider》

相关知识：

《Spark Structrued Streaming源码分析--(三)Aggreation聚合状态存储与更新》

HDFSBackedStateStoreProvider存储state的目录结构在该文章中介绍的，另外这些文件是一个系列，建议可以多读读，下边借用作者文章中的图展示下state存储目录结构

优化配置

配置描述

解决方式：

运行提交任务时，增加两配置

--conf spark.sql.streaming.minBatchesToRetain=3 --conf spark.sql.streaming.maxBatchesToRetainInMemory=0

目前运行状态：