Spark Streaming 误用.transform(func)函数导致的问题解析

本文探讨了Spark Streaming中Transform方法使用不当导致的问题,包括意外的缓存行为和Kafka偏移量记录错误。深入分析了问题原因及Spark Streaming生成job的内部逻辑。

Spark/Spark Streaming transform 是一个很强的方法,不过使用过程中可能也有一些值得注意的问题。在分析的问题,我们还会顺带讨论下Spark Streaming 生成job的逻辑,从而让大家知道问题的根源。

问题描述

今天有朋友贴了一段 gist,大家可以先看看这段代码有什么问题。

特定情况你会发现UI 的Storage标签上有很多新的Cache RDD,然后你以为是Cache RDD 不被释放,但是通过Spark Streaming 数据清理机制分析我们可以排除这个问题。

接着通过给RDD的设置名字,名字带上时间,发现是延时的Batch 也会产生cache RDD。那这是怎么回事呢?

另外还有一个问题,也是相同的原因造成的:我通过KafkaInputStream.transform 方法获取Kafka偏移量,并且保存到HDFS上。然后发现一旦产生job(包括并没有执行的Job),都会生成了Offset,这样如果出现宕机,你看到的最新Offset 其实就是延时的,而不是出现故障时的Offset了。这样做恢复就变得困难了。

问题分析

其实是这样,在transform里你可以做很多复杂的工作,但是transform接受到的函数比较特殊,是会在TransformedDStream.compute方法中执行的,你需要确保里面的动作都是transformation(延时的),而不能是Action(譬如第一个例子里的count动作),或者不能有立即执行的(比如我提到的例子里的自己通过HDFS API 将Kafka偏移量保存到HDFS)。

override def compute(validTime: Time): Option[RDD[U]] = {
    val parentRDDs = parents.map { parent => 
    ....
  //看这一句,你的函数在调用compute方法时,就会被调用
    val transformedRDD = transformFunc(parentRDDs, validTime)
    if (transformedRDD == null) {
      throw new SparkException.....
    }
    Some(transformedRDD)
  }

这里有两个疑问:

  • 那些.map .transform 都是transformation,不是只有真实被提交后才会被执行么?
  • DStream.compute 方法为什么会在generateJob的时候就被调用呢?

Spark Streaming generateJob 逻辑解析

在JobGenerator中,会定时产生一个GenerateJobs的事件:

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,  longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

该事件会被DStreamGraph.generateJobs 处理,产生Job的逻辑 也很简单,

def generateJobs(time: Time): Seq[Job] = {   
    val jobs = this.synchronized {
      outputStreams.flatMap { outputStream =>
        val jobOption = outputStream.generateJob(time)
        ........    
  }

就是调用各个outputStream 的generateJob方法,典型的outputStream如ForEachDStream。 以ForEachDStream为例,产生job的方式如下:

override def generateJob(time: Time): Option[Job] = {
    parent.getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
          foreachFunc(rdd, time)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

我们看到,在这里会触发所有的DStream链进行compute动作。也就意味着所有transformation产生的DStream的compute方法都会被调用。

正常情况下不会有什么问题,比如.map(func) 产生的MappedDStream里面在compute执行时,func 都是被记住而不是被执行。但是TransformedDStream 是比较特殊的,对应的func是会被执行的,在对应的compute方法里,你会看到这行代码:

val transformedRDD = transformFunc(parentRDDs, validTime)

这里的transformFunc 就是transform(func)里的func了。然而transform 又特别灵活,可以执行各种RDD操作,这个时候Spark Streaming 是拦不住你的,一旦你使用了count之类的Action,产生Job的时候就会被立刻执行,而不是等到Job被提交才执行。

运行wordcount_streaming.py报错2025-06-03 06:54:40,918 ERROR [org.apache.spark.streaming.scheduler.JobScheduler] - Error running job streaming job 1748958880000 ms.0 org.apache.spark.SparkException: An exception was raised by Python: Traceback (most recent call last): File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", line 71, in call r = self.func(t, *rdds) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 236, in func return old_func(rdd) # type: ignore[call-arg, arg-type] File "/usr/local/spark/mycode/streaming/wordcount/wordcount_streaming.py", line 26, in <lambda> word_counts.foreachRDD(lambda rdd: save_results(rdd, output_dir)) NameError: name 'save_results' is not defined at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95) at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78) at org.apache.spark.streaming.api.python.PythonDStream$.$anonfun$callForeachRDD$1(PythonDStream.scala:179) at org.apache.spark.streaming.api.python.PythonDStream$.$anonfun$callForeachRDD$1$adapted(PythonDStream.scala:179) at org.apache.spark.streaming.dstream.ForEachDStream.$anonfun$generateJob$2(ForEachDStream.scala:51) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417) at org.apache.spark.streaming.dstream.ForEachDStream.$anonfun$generateJob$1(ForEachDStream.scala:51) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.$anonfun$run$1(JobScheduler.scala:256) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Traceback (most recent call last): File "/usr/local/spark/mycode/streaming/wordcount/wordcount_streaming.py", line 30, in <module> ssc.awaitTermination() File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/context.py", line 239, in awaitTermination File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/usr/local/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o29.awaitTermination. : org.apache.spark.SparkException: An exception was raised by Python: Traceback (most recent call last): File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", line 71, in call r = self.func(t, *rdds) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 236, in func return old_func(rdd) # type: ignore[call-arg, arg-type] File "/usr/local/spark/mycode/streaming/wordcount/wordcount_streaming.py", line 26, in <lambda> word_counts.foreachRDD(lambda rdd: save_results(rdd, output_dir)) NameError: name 'save_results' is not defined at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95) at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78) at org.apache.spark.streaming.api.python.PythonDStream$.$anonfun$callForeachRDD$1(PythonDStream.scala:179) at org.apache.spark.streaming.api.python.PythonDStream$.$anonfun$callForeachRDD$1$adapted(PythonDStream.scala:179) at org.apache.spark.streaming.dstream.ForEachDStream.$anonfun$generateJob$2(ForEachDStream.scala:51) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417) at org.apache.spark.streaming.dstream.ForEachDStream.$anonfun$generateJob$1(ForEachDStream.scala:51) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.$anonfun$run$1(JobScheduler.scala:256) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ,怎么解决
最新发布
06-04
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值