transform和foreachRDD

CHSN

于 2022-05-22 10:10:02 发布

阅读量347

点赞数

CC 4.0 BY-SA版权

分类专栏：学习笔记文章标签：学习

本文链接：https://blog.youkuaiyun.com/csncd/article/details/124907310

学习笔记专栏收录该内容

30 篇文章

订阅专栏

本文深入探讨Spark Streaming中的DStream操作，重点解析transform算子，它创建新的DStream通过应用函数于每个RDD，不涉及action算子，仅做转换。同时，对比介绍了foreachRDD算子，用于对DStream的每个RDD执行任意操作，扩大了对RDD操作的可能性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

transform算子源码：

  /**
   * Return a new DStream in which each RDD is generated by applying a function
   * on each RDD of 'this' DStream.
   * 返回一个新的 DStream，其中每个 RDD 是通过对 'this' DStream 的每个 RDD 应用一个函数来生成的。
   */
  def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U] = ssc.withScope {
    // because the DStream is reachable from the outer object here, and because
    // DStreams can't be serialized with closures, we can't proactively check
    // it for serializability and so we pass the optional false to SparkContext.clean
    val cleanedF = context.sparkContext.clean(transformFunc, false)
    transform((r: RDD[T], _: Time) => cleanedF(r))
  }

  /**
   * Return a new DStream in which each RDD is generated by applying a function
   * on each RDD of 'this' DStream.
   */
  def transform[U: ClassTag](transformFunc: (RDD[T], Time) => RDD[U]): DStream[U] = ssc.withScope {
    // because the DStream is reachable from the outer object here, and because
    // DStreams can't be serialized with closures, we can't proactively check
    // it for serializability and so we pass the optional false to SparkContext.clean
    val cleanedF = context.sparkContext.clean(transformFunc, false)
    val realTransformFunc = (rdds: Seq[RDD[_]], time: Time) => {
      assert(rdds.length == 1)
      cleanedF(rdds.head.asInstanceOf[RDD[T]], time)
    }
    new TransformedDStream[U](Seq(this), realTransformFunc)
  }

返回一个新的 DStream，其中每个 RDD 是通过对 DStream 的每个 RDD 应用一个函数来生成的。这说明它是不能使用action算子的，只是对rdd进行转换的操作，最终还是返回一个rdd。

而foreachRDD算子就是拿到DStream的每一个rdd进行操作（这样的意义是对于rdd的操作算子更加丰富）