spark2.3 RDD之 map 源码解析

DPnice

于 2018-04-26 13:57:53 发布

阅读量2.6k

点赞数

CC 4.0 BY-SA版权

分类专栏： spark 文章标签： spark map scala

本文链接：https://blog.youkuaiyun.com/DPnice/article/details/80092247

spark 专栏收录该内容

19 篇文章

订阅专栏

本文详细介绍了Spark和Scala中map操作的具体实现方式。在Spark中，map操作通过将函数应用于RDD每个元素来创建新的RDD；而在Scala中，map则用于创建一个应用了转换函数的新迭代器。两种实现均保留了原有数据集的分区特性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

spark map源码

/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

scala map 源码

/** Creates a new iterator that maps all produced values of this iterator
   *  to new values using a transformation function.
   *
   *  @param f  the transformation function
   *  @return a new iterator which transforms every value produced by this
   *          iterator by applying the function `f` to it.
   *  @note   Reuse: $consumesAndProducesIterator
   */
  def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
    def hasNext = self.hasNext
    def next() = f(self.next())
  }

map将RDD原分区的 iterator 的每一个元素调用传入函数 f ，底层用Scala的map 方法，回调函数map的next，将每一个元素进行计算处理，最后返回一个新的RDD,新的RDD的分区数保持不变。