Spark调优(1)--算子map,flatMap与mapValues,flatMapValues的区别

最新推荐文章于 2025-04-15 12:16:42 发布

xiaolin_xinji

最新推荐文章于 2025-04-15 12:16:42 发布

阅读量1k

点赞数

分类专栏： Spark 文章标签： mapValue map flatmap

本文链接：https://blog.youkuaiyun.com/weixin_44131414/article/details/108978841

版权

Spark 专栏收录该内容

25 篇文章

订阅专栏

本文详细探讨了Spark中RDD算子map、flatMap、mapValues及flatMapValues的工作原理，并通过源码解析了这些算子如何影响shuffle操作次数。特别关注了mapValues算子如何通过保留分区特性减少shuffle操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

关于 map,flatMap,mapValues,flatMapValues 算子大家都应该很熟悉,底层都是调用了MapPartitionsRDD 算子,它的父类是 RDD;

我们看下简单的例子,以下两个任务发生了几次shuffle

// reduceByKey 之后使用 map 对值进行操作,再 groupByKey
data.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).map(x=>(x._1,x._2*10)).groupByKey().foreach(println)
// reduceByKey 之后使用 mapValues 对值进行操作,再 groupByKey 
data.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).mapValues(_*10).groupByKey().foreach(println)

如图所示:

使用了map算子 DAG图
使用了mapValues之后:

从以上两张图可以看出来,为什么使用了 mapValues 之后 shuffle 仅发生了一次,而不是想想中的2次,
这其实因为 MapPartitionsRDD 参数 preservesPartitioning 的原因,以及 groupByKey底层函数 combineByKeyWithClassTag的原因 :

解析

MapPartitionsRDD 源码:
从源码可以看出 preservesPartitioning的含义,即是否要使用到父Rdd的分区器

/**
 * An RDD that applies the provided function to every partition of the parent RDD.
 *
 * @param prev the parent RDD.
 * @param f The function used to map a tuple of (TaskContext, partition index, input iterator) to
 *          an output iterator.
 * @param preservesPartitioning Whether the input function preserves the partitioner, which should
 *                              be `false` unless `prev` is a pair RDD and the input function
 *                              doesn't modify the keys.
 * @param isOrderSensitive whether or not the function is order-sensitive. If it's order
 *                         sensitive, it may return totally different result when the input order
 *                         is changed. Mostly stateful functions are order-sensitive.
 */
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false,
    isOrderSensitive: Boolean = false)
  extends RDD[U](prev) {
  
  //判断是否使用 firstParent 的分区器
  
  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
  ......

map 源码:
调用了 MapPartitionsRDD 参数preservesPartitioning 为false ,则无分区器

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

mapValues 源码:
底层调用 MapPartitionsRDD 传入参数 preservesPartitioning = true

/**
   * Pass each value in the key-value pair RDD through a map function without changing the keys;
   * this also retains the original RDD's partitioning.
   */
  def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
      preservesPartitioning = true)
  }

好了, 从以上源码我们可以看到在使用 groupByKey 这个算子之前它的父Rdd的分区器是从哪儿来的,现在我们来看看 groupByKey的源码:

groupByKey 源码:
groupByKey 底层是调用了 combineByKeyWithClassTag 函数,从源码中我们可以看到如果 self.partitioner == Some(partitioner),那么将不会调用 ShuffledRDD 发生shuffle ,看到这里就应该明白了为什么;由于算子使用的分区器都是一样的,那么相同的key值一定会与之前的 reduceByKey 算子shuffle 时移动, 分区一致,即 1 这个值不管如何一定会在1 这个分区,(使用了相同的分区器),既然如此,那么之间简单的进行io操作 ,进行移动即可;

 def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
      //判断是否要调用  ShuffledRDD 还是直接 进行io数据移动
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

在这里插入图片描述