【Spark】RDD操作详解3——键值型Transformation算子

最新推荐文章于 2024-01-09 11:44:23 发布

原创

最新推荐文章于 2024-01-09 11:44:23 发布 · 3.2k 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#spark

本文详细介绍了Spark中的键值型Transformation算子，包括mapValues、combineByKey、reduceByKey、partitionBy、cogroup、join、leftOuterJoin和rightOuterJoin等操作，分析了它们的功能和应用场景，以及如何影响数据分区和处理过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Transformation处理的数据为Key-Value形式的算子大致可以分为：输入分区与输出分区一对一、聚集、连接操作。

输入分区与输出分区一对一

mapValues

mapValues：针对（Key，Value）型数据中的Value进行Map操作，而不对Key进行处理。

方框代表RDD分区。a=>a+2代表只对（ V1， 1）数据中的1进行加2操作，返回结果为3。

源码：

  /**
   * Pass each value in the key-value pair RDD through a map function without changing the keys;
   * this also retains the original RDD's partitioning.
   */
  def mapValues[U](f: V => U): RDD[(K, U)] = {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
      preservesPartitioning = true)
  }

单个RDD或两个RDD聚集

（1）combineByKey

combineByKey是对单个Rdd的聚合。相当于将元素为（Int，Int）的RDD转变为了（Int，Seq[Int]）类型元素的RDD。
定义combineByKey算子的说明如下：

createCombiner： V => C，在C不存在的情况下，如通过V创建seq C。

mergeValue：(C, V) => C，当C已经存在的情况下，需要merge，如把item V加到seq
C中，或者叠加。

mergeCombiners：(C,C) => C，合并两个C。

partitioner： Partitioner（分区器），Shuffle时需要通过Partitioner的分区策略进行分区。

mapSideCombine： Boolean=true，为了减小传输量，很多combine可以在map端先做。例如，叠加可以先在一个partition中把所有相同的Key的Value叠加，再shuffle。

serializerClass：String=null，传输需要序列化，用户可以自定义序列化类。

方框代表RDD分区。通过combineByKey，将（V1，2）、（V1，1）数据合并为（V1，Seq（2，1））。

源码：

  /**
   * Generic function to combine the elements for each key using a custom set of aggregation
   * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
   * Note that V and C can be different -- for example, one might group an RDD of type
   * (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
   *
   * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
   * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
   * - `mergeCombiners`, to combine two C's into a single one.
   *
   * In addition, users can control the partitioning of the output RDD, and whether to perform
   * map-side aggregation (if a mapper can produce multiple items with the same key).
   */
  def combineByKey[C](createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {