Transformation处理的数据为Key-Value形式的算子大致可以分为:输入分区与输出分区一对一、聚集、连接操作。
输入分区与输出分区一对一
mapValues
mapValues:针对(Key,Value)型数据中的Value进行Map操作,而不对Key进行处理。
方框代表RDD分区。a=>a+2代表只对( V1, 1)数据中的1进行加2操作,返回结果为3。
源码:
/**
* Pass each value in the key-value pair RDD through a map function without changing the keys;
* this also retains the original RDD's partitioning.
*/
def mapValues[U](f: V => U): RDD[(K, U)] = {
val cleanF = self.context.clean(f)
new MapPartitionsRDD[(K, U), (K, V)](self,
(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
preservesPartitioning = true)
}
单个RDD或两个RDD聚集
(1)combineByKey
combineByKey是对单个Rdd的聚合。相当于将元素为(Int,Int)的RDD转变为了(Int,Seq[Int])类型元素的RDD。
定义combineByKey算子的说明如下:
- createCombiner: V => C, 在C不存在的情况下,如通过V创建seq C。
- mergeValue:(C, V) => C, 当C已经存在的情况下,需要merge,如把item V加到seq
C中,或者叠加。- mergeCombiners:(C,C) => C,合并两个C。
- partitioner: Partitioner(分区器),Shuffle时需要通过Partitioner的分区策略进行分区。
- mapSideCombine: Boolean=true, 为了减小传输量,很多combine可以在map端先做。例如, 叠加可以先在一个partition中把所有相同的Key的Value叠加, 再shuffle。
- serializerClass:String=null,传输需要序列化,用户可以自定义序列化类。
方框代表RDD分区。 通过combineByKey,将(V1,2)、 (V1,1)数据合并为(V1,Seq(2,1))。
源码:
/**
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
* Note that V and C can be different -- for example, one might group an RDD of type
* (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, and whether to perform
* map-side aggregation (if a mapper can produce multiple items with the same key).
*/
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {