深度剖析Spark中常用且易混的5个K-V类型算子

最新推荐文章于 2023-09-18 16:57:01 发布

原创

最新推荐文章于 2023-09-18 16:57:01 发布 · 1.1k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#spark #大数据 #scala #java #hadoop

在这里插入图片描述

原文发在我的公众号微信公众号"大数据学习应用"中
公众号后台回复"spark源码"可查看spark源码分析系列
本文系个人原创请勿私自转载

本文共约4400字

前言

spark内置了非常多有用的算子，通过对这些算子的组合就可以完成业务需要的功能。

spark的编程归根结底就是对spark算子的使用，因此非常有必要熟练掌握这些内置算子。

本文重点分析以下spark算子

groupByKey
reduceByKey
aggregateByKey
foldByKey
combineByKey

这几个算子操作的对象都是(k,v)类型的RDD

虽然都有迭代合并的意思但不同点在于传入的参数以及分区内和分区间的计算规则等

groupByKey()

函数签名

def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
}

def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))
}

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
        createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

函数说明

groupByKey()称为分组合并
对相同的key进行分组并对每个key返回一个Iterable[V]
Iterable[V]存放的是之前相同的key所对应的一个一个的value值
如果直接输出则value默认为CompactBuffer数据结构
groupByKey()处理数据时需要等待，等待所有相同的key都到达时，才能继续往后执行
groupByKey()会将数据打乱重组，也就是说含有shuffle的过程，但是又不能在内存中等待数据，所以必须将shuffle的数据落盘等待

关于CompactBuffer

CompactBuffer是spark里的数据结构，它继承自一个迭代器和序列，所以它的返回值是一个能进行循环遍历的集合

/**
* An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
* ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
* so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
* elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
* entries than that. This makes it more efficient for operations like groupBy where we expect
* some keys to have very few elements.
*/
/**
类似于ArrayBuffer的仅追加缓冲区，但是对于小型缓冲区而言，其内存效率更高。
ArrayBuffer总是分配一个Object数组来存储数据，默认情况下有16个条目，
因此它有大约80-100字节的开销。 
相反，CompactBuffer最多可以在主对象的字段中保留两个元素，并且仅当有更多条目时才分配Array [AnyRef]。 
这对于像groupBy这样的操作来说效率更高，因为我们希望某些键的元素很少。
*/
private[spark] class CompactBuffer[T: ClassTag] extends Seq[T] with Serializable

代码举例

var rdd = sc.makeRDD(
    List(
        ("hello", 1),
        ("hello", 2),
        ("hadoop", 2),
        ("hadoop", 2),
        ("hadoop", 4)
    )
)

// 使用key进行分组操作
val rdd1: RDD[(String, Iterable[Int])] = rdd.groupByKey()
rdd1.collect().foreach(println)
// 可以直接输出 结果为
//(h