Spark API

最新推荐文章于 2022-03-30 21:38:56 发布

FightForProgrammer

最新推荐文章于 2022-03-30 21:38:56 发布

阅读量790

点赞数

分类专栏： spark技术文章标签： Spark aggregate

本文链接：https://blog.youkuaiyun.com/FightForProgrammer/article/details/47174977

版权

spark技术专栏收录该内容

6 篇文章

订阅专栏

本文介绍了Spark RDD API中关键的几个操作：aggregate通过seqOp和combOp函数聚合数据；aggregateByKey针对相同key进行局部聚合；cartesian计算两个RDD的笛卡尔积；coalesce和repartition用于调整RDD分区数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark RDD API使用说明（一）

1、aggregate

1.1 函数声明

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U)=> U): U

1.2函数说明

aggregate函数通过两个函数来操作RDD。第一个reduce函数(seqOp)对每个partition聚合，然后将初始值(zeroValue)和所有partitions的结果进行(combOp)操作，生成最终结果。应用两个reduce函数十分方便，比如：第一个用于求各个partition的最大值，第二个用于汇总每个partition的和。

1.3 实例

scala> def seqOp(a:Int , b:Int) : Int = {
     | math.max(a, b)
     | }
seqOp: (a: Int, b: Int)Int

scala> def combOp(a:Int, b:Int) : Int = {
     | a + b
     | }
combOp: (a: Int, b: Int)Int

scala> val v = sz.parallelize(List(1,2,3,4,5,6), 3)
scala> v.aggregate(4)(seqOp, combOp)
结果：18
计算过程：
	首先分为三个区：(1,2) ，(3,4) ，(5,6)
	然后 调用max(4,1,2),max(4,3,4),max(4,5,6)。计算三个区的结果分别为：4,4,6
	在对三个区和初始值4进行聚合:4+4=8；8+4=12；12+6=18
注意：初始值zeroValue使用了两次：1在各个partition使用；2，对partition合并的适合使用

2、aggregateByKey

2.1函数声明

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicitarg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicitarg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicitarg0: ClassTag[U]): RDD[(K, U)]

2.2 函数说明

和aggregate类似，区别是：1，只对具有相同key的值进行聚合；2，初始值只出现在第一个reduce函数（seqOp）

2.3 实例

val pariRDD = sc.parallelize(List( ("cat", 2), ("cat", 5),("mouse", 4), ("cat", 12), ("dog", 12),("mouse", 2)), 2)
 //定义函数，显示分区情况
def func(index:Int, iter:Iterator[(String, Int)]):Iterator[String] = {
           iter.toList.map(x => "[partID:" +index + ", val:" + x + "]").iterator
}
//查看分区情况
pairRDD.mapPartitionsWithIndex(func).collect
//结果：Array[String]= Array([partID:0, val:(cat,2)], [partID:0, val:(cat,5)], [partID:0,val:(mouse,4)], [partID:1, val:(cat,12)], [partID:1, val:(dog,12)], [partID:1,val:(mouse,2)])
pairRDD.aggregateByKey((0))(math.max(_,_),_+_).collect
//结果：Array[(String,Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_,_), _ + _).collect
//结果：Array((dog,100), (cat,200), (mouse,200))

3、cartesian

3.1函数声明

def cartesian[U:ClassTag](other: RDD[U]): RDD[(T, U)]

3.2 函数说明

计算两个RDD的笛卡尔积，然后返回一个新的RDD（注意：应用此函数内存消耗很快）

3.3 实例

valx = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res0: Array[(Int, Int)] = Array((1,6),(1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7),(3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10),(5,9), (5,10))

4、coalesce,repartition

4.1函数声明

def coalesce ( numPartitions : Int , shuffle : Boolean = false ): RDD [T]

def repartition( numPartitions : Int ): RDD [T]

4.2 函数说明

将partition合并成指定数目的partition

4.3实例

val y = sc.parallelize(1 to 10, 10)
val z = y.coalesce(2, false)
z.partitions.length
结果： Int = 2