6. cogroup和groupByKey区别
6.1 cogroup()
同一个rdd中, 相同k的value迭代组成迭代器k,(Iterator[v],Iterator[w])
依赖于两个rdd
举例: rdd.cogroup(rdd2)
val rdd9: RDD[(String, Any)] = sc.parallelize(List((“tom”, “aa”),(“tom”, 1), (“tom”, “bb”),(“jerry”, 3), (“kitty”, 2)))
val rdd10 = sc.parallelize(List((“jerry”, 2), (“tom”, 1), (“shuke”, 2)))
val cogroup: RDD[(String, (Iterable[Any], Iterable[Int]))] = rdd9.cogroup(rdd10)
val c9: Array[(String, (Iterable[Any], Iterable[Int]))] = cogroup.collect()
//结果
ArrayBuffer((tom,(CompactBuffer(aa, 1, bb),CompactBuffer(1))), (jerry,(CompactBuffer(3),CompactBuffer(2))), (shuke,(CompactBuffer(),CompactBuffer(2))), (kitty,(CompactBuffer(2),CompactBuffer())))
6.2 groupByKey()
不依赖多个RDD,一个rdd执行起来的算子
求一个集合的相同k合并后的value, shuffle阶段, 得到的value是迭代器 rdd.groupByKey()得到的是 Array(k, Iterator[v])
val rdd7: RDD[(String, Any)] = sc.parallelize(List((“tom”, “aa”),(“tom”, 1), (“tom”, “bb”),(“jerry”, 3), (“kitty”, 2)))
val key: RDD[(String, Iterable[Any])] = rdd7.groupByKey()
val c77: Array[(String, Iterable[Any])] = key.collect()
//结果
ArrayBuffer((tom,CompactBuffer(aa, 1, bb)), (jerry,CompactBuffer(3)), (kitty,CompactBuffer(2)))
7. groupBy(): Iterator(k,v) VS groupByKey(): Iterator[v]
groupBy()更加灵活 返回的是 k, Iterator(k,v)
ArrayBuffer((tom,CompactBuffer((tom,aa), (tom,1), (tom,bb))), (jerry,CompactBuffer((jerry,3))), (kitty,CompactBuffer((kitty,2))))
ReduceByKey 按照相同的key进行分组聚合,聚合的运算逻辑由你写