Spark 文档中对 aggregate的函数定义如下:
def aggregate[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U)
=> U)(implicit arg0: ClassTag[U]): U
注释:
Aggregate the elements of each partition, and then the results for
all the partitions, using given combine functions and a neutral
"zero value".
This function can return a different result type, U,
than the type of this RDD, T.
Thus, we need one operation for merging a T into an U
and one operation for merging two U's, as in
Scala.TraversableOnce. Both of these functions are allowed to
modify and return their first argument instead of creating a new U
to avoid memory allocation.
aggregate函数首先对每个分区里面的元素进行聚合,然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个操作返回的类型不需要和RDD中元素类型一致,所以在使用 aggregate()时,需要提供我们期待的返回类型的初始值,然后通过一