Spark算子之reduceByKey、groupByKey

reduceByKey:
  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }

翻译:使用关联和可交换reduce函数合并每个键的值。 在将结果发送到reducer之前,这也将在每个映射器上本地执行合并,类似于MapReduce中的“组合器”。 输出将使用现有的分区程序/并行级别进行散列分区。
reduceByKey可以使用指定的reduce函数,将相同的key聚合到一起,reduce任务的个数可以通过第二个可选参数来设置。
eg:

val rdd = sc.parallelize(List(("female",1),("male",5),("female",5),("male",2)))
val reduce = rdd.reduceByKey((x,y) => x+y)
// 结果
Array((female,6), (male,7))
groupByKey:
  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

翻译:将RDD中每个键的值分组为单个序列。 使用现有分区程序/并行级别对生成的RDD进行散列分区。 不保证每个组内元素的排序,并且每次评估结果RDD时甚至可能不同。
groupByKey是对每个key进行操作,只生成一个sequence。
eg:

val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val group = wordPairsRDD.groupByKey()
// 结果
Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))
区别
  1. reduceByKey:按照key进行聚合,在shuffle之前有combine(预聚合)操作,返回结果是RDD[k,v].
  2. groupByKey:按照key进行分组,直接进行shuffle。
    在不影响业务逻辑时候,建议使用reduceByKey。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值