Spark-Core之源码级算子详解(三)

最新推荐文章于 2024-10-19 14:23:25 发布

晓晓很可爱

最新推荐文章于 2024-10-19 14:23:25 发布

阅读量363

点赞数

CC 4.0 BY-SA版权

分类专栏： spark-core 文章标签： spark

本文链接：https://blog.youkuaiyun.com/Fresh_man888/article/details/110292177

RDD中常用transformation算子

0.intersection求交集

功能:是对两个rdd或者两个集合,求共同的部分,比如第一个rdd中有的数据并且第二个rdd中也有的数据,取出相同的元素(会去重)

底层:底层调用的cogroup，map将数据本身当成key，null当成value，然后进行过滤，过滤的条件为，两个迭代器都不为空迭代器，然后调用keys取出key

 def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

1.Subtract差集

功能:将第一个rdd和第二个rdd中相同的部分差掉(差集可以使用在集合或者是RDD中)(不会去重)

要求:两个RDD中的数据类型必须是一致的,数据类型可以是任意类型

底层调用了一个map方法,将值变成k,v(k是集合本身的元素,v是null)的形式,然后调用了subtractByKey,subtractByKey底层是new 了一个subTractRDD,将k,和v放入一个map集合,先将rdd1中的所有数据拼成(k,seq())中,然后遍历第二个rdd,根据第二个rdd中的key移除第一个rdd中的key,底层源码如下:

// the first dep is rdd1; add all values to the map
integrate(0, t => getSeq(t._1) += t._2)
// the second dep is rdd2; remove all of its keys
integrate(1, t => map.remove(t._1))

2.PartitionBy算子(transformation算子)

功能:按照指定的分区器,重新分区

要求:数据类型为K,V类型的才能调用PartitonBy重新分区

底层:如果和传入的分区器和原来的分区器相同,这使用原来的分区器,如果不同则new shuffleRDD,传入新传入的分区器

if (self.partitioner == Some(partitioner)) {
  self
} else {
  new ShuffledRDD[K, V, V](self, partitioner)
}

3.repartition重新设置分区数量算子

功能:将数据打散,重新分区,可以将分区数量变多也可以将分区数量减少,有两个stage,不管是分区数量是增加还是分区数量减少或者是不变都一定存在shuffle,其中有一个参数:shuffle=true

要求:数据不一定是K,V类型,

底层:调用了coalesce,并且默认shuffle=true,在底层是调用了shuffleRDD

if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

4.coalesce将分区汇聚或者合并的算子

功能:可以有shuffle也可以没有(其中有一个参数默认shuffle=false),可以将数据打散,也可以不打散,以4个分区变成2个分区为例:从rdd角度分析,是将4个分区合并成了2个分区,从task角度来说是一个task读多个task的数据,如果将分区数量变多,但是shuffle=false,最后的结果是分区数量不变,这样做没有意义

补充:一个sta