spark编程模型(六)之RDD基础转换操作(Transformation Operation)——coalesce、repartition...

coalesce()
  • def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]

  • 该函数用于将RDD进行重分区,使用HashPartitioner

  • 第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false

  • 只传入第一个参数,表示降低RDD中partitions(分区)数量为numPartitions,numPartitions要小于RDD原分区数量

  • 若传入的numPartitions值大于RDD原分区数量,则第二个参数必须设置为true,否则无效

      scala> val data = sc.textFile("/data/spark_rdd.txt", 2)
      data: org.apache.spark.rdd.RDD[String] = /data/spark_rdd.txt MapPartitionsRDD[19] at textFile at <console>:24
    
      scala> data.partitions.size
      res11: Int = 2
    
      scala> val new_data = data.coalesce(1)
      new_data: org.apache.spark.rdd.RDD[String] = CoalescedRDD[20] at coalesce at <console>:26
    
      scala> new_data.partitions.size
      res12: Int = 1
    
      scala> val new_data = data.coalesce(4)
      new_data: org.apache.spark.rdd.RDD[String] = CoalescedRDD[21] at coalesce at <console>:26
    
      scala> new_data.partitions.size
      res13: Int = 2
    
      scala> val new_data = data.coalesce(4, true)
      new_data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at coalesce at <console>:26
    
      scala> new_data.partitions.size
      res14: Int = 4
repartition()
  • def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

  • Reshuffle(重新洗牌)RDD 中的数据以创建或者更多的 partitions(分区)并将每个分区中的数据尽量保持均匀

  • 该操作总是通过网络来 shuffles 所有的数据

  • 该函数其实就是coalesce()函数中第二个参数为true的实现

      scala> val data = sc.textFile("/data/spark_rdd.txt", 2)
      data: org.apache.spark.rdd.RDD[String] = /data/spark_rdd.txt MapPartitionsRDD[1] at textFile at <console>:24
    
      scala> data.partitions.size
      res0: Int = 2
    
      scala> val data_1 = data.repartition(1)
      data_1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at repartition at <console>:26
    
      scala> data_1.partitions.size
      res2: Int = 1
    
      scala> val data_2 = data.repartition(4)
      data_2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at repartition at <console>:26
    
      scala> data_2.partitions.size
      res3: Int = 4

转载于:https://www.cnblogs.com/oldsix666/articles/9458195.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值