spark 分区 java_spark中的分区操作回顾--mapPartition

本文探讨了在Spark中如何使用repartitionAndSortWithinPartitions进行重分区和排序,以及自定义分区器SortPartitoner的实现。通过示例展示了如何在分区后进行全局排序,以及mapPartitionsWithIndex的多种使用方式,包括内部创建数组缓存数据和自定义迭代器的方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.spark中repartitionAndSortWithinPartitions实现重分区+排序

def spark_rand(): Unit ={

val data=Array(1, 2, 3, 4, 5, 6,6, 7,3,4, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)

val s: RDD[(Int, Int)] =spark.sparkContext.parallelize(data,1).map(str=>(str,1))

val mx=s.sortBy(_._1,false).first()._1

s.repartitionAndSortWithinPartitions(new SortPartitoner(4,mx))

.mapPartitionsWithIndex{(partionId,iter)=>

var part_name = "part_" + partionId

var part_map = scala.collection.mutable.Map[String,List[Int]]()

part_map(part_name) = List[Int]()

while(iter.hasNext){

part_map(part_name) :+= iter.next()._1//:+= 列表尾部追加元素

}

part_map.iterator

}

}.collect().foreach(println(_))

//自定义分区器

class SortPartitoner(num: Int,max:Int) extends Partitioner {

override def numPartitions: Int = num

val partitionerSize = max / num + 1

override def getPartition(key: Any): Int = {

val intKey = key.asInstanceOf[Int]

intKey / partitionerSize

}

}

##输出结果

(part_0,List(1, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 9, 10, 11, 22))

18/10/10 11:41:47 INFO SparkContext: Invoking stop() from shutdown hook

(part_1,List(26, 33, 43, 50))

(part_2,List(54, 76))

(part_3,List(81, 94, 100))

备注:

spark分区后再使用sortby,会将数据又进行shullffe操作的,相当于再调用一次rangePatition.

调用sortby后:

(part_0,List(1, 2, 3, 3, 4, 4))

(part_1,List(5, 6, 6, 7, 8, 9))

(part_2,List(10, 11, 22, 26, 33, 43))

(part_3,List(50, 54, 76, 81, 94, 100))

2.重写分区,实现分区排序+二次归并排序实现全局排序

002---数据按照分区输出

val data=Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)

spark.sparkContext.parallelize(data,1)

.mapPartitionsWithIndex{(partionId,iter)=>

var part_name = "part_" + partionId

var part_map = scala.collection.mutable.Map[String,List[Int]]()

part_map(part_name) = List[Int]()

while(iter.hasNext){

part_map(part_name) :+= iter.next()//:+= 列表尾部追加元素

}

part_map.iterator

}

}.collect().foreach(println(_))

003--按照数据大小分区,容易造成数据倾斜

val data=Array(1, 2, 3, 4, 5, 6,6, 7, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)

val s: RDD[(Int, Int)] =spark.sparkContext.parallelize(data,1).map(str=>(str,1))

val mx=s.sortBy(_._1,false).first()._1

s.partitionBy(new SortPartitoner(4,mx))

.mapPartitionsWithIndex{(partionId,iter)=>

var part_name = "part_" + partionId

var part_map = scala.collection.mutable.Map[String,List[Int]]()

part_map(part_name) = List[Int]()

while(iter.hasNext){

part_map(part_name) :+= iter.next()._1//:+= 列表尾部追加元素

}

part_map.iterator

}

}.collect().foreach(println(_))

//自定义分区器

class SortPartitoner(num: Int,max:Int) extends Partitioner {

override def numPartitions: Int = num

val partitionerSize = max / num + 1

override def getPartition(key: Any): Int = {

val intKey = key.asInstanceOf[Int]

intKey / partitionerSize

}

}

##结果

(part_0,List(1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11, 22))

(part_1,List(33, 43, 26, 50))

(part_2,List(54, 76))

(part_3,List(81, 94, 100))

3.mapPartition的几种实现方式

/**

* mappartition的使用

*/

def test_mapPartition(): Unit ={

val sc=spark.sparkContext

val a:RDD[Int] = sc.parallelize(1 to 1000000,2 )

val startTime=System.nanoTime()

println(a.repartition(4).map(str=>str*3).sum())

val endTime=System.nanoTime()

println((endTime-startTime)/1000000000d)

//01-第一种写法(内部创建一个数组,用于缓存所有的数据)

def terFunc(iter: Iterator[Int]) : Iterator[Int] = {

var res = List[Int]()

while (iter.hasNext)

{

val cur = iter.next;

res.::= (cur*3) ;

}

res.iterator

}

val startTime2=System.nanoTime()

val result = a.mapPartitions(terFunc).sum()

val endTime2=System.nanoTime()

println(result+"=="+(endTime2-startTime2)/1000000000d)

//02--第二种写法(自定义迭代器)

val startTime3=System.nanoTime()

val result2 = a.mapPartitions(v => new CustomIterator(v)).sum()

val endTime3=System.nanoTime()

println(result2+"=="+(endTime2-startTime2)/1000000000d)

}

//03--这也是一个mappartition的遍历操作

def mapPartitionsTest(listParam:Iterator[Int]):Iterator[Int]={

println("by partition:")

var res = for(param

res

}

//自定义迭代器

class CustomIterator(iter: Iterator[Int]) extends Iterator[Int] {

override def hasNext: Boolean = {

iter.hasNext

}

override def next(): Int = {

val cur=iter.next()

cur*3//返回值

}

}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值