官方解释:
Return a new RDD by applying a function to each partition of this RDD,
while tracking the index of the original partition
大白话就是该算子会以当前RDD每个分区为单位,携带分区信息回调一次你传递的方法,最后返回
一个新的RDD,所以该算子是一个Transformations算子
1.java api定义
def mapPartitionsWithIndex[R]
(f: Function2[Integer, Iterator[T], Iterator[R]],
preservesPartitioning: Boolean = false): JavaRDD[R]
f的第一个参数为当前的分区号,第二个参数为当前分区的所有数据集,第三参数为当前分区
处理后返回的结果数据集,当前RDD数据类型为T,返回后的RDD数据类型为R
preservesPartitioning默认为false,该参数我们不关心
代码示例:
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("test");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> rdd1 = jsc.parallelize( Arrays.asList(
"spark1", "spark2", "spark3",
"spark4", "spark5", "spark6",
"spark7", "spark8", "spark9"),
3);
List<String> mapStr = new ArrayList<>();
JavaRDD<String> rdd2 = rdd1.mapPartitionsWithIndex(
(Function2<Integer, Iterator<String>, Iterator<String>>) (index, iterator) -> {
System.out.println();
if (mapStr.size() > 0) {
mapStr.clear();
}
while (iterator.hasNext()) {
String next = iterator.next();
String subfix = "";
if (index == 0) {
subfix = "【北京区】";
} else if (index == 1) {
subfix = "【上海区】";
} else {
subfix = "【广州区】";
}
mapStr.add(subfix + next);
}
return mapStr.iterator();
}, true);
rdd2.foreach((VoidFunction<String>) s -> {
System.out.println(s);
});
jsc.stop();
}
运行结果:
【北京区】spark1
【北京区】spark2
【北京区】spark3
【上海区】spark4
【上海区】spark5
【上海区】spark6
【广州区】spark7
【广州区】spark8
【广州区】spark9
2.scala api定义
def mapPartitionsWithIndex[U : ClassTag]
(f: (Int, scala.Iterator[T]) => scala.Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
代码示例:
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("appName")
.getOrCreate()
val sc = sparkSession.sparkContext
val rdd1: RDD[String] = sc.parallelize(List(
"spark1", "spark2", "spark3",
"spark4", "spark5", "spark6",
"spark7", "spark8", "spark9"),
3)
val rdd2: RDD[String] = rdd1.mapPartitionsWithIndex {
(index, iter) => {
println()
var result = List[String]()
var subfix = ""
if (index == 0) subfix = "【北京区】"
else if (index == 1) subfix = "【上海区】"
else subfix = "【广州区】"
while (iter.hasNext) {
val str: String = iter.next()
result = result :+ subfix + str
}
result.iterator
}
}
rdd2.foreach(println(_))
sc.stop()
}
运行结果:
【北京区】spark1
【北京区】spark2
【北京区】spark3
【上海区】spark4
【上海区】spark5
【上海区】spark6
【广州区】spark7
【广州区】spark8
【广州区】spark9