一、reduceByKey
接收一个函数,按照相同的key进行reduce操作,类似于scala的reduce的操作
scala版本
例1:对二元组进行reduceByKey
val rdd1=sc.makeRDD(List((1,2),(1,3),(4,6),(4,8),(5,1)))
val rdd2=rdd1.reduceByKey((x,y)=>{println(x+"+"+y);x+y})
rdd2.collect.foreach(println)
例二:单词计数
准备文件
aa bb cc aa aa aa dd dd zz ee ff ee
ff aa bb zz
ee kk
cc zz zzz
val rdd=sc.textFile("D:/test/sample.txt")
rdd.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).collect.foreach(println)
java版本
JavaRDD<String> rdd=sc.textFile("in/sample.txt")
FlatMapFunction<String,Tuple2<String,Integer>> flatMapFunction=new FlatMapFunction<String,Tuple2<String,Integer>>(){
@override
public Iterator<Tuple2<String,Integer>> call(String s) throws Exception{
String[] split=s.split(" ");
ArrayList<Tuple2<String,Integer>> list=new ArrayList();
for(String str:split){
Tuple2<String,Integer> t2=new Tuple2<>(str,1);
list.add(t2);
}
return list.iterator();
}
};
PairFlatMapFunction<String, String, Integer> pairFlatMapFunction = new PairFlatMapFunction<String, String, Integer>() {
public Iterator<Tuple2<String, Integer>> call(String s) throws Exception {
String[] split = s.split(" ");
ArrayList<Tuple2<String, Integer>> list = new ArrayList();
for (String str : split) {
Tuple2<String, Integer> t2 = new Tuple2<>(str, 1);
list.add(t2);
}
return list.iterator();
}
};
Function2<Integer, Integer, Integer> function2 = new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
};
JavaPairRDD<String,Integer> stringIntegerJavaPairRDD
= rdd1.flatMapToPair(pairFlatMapFunction).reduceByKey(function2);
List<Tuple2<String,Integer>> collect=stringIntegerJavaPairRDD.collect();
for (Tuple2 t : collect) {
System.out.println(t);
}
二、foldByKey
该函数用于rdd[K,V]根据K将V做折叠、合并处理,其中的参数zeroValue表示先根据映射函数将zeroValue应用于V,进行初始化V,再将映射函数应用于初始化后的V.
scala版本
val rdd=sc.makeRDD(List(("A",1),("A",2),("B",2),("C",5)))
rdd.foldByKey(1)(_+_).collect.foreach(println)
输出:
(A,5) -> 1+1=2 1+2=3 -> 2+3=5
(B,3)
(C,6)
三、sortByKey
SortByKey用于对pairRDD按照key进行排序,第一个参数可以设置true或者false,默认是true
scala版本
val rdd = sc.makeRDD(List((5,"hello"),(2,"world"),(1,"scala"),(4,"java")))
rdd.sortByKey.collect.foreach(println)
java版本
List<Tuple2<Integer,String>> list=new ArrayList<>();
list.add(new Tuple2<>(5,"hello"));
list.add(new Tuple2<>(2,"world"));
list.add(new Tuple2<>(1,"scala"));
list.add(new Tuple2<>(4,"java"));
JavaRDD<Tuple2<Integer,String>> rdd=sc.parallelize(list)
pairFunction<Tuple2<Integer,String>,Integer,String> pairFunction= new PairFunction<Tuple2<Integer,String>,Integer,String >() {
@Override
public Tuple2<Integer,String> call(Tuple2<Integer,String> t2) throws Exception {
return t2;
});
JavaPairRDD<Integer,String> PairRDD=rdd.mapToPair(pairFunction);
JavaPairRDD<Integer,String> sortPairRdd=PairRdd.sortByKey(); //可在sortByKey中放入参数false,倒序排列
List<Tuple2<Integer,String>> collect=sortPairRDD.collect();
for(Tuple2<Integer,String> t:collect){
System.out.println(t);
}
//输出:
(1,scala)
(2,world)
(4,java)
(5,hello)