Pair RDD
概念:
Pair RDD即存储键值对的RDD,期内每个对象都是键值对的形式。
Java创建PairRDD
需要通过一个普通的RDD,然后通过mapToPair方法将其转化成PairRDD.
例子:
将普通RDD中的字符串,取出字符串的第一个单词作为主键,然后生成一个PairRDD
JavaRDD<String> rdd = sc.parallelize(Arrays.asList("xiaobai is a big boss", "is it right", "sure you are right"));
JavaPairRDD<String, String> sPair = rdd.mapToPair(new PairFunction<String, String, String>() {
@Override
public Tuple2<String, String> call(String s) throws Exception {
return new Tuple2<String, String>(s.split(" ")[0], s);
}
});
System.out.println(sPair.collect());
运行结果如下:
Pair RDD的转化操作整合:
针对两个Pair RDD的操作:
举个使用的例子:
筛选掉长度大于等于20个字符的数据
Function<Tuple2<String, String>, Boolean> longWordFilter =
new Function<Tuple2<String, String>, Boolean>() {
public Boolean call(Tuple2<String, String> keyValue) {
return (keyValue._2().length() < 20);
}
};
JavaPairRDD<String, String> result = pairs.filter(longWordFilter);
System.out.println(result.collect());
或者这样写也一样的:
JavaPairRDD<String, String> result= sPair.filter(new Function<Tuple2<String, String>, Boolean>() {
@Override
public Boolean call(Tuple2<String, String> stringStringTuple2) throws Exception {
return (stringStringTuple2._2().length() < 15);
}
});
System.out.println(result.collect());
执行结果: