创建RDD
public class TestAPI {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("TestAPI");
try(JavaSparkContext jsc = new JavaSparkContext(conf)){
List<String> list2 = new ArrayList<String>();
list2.add("a,b,c,d,e");
list2.add("1,2,3,4,5");
JavaRDD<String> list1 = jsc.parallelize(list2);
System.out.println("list1: " + list1.collect());
}
}
}
Output
list1: [a,b,c,d,e, 1,2,3,4,5]
map
函数原型: JavaRDD map(Function<T,R> f)
操作含义: map操作可以将数据集的每一个元素按指定的函数f转换为一个新的RDD
JavaRDD<String[]> mapRDD = list1.map(
new Function<String, String[]>() {
@Override
public String[] call(String s) throws Exception {
return s.split(",");
}
}
);
List<String[]> result = mapRDD.collect();
System.out.println("mapRDD:");
for(int i = 0; i < result.size(); i++){
for(int j = 0; j < result.get(i).length; j++ )
System.out.print(result.get(i)[j] + " ");
System.out.println();
}
Output
mapRDD:
a b c d e
1 2 3 4 5
flatmap
函数原型: JavaRDD<U> flatMap(FlatMapFunction<T,U> f)
操作含义: 类似于map,但不同的是每个元素可以按指定的函数f映射为0个或多个元素,所以函数f返回的是一个序列集合而不是一个单一的元素
JavaRDD<String> flatmap = list1.flatMap(
new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String s) throws Exception {
return Arrays.asList(s.split(",")).iterator();
}
}
);
System.out.println("flatmapRDD: " + flatmap.collect());
Output
flatmapRDD: [a, b, c, d, e, 1, 2, 3, 4, 5]
filter
函数原型: JavaRDD<T> filter(Function<T,Boolean> f)
操作含义: 对数据集中的元素进行过滤,将符合指定函数f条件的元素组成为一个新的RDD
JavaRDD<String> filterRDD = list1.filter(
new Function<String, Boolean>() {
@Override
public Boolean call(String s) throws Exception {
if(s.contains("a"))
return true;
return false;
}
}
);
System.out.println("filterRDD: " + filterRDD.collect());
Output
filterRDD: [a,b,c,d,e]
union
函数原型: JavaRDD<T> union(JavaRDD<T> other)
操作含义: 将两个RDD合并为一个新的RDD
JavaRDD<String> unionRDD = list1.union(list1);
System.out.println("unionRDD: " + unionRDD.collect());
Output
unionRDD: [a,b,c,d,e, 1,2,3,4,5, a,b,c,d,e, 1,2,3,4,5]
Java 没有自带的二元组类型, 因此 Spark 的 Java API 让用户使用 scala.Tuple2 类来创建二 元组。可以通过 new Tuple2(elem1, elem2) 来创建一个新的二元 组, 并且可以通过 ._1() 和 ._2() 方法访问其中的元素
groupByKey
函数原型: JavaPairRDD<K,Iterable<V>> groupByKey(int numPartitions)
操作含义: 对有相同key的元素进行分组操作,返回(K, Iterable<V>)格式的PairRDD
List<Tuple2<String, Integer> > pair1= new ArrayList<>();
Tuple2<String, Integer> tp1 = new Tuple2<>("a", 1);
Tuple2<String, Integer> tp2 = new Tuple2<>("b", 2);
Tuple2<String, Integer> tp3 = new Tuple2<>("a", 3);
Tuple2<String, Integer> tp4 = new Tuple2<>("b", 4);
pair1.add(tp1);
pair1.add(tp2);
pair1.add(tp3);
pair1.add(tp4);
JavaPairRDD<String, Integer> pairRDD = jsc.parallelizePairs(pair1);
System.out.println("pairRDD: " + pairRDD.collect());
JavaPairRDD<String, Iterable<Integer>> groupByKeyRDD = pairRDD.groupByKey();
System.out.println("groupByKeyRDD: " + groupByKeyRDD.collect());
Output
pairRDD: [(a,1), (b,2), (a,3), (b,4)]
groupByKeyRDD: [(a,[1, 3]), (b,[2, 4])]
reduceByKey
函数原型: JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func,int numPartitions)
操作含义: 对有相同key的元素根据指定的函数func进行聚合操作,返回(K, V)格式的PairRDD
JavaPairRDD reduceByKeyRDD = pairRDD.reduceByKey(
new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
}
);
System.out.println("reduceByKeyRDD: " + reduceByKeyRDD.collect());
Output
reduceByKeyRDD: [(a,4), (b,6)]
mapValues
函数原型: JavaRDD<K, T> map(Function<V, T> f)
操作含义: 类似于map操作,可以将数据集的每个键值对的值按指定的函数f转换为一个新的RDD
JavaPairRDD<String, String> mapValueRDD = pairRDD.mapValues(
new Function<Integer, String>() {
@Override
public String call(Integer integer) throws Exception {
return "NO." + integer;
}
}
);
System.out.println("mapValueRDD: " + mapValueRDD.collect());
Output
mapValueRDD: [(a,NO.1), (b,NO.2), (a,NO.3), (b,NO.4)]
join
函数原型: JavaPairRDD<K,scala.Tuple2<V,W>> join(JavaPairRDD<K,W> other,int numPartitions)
操作含义: 连接具有相同key的元素,返回(K, (V, W))格式的PairRDD,V和W分别是原有2个RDD里面具有相同key的值
JavaPairRDD<String, Iterable<String>> joinRDD2= mapValueRDD.groupByKey();
System.out.println("joinRDD2: " + joinRDD2.collect());
JavaPairRDD<String, Tuple2<Iterable<Integer>, Iterable<String>>> joinRDD = groupByKeyRDD.join(joinRDD2);
System.out.println("joinRDD: " + joinRDD.collect());
Output
joinRDD2: [(a,[NO.1, NO.3]), (b,[NO.2, NO.4])]
joinRDD: [(a,([1, 3],[NO.1, NO.3])), (b,([2, 4],[NO.2, NO.4]))]
cogroup
函数原型: JavaPairRDD<K,scala.Tuple2<Iterable,Iterable>> cogroup(JavaPairRDD<K,W> other,int numPartitions)
操作含义: 对2个PairRDD中具有相同key的元素进行分组操作
JavaPairRDD<String, Tuple2<Iterable<Integer>, Iterable<String>>> cogroupRDD = pairRDD.cogroup(mapValueRDD);
System.out.println("cogroupRDD: " + cogroupRDD.collect());
Output
cogroupRDD: [(a,([1, 3],[NO.1, NO.3])), (b,([2, 4],[NO.2, NO.4]))]
WordCount示例
要求: 统计一个文本文件的所有单词出现的个数
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.omg.CORBA.INTERNAL;
import scala.Tuple2;
import java.util.Arrays;
import java.util.Iterator;
public class WordCount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("WordCount");
try(JavaSparkContext jsc = new JavaSparkContext(conf)){
JavaRDD<String> music = jsc.textFile("TheSoundOfSilence.txt");
JavaRDD<String> words = music.flatMap(
new FlatMapFunction<String, String>(){
@Override
public Iterator<String> call(String s) throws Exception {
return Arrays.asList(s.split(" ")).iterator();
}
}
);
JavaPairRDD<String, Integer> word_pair = words.mapToPair(
new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2(s, 1);
}
}
);
JavaPairRDD word_count = word_pair.reduceByKey(
new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
}
);
JavaPairRDD<Integer, String> word_pair_swap = word_count.mapToPair(
new PairFunction<Tuple2<String, Integer>, Integer, String>() {
@Override
public Tuple2<Integer, String> call(Tuple2<String, Integer> tuple) throws Exception {
return tuple.swap();
}
}
);
JavaPairRDD word_pair_swap_sort = word_pair_swap.sortByKey(false);
JavaPairRDD word_pair_reswap = word_pair_swap_sort.mapToPair(
new PairFunction <Tuple2<String, Integer>, Integer, String>(){
@Override
public Tuple2<Integer, String> call(Tuple2<String, Integer> tuple) throws Exception {
return tuple.swap();
}
}
);
System.out.println(word_pair_reswap.collect());
word_pair_reswap.saveAsTextFile("Result");
}
}
}
Output
[(the,18), (of,10), (And,9), (my,7), (silence,6), (I,6), (that,5), (words,4), (a,4), (,4), (People,3), (was,3), (light,3), (in,3), (In,3), (sound,3), (its,2), (neon,2), (without,2), (people,2), (walls,2), (said,2), (like,2), (might,2), (vision,2), (you,2), (to,2), (and,2), (touched,1), (voices,1), (reach,1), (Because,1), (it,1), (The,1), ("Fool",1), (writing,1), (Whispering,1), (listening,1), (old,1), (naked,1), (Hear,1), (Take,1), (still,1), (arms,1), (dare,1), (remains,1), (cobble,1), (To,1), (along,1), (stone,1), (echoed,1), (out,1), (halo,1), (I,"You,1), (hearing,1), (Within,1), (silent,1), (cancer,1), (are,1), (not,1), (god,1), (split,1), (do,1), (softly,1), (no,1), (When,1), (written,1), (Narrow,1), (restless,1), (halls",1), (prayed,1), (Ten,1), (creeping,1), (you",1), (bowed,1), (streets,1), (That,1), (Beneath,1), (darkness,1), (By,1), (I've,1), (collar,1), (sign,1), (Left,1), (made,1), (But,1), (subway,1), (Hello,1), (on,1), (brain,1), (never,1), (one,1), (with,1), (while,1), (flashed,1), (come,1), (tenement,1), (talk,1), (warning,1), (lamp,1), (Silence,1), (they,1), (eyes,1), (again,1), (sleeping,1), (prophets,1), (share,1), (sounds,1), (damp,1), (planted,1), ("The,1), (teach,1), (seeds,1), (forming,1), (street,1), (turned,1), (know,1), (flash,1), (more,1), (thousand,1), (fell,1), (maybe,1), (signs,1), (Disturb,1), (cold,1), (Sound,1), (stabbed,1), (raindrops,1), (friend,1), (saw,1), (dreams,1), (songs,1), (speaking,1), (talking,1), (walked,1), (were,1), (grows,1)]
本文所使用的的文件内容如下
The Sound of silence
Hello darkness my old friend
I’ve come to talk with you again
Because a vision softly creeping
Left its seeds while I was sleeping
And the vision that was planted
In my brain still remains
Within the sound of silence
In restless dreams I walked along
Narrow streets of cobble stone
Beneath the halo of a street lamp
I turned my collar to the cold and damp
When my eyes were stabbed
By the flash of a neon light
That split the light
And touched the sound of silence
And in the naked light I saw
Ten thousand people maybe more
People talking without speaking
People hearing without listening
People writing songs that voices never share
And no one dare
Disturb the sound of silence
“Fool” said I,“You do not know
Silence like a cancer grows
Hear my words that I might teach you
Take my arms that I might reach you”
But my words like silent raindrops fell
And echoed in the walls of silence
And the people bowed and prayed
To the neon god they made
And the sign flashed out its warning
In the words that it was forming
And the signs said “The words of the prophets
are written on the subway walls
And tenement halls”
Whispering in the sounds of silence
输出文件的内容在./Result/part-00000上,文件内容为
(the,18)
(of,10)
(And,9)
(my,7)
(silence,6)
(I,6)
(that,5)
(words,4)
(a,4)
(,4)
(People,3)
(was,3)
(light,3)
(in,3)
(In,3)
(sound,3)
(its,2)
(neon,2)
(without,2)
(people,2)
(walls,2)
(said,2)
(like,2)
(might,2)
(vision,2)
(you,2)
(to,2)
(and,2)
(touched,1)
(voices,1)
(reach,1)
(Because,1)
(it,1)
(The,1)
(“Fool”,1)
(writing,1)
(Whispering,1)
(listening,1)
(old,1)
(naked,1)
(Hear,1)
(Take,1)
(still,1)
(arms,1)
(dare,1)
(remains,1)
(cobble,1)
(To,1)
(along,1)
(stone,1)
(echoed,1)
(out,1)
(halo,1)
(I,“You,1)
(hearing,1)
(Within,1)
(silent,1)
(cancer,1)
(are,1)
(not,1)
(god,1)
(split,1)
(do,1)
(softly,1)
(no,1)
(When,1)
(written,1)
(Narrow,1)
(restless,1)
(halls”,1)
(prayed,1)
(Ten,1)
(creeping,1)
(you",1)
(bowed,1)
(streets,1)
(That,1)
(Beneath,1)
(darkness,1)
(By,1)
(I’ve,1)
(collar,1)
(sign,1)
(Left,1)
(made,1)
(But,1)
(subway,1)
(Hello,1)
(on,1)
(brain,1)
(never,1)
(one,1)
(with,1)
(while,1)
(flashed,1)
(come,1)
(tenement,1)
(talk,1)
(warning,1)
(lamp,1)
(Silence,1)
(they,1)
(eyes,1)
(again,1)
(sleeping,1)
(prophets,1)
(share,1)
(sounds,1)
(damp,1)
(planted,1)
("The,1)
(teach,1)
(seeds,1)
(forming,1)
(street,1)
(turned,1)
(know,1)
(flash,1)
(more,1)
(thousand,1)
(fell,1)
(maybe,1)
(signs,1)
(Disturb,1)
(cold,1)
(Sound,1)
(stabbed,1)
(raindrops,1)
(friend,1)
(saw,1)
(dreams,1)
(songs,1)
(speaking,1)
(talking,1)
(walked,1)
(were,1)
(grows,1)
本文深入解析了Apache Spark中RDD的基本操作,包括创建、map、flatMap、filter、union、groupByKey、reduceByKey等,并通过示例代码展示了如何进行WordCount统计。
1215

被折叠的 条评论
为什么被折叠?



