Spark入门(二): Java API示例

最新推荐文章于 2025-10-28 13:51:06 发布

原创最新推荐文章于 2025-10-28 13:51:06 发布 · 482 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Spark #Java

大数据专栏收录该内容

3 篇文章

订阅专栏

本文深入解析了Apache Spark中RDD的基本操作，包括创建、map、flatMap、filter、union、groupByKey、reduceByKey等，并通过示例代码展示了如何进行WordCount统计。

创建RDD

public class TestAPI {
	public static void main(String[] args) {
		SparkConf conf = new SparkConf().setMaster("local").setAppName("TestAPI");
		try(JavaSparkContext jsc = new JavaSparkContext(conf)){
			List<String> list2 = new ArrayList<String>();
			list2.add("a,b,c,d,e");
			list2.add("1,2,3,4,5");
			JavaRDD<String> list1 = jsc.parallelize(list2);
			System.out.println("list1: " + list1.collect());
		}
	}
}

Output

list1: [a,b,c,d,e, 1,2,3,4,5]

map

函数原型: JavaRDD map(Function<T,R> f)
操作含义: map操作可以将数据集的每一个元素按指定的函数f转换为一个新的RDD

JavaRDD<String[]> mapRDD = list1.map(
	new Function<String, String[]>() {
		@Override
		public String[] call(String s) throws Exception {
			return s.split(",");
		}
	}
);
List<String[]> result = mapRDD.collect();
System.out.println("mapRDD:");
for(int i = 0; i < result.size(); i++){
	for(int j = 0; j < result.get(i).length; j++ )
		System.out.print(result.get(i)[j] + " ");
	System.out.println();
}

Output

mapRDD:
a b c d e 
1 2 3 4 5

flatmap

函数原型: JavaRDD<U> flatMap(FlatMapFunction<T,U> f)
操作含义: 类似于map，但不同的是每个元素可以按指定的函数f映射为0个或多个元素，所以函数f返回的是一个序列集合而不是一个单一的元素

JavaRDD<String> flatmap = list1.flatMap(
	new FlatMapFunction<String, String>() {
		@Override
		public Iterator<String> call(String s) throws Exception {
			return Arrays.asList(s.split(",")).iterator();
		}
	}
);
System.out.println("flatmapRDD: " + flatmap.collect());

Output

flatmapRDD: [a, b, c, d, e, 1, 2, 3, 4, 5]

filter

函数原型: JavaRDD<T> filter(Function<T,Boolean> f)
操作含义: 对数据集中的元素进行过滤，将符合指定函数f条件的元素组成为一个新的RDD

JavaRDD<String> filterRDD = list1.filter(
	new Function<String, Boolean>() {
		@Override
		public Boolean call(String s) throws Exception {
				if(s.contains("a"))
					return true;
				return false;
		}
	}
);
System.out.println("filterRDD: " + filterRDD.collect());

Output

filterRDD: [a,b,c,d,e]

union

函数原型: JavaRDD<T> union(JavaRDD<T> other)
操作含义: 将两个RDD合并为一个新的RDD

JavaRDD<String> unionRDD = list1.union(list1);
System.out.println("unionRDD: " + unionRDD.collect());

Output

unionRDD: [a,b,c,d,e, 1,2,3,4,5, a,b,c,d,e, 1,2,3,4,5]

Java 没有自带的二元组类型，因此 Spark 的 Java API 让用户使用 scala.Tuple2 类来创建二元组。可以通过 new Tuple2(elem1, elem2) 来创建一个新的二元组，并且可以通过 ._1() 和 ._2() 方法访问其中的元素

groupByKey

函数原型: JavaPairRDD<K,Iterable<V>> groupByKey(int numPartitions)
操作含义: 对有相同key的元素进行分组操作，返回(K, Iterable<V>)格式的PairRDD

List<Tuple2<String, Integer> > pair1= new ArrayList<>();
Tuple2<String, Integer> tp1 = new Tuple2<>("a", 1);
Tuple2<String, Integer> tp2 = new Tuple2<>("b", 2);
Tuple2<String, Integer> tp3 = new Tuple2<>("a", 3);
Tuple2<String, Integer> tp4 = new Tuple2<>("b", 4);
pair1.add(tp1);
pair1.add(tp2);
pair1.add(tp3);
pair1.add(tp4);
JavaPairRDD<String, Integer> pairRDD = jsc.parallelizePairs(pair1);
System.out.println("pairRDD: " + pairRDD.collect());
JavaPairRDD<String, Iterable<Integer>> groupByKeyRDD = pairRDD.groupByKey();
System.out.println("groupByKeyRDD: " + groupByKeyRDD.collect());

Output

pairRDD: [(a,1), (b,2), (a,3), (b,4)]
groupByKeyRDD: [(a,[1, 3]), (b,[2, 4])]

reduceByKey

函数原型: JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func,int numPartitions)
操作含义: 对有相同key的元素根据指定的函数func进行聚合操作，返回(K, V)格式的PairRDD

JavaPairRDD reduceByKeyRDD = pairRDD.reduceByKey(
	new Function2<Integer, Integer, Integer>() {
		@Override
		public Integer call(Integer integer, Integer integer2) throws Exception {
			return integer + integer2;
		}
	}
);
System.out.println("reduceByKeyRDD: " + reduceByKeyRDD.collect());

Output

reduceByKeyRDD: [(a,4), (b,6)]

mapValues

函数原型: JavaRDD<K, T> map(Function<V, T> f)
操作含义: 类似于map操作，可以将数据集的每个键值对的值按指定的函数f转换为一个新的RDD

JavaPairRDD<String, String> mapValueRDD = pairRDD.mapValues(
	new Function<Integer, String>() {
		@Override
		public String call(Integer integer) throws Exception {
			return "NO." + integer;
		}
	}
);
System.out.println("mapValueRDD： " + mapValueRDD.collect());

Output

mapValueRDD： [(a,NO.1), (b,NO.2), (a,NO.3), (b,NO.4)]

join

函数原型: JavaPairRDD<K,scala.Tuple2<V,W>> join(JavaPairRDD<K,W> other,int numPartitions)
操作含义: 连接具有相同key的元素，返回(K, (V, W))格式的PairRDD，V和W分别是原有2个RDD里面具有相同key的值

JavaPairRDD<String, Iterable<String>> joinRDD2= mapValueRDD.groupByKey();
System.out.println("joinRDD2: " + joinRDD2.collect());
JavaPairRDD<String, Tuple2<Iterable<Integer>, Iterable<String>>> joinRDD = groupByKeyRDD.join(joinRDD2);
System.out.println("joinRDD: " + joinRDD.collect());

Output

joinRDD2: [(a,[NO.1, NO.3]), (b,[NO.2, NO.4])]
joinRDD: [(a,([1, 3],[NO.1, NO.3])), (b,([2, 4],[NO.2, NO.4]))]

cogroup

函数原型: JavaPairRDD<K,scala.Tuple2<Iterable,Iterable>> cogroup(JavaPairRDD<K,W> other,int numPartitions)
操作含义: 对2个PairRDD中具有相同key的元素进行分组操作

JavaPairRDD<String, Tuple2<Iterable<Integer>, Iterable<String>>> cogroupRDD = pairRDD.cogroup(mapValueRDD);
System.out.println("cogroupRDD: " + cogroupRDD.collect());

Output

cogroupRDD: [(a,([1, 3],[NO.1, NO.3])), (b,([2, 4],[NO.2, NO.4]))]

WordCount示例

要求: 统计一个文本文件的所有单词出现的个数

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.omg.CORBA.INTERNAL;
import scala.Tuple2;

import java.util.Arrays;
import java.util.Iterator;

public class WordCount {
	public static void main(String[] args) {
		SparkConf conf = new SparkConf().setMaster("local").setAppName("WordCount");
		try(JavaSparkContext jsc = new JavaSparkContext(conf)){
			JavaRDD<String> music = jsc.textFile("TheSoundOfSilence.txt");
			JavaRDD<String> words = music.flatMap(
				new FlatMapFunction<String, String>(){
					@Override
					public Iterator<String> call(String s) throws Exception {
						return Arrays.asList(s.split(" ")).iterator();
					}
				}
			);
			
			JavaPairRDD<String, Integer> word_pair = words.mapToPair(
					new PairFunction<String, String, Integer>() {
						@Override
						public Tuple2<String, Integer> call(String s) throws Exception {
							return new Tuple2(s, 1);
						}
					}
			);

			JavaPairRDD word_count = word_pair.reduceByKey(
				new Function2<Integer, Integer, Integer>() {
					@Override
					public Integer call(Integer integer, Integer integer2) throws Exception {
						return integer + integer2;
					}
				}
			);

			JavaPairRDD<Integer, String> word_pair_swap = word_count.mapToPair(
				new PairFunction<Tuple2<String, Integer>, Integer, String>() {
					@Override
					public Tuple2<Integer, String> call(Tuple2<String, Integer> tuple) throws Exception {
						return tuple.swap();
					}
				}
			);

			JavaPairRDD word_pair_swap_sort = word_pair_swap.sortByKey(false);

			JavaPairRDD word_pair_reswap = word_pair_swap_sort.mapToPair(
				new PairFunction <Tuple2<String, Integer>, Integer, String>(){
					@Override
					public Tuple2<Integer, String> call(Tuple2<String, Integer> tuple) throws Exception {
						return tuple.swap();
					}
				}
			);

			System.out.println(word_pair_reswap.collect());

			word_pair_reswap.saveAsTextFile("Result");
		}
	}

}

Output

[(the,18), (of,10), (And,9), (my,7), (silence,6), (I,6), (that,5), (words,4), (a,4), (,4), (People,3), (was,3), (light,3), (in,3), (In,3), (sound,3), (its,2), (neon,2), (without,2), (people,2), (walls,2), (said,2), (like,2), (might,2), (vision,2), (you,2), (to,2), (and,2), (touched,1), (voices,1), (reach,1), (Because,1), (it,1), (The,1), ("Fool",1), (writing,1), (Whispering,1), (listening,1), (old,1), (naked,1), (Hear,1), (Take,1), (still,1), (arms,1), (dare,1), (remains,1), (cobble,1), (To,1), (along,1), (stone,1), (echoed,1), (out,1), (halo,1), (I,"You,1), (hearing,1), (Within,1), (silent,1), (cancer,1), (are,1), (not,1), (god,1), (split,1), (do,1), (softly,1), (no,1), (When,1), (written,1), (Narrow,1), (restless,1), (halls",1), (prayed,1), (Ten,1), (creeping,1), (you",1), (bowed,1), (streets,1), (That,1), (Beneath,1), (darkness,1), (By,1), (I've,1), (collar,1), (sign,1), (Left,1), (made,1), (But,1), (subway,1), (Hello,1), (on,1), (brain,1), (never,1), (one,1), (with,1), (while,1), (flashed,1), (come,1), (tenement,1), (talk,1), (warning,1), (lamp,1), (Silence,1), (they,1), (eyes,1), (again,1), (sleeping,1), (prophets,1), (share,1), (sounds,1), (damp,1), (planted,1), ("The,1), (teach,1), (seeds,1), (forming,1), (street,1), (turned,1), (know,1), (flash,1), (more,1), (thousand,1), (fell,1), (maybe,1), (signs,1), (Disturb,1), (cold,1), (Sound,1), (stabbed,1), (raindrops,1), (friend,1), (saw,1), (dreams,1), (songs,1), (speaking,1), (talking,1), (walked,1), (were,1), (grows,1)]

本文所使用的的文件内容如下
The Sound of silence
Hello darkness my old friend
I’ve come to talk with you again
Because a vision softly creeping
Left its seeds while I was sleeping
And the vision that was planted
In my brain still remains
Within the sound of silence

In restless dreams I walked along
Narrow streets of cobble stone
Beneath the halo of a street lamp
I turned my collar to the cold and damp
When my eyes were stabbed
By the flash of a neon light
That split the light
And touched the sound of silence

And in the naked light I saw
Ten thousand people maybe more
People talking without speaking
People hearing without listening
People writing songs that voices never share
And no one dare
Disturb the sound of silence

“Fool” said I,“You do not know
Silence like a cancer grows
Hear my words that I might teach you
Take my arms that I might reach you”
But my words like silent raindrops fell
And echoed in the walls of silence

And the people bowed and prayed
To the neon god they made
And the sign flashed out its warning
In the words that it was forming
And the signs said “The words of the prophets
are written on the subway walls
And tenement halls”
Whispering in the sounds of silence

输出文件的内容在./Result/part-00000上，文件内容为

(the,18)
(of,10)
(And,9)
(my,7)
(silence,6)
(I,6)
(that,5)
(words,4)
(a,4)
(,4)
(People,3)
(was,3)
(light,3)
(in,3)
(In,3)
(sound,3)
(its,2)
(neon,2)
(without,2)
(people,2)
(walls,2)
(said,2)
(like,2)
(might,2)
(vision,2)
(you,2)
(to,2)
(and,2)
(touched,1)
(voices,1)
(reach,1)
(Because,1)
(it,1)
(The,1)
(“Fool”,1)
(writing,1)
(Whispering,1)
(listening,1)
(old,1)
(naked,1)
(Hear,1)
(Take,1)
(still,1)
(arms,1)
(dare,1)
(remains,1)
(cobble,1)
(To,1)
(along,1)
(stone,1)
(echoed,1)
(out,1)
(halo,1)
(I,“You,1)
(hearing,1)
(Within,1)
(silent,1)
(cancer,1)
(are,1)
(not,1)
(god,1)
(split,1)
(do,1)
(softly,1)
(no,1)
(When,1)
(written,1)
(Narrow,1)
(restless,1)
(halls”,1)
(prayed,1)
(Ten,1)
(creeping,1)
(you",1)
(bowed,1)
(streets,1)
(That,1)
(Beneath,1)
(darkness,1)
(By,1)
(I’ve,1)
(collar,1)
(sign,1)
(Left,1)
(made,1)
(But,1)
(subway,1)
(Hello,1)
(on,1)
(brain,1)
(never,1)
(one,1)
(with,1)
(while,1)
(flashed,1)
(come,1)
(tenement,1)
(talk,1)
(warning,1)
(lamp,1)
(Silence,1)
(they,1)
(eyes,1)
(again,1)
(sleeping,1)
(prophets,1)
(share,1)
(sounds,1)
(damp,1)
(planted,1)
("The,1)
(teach,1)
(seeds,1)
(forming,1)
(street,1)
(turned,1)
(know,1)
(flash,1)
(more,1)
(thousand,1)
(fell,1)
(maybe,1)
(signs,1)
(Disturb,1)
(cold,1)
(Sound,1)
(stabbed,1)
(raindrops,1)
(friend,1)
(saw,1)
(dreams,1)
(songs,1)
(speaking,1)
(talking,1)
(walked,1)
(were,1)
(grows,1)