第一步挺简单的,两步:引包,写代码
引包:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.1.1</version>
</dependency>
Java写Spark
public static void main(String[] args) {
// local :本地单线程 master:连接主节点
SparkConf conf = new SparkConf().setMaster("local").setAppName("wc");
JavaSparkContext context = new JavaSparkContext(conf);
// src同级目录下放一个text.txt文件
JavaRDD<String> text = context.textFile("text.txt");
JavaRDD<String> words = text.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String line) {
return Arrays.asList(line.split(" ")).iterator();
}
});
JavaPairRDD<String, Integer> map=words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String word) {
return new Tuple2<String, Integer>(word, 1);
}
});
JavaPairRDD<String, Integer> reduce = map.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer+integer2;
}
});
JavaPairRDD<Integer, String> sort = reduce.mapToPair(new PairFunction<Tuple2<String,Integer>, Integer, String>() {
public Tuple2<Integer, String> call(Tuple2<String, Integer> tuple2) {
return new Tuple2<Integer, String>(tuple2._2,tuple2._1);
}
});
JavaPairRDD<Integer, String> sortKet = sort.sortByKey(false);//倒序
JavaPairRDD<String, Integer> results = sortKet.mapToPair(new PairFunction<Tuple2<Integer,String>, String, Integer>() {
public Tuple2<String, Integer> call(Tuple2<Integer, String> tuple2) {
return new Tuple2<String, Integer>(tuple2._2,tuple2._1);
}
});
results.foreach(new VoidFunction<Tuple2<String, Integer>>() {
public void call(Tuple2<String, Integer> result) throws Exception {
System.out.println(result._1 + " "+result._2);
}
});
context.close();
}
本文介绍如何使用Java和Apache Spark进行文本文件的词频统计。从引入依赖开始,详细展示了创建SparkConf配置,读取文本文件,使用flatMap、mapToPair、reduceByKey等操作进行单词分割、计数及排序的全过程。
1230

被折叠的 条评论
为什么被折叠?



