一、场景:统计文件单词个数
二、Scala代码实现:
package cn.com.git.scala.spark.test
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object SparkWordCount {
def FILE_NAME: String = "word_count_results_";
def main(args: Array[String]) {
val sc = new SparkContext("local", "WordCount", System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_TEST_JAR")))
//本地文件
val filePath = "E:/temp/demo.txt";
val textFile = sc.textFile(filePath);
//空格作为分隔符,统计单词数量
word => (word, 1)).reduceByKey((a, b) => a + b)
//结果保存在文件中
wordCounts.repartition(1).saveAsTextFile(FILE_NAME + System.currentTimeMillis())
println("Word Count program running results are successfully saved.");
}
}
三、执行结果
工程目录下生成结果文件夹:word_count_results_1500537461876(毫秒数);
包括子文件:_SUCCESS 、part-00000;
统计结果在part-00000文件中:
(F48_20170531_090622_75403,9)
(INFO,9)
(mon.router.r,9)
(296,9)
(2017-06-16-15:33:20.240,9)
(pquery.CifInfoQueryByAcct,9)