这个 demo 我是参考尚硅谷视频里的代码,结合之前在 GitHub 上所学的。
代码
package org.developer.bigdata.spark
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
//创建 sparkConf 对象
//设定 Spark 计算框架的运行环境
val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
//创建spark上下文对象
val sc = new SparkContext(config)
//读取文件,将文件内容一行一行的读取出来
val lines: RDD[String] = sc.textFile("in/word.txt")
//将一行一行的数据分解一个一个的单词
val words: RDD[(String)] = lines.flatMap(_.split(" "))
//将单词数据进行结构的转换
val wordToOne: RDD[(String,Int)] = words.map((_, 1))
//对转换结构后的数据进行分组聚合
val wordToSum: RDD[(String,Int)] = wordToOne.reduceByKey(_ + _)
//将统计结果打印到控制台
val result: Array[(String,Int)] = wordToSum.collect()
result.foreach(println)
}
}
结果
20/05/27 21:28:28 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
20/05/27 21:28:28 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
20/05/27 21:28:28 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
20/05/27 21:28:28 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
20/05/27 21:28:28 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1304 bytes result sent to driver
20/05/27 21:28:28 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1329 bytes result sent to driver
20/05/27 21:28:28 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 22 ms on localhost (executor driver) (1/2)
20/05/27 21:28:28 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 24 ms on localhost (executor driver) (2/2)
20/05/27 21:28:28 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/05/27 21:28:28 INFO DAGScheduler: ResultStage 1 (collect at WordCount.scala:29) finished in 0.025 s
20/05/27 21:28:28 INFO DAGScheduler: Job 0 finished: collect at WordCount.scala:29, took 0.269880 s
20/05/27 21:28:28 INFO SparkContext: Invoking stop() from shutdown hook
20/05/27 21:28:28 INFO SparkUI: Stopped Spark web UI at http://10.106.20.63:4040
20/05/27 21:28:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/05/27 21:28:28 INFO MemoryStore: MemoryStore cleared
20/05/27 21:28:28 INFO BlockManager: BlockManager stopped
20/05/27 21:28:28 INFO BlockManagerMaster: BlockManagerMaster stopped
20/05/27 21:28:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
(scala,1) #结果
(hello,4) #结果
(bigdata,1)
(spark,1)
(hadoop,1)
20/05/27 21:28:28 INFO SparkContext: Successfully stopped SparkContext
20/05/27 21:28:28 INFO ShutdownHookManager: Shutdown hook called
20/05/27 21:28:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-719285ca-6713-44c8-8181-87279ed1b88f
这是 word.txt
文本中的内容
解析
引用之前的一张图片,这是原理