本篇解读spark官方文档上的第一个shell示例程序。
解读spark程序必须具备scala基础知识,关于scala基础参考文章1.
完整代码如下:
scala> val textFile = sc.textFile("file:///usr/local/spark/README.md")
textFile: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/README.md MapPartitionsRDD[1] at textFile at <console>:27
scala> textFile.count()
res1: Long = 95
scala> textFile.first()
res3: String = # Apache Spark
scala> val lineWithSpark = textFile.filter(line=>line.contains("Spark"))
lineWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:29
scala> lineWithSpark.count()
16/05/10 22:35:51 INFO scheduler.DAGScheduler: Job 2 finished: count at <console>:32, took 0.043697 s
res2: Long = 17
解析如下:(API参考文章2)
1、从hdfs读取文件。def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]。返回RDD[String]类型。
2、count()得到RDD的行数。
3、first()得到第一个RDD元素。
4、filter()找到满足要求的元素,形成新的RDD。def filter(f: (T) ⇒ Boolean): RDD[T].
line是scala的list类型,包含spark则保留下来。
5、count()得到满足结果后新RDD的行数。
扩展解析1:
此外,分割List还可以用空格,形式如split(" "),求最大值,形式如下:
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 14
由于map、reduce的参数是闭包形式,也可如下操作,形成scala、java混合编程。
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 14
扩展解析2:与mapreduce的词频统计对比。源码和执行结果如下。
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:30
scala> wordCounts.collect()
16/05/10 23:27:04 INFO scheduler.DAGScheduler: Job 5 finished: collect at <console>:33, took 0.466633 s
res7: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (cluster.,1), (its,1), ([run,1), (general,2), (have,1), (pre-built,1), (YARN,,1), (locally,2), (changed,1), (locally.,1), (sc.parallelize(1,1), (only,1), (several,1), (This,2), (basic,1), (Configuration,1), (learning,,1), (documentation,3), (first,1), (graph,1), (Hive,2), (["Specifying,1), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), ([params]`.,1), ([project,2), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (are,1), (systems.,1), (params,1), (scala>,1), (DataFrames,,1), (provides,1), (refer,2)...
这里注意flatMap()、map()、reduceByKey(). collect()是RDD的Action。
RDD的transformation算子详细解释参考文章3.
1、http://www.yiibai.com/scala/scala_file_io.html#
2、http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.sources.StringContains
3、http://www.myexception.cn/other/1961405.html