spark第一个简单示例的源码解读

本篇解读spark官方文档上的第一个shell示例程序。

解读spark程序必须具备scala基础知识,关于scala基础参考文章1.


完整代码如下:

scala> val textFile = sc.textFile("file:///usr/local/spark/README.md")
textFile: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/README.md MapPartitionsRDD[1] at textFile at <console>:27
scala> textFile.count()
res1: Long = 95
scala> textFile.first()
res3: String = # Apache Spark
scala> val lineWithSpark = textFile.filter(line=>line.contains("Spark"))
lineWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:29
scala> lineWithSpark.count()
16/05/10 22:35:51 INFO scheduler.DAGScheduler: Job 2 finished: count at <console>:32, took 0.043697 s
res2: Long = 17


解析如下:(API参考文章2)

1、从hdfs读取文件。def textFile(path: StringminPartitions: Int = defaultMinPartitions)RDD[String]。返回RDD[String]类型。

2、count()得到RDD的行数。

3、first()得到第一个RDD元素。

4、filter()找到满足要求的元素,形成新的RDD。def filter(f: (T) ⇒ Boolean)RDD[T].

line是scala的list类型,包含spark则保留下来。

5、count()得到满足结果后新RDD的行数。


扩展解析1:

此外,分割List还可以用空格,形式如split(" "),求最大值,形式如下:

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 14
由于map、reduce的参数是闭包形式,也可如下操作,形成scala、java混合编程。

scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 14

扩展解析2:与mapreduce的词频统计对比。源码和执行结果如下。

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:30
scala> wordCounts.collect()
16/05/10 23:27:04 INFO scheduler.DAGScheduler: Job 5 finished: collect at <console>:33, took 0.466633 s
res7: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (cluster.,1), (its,1), ([run,1), (general,2), (have,1), (pre-built,1), (YARN,,1), (locally,2), (changed,1), (locally.,1), (sc.parallelize(1,1), (only,1), (several,1), (This,2), (basic,1), (Configuration,1), (learning,,1), (documentation,3), (first,1), (graph,1), (Hive,2), (["Specifying,1), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), ([params]`.,1), ([project,2), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (are,1), (systems.,1), (params,1), (scala>,1), (DataFrames,,1), (provides,1), (refer,2)...

这里注意flatMap()、map()、reduceByKey(). collect()是RDD的Action。


RDD的transformation算子详细解释参考文章3.


1、http://www.yiibai.com/scala/scala_file_io.html#

2、http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.sources.StringContains

3、http://www.myexception.cn/other/1961405.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值