【Spark】Spark三: Spark RDD API初步第一部分

本文介绍了Spark中的RDD基本操作,包括创建、过滤、映射和函数式编程。还详细展示了如何读写HDFS,利用缓存提高效率。接着,通过实例演示了词频统计,并将结果排序,揭示了Spark对结果自动排序的特性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

RDD基本操作

RDD是Spark提供给程序员操作的基本对象,很多Map/Reduce的操作都是在RDD上进行的,

 

1. 将List转化为RDD

 

scala> val rdd = sc.parallelize(List(1,2,3,4,5));
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12

 

2. 对RDD进行过滤

 

scala> val filteredRDD = rdd.filter(_ >= 4);
filteredRDD: org.apache.spark.rdd.RDD[Int] = FilteredRDD[1] at filter at <console>:14

 

3. 对filteredRDD执行collect操作

 

scala> filteredRDD.collect();

res0: Array[Int] = Array(4, 5) //可见满足条件的元素是4和5

 

4. 对RDD做map操作

 

scala> var mappedRDD = rdd.map(_ * 2)
mappedRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[2] at map at <console>:14

 

5. 对mappedRDD做collect操作

 

scala> mappedRDD.collect();
res1: Array[Int] = Array(2, 4, 6, 8, 10)

 

6. 对rdd做函数式编程

 

scala> sc.parallelize(List(1,2,3,4,5)).map(_ * 2).filter(_ >=4 ).collect();
res2: Array[Int] = Array(4, 6, 8, 10)

 

读写HDFS、缓存

1. 将Spark安装目录下的README.md文件上传至HDFS

 

hdfs dfs -put /home/hadoop/software/spark-1.2.0-bin-hadoop2.4/README.md /user/hadoop

 

查看README.md的前10行

 

hdfs dfs -cat /user/hadoop/README.md | head -10

 

得到结果

 

# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
rich set of higher-level tools including Spark SQL for SQL and structured
data processing, MLlib for machine learning, GraphX for graph processing,
and Spark Streaming for stream processing.

<http://spark.apache.org/>

 

2. Spark读取HDFS上的README.md

 

scala> val rdd = sc.textFile("README.md");
rdd: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[9] at textFile at <console>:12

 

3. 行数统计

 

scala> rdd.count();
15/01/02 03:52:57 INFO scheduler.DAGScheduler: Job 4 finished: count at <console>:15, took 0.396747 s
res4: Long = 98

 

可见,README.md一共98行,Spark读取它用了0.396747秒。执行如下命令,可以确定README.md确实是98行

 

[hadoop@hadoop spark-1.2.0-bin-hadoop2.4]$ cat /home/hadoop/software/spark-1.2.0-bin-hadoop2.4/README.md | wc -l
98

 

4. RDD缓存

 

在第三步的基础上,再次执行rdd.count()操作,

 

scala> rdd.count();
15/01/02 03:52:57 INFO scheduler.DAGScheduler: Job 4 finished: count at <console>:15, took 0.051862 s
res4: Long = 98

此时使用了0.05秒,远远小于第一次运行count操作的0.39秒,原因是Spark对结果RDD进行了缓存了,

可以使用如下操作显式的将RDD加入缓存

 

scala> rdd.cache();
res9: rdd.type = README.md MappedRDD[9] at textFile at <console>:12

 

 

5. 将rdd中的数据(README.md)回写到HDFS中

 

scala> rdd.saveAsTextFile("RDD");

 

注意,RDD是HDFS /user/hadoop目录下的一个目录,真正的内容存储在part-00000文件中

 

[hadoop@hadoop spark-1.2.0-bin-hadoop2.4]$ hdfs dfs -cat /user/hadoop/RDD/part-00000 | wc -l
98

 

词频统计

1. 统计HDFS上,README.md的词频(word count)

 

scala> val wordCountRDD = rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_);
wordCountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[13] at reduceByKey at <console>:14

 

2. 词频统计

 

scala> wordCountRDD.collect();

///结果
res11: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), (page](http://spark.apache.org/documentation.html),1), (Once,1), (application,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,2), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (all,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (HDFS,1), (Versions,1), (Data.,1),....

 

3. 将统计结果写入HDFS

 

 

scala> wordCountRDD.saveAsTextFile("WordCountRDD");

 

查看HDFS上数据,

 

 

[hadoop@hadoop spark-1.2.0-bin-hadoop2.4]$ hdfs dfs -cat /user/hadoop/WordCountRDD/part-00000 | head -10
(package,1)
(For,2)
(processing.,1)
(Programs,1)
(Because,1)
(The,1)
(cluster.,1)
(its,1)
([run,1)
(APIs,1)

 

 

 

词频统计并对词频排序

1. 使用如下一条语句即可完成词频统计以及排序的功能

 

 

scala> rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2,x._1)).saveAsTextFile("SortedWordCountRDD")

 

 

2. 查看HDFS上数据,

 

 

[hadoop@hadoop spark-1.2.0-bin-hadoop2.4]$ hdfs dfs -cat /user/hadoop/SortedWordCountRDD/part-00000 | head -10
(,66)
(the,21)
(Spark,15)
(to,14)
(for,11)
(and,10)
(a,9)
(##,8)
(run,7)
(is,6)


[hadoop@hadoop spark-1.2.0-bin-hadoop2.4]$ hdfs dfs -cat /user/hadoop/SortedWordCountRDD/part-00000 | tail -10
(core,1)
(web,1)
("local[N]",1)
(package.),1)
(MLlib,1)
(["Building,1)
(Scala,,1)
(mvn,1)
(./dev/run-tests,1)
(sample,1)

 

可见,Spark对结果进行了排序

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值