Spark RDD Action 详解---Spark学习笔记8

本文介绍了Spark中RDD Action函数的使用方法,并通过实例详细解释了reduce、collect、count、take、first、takeSample、saveAsTextFile、saveAsSequenceFile、countByKey及foreach等函数的应用场景与操作流程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

话接上回Spark RDD Transformation

这次学习的是Spark RDD 的Action 部分函数。

这次拿根目录下的kmeans_data.txt当数据源来实践。

reduce

和transformation里面的reduceByKey差不多,但是这里没有根据key分组,只是单纯的2个参数。

val kemeans_data = sc.textFile("kmeans_data.txt")
kemeans_data.flatMap(_.split(" ")).map(_.toDouble).reduce(_+_)
res67: Double = 82.80000000000001

collect

将当前的数据集扁平化为一个数组,
<pre name="code" class="java">scala> kemeans_data.collect()
14/04/04 17:17:38 INFO spark.SparkContext: Starting job: collect at <console>:15
14/04/04 17:17:38 INFO scheduler.DAGScheduler: Got job 35 (collect at <console>:15) with 1 output partitions (allowLocal=false)
14/04/04 17:17:38 INFO scheduler.DAGScheduler: Final stage: Stage 46 (collect at <console>:15)
14/04/04 17:17:38 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 17:17:38 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 17:17:38 INFO scheduler.DAGScheduler: Submitting Stage 46 (MappedRDD[93] at textFile at <console>:12), which has no missing parents
14/04/04 17:17:38 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 46 (MappedRDD[93] at textFile at <console>:12)
14/04/04 17:17:38 INFO local.LocalTaskSetManager: Size of task 40 is 1656 bytes
14/04/04 17:17:38 INFO executor.Executor: Running task ID 40
14/04/04 17:17:38 INFO storage.BlockManager: Found block broadcast_2 locally
14/04/04 17:17:38 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/kmeans_data.txt:0+72
14/04/04 17:17:38 INFO executor.Executor: Serialized size of result for 40 is 519
14/04/04 17:17:38 INFO executor.Executor: Sending result for 40 directly to driver
14/04/04 17:17:38 INFO scheduler.DAGScheduler: Completed ResultTask(46, 0)
14/04/04 17:17:38 INFO local.LocalScheduler: Remove TaskSet 46.0 from pool 
14/04/04 17:17:38 INFO executor.Executor: Finished task ID 40
14/04/04 17:17:38 INFO scheduler.DAGScheduler: Stage 46 (collect at <console>:15) finished in 0.005 s
14/04/04 17:17:38 INFO spark.SparkContext: Job finished: collect at <console>:15, took 0.008528283 s
res69: Array[String] = Array(0.0 0.0 0.0, 0.1 0.1 0.1, 0.2 0.2 0.2, 9.0 9.0 9.0, 9.1 9.1 9.1, 9.2 9.2 9.2)
 其实这里可以清楚的看到,每一行是一个数组的元素。 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

数据集大了就不要这么玩了,一般是用在较小的书数据集上,大和小是相对于你的机器的。

count

这个地球人都知道。。。返回的是dataset中的element的个数。
kemeans_data.count()
res70: Long = 6

take

顾名思义,取多少个元素
scala> splitedLine.take(1)
14/04/04 11:14:05 INFO spark.SparkContext: Starting job: take at <console>:19
14/04/04 11:14:05 INFO scheduler.DAGScheduler: Got job 6 (take at <console>:19) with 1 output partitions (allowLocal=true)
14/04/04 11:14:05 INFO scheduler.DAGScheduler: Final stage: Stage 6 (take at <console>:19)
14/04/04 11:14:05 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 11:14:05 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 11:14:05 INFO scheduler.DAGScheduler: Computing the requested partition locally
14/04/04 11:14:05 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/CHANGES.txt:0+65951
14/04/04 11:14:05 INFO spark.SparkContext: Job finished: take at <console>:19, took 0.004364898 s
res12: Array[Array[java.lang.String]] = Array(Array("", "", Merge, pull, request, #248, from, colorant/branch-0.8))

指定下标来获取元素
scala> splitedLine.take(1)(0)
14/04/04 11:14:10 INFO spark.SparkContext: Starting job: take at <console>:19
14/04/04 11:14:10 INFO scheduler.DAGScheduler: Got job 7 (take at <console>:19) with 1 output partitions (allowLocal=true)
14/04/04 11:14:10 INFO scheduler.DAGScheduler: Final stage: Stage 7 (take at <console>:19)
14/04/04 11:14:10 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 11:14:10 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 11:14:10 INFO scheduler.DAGScheduler: Computing the requested partition locally
14/04/04 11:14:10 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/CHANGES.txt:0+65951
14/04/04 11:14:10 INFO spark.SparkContext: Job finished: take at <console>:19, took 0.004808966 s
res13: Array[java.lang.String] = Array("", "", Merge, pull, request, #248, from, colorant/branch-0.8)

scala> splitedLine.take(1)(0)(0)
14/04/04 11:14:15 INFO spark.SparkContext: Starting job: take at <console>:19
14/04/04 11:14:15 INFO scheduler.DAGScheduler: Got job 8 (take at <console>:19) with 1 output partitions (allowLocal=true)
14/04/04 11:14:15 INFO scheduler.DAGScheduler: Final stage: Stage 8 (take at <console>:19)
14/04/04 11:14:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 11:14:15 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 11:14:15 INFO scheduler.DAGScheduler: Computing the requested partition locally
14/04/04 11:14:15 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/CHANGES.txt:0+65951
14/04/04 11:14:15 INFO spark.SparkContext: Job finished: take at <console>:19, took 0.005684334 s
res14: java.lang.String = ""

first

获取数据集的第一个元素。

 kemeans_data.first()

res0: String = 0.0 0.0 0.0

takeSample

def takeSample(withReplacement: Booleannum: Intseed: Int)Array[T] Return an array with a random sample of  num  elements of the dataset, with or without replacement, using the given random number generator seed.

简单来说就是给它一个随机种子seed,取num个数据,是否替换。
scala> kemeans_data.takeSample(false,2,3)
14/04/04 17:28:49 INFO spark.SparkContext: Starting job: takeSample at <console>:15
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Got job 14 (takeSample at <console>:15) with 1 output partitions (allowLocal=false)
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Final stage: Stage 14 (takeSample at <console>:15)
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Submitting Stage 14 (MappedRDD[1] at textFile at <console>:12), which has no missing parents
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 14 (MappedRDD[1] at textFile at <console>:12)
14/04/04 17:28:49 INFO local.LocalTaskSetManager: Size of task 13 is 1645 bytes
14/04/04 17:28:49 INFO executor.Executor: Running task ID 13
14/04/04 17:28:49 INFO storage.BlockManager: Found block broadcast_0 locally
14/04/04 17:28:49 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/kmeans_data.txt:0+72
14/04/04 17:28:49 INFO executor.Executor: Serialized size of result for 13 is 473
14/04/04 17:28:49 INFO executor.Executor: Sending result for 13 directly to driver
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Completed ResultTask(14, 0)
14/04/04 17:28:49 INFO local.LocalScheduler: Remove TaskSet 14.0 from pool 
14/04/04 17:28:49 INFO executor.Executor: Finished task ID 13
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Stage 14 (takeSample at <console>:15) finished in 0.008 s
14/04/04 17:28:49 INFO spark.SparkContext: Job finished: takeSample at <console>:15, took 0.012453567 s
14/04/04 17:28:49 INFO spark.SparkContext: Starting job: takeSample at <console>:15
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Got job 15 (takeSample at <console>:15) with 1 output partitions (allowLocal=false)
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Final stage: Stage 15 (takeSample at <console>:15)
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Submitting Stage 15 (SampledRDD[7] at takeSample at <console>:15), which has no missing parents
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 15 (SampledRDD[7] at takeSample at <console>:15)
14/04/04 17:28:49 INFO local.LocalTaskSetManager: Size of task 14 is 1850 bytes
14/04/04 17:28:49 INFO executor.Executor: Running task ID 14
14/04/04 17:28:49 INFO storage.BlockManager: Found block broadcast_0 locally
14/04/04 17:28:49 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/kmeans_data.txt:0+72
14/04/04 17:28:49 INFO executor.Executor: Serialized size of result for 14 is 519
14/04/04 17:28:49 INFO executor.Executor: Sending result for 14 directly to driver
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Completed ResultTask(15, 0)
14/04/04 17:28:49 INFO local.LocalScheduler: Remove TaskSet 15.0 from pool 
14/04/04 17:28:49 INFO executor.Executor: Finished task ID 14
14/04/04 17:28:49 INFO scheduler.DAGScheduler: Stage 15 (takeSample at <console>:15) finished in 0.006 s
14/04/04 17:28:49 INFO spark.SparkContext: Job finished: takeSample at <console>:15, took 0.009247992 s
res9: Array[String] = Array(9.0 9.0 9.0, 0.2 0.2 0.2)

saveAsTextFile

将结果存到文件,这里找到大于0.2的点,然后存到shengli.result文件夹里
scala> kemeans_data.flatMap(_.split(" ")).map(_.toDouble).filter(_>0.2).saveAsTextFile("shengli.result")
14/04/04 17:31:24 INFO rdd.PairRDDFunctions: Saving as hadoop file of type (NullWritable, Text)
14/04/04 17:31:24 INFO spark.SparkContext: Starting job: saveAsTextFile at <console>:15
14/04/04 17:31:24 INFO scheduler.DAGScheduler: Got job 18 (saveAsTextFile at <console>:15) with 1 output partitions (allowLocal=false)
14/04/04 17:31:24 INFO scheduler.DAGScheduler: Final stage: Stage 18 (saveAsTextFile at <console>:15)
14/04/04 17:31:24 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 17:31:24 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 17:31:24 INFO scheduler.DAGScheduler: Submitting Stage 18 (MappedRDD[17] at saveAsTextFile at <console>:15), which has no missing parents
14/04/04 17:31:25 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 18 (MappedRDD[17] at saveAsTextFile at <console>:15)
14/04/04 17:31:25 INFO local.LocalTaskSetManager: Size of task 17 is 4805 bytes
14/04/04 17:31:25 INFO executor.Executor: Running task ID 17
14/04/04 17:31:25 INFO storage.BlockManager: Found block broadcast_0 locally
14/04/04 17:31:25 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/kmeans_data.txt:0+72
14/04/04 17:31:25 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_201404041731_0000_m_000000_17' to file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/shengli.result
14/04/04 17:31:25 INFO mapred.SparkHadoopWriter: attempt_201404041731_0000_m_000000_17: Committed
14/04/04 17:31:25 INFO executor.Executor: Serialized size of result for 17 is 435
14/04/04 17:31:25 INFO executor.Executor: Sending result for 17 directly to driver
14/04/04 17:31:25 INFO scheduler.DAGScheduler: Completed ResultTask(18, 0)
14/04/04 17:31:25 INFO local.LocalScheduler: Remove TaskSet 18.0 from pool 
14/04/04 17:31:25 INFO executor.Executor: Finished task ID 17
14/04/04 17:31:25 INFO scheduler.DAGScheduler: Stage 18 (saveAsTextFile at <console>:15) finished in 0.080 s
14/04/04 17:31:25 INFO spark.SparkContext: Job finished: saveAsTextFile at <console>:15, took 0.106020906 s

在根目录下生存了一个文件夹shengli.result,里面有这次任务的结果
cd shengli.result/
ll
-rw-r--r--  1 root root    8 04-04 17:31 ._SUCCESS.crc
-rwxrwxrwx  1 root root    0 04-04 17:31 _SUCCESS
-rw-r--r--  1 root root   12 04-04 17:31 .part-00000.crc
-rwxrwxrwx  1 root root   36 04-04 17:31 part-00000
drwxr-xr-x  2 root root 4096 04-04 17:31 .
drwxr-xr-x 25 1000 1000 4096 04-04 17:33 ..

more part-00000 
9.0
9.0
9.0
9.1
9.1
9.1
9.2
9.2
9.2

saveAsSequenceFile

scala> kemeans_data.flatMap(_.split(" ")).map(word=>(word,1)).saveAsSequenceFile("shengli.seq")
14/04/04 17:39:56 INFO rdd.SequenceFileRDDFunctions: Saving as sequence file of type (Text,IntWritable)
14/04/04 17:39:56 INFO rdd.PairRDDFunctions: Saving as hadoop file of type (Text, IntWritable)
14/04/04 17:39:56 INFO spark.SparkContext: Starting job: saveAsSequenceFile at <console>:15
14/04/04 17:39:56 INFO scheduler.DAGScheduler: Got job 19 (saveAsSequenceFile at <console>:15) with 1 output partitions (allowLocal=false)
14/04/04 17:39:56 INFO scheduler.DAGScheduler: Final stage: Stage 19 (saveAsSequenceFile at <console>:15)
14/04/04 17:39:56 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 17:39:56 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 17:39:56 INFO scheduler.DAGScheduler: Submitting Stage 19 (MappedRDD[20] at saveAsSequenceFile at <console>:15), which has no missing parents
14/04/04 17:39:56 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 19 (MappedRDD[20] at saveAsSequenceFile at <console>:15)
14/04/04 17:39:56 INFO local.LocalTaskSetManager: Size of task 18 is 4837 bytes
14/04/04 17:39:56 INFO executor.Executor: Running task ID 18
14/04/04 17:39:56 INFO storage.BlockManager: Found block broadcast_0 locally
14/04/04 17:39:56 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/kmeans_data.txt:0+72
14/04/04 17:39:56 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_201404041739_0000_m_000000_18' to file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/shengli.seq
14/04/04 17:39:56 INFO mapred.SparkHadoopWriter: attempt_201404041739_0000_m_000000_18: Committed
14/04/04 17:39:56 INFO executor.Executor: Serialized size of result for 18 is 435
14/04/04 17:39:56 INFO executor.Executor: Sending result for 18 directly to driver
14/04/04 17:39:56 INFO scheduler.DAGScheduler: Completed ResultTask(19, 0)
14/04/04 17:39:56 INFO local.LocalScheduler: Remove TaskSet 19.0 from pool 
14/04/04 17:39:56 INFO executor.Executor: Finished task ID 18
14/04/04 17:39:56 INFO scheduler.DAGScheduler: Stage 19 (saveAsSequenceFile at <console>:15) finished in 0.112 s
14/04/04 17:39:56 INFO spark.SparkContext: Job finished: saveAsSequenceFile at <console>:15, took 0.199192684 s

查看结果文件,的确是sequenceFile
# cd shengli.seq/
# ll
-rw-r--r--  1 root root    8 04-04 17:39 ._SUCCESS.crc
-rwxrwxrwx  1 root root    0 04-04 17:39 _SUCCESS
-rw-r--r--  1 root root   12 04-04 17:39 .part-00000.crc
-rwxrwxrwx  1 root root  373 04-04 17:39 part-00000
drwxr-xr-x 26 1000 1000 4096 04-04 17:39 ..
drwxr-xr-x  2 root root 4096 04-04 17:39 .

# cat part-00000 
SEQorg.apache.hadoop.io.Text org.apache.hadoop.io.IntWritable异妈
"0.00.00.00.10.10.10.20.20.29.09.09.09.19.19.19.29.29.2

countByKey

很好理解,查看一下一个key的个数。
Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key.
scala> kemeans_data.flatMap(_.split(" ")).map(word=>(word,1)).countByKey()
14/04/04 17:44:02 INFO spark.SparkContext: Starting job: countByKey at <console>:15
14/04/04 17:44:02 INFO scheduler.DAGScheduler: Got job 20 (countByKey at <console>:15) with 1 output partitions (allowLocal=false)
14/04/04 17:44:02 INFO scheduler.DAGScheduler: Final stage: Stage 20 (countByKey at <console>:15)
14/04/04 17:44:02 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/04/04 17:44:02 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/04 17:44:02 INFO scheduler.DAGScheduler: Submitting Stage 20 (MapPartitionsRDD[24] at countByKey at <console>:15), which has no missing parents
14/04/04 17:44:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 20 (MapPartitionsRDD[24] at countByKey at <console>:15)
14/04/04 17:44:02 INFO local.LocalTaskSetManager: Size of task 19 is 1995 bytes
14/04/04 17:44:02 INFO executor.Executor: Running task ID 19
14/04/04 17:44:02 INFO storage.BlockManager: Found block broadcast_0 locally
14/04/04 17:44:03 INFO rdd.HadoopRDD: Input split: file:/app/home/hadoop/shengli/spark-0.8.1-incubating-bin-hadoop1/kmeans_data.txt:0+72
14/04/04 17:44:03 INFO executor.Executor: Serialized size of result for 19 is 823
14/04/04 17:44:03 INFO executor.Executor: Sending result for 19 directly to driver
14/04/04 17:44:03 INFO scheduler.DAGScheduler: Completed ResultTask(20, 0)
14/04/04 17:44:03 INFO local.LocalScheduler: Remove TaskSet 20.0 from pool 
14/04/04 17:44:03 INFO executor.Executor: Finished task ID 19
14/04/04 17:44:03 INFO scheduler.DAGScheduler: Stage 20 (countByKey at <console>:15) finished in 0.024 s
14/04/04 17:44:03 INFO spark.SparkContext: Job finished: countByKey at <console>:15, took 0.073272783 s
res15: scala.collection.Map[java.lang.String,Long] = Map(9.0 -> 3, 9.1 -> 3, 0.0 -> 3, 0.2 -> 3, 0.1 -> 3, 9.2 -> 3)

foreach

这个最常用,遍历集合,对集合中的每个元素应用func。

原创,转载请注明出处http://blog.youkuaiyun.com/oopsoom/article/details/22948207,谢谢。

### Spark RDD 编程教程 RDD(Resilient Distributed Dataset)是 Spark 的核心抽象之一,表示一个不可变的、分区的数据集[^4]。它可以被并行操作,并支持多种转换和动作操作。 #### 创建 RDD 可以通过两种方式创建 RDD: 1. **从外部存储加载数据**:例如 HDFS 文件或其他分布式文件系统中的文件。 2. **通过已有的 Scala 集合对象**:适用于小型测试数据集。 以下是基于 `spark-shell` 的交互式编程示例: ```scala // 加载本地文件作为 RDD val textFile = sc.textFile("file:///path/to/your/file.txt") // 使用集合创建 RDD val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) // 显示前几个元素 textFile.take(5).foreach(println) distData.collect().foreach(println) ``` 上述代码展示了如何从文件或内存数组中创建 RDD 并查看其内容。 --- ### Spark Shell 交互式编程 示例 `spark-shell` 是一种快速启动 Spark 上下文的方式,适合开发人员进行调试和原型设计[^2]。以下是一个简单的交互式编程流程: #### 启动 Spark Shell 运行命令如下: ```bash bin/spark-shell ``` 这将在本地模式下启动 Spark,默认绑定地址为 `http://localhost:4040`,提供 Web UI 查看任务状态。 #### 数据处理实例 假设有一个名为 `input.txt` 的文本文件,其中每行包含若干单词。我们希望统计每个单词出现的次数。 ```scala // 读取文件并分割成单词 val lines = sc.textFile("file:///path/to/input.txt") val words = lines.flatMap(line => line.split(" ")) // 将每个单词映射为 (word, 1),然后按 key 聚合计数 val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _) // 输出结果 wordCounts.foreach(println) ``` 此代码片段实现了经典的 WordCount 功能,利用了 RDD 的 `flatMap`、`map` 和 `reduceByKey` 方法[^5]。 --- ### 关键方法解析 - **Transformation**: 如 `map`, `filter`, `flatMap`, `groupByKey`, `reduceByKey` 等用于定义计算逻辑。 - **Action**: 如 `collect`, `count`, `take`, `saveAsTextFile` 等触发实际计算。 这些方法共同构成了 Spark 中的核心编程模型。 ---
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值