RDD 编程
一、概述
RDD是resilient distributed dataset的缩写。创建RDD有两种方法:一种是parallize驱动程序中存在的集合;另一种是引用外部存储系统中的数据集来创建。外部数据源象共享文件系统,HDFS,HBase,或者任何提供Hadoop InputFormat的数据源。
1、 Parallelized Collections(并行化集合)
并行化集合的创建是通过调用SparkContext的parallelize方法作用于已经存在的collection(在驱动程序中)。集合中的元素被复制,使其成为一个分布式数据集,该分布式集可以被并行操作。
Scala版本:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Java版本:
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
Python 版本:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
分布式集合一旦创建,它就可以被并行操作。例如,我们可以distData.reduce((a,b)=>a+b)计算数组的各。一个对于并行集合十分重要的参数是将数据集切分成的分区的数量。Spark将为集群中每个分区(Partition)运行一个task。特别提一下,一般在集群中每一个CPU配2~4个分区。一般而言,Spark会尝试着基于集群自动设置分区的数量。尽管如此,也可以手工设置parallize方法的第二个参数,例如sc.parallelize(data,10)。注意:有些地方使用slice这个词以保持向后兼容
。
2、External Datasets(外部数据集)
Spark 可以从任何支持Hadoop文件系统的数据源创建分布式数据集,这些数据源(文件系统)包括本地文件系统,HDFS,Cassandra,HBase,Amazon S3等等。Spark支持文本文件,序列化文件和任何其它Hadoop支持的输出格式。
SparkContext通过textFile方法创建文本类型RDDs。该方法里面的参数是文件的路径,它从文件中读取的以行的集合的形式的数据。
以下是关于Spark读取文件应该注意的一些Notes:
(1) 如果使用的本地的文件系统,必须保证在所有的工作节点上的文件可访问,并且文件路径相同。给每个节点复制一份文件或者使用一个网络分享文件系统。
(2) 文件路径格式,textFile支持运行于一般路径中的文件,压缩文件和wildcards等等。例如:
textFile("/my/directory"), // 一般格式的文件的路径
textFile("/my/directory/*.txt"), // 带有通配符
textFile("/my/directory/*.gz"). // 压缩格式的文件
(3) textFile方法的第二参数,这个参数控制文件分区的数量。默认情况下,Spark 为每一个文件的每个块创建一个分区,但是也可以通过传递一个较大的数值来设置更高的分区数量。但是要注意的是分区数量不能少于块的数量。
Apart from text files, Spark’s Scala API also supports several other data formats:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files
, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file. Partitioning is determined by data locality which, in some cases, may result in too few partitions. For those cases, wholeTextFiles provides an optional second argument for controlling the minimal number of partitions.
For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.
For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD
method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).
RDD.saveAsObjectFile
and SparkContext.objectFile
support saving an RDD in a simple format consisting of serialized Java objects
. While this is not
as efficient as specialized formats like Avro, it offers an easy way to save any RDD.
二、RDD操作
RDD支持两种类型的操作:一种是transformations,这种操作是从已经存在的数据集创建一个新的数据集合;另一种是action,这种操作在执行对数据集合的计算后,会将运算值返回到驱动程序。
在Spark中所有的transformations都是惰性的,Spark引擎不会立马计算结果。相反,Spark仅仅记住作用在一些最开始的dataset的transformation操作。Transformations仅仅当一个action操作需要结果被返回到驱动程序时才会被运算。这样的设计也是Spark为什么运算那么快速的重要方面。
比如说,通过map创建的数据集将会被reduce中使用,并且Spark只会将reduce的最终结果返回到驱动程序,而不是大量的被mapped后的数据集。
默认情况下,每个被transforme的RDD可能每次都会被重新计算当我们对它进行action操作运算时,所以此时,可以使用persis(or cache)方法将RDD持久化到机器内存中。这种情况下,Spark将这些RDD存于集群中,当下上次查询时,速度会快很多。除此之外,Spark也支持将RDD持久化到磁盘中,或者支持跨多节点复制的方式对RDD进行持久化。
2.1 基本操作
val lines = sc.textFile("data.txt") // 加载外部文件
val lineLengths = lines.map(s => s.length) //defines lineLengths as the result of a map transformation
val totalLength = lineLengths.reduce((a, b) => a + b)
最后,运行reduce操作,这是一个action.在这个点上,Spark会将计算分在各个独立节点的task任务中,每个机子运行它自己被分到的map和本地reduce 部分,而只需要将结果返回到驱动程序就行!
如果后面会用到lineLengths,可以添加这个将lineLengths持久化:
lineLengths.persist() //持久化操作
2.2 传递函数到Spark
2.3 Understanding closures
(1) 例子:
(2) Local vs. cluster modes
The behavior of the above code is undefined, and may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.
The variables within the closure sent to each executor are now copies and thus, when counter is referenced within the foreach function, it’s no longer the counter on the driver node. There is still a counter in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of counter will still be zero since all operations on counter were referencing the value within the serialized closure.
In local mode, in some circumstances, the foreach function will actually execute within the same JVM as the driver and will reference the same original counter, and may actually update it.
To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator. Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.
(3) Printing elements of an RDD
2.4 Working with Key-Value Pairs
2.5 Transformations操作
2.6 Actions操作
2.7 Shuffle操作
(1) 背景
(2) 性能影响方面的讨论
2.8 RDD持久化
(1) 内存级别选择
(2) 移除数据
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.