RDD操作类型
名称 | 描述 | |
RDD操作类型 | transformation | 从一个已有的数据集创建一个新数据集。惰性执行 |
action | 返回一个值到driver端,在一个数据集计算后。非惰性执行 | |
persist(cache) | 持久化或缓存RDD。惰性执行 |
Understanding closures
spark的一个重要点,当代码在集群执行时,理解变量和方法的范围和生命周期。RDD的超出它们范围修改变量会经常引起混乱。以下例子,foreach() 正在累加一个计数器,相同的问题也出现在其他的操作中。
例如:
思考以下这个本地RDD元素求和,这个行为执行结果的差异依赖于是否这个执行在相同的JVM中。一个常用例子是spark运行在本地模式(--master = local[n]) 与部署在这个Spark应用在集群中。
var counter = 0
var rdd = sc.parallelize(data)
// Wrong: Don't do this!!
rdd.foreach(x => counter += x)
println("Counter value: " + counter)
Local vs. cluster modes
以上代码最大的质疑是不确定性。在本地模式是单个JVM,以上代码在RDD内计算值的和,并且存储这个值在计数器中。这是因为这个RDD和变量计数器是在driver node端相同的内存中。
然而,在集群模式下,这个结果是很复杂的,并且以上代码不能像期望那样执行。为了执行jobs,Spark拆分RDD的执行过程到任务集(tasks)中--每个任务被一个executor操作。在执行之前。Spark计算这个closure。为了能执行在RDD的计算(例如foreach()),closure 是那些能被executor访问的变量和方法。closure能够被序列化并且发送给executor。在本地模式,只有一个executor,因此所有的东西共享相同的closure。然而在其他的模式中,情况并非如此,executor单独运行它们自己的复本closure。
在这里发生的情况是,在closure中的变量发送给每个机器的是复本,因此,当counter被foreach引用时,这个counter已经不是driver节点的counter了。在driver node的内存中一直是一个counter,但是不能被executor访问。excutor仅能看到这序列化的closure中的复本。因此,counter的最终结果是0,自此所有在counter中的操作是被关联到序列化的closure值中。
在这样的情况下,为了更好的定义这行为,首先应该采用Accumulator。在Spark中Accumulators被指定用于提供一种安全机制去更新变量当execution被划分到集群中的worker node时。
通常情况下,closures - 类似循环结构或本地定义的方法,不能用于修改全局状态。Spark并不能定义和保证从closure 外引入对象的修改行为。有些代码不能在本地运行,但是只是偶然并且这样的代码不能像预期结果在分布式模式下。使用Accumulator替代一些全局的聚合。
Printing elements of an RDD
一些常见的习惯,是使用rdd.foreach(println)
or rdd.map(println)
.去打印RDD的elements。在单机模式下,将产生期望的输出,并且打印RDD的所有elements。然而,在cluster mode,输出到stdout被executor调用,输出正在写到executor的stdout,而不是在driver node上。因为driver node上的stdout没有展示结果,为了打印在driver打印所有的结果,通过调用collect()方法把RDD拉回到driver node通过rdd.collect().foreach(println)。这样做可能会引起driver内存溢出,因为collect()拉取全部的RDD到单个机器;如果仅需要打印少量的RDD的elements,一个安全方式是使用take():
rdd.take(100).foreach(println)
.
Working with Key-Value Pairs
大量的Spark操作能运行在包含任意类型对象的RDD时,少量的指定操作运行在包含key-value pairs的RDD。最为普遍的是分布式的"shuffle"操作,如通过key 去grouping、aggregating elements。
在Scala中,这些操作是自动生效在包含Tuple2对象的RDD上。key-value pairs 操作是可以运行在PairRDDFunctions类,因为它们自动封装了 tuples的RDD。
例如:以下代码使用了reduceByKey操作在 key-value pairs去统计每一行在text中出现多少次。
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
我们也能够使用counts.sortByKey(),例如,为了按字母排序pairs,并且最终通过counts.collect()把结果作为一个对象数组返回到driver。
注意:当采用对象作为pairs操作进,你必须确保对象equals()和hashCode()已经实现。
Transformations
类型 | 函数名 | 功能说明 | 注意点 |
transformation | map(func) | 返回一个新的分布式数据集,由源数据集的每个element经过func加工 | |
flatMap() | 功能: line1 = aa bb cc dd line2 = bb aa ee ff text = line1 line2 经过map 结果为((aa,bb,cc,dd),(bb,aa,ee,ff)) 经过flatMap 结果为(aa,bb,cc,dd,bb,aa,ee,ff)
| ||
filter(func) | 返回一个新数据集,由源数据集经过func过滤 | ||
mapPartitions(func) | 功能与map类型,但是单独运行在RDD的每一个分区上,因此func必须是 func:Iterator<T> => Iterator<U> T是RDD的类型
| ||
mapPartitionsWithIndex(func) | Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T. 功能与mapPartitions相同,但是需要提供一个func拥一个Interger值代替分区的索引,因此func必须是 func:(Int,Iterator<T>) => Iterator<U> | 这个可以生成一个不同类型的RDD | |
sample(withReplacement, fraction, seed) | 对原数据集进行抽样: withReplacement:Boolean false:不重复抽样 true:重复抽样 fraction:抽样比例 seed:随机因子 | ||
union(otherDataset) | 返回一个新的数据集,包含原数据集与otherDataset的数据集 | ||
intersection(otherDataset) | 返回包含sourceDataset与otherDataset的交集 | ||
distinct([numTasks])) | 对原数据集进行去重 | ||
groupByKey([numTasks]) | When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Note: 默认情况下,并行度与输入源RDD的分区数相关,也可能通过参数进行设置不同的任务数 | ||
reduceByKey(func, [numTasks]) | When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in
| ||
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) | When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey , the number of reduce tasks is configurable through an optional second argument. | 与reduceByKey不同的是,返回的value类型变化了 | |
sortByKey([ascending], [numTasks]) | 排序rdd通过key | ||
join(otherDataset, [numTasks]) | When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin , rightOuterJoin , and fullOuterJoin . | ||
cogroup(otherDataset, [numTasks]) | When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith . | ||
cartesian(otherDataset) | When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). | ||
pipe(command, [envVars]) | Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings. | ||
coalesce(numPartitions) | Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset. | 重新分区不需要shuffle | |
repartition(numPartitions) | Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network. | 重新分区需要shuffle | |
repartitionAndSortWithinPartitions(partitioner) | Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery. | ||
action | reduce(func) | Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. | |
collect() | Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. | ||
count() | Return the number of elements in the dataset. | ||
first() | Return the first element of the dataset (similar to take(1)). | ||
take(n) | Return an array with the first n elements of the dataset. | ||
takeSample(withReplacement, num, [seed]) | Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed. | ||
takeOrdered(n, [ordering]) | Return the first n elements of the RDD using either their natural order or a custom comparator. | ||
saveAsTextFile(path) | Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. | ||
saveAsSequenceFile(path) (Java and Scala) | Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc) | ||
saveAsObjectFile(path) (Java and Scala) | Write the elements of the dataset in a simple format using Java serialization, which can then be loaded usingSparkContext.objectFile() . | ||
countByKey() | Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. | ||
foreach(func) | Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details. |