本文参考Zhen He
69、take
原型
def take(num: Int): Array[T]
含义
take 提取RDD中元素的前几个,这几个是没有排序的,不需要排序,但是这个底部代码实现起来非常困难,因为他们分布在不同的分区
示例
val a = sc.parallelize(1 to 10,2)
a.take(2)
res1: Array[Int] = Array(1, 2)
val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b.take(2)
res2: Array[String] = Array(dog, cat)
70、takeOrdered
原型
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
含义
takeOrdered 先将RDD中的数据进行排序,然后在取出来指定个数的数据,用户可以自定义排序方法。
示例
val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b.takeOrdered(2)
res19: Array[String] = Array(ape, cat)
71、takeSample
原型
def takeSample(withReplacement: Boolean, num: Int, seed: Int): Array[T]
含义
takeSample 基本功能是随机返回用户数量的元素。与sample不同的是,(1) takeSample 返回的是确定数量的值,而sample返回的数的个数不确定。(2) takeSample返回的是一个数组,而sample返回的是一个RDD。
示例
val x = sc.parallelize(1 to 100, 3)
x.takeSample(true, 10, 1)
res1: Array[Int] = Array(75, 17, 80, 48, 71, 12, 34, 3, 20, 17)
72、toDebugString
原型
def toDebugString: String
含义
toDebugString 获取RDD的依赖及其Debug信息
示例
val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.toDebugString
res1: String =
(3) MapPartitionsRDD[289] at subtract at <console>:34 []
| SubtractedRDD[288] at subtract at <console>:34 []
+-(3) MapPartitionsRDD[286] at subtract at <console>:34 []
| | ParallelCollectionRDD[284] at parallelize at <console>:30 []
+-(3) MapPartitionsRDD[287] at subtract at <console>:34 []
| ParallelCollectionRDD[285] at parallelize at <console>:30 []
73、top
原型
ddef top(num: Int)(implicit ord: Ordering[T]): Array[T]
含义
top 获取RDD按照降序排序后,获取前几个
示例
val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)
c.top(2)
res1: Array[Int] = Array(9, 8)
74、toString
原型
override def toString: String
含义
toString 获取RDD人性化的描述信息
示例
val randRDD = sc.parallelize(List( (7,"cat"), (6, "mouse"),(7, "cup"), (6, "book"), (7, "tv"), (6, "screen"), (7, "heater")))
val sortedRDD = randRDD.sortByKey()
sortedRDD.toString
res1: String = ShuffledRDD[88] at sortByKey at <console>:23
75、treeAggregate
原型
def treeAggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U, depth: Int = 2)(implicit arg0: ClassTag[U]): U
含义
treeAggregate 和aggregateByKey 非常类似,不会在每个分区之间再应用初始值。不同的是,他可以指定每次聚合元素的个数,这样能够减少聚合的次数,减少计算,默认聚合的是两个。
示例
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.treeAggregate(0)(math.max(_, _), _ + _)
res1: Int = 9
z.treeAggregate(5)(math.max(_, _), _ + _,4)
res2: Int = 11
76、treeReduce
原型
def treeReduce(f: (T, T) ⇒ T, depth: Int = 2): T
含义
treeReduce 和reduce 非常类似。不同的是,他可以指定每次聚合元素的个数,这样能够减少聚合的次数,减少计算,默认聚合的是两个。
示例
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.treeReduce(_+_,3)
res1: Int = 21
76、union, ++
原型
def ++(other: RDD[T]): RDD[T]
def union(other: RDD[T]): RDD[T]
含义
union 和集合中相加很类似,但是他们之间允许出现重复数据,如果不希望出现重复数据可以使用distinct
示例
val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(2 to 7, 1)
(a ++ b).collect
res0: Array[Int] = Array[Int] = Array(1, 2, 3, 2, 3, 4, 5, 6, 7)
77、unpersist
原型
def unpersist(blocking: Boolean = true): RDD[T]
含义
unpersist 撤销已经实现的RDD的内容,但是产生的RDD编号不会消失,如果再次使用这个编号,就会将消失的数据复原
示例
val y = sc.parallelize(1 to 10, 10)
val z = (y++y)
z.collect
z.unpersist(true)
78、values
原型
def values: RDD[V]
含义
values 获取RDD元组中每一个key-value元素的value值,和keys对应
示例
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.values.collect
res3: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)
79、variance [Double], sampleVariance [Double]
原型
def variance(): Double
def sampleVariance(): Double
含义
variance 求RDD中元素的方差,适用于小量数据 divide N
sampleVariance 求RDD中元素的方差,适用于大量数据 divide N-1
示例
val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02, 19.29, 11.09, 21.0), 2)
x.variance
res14: Double = 66.04584444444443
x.sampleVariance
res13: Double = 74.30157499999999
79、zip
原型
def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]
含义
zip 将两个RDD联合起来,形成一个key-value类型的RDD
示例
val x = sc.parallelize(1 to 10)
val y = sc.parallelize(21 to 30)
x.zip(y).collect
80、zipParititions
原型
def zipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]
含义
zipParititions 和zip的功能类似,但是用户可以自定义实现多个RDD之间进行zip,但是用户必须自定义实现函数
示例
val a = sc.parallelize(0 to 9, 3)
val b = sc.parallelize(10 to 19, 3)
val c = sc.parallelize(100 to 109, 3)
def myfunc(aiter: Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]): Iterator[String] =
{
var res = List[String]()
while (aiter.hasNext && biter.hasNext && citer.hasNext)
{
val x = aiter.next + " " + biter.next + " " + citer.next
res ::= x
}
res.iterator
}
a.zipPartitions(b, c)(myfunc).collect
res0: Array[String] = Array(2 12 102, 1 11 101, 0 10 100, 5 15 105, 4 14 104, 3 13 103, 9 19 109, 8 18 108, 7 17 107, 6 16 106)
81、zipWithIndex
原型
def zipWithIndex(): RDD[(T, Long)]
含义
zipWithIndex 和zip的功能类似,是zip 的简化版,他使用前者所在的位置作为其value,不需要用户再定义一个RDD
示例
val z = sc.parallelize(Array("A", "B", "C", "D"))
val r = z.zipWithIndex
res110: Array[(String, Long)] = Array((A,0), (B,1), (C,2), (D,3))
val z = sc.parallelize(100 to 120, 5)
val r = z.zipWithIndex
r.collect
res11: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3), (104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10), (111,11), (112,12), (113,13), (114,14), (115,15), (116,16), (117,17), (118,18), (119,19), (120,20))
82、zipWithUniqueId
原型
def zipWithUniqueId(): RDD[(T, Long)]
含义
zipWithUniqueId 和zip的功能类似,是zip 的简化版,他使用一个系统分配的唯一Id作为其value ,这与序号并不对应,但是这个更加安全,不同分区之间也不会导致相同的value
示例
val z = sc.parallelize(100 to 120, 5)
val r = z.zipWithUniqueId
r.collect
res1: Array[(Int, Long)] = Array((100,0), (101,5), (102,10), (103,15), (104,1), (105,6), (106,11), (107,16), (108,2), (109,7), (110,12), (111,17), (112,3), (113,8), (114,13), (115,18), (116,4), (117,9), (118,14), (119,19), (120,24))