Spark RDD

最新推荐文章于 2025-04-13 18:39:45 发布

原创最新推荐文章于 2025-04-13 18:39:45 发布 · 324 阅读

CC 4.0 BY-SA版权

def compute(split: Partition, context: TaskContext): Iterator[T]

protected def getPartitions: Array[Partition]

protected def getDependencies: Seq[Dependency[_]] = deps

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

@transient val partitioner: Option[Partitioner] = None

only for key-value

narrow-dependency：

分区： 与父RDD一致

依赖： 与父RDD一对一

函数： transformation 映射

最佳位置： no

分区策略： no

joined-RDD:

分区： 每个reduce一个分区
依赖： 宽依赖，多个或所有父RDD
函数：  shuffle后的数据
最佳位置： no
分区策略： HashPartitioner（partitions： Int）

宽依赖 窄依赖

宽依赖    子RDD的partition 依赖于所有父RDD partitions

         影响：如果父节点挂掉，将需要恢复所有父节点后才能计算子节点； 不能并行计算

窄依赖    子RDD的partition  只依赖于最多一个RDD partition

         影响：如果父节点挂掉，只需恢复最多一个父节点； 可以完全并行计算

-> 关键点是 在stage分界点位置（shuffle处需要宽依赖的父节点要落地persist）

transformation

--map:

*自定义函数处理方式

JavaRDD<Integer> lineLengths = lines.map(newFunction<String, Integer>() {
  publicInteger call(String s) {returns.length(); }
});

--mapPartitions:

//iter 表示 每个分区的迭代器，不是每个元素
val a = sc.parallelize(1 to 9, 3)
def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
    var res = List[(T, T)]() 
    var pre = iter.next while (iter.hasNext) {
        val cur = iter.next; 
        res .::= (pre, cur) pre = cur;
    } 
    res.iterator
}
a.mapPartitions(myfunc).collect
 Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

--mapPartitionsWithIndex
var rdd1 = sc.makeRDD(1 to 9,3)
//rdd1有两个分区
var rdd2 = rdd1.mapPartitionsWithIndex{
        (x,iter) => {
          var result = List[String]()
            var i = 0
            while(iter.hasNext){
              i += iter.next()
            }
            result.::(x + "|" + i).iterator
           
        }
      }
//rdd2将rdd1中每个分区的数字累加，并在每个分区的累加结果前面加了分区索引
scala> rdd2.collect
res13: Array[String] = Array(0|6, 1|15, 2|24)
--mapValues

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)
val b = a.map(x => (x.length, x))
b.mapValues("x" + _ + "x").collect
b.mapValues("x" + _ + "x").reduceByKey(_+_).collect  //这里的mapValues只是map value，reduceByKey里的key仍然是元祖的key，即tuple.1

--mapWith

val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) 
x.mapWith(a => a * 10)((a, b) => (b + 2)).collect 
=> partition index  乘以10  -> (0,1,2)
def mapWith[A: ClassTag, U: ](constructA: Int => A, preservesPartitioning: Boolean = false)(f: (T, A) => U): RDD[U]
第一个函数constructA是把RDD的partition index（index从0开始）作为输入，输出为新类型A；
第二个函数f是把二元组(T, A)作为输入（其中T为原RDD中的元素，A为第一个函数的输出），输出类型为U。

注意使用 sc.parallelize(***,N) 表示把数据集分成N块，每块执行相应function

引用： http://blog.youkuaiyun.com/lfz_carlos/article/details/50753695

       http://lxw1234.com/archives/2015/07/348.htm

Action