主要方法属性:
- A list of partitions - A function for computing each split - A list of dependencies on other RDDs - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
RDD是一个抽象类,继承类可以有多种实现;
第1个参数SparkContext,@transient表示不需要序列化
第2个参数deps,表示依赖关系
abstract class RDD[T: ClassTag]( private var _sc: SparkContext, private var deps: Seq[Dependency[_]] ) extends Serializable with Logging { //该方法只会被调用一次。由子类实现,返回这个RDD的所有partition。 protected def getPartitions: Array[Partition] //该方法只会被调用一次。计算该RDD和父RDD的依赖关系 protected def getDependencies: Seq[Dependency[_]] = deps // 对分区进行计算,返回一个可遍历的结果 def compute(split: Partition, context: TaskContext): Iterator[T] //可选的,指定优先位置,输入参数是split分片,输出结果是一组优先的节点位置 protected def getPreferredLocations(split: Partition): Seq[String] = Nil //可选的,分区的方法,针对第4点,类似于mapreduce当中的Paritioner接口,控制key分到哪个reduce val partitioner: Option[Partitioner] = None }
2.以wordCount程序举例
import org.apache.spark.SparkContext import org.apache.spark.SparkConf object WordCount {