RDD是一个抽象类,操作RDD就像操作本地集合一样,降低编程难度。RDD不存真正要计算的数据,而是记录了RDD的转换关系(调用了什么方法,传入什么函数)。RDD的算子分为两类,一类是Transformation(lazy),一类是Action(触发任务执行)
RDD的map方法,是Executor中执行时,是一条一条的将数据拿出来处理。
创建方式:
1、通过外部存储的诗句进行创建 sc.textFile("path")
2、将Driver的scala集合通过并行化变成RDD。 sc.parallelize(List(x,x,x,x,x))
3、调用已存在的RDD进行转换得到新的RDD。(转换的特点:1.生成新的RDD 2.Lazy)
分区取决哪些因素?
1、如果是driver端的scala集合并行化转换成RDD,没指定分区,则rdd分区按app分配中指定的核数
2、如从hdfs读创建RDD,并且设置最小分区数量是1,那么rdd分区数即使输入切片的数据。若没设置,spark调用textFile时默认传入2,分区会大于等于切片数量。
RDD五大特点:
* - A list of partitions (一系列分区,分区有编号,有顺序的)
* - A function for computing each split (每一个切片都会有一个函数作业在上面用于对数据进行处理)
* - A list of dependencies on other RDDs (RDD和RDD之间存在依赖关系)
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
(可选,key value类型的RDD才有RDD[(K,V)])如果是kv类型的RDD,会一个分区器,默认是hash-partitioned
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
(可以,如果是从HDFS中读取数据,会得到数据的最优位置(向Namenode请求元数据))
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
//该方法只会被调用一次。由子类实现,返回这个RDD的所有partition。
protected def getPartitions: Array[Partition]
//该方法只会被调用一次。计算该RDD和父RDD的依赖关系
protected def getDependencies: Seq[Dependency[_]] = deps
// 对分区进行计算,返回一个可遍历的结果
def compute(split: Partition, context: TaskContext): Iterator[T]
//可选的,指定优先位置,输入参数是split分片,输出结果是一组优先的节点位置
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
//可选的,分区的方法,针对第4点,类似于mapreduce当中的Paritioner接口,控制key分到哪个reduce
@transient val partitioner: Option[Partitioner] = None
}
注:图片均来自网络 。
当进行textFile时,调用了hadoopFile。
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
在HadoopFile中我们可以看到new了hadoopRDD,即返回了一个hadoopRDD。
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
在hadoopRDD源码中,主要有getPartitions方法,compute方法
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
// 创建切片
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
// 根据切片创建分区Array并返回
for (i <- 0 until inputSplits.size) {
array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
val iter = new NextIterator[(K, V)] {
// 将theSplit转为HadoopPartition
val split = theSplit.asInstanceOf[HadoopPartition]
logInfo("Input split: " + split.inputSplit)
val jobConf = getJobConf()
val inputMetrics = context.taskMetrics.getInputMetricsForReadMethod(DataReadMethod.Hadoop)
// 找到一个函数,该函数将返回该线程读取的文件系统字节。
// 创建RecordReader,因为RecordReader的构造函数可能读取一些字节
val bytesReadCallback = inputMetrics.bytesReadCallback.orElse {
split.inputSplit.value match {
case _: FileSplit | _: CombineFileSplit =>
SparkHadoopUtil.get.getFSBytesReadOnThreadCallback()
case _ => None
}
}
inputMetrics.setBytesReadCallback(bytesReadCallback)
var reader: RecordReader[K, V] = null
val inputFormat = getInputFormat(jobConf)
HadoopRDD.addLocalConfiguration(new SimpleDateFormat("yyyyMMddHHmm").format(createTime),
context.stageId, theSplit.index, context.attemptNumber, jobConf)
reader = inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)
// 注册一个任务完成回调,以关闭输入流。
context.addTaskCompletionListener{ context => closeIfNeeded() }
val key: K = reader.createKey()
val value: V = reader.createValue()
override def getNext(): (K, V) = {
try {
finished = !reader.next(key, value)
} catch {
case eof: EOFException =>
finished = true
}
if (!finished) {
inputMetrics.incRecordsRead(1)
}
(key, value)
}
}