RDD详解
Spark计算中一个重要的概念就是可以跨越多个节点的可伸缩分布式数据集RDD(resilient distributed dataset)Spark的内存计算核心就是RDD的并行计算。RDD就是一个分布式、不可变、带有分区的弹性的数据集合。所谓的Spark批处理,实际上就是对RDD的集合操作,RDD有以下特点:
- 任意一个RDD都包含分区数(决定程序某个阶段计算并行度)
- RDD的分布式计算是在分区内部计算的
- 因为RDD是只读的,RDD之间存在依赖关系(宽、窄依赖)
- 针对k-v类型的RDD,一般可以指定分区策略(一般系统提供)
- 针对存储在HDFS上的文件,系统可以计算最优位置,计算每个切片。
Spark的整个任务的计算无外乎围绕RDD的三种类型操作RDD创建、RDD转换、RDD Action.通常习惯性的将flatMap/map/reduceByKey称为RDD的转换算子,collect触发任务执行,因此被人们称为动作算子。在Spark中所有的Transform算子都是lazy执行的,只有在Action算子的时候,Spark才会真正的运行任务,也就是说只有遇到Action算子的时候,SparkContext才会对任务做DAG状态拆分,系统才会计算每个阶段下任务的TaskSet,继而TaskSchedule才会将任务提交给Executor执行。以字符统计为例:
textFile(“路径”,分区数) -> flatMap -> map -> reduceByKey -> sortBy在这些转换中其中flatMap/map
、reduceByKey、sortBy都是转换算子,所有转换算子都是lazy执行的。程序在遇到collect(Action算子)系统才会触发job执行。Spark底层会按照RDD的依赖关系将整个计算拆分成若干个阶段(stage),我们通常将RDD的依赖关系称为RDD的血统-lineage。血统的依赖通常包含:宽依赖、窄依赖。
RDD容错及stage划分
容错
Spark的计算本质就是对RDD做各种转换,因为RDD是一个不可变只读的集合,因此每次的转换都需要上一次的RDD作为本次转换的输入,因此RDD的lineage描述的是RDD间的相互依赖关系。为了保证RDD中数据的健壮性,RDD数据集通过所谓血统关系(lineage)记住了他是如何其他RDD中演变过来的。Spark将RDD之间的关系规类为宽依赖和窄依赖。Spark会根据Lineage存储的RDD的依赖关系对RDD计算做故障容错,目前Spark的容错策略主要是根据RDD依赖关系重新计算、对RDD做cache、对RDD做checkpoint手段完成RDD计算的故障容错。
宽依赖和窄依赖
RDD在lineage依赖方面分为两种NarrowDependencies和Wide Dependencies用来解决数据容错的高效性。Narrow Dependencies是指父RDD的每个分区最多被一个子RDD的分区使用,表现为一个父RDD的分区对应一个子RDD的分区或多个父RDD的分区对应子RDD的一个分区,Wide Dependencies父RDD的一个分区对应一个子RDD的多个分区。
对于Wide Dependencies的计算的输入和输出在不同的节点上,一般需要跨节点做shuffle,因此RDD在做宽依赖恢复的时候需要多个节点重新计算,成本较高。Narrow Dependencies RDD之间的计算是在同一个Task当中实现的,因此在RDD分区数据丢失的时候,也非常容易恢复。
Stage的划分
Spark任务阶段的划分是按照RDD的lineage关系逆向生成的,Spark任务提交流程大致入下图:
DAGSchedule中对State拆分的逻辑代码片段:
- DAGScheduler.scala719行
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
//...
}
- DAGScheduler675行
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
//eventProcessLoop 实现的是一个队列,系统底层会调用 doOnReceive -> case JobSubmitted -> dagScheduler.handleJobSubmitted(951行)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
- DAGScheduler 951行
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
//...
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
//...
}
submitStage(finalStage)
}
- DAGScheduler 1060行
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//计算当前State的父Stage
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//如果当前的State没有父Stage,就提交当前Stage中的Task
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
//递归查找当前父Stage的父Stage
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
- DAGScheduler 549行(获取当前stage的父stage)
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new ArrayStack[RDD[_]]//栈
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
//如果是宽依赖ShuffleDependency,就添加一个Stage
case shufDep: ShuffleDependency,[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//如果是窄依赖NarrowDependency,将当前的父RDD添加到栈中
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {//循环遍历栈,计算 stage
visit(waitingForVisit.pop())
}
missing.toList
}
- DAGScheduler 1083行(提交当前Stage的TaskSet)
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// First figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage
// SparkListenerStageSubmitted should be posted before testing whether tasks are
// serializable. If tasks are not serializable, a SparkListenerStageCompleted event
// will be posted, which should always come after a corresponding SparkListenerStageSubmitted
// event.
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
// If there are tasks to execute, record the submission time of the stage. Otherwise,
// post the even without the submission time, which indicates that this stage was
// skipped.
if (partitionsToCompute.nonEmpty) {
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
}
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
var taskBinary: Broadcast[Array[Byte]] = null
var partitions: Array[Partition] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
var taskBinaryBytes: Array[Byte] = null
// taskBinaryBytes and partitions are both effected by the checkpoint status. We need
// this synchronization in case another concurrent job is checkpointing this RDD, so we get a
// consistent view of both variables.
RDDCheckpointData.synchronized {
taskBinaryBytes = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
partitions = stage.rdd.partitions
}
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage
// Abort execution
return
case e: Throwable =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
// Abort execution
return
}
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
stage.rdd.isBarrier())
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None)
stage match {
case stage: ShuffleMapStage =>
logDebug(s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})")
markMapStageJobsAsFinished(stage)
case stage : ResultStage =>
logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")
}
submitWaitingChildStages(stage)
}
}
小结
通过以上源码分析,可以得出Spark所谓宽窄依赖实际上指的是ShuffleDependency或者是Narrow Dependency,如果是Shuffle Dependency系统会生成一个ShuffleMapStage,如果是Narrow Dependency则忽略,归为当前Stage。当系统回推到起始RDD的时候因为发现当前RDD或者ShuffleMapStage没有父Stage的时候,系统会在当前stage下的task封装成Shuffle Map Task(如果是ResultStage就是Result Task),当前Task的数目等于当前stage分区的分区数。然后将task封装成TaskSet通过调用taskScheduler.submitTasks将任务提交给集群。
RDD缓存
缓存是一种RDD计算容错的手段,程序在RDD丢失数据的时候,可以通过缓存快速计算当前RDD的是,而不需要反推出所有的RDD重新计算,因此Spark在需要对某个RDD多次使用的时候,为了提高程序执行效率,可以考虑使用RDD的cache。
要持久化一个RDD,只要调用其cache()或persist()方法即可,在该RDD第一次被计算出来时,就会直接缓存在每个节点中。而且Spark的持久化机制时自动容错的,如果持久化的RDD的任何partition丢失了,那么Spark会自动通过其源RDD,使用transformation操作重新计算该partition
cache()和persist()的区别在于cache()时persist()的一种简化方式,cache()的底层就是调用的persist()的无参版本,就是调用persist(MEMORY_ONLY),将数据持久化到内存中。如果需要从内存中清除缓存,可以使用unpersist()方法。
Spark自己也会在shuffle操作时,进行数据的持久化,比如写入磁盘,主要时为了在节点失败时,避免重新计算整个过程。
val conf = new SparkConf()
.setAppName("word-count")
.setMaster("local[2]")
val sc = new SparkContext(conf)
val value: RDD[String] = sc.textFile("file:///D:/demo/words/")
.cache()
value.count()
var begin=System.currentTimeMillis()
value.count()
var end=System.currentTimeMillis()
println("耗时:"+ (end-begin))//耗时:253
//失效缓存
value.unpersist()
begin=System.currentTimeMillis()
value.count()
end=System.currentTimeMillis()
println("不使用缓存耗时:"+ (end-begin))//2029
sc.stop()
除了调用cache之外,spark提供了更细粒度的RDD缓存方案,用户可以根据集群的内存状态选择合适的缓存策略。可以使用persist方法指定缓存级别
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
xxRDD.persist(StorageLevel.MEMORY_AND_DISK_SER_2)
其中:
MEMORY_ONLY:表示数据完全不经过序列化,直接存储在内存中,效率高,但有可能导致内存溢出
MEMORY_ONLY_SER和MEMORY_ONLY一样,只不过需要对RDD的数据做序列化,牺牲CPU节省内存,同样有内存溢出的可能。
其中_2表示缓存结果有备份,如果大家不确定该使用哪种级别,一般推MEMORY_ONLY_SER_2;能不使用DISK相关的策略,就不使用,有的时候,从磁盘读取数据,还不如重新计算一次
CheckPoint机制
使用缓存机制可以有效的保证RDD的故障恢复,但如果缓存失效还是会导致系统重新计算RDD的结果,所以对于一些lineage较长的场景,计算比较耗时,用户可以尝试使用checkpoint机制存储RDD的计算结果,checkpoint和缓存最大的不同在于,使用checkpoint之后被checkpoint的RDD数据直接持久化在文件系统中,一般推荐将结果写在hdfs中,这种checkpoint不会自动清空。注意:checkpoint在计算过程中先是对RDD做mark,在任务执行结束后,再对mark的RDD实行checkpoint,也就是要重新计算被mark的RDD的依赖和结果,为了避免这种重复计算,推荐使用策略:
val conf = new SparkConf().setMaster("yarn").setAppName("wordcount")
val sc = new SparkContext(conf)
sc.setCheckpointDir("hdfs:///checkpoints")//首先设置一个容错的系统文件目录
val lineRDD: RDD[String] = sc.textFile("hdfs:///words/t_word.txt")
val cacheRdd = lineRDD.flatMap(line => line.split(" "))
.map(word => (word, 1))
.groupByKey()
.map(tuple => (tuple._1, tuple._2.sum))
.sortBy(tuple => tuple._2, false, 1)
.cache()
cacheRdd.checkpoint()//对RDD调用checkpoint方法,再RDD所处的job结束后,会启动一个单独的job,来将checkpoint过的rdd的数据写入之前设置的文件系统。
cacheRdd.collect().foreach(tuple=>println(tuple._1+"->"+tuple._2))
cacheRdd.unpersist()
//3.关闭sc
sc.stop()