RDD会把之前的依赖关系记录下来,防止在RDD的某一个阶段执行过程中出现问题后,部分分区数据丢失后,无法恢复数据,这是RDD另一个弹性属性,在数据出现问题,重新恢复数据
每一个RDD会把它之前的血缘关系记录下来,可以用 rdd.toDebugString 来获取他所有的血缘关系
val fileRDD: RDD[String] = sc.makeRDD(List("scala","Spark","Spark","scala","hello"))
println(fileRDD.toDebugString)
println("----------------------")
val wordRDD: RDD[String] = fileRDD.flatMap(_.split(" "))
println(wordRDD.toDebugString)
println("----------------------")
val mapRDD: RDD[(String, Int)] = wordRDD.map((_,1))
println(mapRDD.toDebugString)
println("----------------------")
val resultRDD: RDD[(String, Int)] = mapRDD.reduceByKey(_+_)
println(resultRDD.toDebugString)
resultRDD.collect()
输出结果如下:
(4) ParallelCollectionRDD[0] at makeRDD at DependesesTest.scala:12 []
----------------------
(4) MapPartitionsRDD[1] at flatMap at DependesesTest.scala:15 []
| ParallelCollectionRDD[0] at makeRDD at DependesesTest.scala:12 []
----------------------
(4) MapPartitionsRDD[2] at map at DependesesTest.scala:18 []
| MapPartitionsRDD[1] at flatMap at DependesesTest.scala:15 []
| ParallelCollectionRDD[0] at makeRDD at DependesesTest.scala:12 []
----------------------
(4) ShuffledRDD[3] at reduceByKey at DependesesTest.scala:21 []
+-(4) MapPartitionsRDD[2] at map at DependesesTest.scala:18 []
| MapPartitionsRDD[1] at flatMap at DependesesTest.scala:15 []
| ParallelCollectionRDD[0] at makeRDD at DependesesTest.scala:12 []
每一个RDD都会把它之前的血缘关系记录下来
RDD的依赖关系:记录RDD与它前一个RDD的关系
val fileRDD: RDD[String] = sc.makeRDD(List("scala","Spark","Spark","scala","hello"))
println(fileRDD.dependencies)
println("----------------------")
val wordRDD: RDD[String] = fileRDD.flatMap(_.split(" "))
println(wordRDD.dependencies)
println("----------------------")
val mapRDD: RDD[(String, Int)] = wordRDD.map((_,1))
println(mapRDD.dependencies)
println("----------------------")
val resultRDD: RDD[(String, Int)] = mapRDD.reduceByKey(_+_)
println(resultRDD.dependencies)
resultRDD.collect()
结果如下:
List()
----------------------
List(org.apache.spark.OneToOneDependency@1be59f28)
----------------------
List(org.apache.spark.OneToOneDependency@253b380a)
----------------------
List(org.apache.spark.ShuffleDependency@23f3dbf0)
这里依赖关系分为两种
org.apache.spark.OneToOneDependency
org.apache.spark.ShuffleDependency
OneToOneDependency 也就是我们说的窄依赖
/**
* :: DeveloperApi ::
* Represents a one-to-one dependency between partitions of the parent and child RDDs.
*/
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
ShuffleDependency 平常说的宽依赖
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Serializer = SparkEnv.get.serializer,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false,
val shuffleWriterProcessor: ShuffleWriteProcessor = new ShuffleWriteProcessor)