Spark中的血缘关系与依赖

本文详细介绍了Spark中的RDD如何记录依赖关系,从`fileRDD`创建到`resultRDD`的转换过程中,展示了窄依赖(一对一依赖)和宽依赖(ShuffleDependency)的区别。通过`dependencies`属性揭示了每个RDD的前世今生,帮助理解数据处理流程的弹性与效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

RDD会把之前的依赖关系记录下来,防止在RDD的某一个阶段执行过程中出现问题后,部分分区数据丢失后,无法恢复数据,这是RDD另一个弹性属性,在数据出现问题,重新恢复数据
每一个RDD会把它之前的血缘关系记录下来,可以用 rdd.toDebugString 来获取他所有的血缘关系

val fileRDD: RDD[String] = sc.makeRDD(List("scala","Spark","Spark","scala","hello"))
    println(fileRDD.toDebugString)
    println("----------------------")
    val wordRDD: RDD[String] = fileRDD.flatMap(_.split(" "))
    println(wordRDD.toDebugString)
    println("----------------------")
    val mapRDD: RDD[(String, Int)] = wordRDD.map((_,1))
    println(mapRDD.toDebugString)
    println("----------------------")
    val resultRDD: RDD[(String, Int)] = mapRDD.reduceByKey(_+_)
    println(resultRDD.toDebugString)
    resultRDD.collect()

输出结果如下:

(4) ParallelCollectionRDD[0] at makeRDD at DependesesTest.scala:12 []
----------------------
(4) MapPartitionsRDD[1] at flatMap at DependesesTest.scala:15 []
 |  ParallelCollectionRDD[0] at makeRDD at DependesesTest.scala:12 []
----------------------
(4) MapPartitionsRDD[2] at map at DependesesTest.scala:18 []
 |  MapPartitionsRDD[1] at flatMap at DependesesTest.scala:15 []
 |  ParallelCollectionRDD[0] at makeRDD at DependesesTest.scala:12 []
----------------------
(4) ShuffledRDD[3] at reduceByKey at DependesesTest.scala:21 []
 +-(4) MapPartitionsRDD[2] at map at DependesesTest.scala:18 []
    |  MapPartitionsRDD[1] at flatMap at DependesesTest.scala:15 []
    |  ParallelCollectionRDD[0] at makeRDD at DependesesTest.scala:12 []

每一个RDD都会把它之前的血缘关系记录下来

RDD的依赖关系:记录RDD与它前一个RDD的关系

 val fileRDD: RDD[String] = sc.makeRDD(List("scala","Spark","Spark","scala","hello"))
    println(fileRDD.dependencies)
    println("----------------------")
    val wordRDD: RDD[String] = fileRDD.flatMap(_.split(" "))
    println(wordRDD.dependencies)
    println("----------------------")
    val mapRDD: RDD[(String, Int)] = wordRDD.map((_,1))
    println(mapRDD.dependencies)
    println("----------------------")
    val resultRDD: RDD[(String, Int)] = mapRDD.reduceByKey(_+_)
    println(resultRDD.dependencies)
    resultRDD.collect()

结果如下:

List()
----------------------
List(org.apache.spark.OneToOneDependency@1be59f28)
----------------------
List(org.apache.spark.OneToOneDependency@253b380a)
----------------------
List(org.apache.spark.ShuffleDependency@23f3dbf0)

这里依赖关系分为两种
org.apache.spark.OneToOneDependency
org.apache.spark.ShuffleDependency

OneToOneDependency 也就是我们说的窄依赖

/**
 * :: DeveloperApi ::
 * Represents a one-to-one dependency between partitions of the parent and child RDDs.
 */
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}

ShuffleDependency 平常说的宽依赖

@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false,
    val shuffleWriterProcessor: ShuffleWriteProcessor = new ShuffleWriteProcessor)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值