dependencies
dependencies可以查看此 RDD 依赖的上一个 RDD 的类型
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
println("rdd1的依赖:" + rdd1.dependencies)
var rdd2 = rdd1.map(x=>(x,1))
println("rdd2的依赖:" + rdd2.dependencies)
var rdd3 = rdd2.reduceByKey(_+_)
println("rdd3的依赖:" + rdd3.dependencies)
var rdd4 = rdd3.groupByKey()
println("rdd4的依赖:" + rdd4.dependencies)
结果
rdd1的依赖:List()
rdd2的依赖:List(org.apache.spark.OneToOneDependency@60e949e1)
rdd3的依赖:List(org.apache.spark.ShuffleDependency@57ce634f)
rdd4的依赖:List(org.apache.spark.OneToOneDependency@6f3f0fae)
toDebugString
可以看到此 RDD的血缘关系
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
println("rdd1的血缘关系:" + rdd1.toDebugString)
println("--------------------------------华丽的分割线--------------------------------")
var rdd2 = rdd1.map(x=>(x,1))
println("rdd2的血缘关系:" + rdd2.toDebugString)
println("--------------------------------华丽的分割线--------------------------------")
var rdd3 = rdd2.reduceByKey(_+_)
println("rdd3的血缘关系:" + rdd3.toDebugString)
println("--------------------------------华丽的分割线--------------------------------")
var rdd4 = rdd3.groupByKey()
println("rdd4的血缘关系:" + rdd4.toDebugString)
结果
rdd1的血缘关系:(1) ParallelCollectionRDD[0] at parallelize at CheckPoint.scala:14 []
--------------------------------华丽的分割线--------------------------------
rdd2的血缘关系:(1) MapPartitionsRDD[1] at map at CheckPoint.scala:19 []
| ParallelCollectionRDD[0] at parallelize at CheckPoint.scala:14 []
--------------------------------华丽的分割线--------------------------------
rdd3的血缘关系:(1) ShuffledRDD[2] at reduceByKey at CheckPoint.scala:24 []
+-(1) MapPartitionsRDD[1] at map at CheckPoint.scala:19 []
| ParallelCollectionRDD[0] at parallelize at CheckPoint.scala:14 []
--------------------------------华丽的分割线--------------------------------
rdd4的血缘关系:(1) MapPartitionsRDD[3] at groupByKey at CheckPoint.scala:29 []
| ShuffledRDD[2] at reduceByKey at CheckPoint.scala:24 []
+-(1) MapPartitionsRDD[1] at map at CheckPoint.scala:19 []
| ParallelCollectionRDD[0] at parallelize at CheckPoint.scala:14 []

红色的部分表示在这个RDD 发生了 Shuffle
Spark RDD 依赖与血缘关系分析
该博客探讨了Spark中的RDD(弹性分布式数据集)的依赖和血缘关系。通过示例展示了如何使用`dependencies`方法查看RDD的依赖类型,如一对一依赖和shuffle依赖。同时,利用`toDebugString`展示RDD的血缘关系图,揭示了数据处理过程中的shuffle操作。了解这些概念对于优化Spark作业的性能和理解数据流程至关重要。
1010

被折叠的 条评论
为什么被折叠?



