一 RDD依赖关系
1 RDD血缘关系
相邻两个RDD之间的关系,称之为依赖关系,多个连续的依赖关系称之为血缘关系
RDD只支持粗粒度转换,即在大量记录上执行的单个操作。将创建RDD的一系列Lineage(血统)记录下来,以便恢复丢失的分区。RDD的Lineage会记录RDD的元数据信息和转换行为,当该RDD的部分分区数据丢失时,它可以根据这些信息来重新运算和恢复丢失的数据分区
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
val sc = new SparkContext(conf)
val lines: RDD[String] = sc.textFile("data/word.txt")
println(lines.toDebugString)
println("******************")
/**
* (2) data/word.txt MapPartitionsRDD[1] at textFile at Spark01_WordCount_Dep.scala:13 []
* | data/word.txt HadoopRDD[0] at textFile at Spark01_WordCount_Dep.scala:13 []
*/
val words: RDD[String] = lines.flatMap(_.split(" "))
println(words.toDebugString)
println("******************")
/**
* (2) MapPartitionsRDD[2] at flatMap at Spark01_WordCount_Dep.scala:17 []
* | data/word.txt MapPartitionsRDD[1] at textFile at Spark01_WordCount_Dep.scala:13 []
* | data/word.txt HadoopRDD[0] at textFile at Spark01_WordCount_Dep.scala:13 []
*/
val wordToOne: RDD[(String, Int)] = words.map((_,1))
println(wordToOne.toDebugString)
println("******************")
/**
* (2) MapPartitionsRDD[3] at map at Spark01_WordCount_Dep.scala:21 []
* | MapPartitionsRDD[2] at flatMap at Spark01_WordCount_Dep.scala:17 []
* | data/word.txt MapPartitionsRDD[1] at textFile at Spark01_WordCount_Dep.scala:13 []
* | data/word.txt HadoopRDD[0] at textFile at Spark01_WordCount_Dep.scala:13 []
*/
val wordCount: RDD[(String, Int)] = wordToOne.reduceByKey(_ + _)
println(wordCount.toDebugString) // +-:shuffle,存在落盘操作,(2)为分区数量
println("******************")
/**
* (2) ShuffledRDD[4] at reduceByKey at Spark01_WordCount_Dep.scala:25 []
* +-(2) MapPartitionsRDD[3] at map at Spark01_WordCount_Dep.scala:21 []
* | MapPartitionsRDD[2] at flatMap at Spark01_WordCount_Dep.scala:17 []
* | data/word.txt MapPartitionsRDD[1] at textFile at Spark01_WordCount_Dep.scala:13 []
* | data/word.txt HadoopRDD[0] at textFile at Spark01_WordCount_Dep.scala:13 []
*/
wordCount.collect().foreach(println)
sc.stop()
}
2 RDD依赖关系
所谓的依赖关系,其实就是两个相邻RDD之间的关系
RDD的依赖关系主要分为两大类:
- 窄依赖 OneToOneDependency,上游(旧)RDD一个分区的数据被下游(新)RDD的一个分区独享,多个上游RDD分区的数据被一个下游RDD分区独享,也称之为窄依赖,类比独生子女
- 宽依赖 ShuffleDependency,上游(旧)RDD一个分区的数据被下游(新)RDD的多个分区共享
因为shuffle存在将分区数据打乱重新组合的操作,所以shuffle属于宽依赖,类比二胎三胎
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
val sc = new SparkContext(conf)
val lines: RDD[String] = sc.textFile("data/word.txt")
println(lines.dependencies)
println("******************")
//List(org.apache.spark.OneToOneDependency@18d910b3)
val words: RDD