Spark PruneDependency 依赖关系 Filter

最新推荐文章于 2022-08-17 16:23:29 发布

原创最新推荐文章于 2022-08-17 16:23:29 发布 · 336 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Spark PruneDependency 依赖关系 Filter

Spark 专栏收录该内容

28 篇文章

订阅专栏

本文深入探讨Spark中PruneDependency的概念及其应用，特别是在PartitionPruningRDD与其父RDD之间的依赖关系中。文章通过实例展示了如何使用filterByRange函数来实现子RDD仅包含父RDD的部分分区，从而实现高效的数据处理。

Spark PruneDependency 依赖关系 Filter

Represents a dependency between the PartitionPruningRDD and its parent. In this
case, the child RDD contains a subset of partitions of the parents’.

youtub视频演示

https://youtu.be/5ZCNiEhO_Qg (youtube视频)
https://www.bilibili.com/video/av37442139/?p=3 (bilibili)

width="800" height="500" src="//player.bilibili.com/player.html?aid=37442139&cid=65822402&page=3" scrolling="no" border="0" allowfullscreen="true">

输入数据

List(("a",2),("d",1),("b",8),("d",3)

处理程序scala

package com.opensource.bigdata.spark.local.rdd.operation.dependency.narrow.n_03_pruneDependency.n_03_filterByRange_filter

import com.opensource.bigdata.spark.local.rdd.operation.base.BaseScalaSparkContext

object Run extends BaseScalaSparkContext{

  def main(args: Array[String]): Unit = {

    val sc = pre()
    val rdd1 = sc.parallelize(List(("a",2),("d",1),("b",8),("d",3)),2)  //ParallelCollectionRDD
    val rdd2 =rdd1.filterByRange("a","b")  //MapParttionsRDD

    println("rdd \n" + rdd2.collect().mkString("\n"))

    sc.stop()
  }

}