Spark-Sql源码解析之七 Execute: executed Plan -> RDD[Row]

最新推荐文章于 2023-07-27 10:14:11 发布

原创

最新推荐文章于 2023-07-27 10:14:11 发布 · 3.4k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#源码 #spark

博客深入探讨了SparkPlan如何执行并转化为RDD[Row]的过程。通过分析`execute`函数和`doExecute`方法，特别是针对`select SUM(id) from test group by dev_chnid`语句，展示了从Exchange到PhysicalRDD的转换。在PhysicalRDD中，`doExecute`直接返回构建的rdd，这个rdd是由`buildScan`生成的，对于Spark 1.4.0中的Parquet文件，对应的是`ParquetRelation2`。整个执行过程涉及到局部和全局聚合，将job 0拆分为两个Stage：stage 0和stage 1。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

SparkPlan如何执行呢，SparkPlan是如何转变为RDD[Row]的呢？首先看一段代码：

SQLContext sqlContext = new SQLContext(jsc);
DataFrame dataFrame = sqlContext.parquetFile(parquetPath);
dataFrame.registerTempTable(source);
String sql = " select SUM(id) from test group by dev_chnid ";
DataFrame result = sqlContext.sql(sql);
log.info("Result:"+result.collect());//collect触发action

override def collect(): Array[Row] = {
  val ret = queryExecution.executedPlan.executeCollect()//执行executedPlan的executeCollect
  ret
}
def executeCollect(): Array[Row] = {
  execute().mapPartitions { iter =>
    val converter = CatalystTypeConverters.createToScalaConverter(schema)
    iter.map(converter(_).asInstanceOf[Row])
  }.collect()//最终执行的是executedPlan的execute，即SparkPlan的execute
}
def collect(): Array[T] = withScope {
  val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
  Array.concat(results: _*)
}

查看SparkPlan的execute函数：

abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable {
  ……
final def execute(): RDD[Row] = {
    RDDOperationScope.withScope(sparkContext, nodeName, false, true) {
      doExecute()//执行各个具体SparkPlan的doExecute函数
    }
  }
……
}

可以每个具体的SparkPlan都会封装一个doExecute函数，其输出为RDD[Row]。就拿select SUM(id) from test group by dev_chnid语句来说，其executePlan为：

Aggregate false, [dev_chnid#0], [CombineSum(PartialSum#45L) AS c0#43L]
 Exchange (HashPartitioning 200)
  Aggregate true, [dev_chnid#0], [dev_chnid#0,SUM(id#17L) AS PartialSum#45L]
   PhysicalRDD [dev_chnid#0,id#17L], MapPartitionsRDD

先看下Aggregatefalse, [dev_chnid#0], [CombineSum(PartialSum#45L) AS c0#43L]的doExecute的函数：

protected override def doExecute(): RDD[Row] = attachTree(this, "execute") {
  if (groupingExpressions.isEmpty) {//如果没有分组
    child.execute().mapPartitions { iter =>//执行child的execute函数
      val buffer = newAggregateBuffer()
      var currentRow: Row = null
      while (iter.hasNext) {
        currentRow = iter.next()
        var i = 0
        while (i < buffer.length) {//计算全局的值
          buffer(i).update(currentRow)
          i += 1
        }
      }
      val resultProjection = new InterpretedProjection(resultExpressions, computedSchema)
      val aggregateResults = new GenericMutableRow(computedAggregates.length)

      var i = 0
      while (i < buffer.length) {
        aggregateResults(i) = buffer(i).eval(EmptyRow)
        i += 1
      }

      Iterator(resultProjection(aggregateResults))
    }
  } else {
    child.execute().mapPartitions { iter =>//执行child的execute函数
      val hashTable = new HashMap[Row, Array[AggregateFunction]]
      //groupingExpressions = [dev_chnid#0]
      //child.output = [dev_chnid#0,id#17L]
      val groupingProjection = new InterpretedMutableProjection(groupin