spark源码分析：catalyst 草稿

最新推荐文章于 2025-11-29 15:09:43 发布

最新推荐文章于 2025-11-29 15:09:43 发布 · 193 阅读

文章标签：

本文探讨了在SQLContext.createSchemaRDD过程中形成的数据结构如何影响逻辑计划的简化和优化，具体分析了Literal(true,BooleanType)等模式的来源及作用。

object Optimizer extends RuleExecutor[LogicalPlan] {
val batches =
Batch("ConstantFolding", Once,
ConstantFolding,
[color=red]BooleanSimplification,
SimplifyFilters,[/color]
SimplifyCasts) ::
Batch("Filter Pushdown", Once,
CombineFilters,
PushPredicateThroughProject,
PushPredicateThroughInnerJoin,
ColumnPruning) :: Nil
}

SimplifyFilters

object SimplifyFilters extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case Filter(Literal(true, BooleanType), child) =>
child
case Filter(Literal(null, _), child) =>
LocalRelation(child.output)
case Filter(Literal(false, BooleanType), child) =>
LocalRelation(child.output)
}
}
起到削减一些逻辑判断，直接返回child或者child.output的作用，那么这些Literal(true, BooleanType)之类的模式是从哪里来的呢？查看Optimizer 的batches 可以发现，是SimplifyFilters前面的batch：BooleanSimplification，在这里面形成的

SQLContext.createSchemaRDD(RDD<A>, TypeTag<A>) line: 90
BaiJoin$.main(String[]) line: 26
BaiJoin.main(String[]) line: not available

看这句：SQLContext.createSchemaRDD(RDD<A>, TypeTag<A>)
当时的断点停在new SchemaRDD这一句：
implicit def createSchemaRDD[A <: Product: TypeTag](rdd: RDD[A]) =
new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd)))
当时的varible界面里有这样一个变量：evidence$1 TypeTags$TypeTagImpl<T> (id=107)
它的值是 TypeTag[com.ailk.test.sql.tb]，所以可以近似认为：A就是com.ailk.test.sql.tb（一个case class类型）
rdd则是：MappedRDD[2] at map at BaiJoin.scala:16
MappedRDD[1] at textFile at BaiJoin.scala:16
HadoopRDD[0] at textFile at BaiJoin.scala:16

def fromProductRdd[A <: Product : TypeTag](productRdd: RDD[A]) = {
ExistingRdd(ScalaReflection.attributesFor[A], productToRowRdd(productRdd))
}
把A里面，所有的item都取出来，成为一个列表，就是com.ailk.test.sql.tb定义的所有列
可见ScalaReflection.attributesFor[A]的结果是一个Seq[Attribute]，它的excute就是返回一个RDD[Row]
case class ExistingRdd(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode {
override def execute() = rdd
}
输入是RDD[A]，输出是RDD[Row]
def productToRowRdd[A <: Product](data: RDD[A]): RDD[Row] = {
data.mapPartitions { iterator =>
if (iterator.isEmpty) {
Iterator.empty
} else {
val bufferedIterator = iterator.buffered
val mutableRow = new GenericMutableRow(bufferedIterator.head.productArity)

bufferedIterator.map { r =>
var i = 0
while (i < mutableRow.length) {
mutableRow(i) = r.productElement(i)
i += 1
}

mutableRow
}
}
}
}

/////////////////////////////////////////////////////////////////////
heap jit-Compiler gc
dfs3
申请内存的操作必须是原子操作线程的模式：tlab--为每个线程来 freeList Bumpthepointer
复制算法
s0和s1复制的是eden中存活的对象
标记清除算法---内存碎片
标记压缩算法----内存拷贝比较严重

root的选择：class thread stacklocal jnilocal monitor “held by jvm”
dfs3 标记法