spark-sql-catalyst
@(spark)[sql][catalyst]
简单说这部分就是做optimizer的工作的,关于这部分是有一篇论文,写的很清楚,可以当作high leve design来看。
还有一篇blog,内容差不多。
总的来说,在catalyst这部分做的事情基本上是传统关系数据库的:
1. parse(让sql语句变成合法的语法树)
2. resolve(验证olumn,table之类的确实存在,并把table,column的scheme和具体的名字结合起来。
3. 生成具体logicplan,详细的见talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
,典型的比如filter,project,sort,union等等。
4. 这里是一个基于规则的优化器,具体代码在catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
1. 按道理来说,catalyst和Spark没有必然的联系,可以看作一个SQL的optimizer。
types
值得一提的是
/**
* ::DeveloperApi::
* The data type for User Defined Types (UDTs).
*
* This interface allows a user to make their own classes more interoperable with SparkSQL;
* e.g., by creating a [[UserDefinedType]] for a class X, it becomes possible to create
* a `DataFrame` which has class X in the schema.
*
* For SparkSQL to recognize UDTs, the UDT must be annotated with
* [[SQLUserDefinedType]].
*
* The conversion via `serialize` occurs when instantiating a `DataFrame` from another RDD.
* The conversion via `deserialize` occurs when reading from a `DataFrame`.
*/
@DeveloperApi
abstract class UserDefinedType[UserType] extends DataType with Serializable {
让我们来看一个例子:
class PointUDT extends UserDefinedType[Point] {
def dataType = StructType(Seq( // Our native structure
StructField("x", DoubleType),
StructField("y", DoubleType)
))
def serialize(p: Point) = Row(p.x, p.y)
def deserialize(r: Row) =
Point(r.getDouble(0), r.getDouble(1))
}
Decimal
关于可怕的decimal,有个专门的类来优化
/**
* A mutable implementation of BigDecimal that can hold a Long if values are small enough.
*
* The semantics of the fields are as follows:
* - _precision and _scale represent the SQL precision and scale we are looking for
* - If decimalVal is set, it represents the whole decimal value
* - Otherwise, the decimal value is longVal / (10 ** _scale)
*/
final class Decimal extends Ordered[Decimal] with Serializable {
Metadata
/**
* :: DeveloperApi ::
*
* Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean,
* Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and
* Array[Metadata]. JSON is used for serialization.
*
* The default constructor is private. User should use either [[MetadataBuilder]] or
* [[Metadata.fromJson()]] to create Metadata instances.
*
* @param map an immutable map that stores the data
*/
@DeveloperApi
sealed class Metadata private[types] (private[types] val map: Map[String, Any])
extends Serializable {
需要注意的点
- 请仔细阅读parser的document,尤其是那些operator
- 在正则表达式中:
(?i) starts case-insensitive mode
,(?-i) turns off case-insensitive mode
Tree
The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or more children. New node types are defined in Scala as subclasses of the TreeNode class. These objects are immutable and can be manipulated using functional transformations, as discussed in the next subsection.
abstract class TreeNode[BaseType <: TreeNode[BaseType]] {
self: BaseType with Product =>
在TreeNode中定义了大量的遍历,map,copy,transform方法。
Expression
Expression是一个巨大的代码分支,凡是搞过数据库的人都知道这玩意儿的复杂。请容许我把Expression的代码贴上来。
至于具体的class本文就不再继续讨论了。
abstract class Expression extends TreeNode[Expression] {
self: Product =>
/** The narrowest possible type that is produced when this expression is evaluated. */
type EvaluatedType <: Any
/**
* Returns true when an expression is a candidate for static evaluation before the query is
* executed.
*
* The following conditions are used to determine suitability for constant folding:
* - A [[Coalesce]] is foldable if all of its children are foldable
* - A [[BinaryExpression]] is foldable if its both left and right child are foldable
* - A [[Not]], [[IsNull]], or [[IsNotNull]] is foldable if its child is foldable
* - A [[Literal]] is foldable
* - A [[Cast]] or [[UnaryMinus]] is foldable if its child is foldable
*/
def foldable: Boolean = false
def nullable: Boolean
def references: AttributeSet = AttributeSet(children.flatMap(_.references.iterator))
/** Returns the result of evaluating this expression on a given input Row */
def eval(input: Row = null): EvaluatedType
/**
* Returns `true` if this expression and all its children have been resolved to a specific schema
* and `false` if it still contains any unresolved placeholders. Implementations of expressions
* should override this if the resolution of this type of expression involves more than just
* the resolution of its children.
*/
lazy val resolved: Boolean = childrenResolved
/**
* Returns the [[DataType]] of the result of evaluating this expression. It is
* invalid to query the dataType of an unresolved expression (i.e., when `resolved` == false).
*/
def dataType: DataType
/**
* Returns true if all the children of this expression have been resolved to a specific schema
* and false if any still contains any unresolved placeholders.
*/
def childrenResolved: Boolean = !children.exists(!_.resolved)
/**
* Returns a string representation of this expression that does not have developer centric
* debugging information like the expression id.
*/
def prettyString: String = {
transform {
case a: AttributeReference => PrettyAttribute(a.name)
case u: UnresolvedAttribute => PrettyAttribute(u.name)
}.toString
}
}
DSL
在catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
中定义了大量的隐式转化来支持dsl。
SqlLexical
class SqlLexical extends StdLexical {
SqlParser
/**
* A very simple SQL parser. Based loosely on:
* https://github.com/stephentu/scala-sql-parser/blob/master/src/main/scala/parser.scala
*
* Limitations:
* - Only supports a very limited subset of SQL.
*
* This is currently included mostly for illustrative purposes. Users wanting more complete support
* for a SQL like language should checkout the HiveQL support in the sql/hive sub-project.
*/
class SqlParser extends AbstractSparkSQLParser with DataTypeParser {
含注释文件一共386行,当然不是完整的scala不过也可以了,算是比较简洁的吧。
plans
QueryPlan
abstract class QueryPlan[PlanType <: TreeNode[PlanType]] extends TreeNode[PlanType] {
所有plan的基类。
JoinType
sealed abstract class JoinType
case object Inner extends JoinType
case object LeftOuter extends JoinType
case object RightOuter extends JoinType
case object FullOuter extends JoinType
case object LeftSemi extends JoinType