spark-sql-catalyst

最新推荐文章于 2022-04-22 19:04:16 发布

原创最新推荐文章于 2022-04-22 19:04:16 发布 · 975 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#sql #optimizer #语法 #spark

spark 专栏收录该内容

23 篇文章

订阅专栏

Spark SQL Catalyst模块主要负责SQL解析和优化工作，包括parse（语法解析）、resolve（引用验证与结合）以及生成和优化logical plan。文章提及了Decimal类型的优化、Metadata注意事项以及TreeNode在表达式和计划树中的应用。此外，还介绍了Expression、DSL、SqlParser和不同JoinType在QueryPlan中的角色。

spark-sql-catalyst

@(spark)[sql][catalyst]
简单说这部分就是做optimizer的工作的，关于这部分是有一篇论文，写的很清楚，可以当作high leve design来看。

还有一篇blog，内容差不多。

总的来说，在catalyst这部分做的事情基本上是传统关系数据库的：
1. parse（让sql语句变成合法的语法树）
2. resolve（验证olumn，table之类的确实存在，并把table，column的scheme和具体的名字结合起来。
3. 生成具体logicplan，详细的见talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala,典型的比如filter，project，sort，union等等。
4. 这里是一个基于规则的优化器，具体代码在catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
1. 按道理来说，catalyst和Spark没有必然的联系，可以看作一个SQL的optimizer。

types

原生类型

值得一提的是

/**                                                                                                                                                                     
 * ::DeveloperApi::                                                                                                                                                     
 * The data type for User Defined Types (UDTs).                                                                                                                         
 *                                                                                                                                                                      
 * This interface allows a user to make their own classes more interoperable with SparkSQL;                                                                             
 * e.g., by creating a [[UserDefinedType]] for a class X, it becomes possible to create                                                                                 
 * a `DataFrame` which has class X in the schema.                                                                                                                       
 *                                                                                                                                                                      
 * For SparkSQL to recognize UDTs, the UDT must be annotated with                                                                                                       
 * [[SQLUserDefinedType]].                                                                                                                                              
 *                                                                                                                                                                      
 * The conversion via `serialize` occurs when instantiating a `DataFrame` from another RDD.                                                                             
 * The conversion via `deserialize` occurs when reading from a `DataFrame`.                                                                                             
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
abstract class UserDefinedType[UserType] extends DataType with Serializable {

让我们来看一个例子：

class PointUDT extends UserDefinedType[Point] {
    def dataType = StructType(Seq( // Our native structure
        StructField("x", DoubleType),
        StructField("y", DoubleType)
    ))
    def serialize(p: Point) = Row(p.x, p.y)
    def deserialize(r: Row) =
    Point(r.getDouble(0), r.getDouble(1))
}

Decimal

关于可怕的decimal，有个专门的类来优化

/**                                                                                                                                                                     
 * A mutable implementation of BigDecimal that can hold a Long if values are small enough.                                                                              
 *                                                                                                                                                                      
 * The semantics of the fields are as follows:                                                                                                                          
 * - _precision and _scale represent the SQL precision and scale we are looking for                                                                                     
 * - If decimalVal is set, it represents the whole decimal value                                                                                                        
 * - Otherwise, the decimal value is longVal / (10 ** _scale)                                                                                                           
 */                                                                                                                                                                     
final class Decimal extends Ordered[Decimal] with Serializable {

Metadata

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 *                                                                                                                                                                      
 * Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean,                                                                      
 * Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and                                                                       
 * Array[Metadata]. JSON is used for serialization.                                                                                                                     
 *                                                                                                                                                                      
 * The default constructor is private. User should use either [[MetadataBuilder]] or                                                                                    
 * [[Metadata.fromJson()]] to create Metadata instances.                                                                                                                
 *                                                                                                                                                                      
 * @param map an immutable map that stores the data                                                                                                                     
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
sealed class Metadata private[types] (private[types] val map: Map[String, Any])                                                                                         
  extends Serializable {

需要注意的点

请仔细阅读parser的document，尤其是那些operator
在正则表达式中：(?i) starts case-insensitive mode ,(?-i) turns off case-insensitive mode

Tree

The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or more children. New node types are defined in Scala as subclasses of the TreeNode class. These objects are immutable and can be manipulated using functional transformations, as discussed in the next subsection.

abstract class TreeNode[BaseType <: TreeNode[BaseType]] {                                                                                                               
  self: BaseType with Product =>

在TreeNode中定义了大量的遍历，map，copy，transform方法。

Expression

Expression是一个巨大的代码分支，凡是搞过数据库的人都知道这玩意儿的复杂。请容许我把Expression的代码贴上来。
至于具体的class本文就不再继续讨论了。

abstract class Expression extends TreeNode[Expression] {                                                                                                                
  self: Product =>                                                                                                                                                      

  /** The narrowest possible type that is produced when this expression is evaluated. */                                                                                
  type EvaluatedType <: Any                                                                                                                                             

  /**                                                                                                                                                                   
   * Returns true when an expression is a candidate for static evaluation before the query is                                                                           
   * executed.                                                                                                                                                          
   *                                                                                                                                                                    
   * The following conditions are used to determine suitability for constant folding:                                                                                   
   *  - A [[Coalesce]] is foldable if all of its children are foldable                                                                                                  
   *  - A [[BinaryExpression]] is foldable if its both left and right child are foldable                                                                                
   *  - A [[Not]], [[IsNull]], or [[IsNotNull]] is foldable if its child is foldable                                                                                    
   *  - A [[Literal]] is foldable                                                                                                                                       
   *  - A [[Cast]] or [[UnaryMinus]] is foldable if its child is foldable                                                                                               
   */                                                                                                                                                                   
  def foldable: Boolean = false                                                                                                                                         
  def nullable: Boolean                                                                                                                                                 
  def references: AttributeSet = AttributeSet(children.flatMap(_.references.iterator))                                                                                  

  /** Returns the result of evaluating this expression on a given input Row */                                                                                          
  def eval(input: Row = null): EvaluatedType                                                                                                                            

  /**          
    * Returns `true` if this expression and all its children have been resolved to a specific schema                                                                     
   * and `false` if it still contains any unresolved placeholders. Implementations of expressions                                                                       
   * should override this if the resolution of this type of expression involves more than just                                                                          
   * the resolution of its children.                                                                                                                                    
   */                                                                                                                                                                   
  lazy val resolved: Boolean = childrenResolved                                                                                                                         

  /**                                                                                                                                                                   
   * Returns the [[DataType]] of the result of evaluating this expression.  It is                                                                                       
   * invalid to query the dataType of an unresolved expression (i.e., when `resolved` == false).                                                                        
   */                                                                                                                                                                   
  def dataType: DataType                                                                                                                                                

  /**                                                                                                                                                                   
   * Returns true if  all the children of this expression have been resolved to a specific schema                                                                       
   * and false if any still contains any unresolved placeholders.                                                                                                       
   */                                                                                                                                                                   
  def childrenResolved: Boolean = !children.exists(!_.resolved)                                                                                                         

  /**                                                        
* Returns a string representation of this expression that does not have developer centric                                                                            
   * debugging information like the expression id.                                                                                                                      
   */                                                                                                                                                                   
  def prettyString: String = {                                                                                                                                          
    transform {                                                                                                                                                         
      case a: AttributeReference => PrettyAttribute(a.name)                                                                                                             
      case u: UnresolvedAttribute => PrettyAttribute(u.name)                                                                                                            
    }.toString                                                                                                                                                          
  }                                                                                                                                                                     
}

DSL

在catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala中定义了大量的隐式转化来支持dsl。

SqlLexical

class SqlLexical extends StdLexical {

SqlParser

/**                                                                                                                                                                     
 * A very simple SQL parser.  Based loosely on:                                                                                                                         
 * https://github.com/stephentu/scala-sql-parser/blob/master/src/main/scala/parser.scala                                                                                
 *                                                                                                                                                                      
 * Limitations:                                                                                                                                                         
 *  - Only supports a very limited subset of SQL.                                                                                                                       
 *                                                                                                                                                                      
 * This is currently included mostly for illustrative purposes.  Users wanting more complete support                                                                    
 * for a SQL like language should checkout the HiveQL support in the sql/hive sub-project.                                                                              
 */                                                                                                                                                                     
class SqlParser extends AbstractSparkSQLParser with DataTypeParser {

含注释文件一共386行，当然不是完整的scala不过也可以了，算是比较简洁的吧。

plans

QueryPlan

abstract class QueryPlan[PlanType <: TreeNode[PlanType]] extends TreeNode[PlanType] {

所有plan的基类。

JoinType

sealed abstract class JoinType                                                                                                                                          

case object Inner extends JoinType                                                                                                                                      

case object LeftOuter extends JoinType                                                                                                                                  

case object RightOuter extends JoinType                                                                                                                                 

case object FullOuter extends JoinType                                                                                                                                  

case object LeftSemi extends JoinType