Project 3:Query Execution

本文详细介绍了BusTub数据库项目的查询计划实现过程,包括关键数据结构如Catalog、TableInfo和IndexInfo的设计,以及顺序扫描、哈希连接等9种执行器的难点解析。

Project 3:Query Execution

前言:本项目是实现BusTub数据库里面的一些基本的查询计划,一共9个,在这里循序渐进进行分析。难度主要集中在对源码的理解与活学活用,不多说,现在开始。

主要的类有上述三个,需要细读源码,理解其构造与实现。

Catalog

这里主要介绍一下Catalog,其余类可类比学习。

using table_oid_t = uint32_t;// 物理表标号
using column_oid_t = uint32_t;// 列标号
using index_oid_t = uint32_t;// 索引标号

主要定义了几个在表中常用的数据类型

主要数据成员

table_info

struct TableInfo {
  TableInfo(Schema schema, std::string name, std::unique_ptr<TableHeap> &&table, table_oid_t oid)
      : schema_{std::move(schema)}, name_{std::move(name)}, table_{std::move(table)}, oid_{oid} {}
  /** The table schema */
  Schema schema_;
  /** The table name */
  const std::string name_;
  /** An owning pointer to the table heap */
  std::unique_ptr<TableHeap> table_;
  /** The table OID */
  const table_oid_t oid_;
};

这里定义了一个结构体存储物理表的主要信息,用于从Catalog中获取。
在这里插入图片描述

index_info

struct IndexInfo {
  IndexInfo(Schema key_schema, std::string name, std::unique_ptr<Index> &&index, index_oid_t index_oid,
            std::string table_name, size_t key_size)
      : key_schema_{std::move(key_schema)},
        name_{std::move(name)},
        index_{std::move(index)},
        index_oid_{index_oid},
        table_name_{std::move(table_name)},
        key_size_{key_size} {}
  /** The schema for the index key */
  Schema key_schema_;
  /** The name of the index */
  std::string name_;
  /** An owning pointer to the index */
  std::unique_ptr<Index> index_;
  /** The unique OID for the index */
  index_oid_t index_oid_;
  /** The name of the table on which the index is created */
  std::string table_name_;
  /** The size of the index key, in bytes */
  const size_t key_size_;
};

和table_info一样,这里声明了一个结构体用于存储索引的信息,需要注意的是这里的索引是基于一个存在的表生成的。
在这里插入图片描述

主要成员函数

在这里插入图片描述

  /** Indicates that an operation returning a `TableInfo*` failed */
  static constexpr TableInfo *NULL_TABLE_INFO{nullptr};

  /** Indicates that an operation returning a `IndexInfo*` failed */
  static constexpr IndexInfo *NULL_INDEX_INFO{nullptr};

声明两个静态常量类型,这里用的是constexpr,必须在类里面初始化


其余的类可类比总结,这里不再赘述,接下来是执行器的实现。

执行器实现

一共九个执行器,主要介绍几个比较难以实现的,以及些许坑点。

先来总体概括一下,查询计划的具体实现过程。

在这里插入图片描述

主要是实现上图中的Init和Next

顺序扫描

给定一个表,执行顺序扫描,要注意以下几点:

  • 需要填充result_set,也就是说每一次Next需要返回一个tuple和rid

  • 要记录下一次开始的位置,也就是要记录表迭代器的位置

  • 要控制输出格式,也就是表本身的schema与执行计划输出的schema不一样,要进行判断与修改

    auto opt = plan_->OutputSchema();
    std::vector<Value> values;
    values.reserve(opt->GetColumnCount());
    //  当输出格式与原schema格式不同时,需要获取输出格式的所有Value,来构造tuple  
    for (const auto &column : opt->GetColumns()) {   
        auto value = column.GetExpr()->Evaluate(tp, &schema); 
        values.push_back(value);
            
    }        
    *tuple = Tuple(values, opt);// 构造tuple       
    *rid = tp->GetRid();
    
  • 只要predicate不为nullptr,就需要进行Evaluate

插入、删除、更新

这几个操作都需要涉及更新表的数据以及索引,为此需要修改物理表数据(TableHeap)与索引信息(Index),同时不能修改result_set,为此需要全部处理后,返回false,有点像pipebreaker。

聚合函数

和顺序扫描操作类似,在Having不为nullptr时,需要进行谓词的判断,当符合条件就返回tuple,这里的重点在Init,因为需要构造哈希表,以供Next函数查询使用。

嵌套连接

编程难度从这里开始上升,这里要求实现下面关系

在这里插入图片描述

同时,要考虑IOcost,为此要记录外键的位置,同时遍历内键,这里利用了比较符的性质,巧妙的区分各种情况。

while (flag_ || left_executor_->Next(&left_tuple_, &left_rid_)) {
    while (right_executor_->Next(&right_tuple, &right_rid)) {
      res = true;
      if (predicate != nullptr) {
        res = predicate->EvaluateJoin(&left_tuple_, left_schema, &right_tuple, right_schema).GetAs<bool>();
      }
      if (res) {
        std::vector<Value> values;
        values.reserve(output_schema->GetColumnCount());
        for (const auto &column : output_schema->GetColumns()) {
          values.push_back(column.GetExpr()->EvaluateJoin(&left_tuple_, left_schema, &right_tuple, right_schema));
        }
        *tuple = Tuple(values, output_schema);
        flag_ = true;
        return true;
      }
    }
    flag_ = false;
    right_executor_->Init();
  }

flag_初始化为false,后面根据需要进行调整。

哈希连接

在这里插入图片描述

这个操作符难度较大,我根据官方提示进行操作,设计相应的哈希结构进行处理,然后将哈希函数进行特化,以此来适应tuple(多个值)的情况。具体形式可以参考:

// ------------------------------------------------------------------
namespace bustub {
struct HashJoinKey {
  Value keys_;
  auto operator==(const HashJoinKey &other) const -> bool {
    return (keys_.CompareEquals(other.keys_) == CmpBool::CmpTrue);
  }
};
} 
// ------------------------------------------------------------------
namespace std {
/** Implements std::hash on Key */
template <>
struct hash<bustub::HashJoinKey> {
  auto operator()(const bustub::HashJoinKey &key) const -> std::size_t {
    size_t curr_hash = 0;
    if (!key.keys_.IsNull()) {
      curr_hash = bustub::HashUtil::CombineHashes(curr_hash, bustub::HashUtil::HashValue(&key.keys_));
    }
    return curr_hash;
  }
};
}
// --------------------------------------------------------------------

通过上述的数据结构,在Init中可以将外键进行处理,将其存储在unorder_map,以供查询

接着在Next中,遍历内键,但要需要注意多个键共享一个key的情况,这里需要申请一些成员变量

 private:
  /** The NestedLoopJoin plan node to be executed. */
  const HashJoinPlanNode *plan_;
  std::unique_ptr<AbstractExecutor> left_child_;
  std::unique_ptr<AbstractExecutor> right_child_;
//-------------------------------------------------
  const Schema *l_schema_;
  std::unordered_map<HashJoinKey, std::vector<Tuple>> h_map_;
  std::vector<Tuple> left_tuples_{};
  RID right_rid_;
  Tuple right_tuple_;
  bool flag_;

distinct、limit

distinct需要一个set即可完成去重操作,limit就不用说啦。

坑点

  • 当构造hash特化函数与数据结构时,需要放在前面,否则会出现Incomplete class
  • 在头文件中使用别的类,需要include,否则会报错
  • RID的含义,需要特别注意,这里rid只在表元中才有意义,最好使用table_iterator调用
  • 在进行grading的时候,一定需要在CreateExecutor中加入MockScan的计划类型,否则会报错unsupported plan type

结果


在这里插入图片描述

17:19:18 INTROSPECTION INIT: Product: IntelliJ IDEA 2024.1 Project: huacai-blog DBMS: MySQL Accessibility matcher: none Connection bounds: null 17:19:18 Given model is accepted 17:19:18 Introspection scope is specified explicitly, but it is the same as the data source&#39;s one: Pattern: kind: schema name: `@`, `blog` 17:19:18 Introspector version: 0 17:19:18 Executing query: -- 8< -- select database() -- >8 -- 17:19:18 execution time: 2 ms fetching time: 3 ms 17:19:18 Attached to database DBMS: MySQL Server version: 5.7.21 Driver version: 8.2 17:19:18 Memory: 859 M available from allocated 1640 M (max 4048 M) 17:19:18 Executing query: -- 8< -- select @@event_scheduler -- >8 -- 17:19:18 execution time: 2 ms fetching time: 3 ms 17:19:18 Connection-related state is initialized. The parameters map: empty map 17:19:18 Detaching from DB 17:19:18 Memory: 856 M available from allocated 1640 M (max 4048 M) 17:19:18 Query Performance Statistics: -------------------------------------------------------------------------------------------------------------------------------------------- ## count err T.min T.avg T.max T.sum % rows packs Query -------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------- Total duration of all queries: 0.000 s. 17:19:18 Dropping the info 17:19:18 Attached to database DBMS: MySQL Server version: 5.7.21 Driver version: 8.2 17:19:18 Memory: 856 M available from allocated 1640 M (max 4048 M) 17:19:18 Executing query: -- 8< -- select @@event_scheduler -- >8 -- 17:19:18 execution time: 1 ms fetching time: 2 ms 17:19:18 Connection-related state is initialized. The parameters map: empty map 17:19:18 ==================================================== REQUEST WITH 1 INTROSPECTION TASKS ==================================================== 17:19:18 Requested tasks: Introspect schema blog in the database 17:19:18 =============================================== DOING A TASK [1/1]: SchemasIntrospectionTask =============================================== 17:19:18 Task description: Introspect schema blog in the database 17:19:18 Retrieving server info 17:19:18 Executing query: -- 8< -- select @@default_storage_engine, @@default_tmp_storage_engine -- >8 -- 17:19:18 execution time: 2 ms fetching time: 1 ms 17:19:18 Executing query: -- 8< -- select table_name from information_schema.tables where lower(table_schema) = &#39;information_schema&#39; and lower(table_name) = &#39;parameters&#39; -- >8 -- 17:19:18 execution time: 6 ms fetching time: 1 ms 17:19:18 Retrieving session info 17:19:18 ---------------------------------------------------------- SCHEMA RETRIEVER: blog ---------------------------------------------------------- 17:19:18 Schema has children: true Incremental is possible: false Asked mode: null Decided mode: FULL 17:19:18 Configuration info: - schema id: no - schema has children: yes - mode: FULL - with sources: yes 17:19:18 Families before introspection: - table: 48 17:19:18 Retrieving tables and views 17:19:18 Executing query: -- 8< -- select table_name, table_type, table_comment, engine, table_collation, create_options from information_schema.tables where table_schema = ? -- >8 -- 17:19:18 execution time: 3 ms fetching time: 3 ms 17:19:18 Retrieving columns 17:19:18 Executing query: -- 8< -- select ordinal_position, column_name, column_type, column_default, generation_expression, table_name, column_comment, is_nullable, extra, collation_name from information_schema.columns where table_schema = ? order by table_name, ordinal_position -- >8 -- 17:19:18 execution time: 8 ms fetching time: 5 ms 17:19:18 Executing query: -- 8< -- select table_name, auto_increment from information_schema.tables where table_schema = ? and auto_increment is not null -- >8 -- 17:19:18 execution time: 106 ms fetching time: 4 ms 17:19:18 Retrieving indices 17:19:18 Executing query: -- 8< -- select table_name, index_name, index_comment, index_type, non_unique, column_name, sub_part, collation, null expression from information_schema.statistics where table_schema = ? and index_schema = ? order by index_schema, table_name, index_name, index_type, seq_in_index -- >8 -- 17:19:18 execution time: 2 ms fetching time: 4 ms 17:19:18 Retrieving constraints 17:19:18 Executing query: -- 8< -- select c.constraint_name, c.constraint_schema, c.table_name, c.constraint_type, false enforced from information_schema.table_constraints c where c.table_schema = ? -- >8 -- 17:19:18 execution time: 2 ms fetching time: 3 ms 17:19:18 Executing query: -- 8< -- select r.constraint_name, null table_name, r.match_option, r.update_rule, r.delete_rule from information_schema.referential_constraints r where r.constraint_schema = ? -- >8 -- 17:19:18 execution time: 2 ms fetching time: 1 ms 17:19:18 Retrieving constraint columns 17:19:18 Executing query: -- 8< -- select constraint_name, table_name, column_name, referenced_table_schema, referenced_table_name, referenced_column_name from information_schema.key_column_usage where table_schema = ? and referenced_column_name is not null order by table_name , constraint_name , ordinal_position -- >8 -- 17:19:18 execution time: 3 ms fetching time: 3 ms 17:19:18 Retrieving partitions 17:19:18 Executing query: -- 8< -- select table_name, partition_name, subpartition_name, partition_ordinal_position, subpartition_ordinal_position, partition_method, subpartition_method, partition_expression, subpartition_expression, partition_description, partition_comment/*, tablespace_name*/ from information_schema.partitions where partition_name is not null and table_schema = ? -- >8 -- 17:19:18 execution time: 98 ms fetching time: 1 ms 17:19:18 Retrieving triggers 17:19:18 Executing query: -- 8< -- select trigger_name, event_object_table, event_manipulation, action_timing, definer from information_schema.triggers where trigger_schema = ? -- >8 -- 17:19:18 execution time: 9 ms fetching time: 2 ms 17:19:18 Retrieving scheduled events 17:19:18 Retrieving routines 17:19:18 Executing query: -- 8< -- select routine_name, routine_type, routine_definition, routine_comment, dtd_identifier, definer, is_deterministic = &#39;YES&#39; is_deterministic, cast(sql_data_access as char(1)) sql_data_access, cast(security_type as char(1)) security_type from information_schema.routines where routine_schema = ? -- >8 -- 17:19:18 execution time: 4 ms fetching time: 1 ms 17:19:18 Retrieving routine permissions 17:19:18 Executing query: -- 8< -- select Host, User, Routine_name, Proc_priv, Routine_type = &#39;PROCEDURE&#39; as is_proc from mysql.procs_priv where Db = ?; -- >8 -- 17:19:18 execution time: 1 ms fetching time: 2 ms 17:19:18 Retrieving column permissions 17:19:18 Executing query: -- 8< -- select grantee, table_name, column_name, privilege_type, is_grantable from information_schema.column_privileges where table_schema = ? union all select grantee, table_name, &#39;&#39; column_name, privilege_type, is_grantable from information_schema.table_privileges where table_schema = ? order by table_name, grantee, privilege_type -- >8 -- 17:19:18 execution time: 1 ms fetching time: 1 ms 17:19:18 Retrieving arguments 17:19:18 Executing query: -- 8< -- select specific_name, ordinal_position, parameter_name, parameter_mode, dtd_identifier from information_schema.parameters where specific_schema = ? and ordinal_position > 0 order by specific_name, ordinal_position -- >8 -- 17:19:18 execution time: 2 ms fetching time: 1 ms 17:19:18 Retrieving view sources 17:19:18 Executing query: -- 8< -- select table_name, view_definition, definer from information_schema.views where table_schema = ? -- >8 -- 17:19:18 execution time: 3 ms fetching time: 2 ms 17:19:18 Retrieving routine sources 17:19:18 Retrieving routine sources (circumventing MySQL bugs) 17:19:18 Retrieving trigger sources 17:19:18 Retrieving trigger sources (circumventing MySQL bugs) 17:19:18 Retrieving event sources 17:19:18 Retrieving event sources (circumventing MySQL bugs) 17:19:18 Status details: Finishing retrieving the schema blog 17:19:18 No deferred tasks to process. 17:19:18 Detaching from DB 17:19:18 Memory: 844 M available from allocated 1640 M (max 4048 M) 17:19:18 Query Performance Statistics: -------------------------------------------------------------------------------------------------------------------------------------------- ## count err T.min T.avg T.max T.sum % rows packs Query -------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------- Total duration of all queries: 0.000 s. 17:19:18 Dropping the info 17:19:18 Metrics: ================================== Operation Cnt Min Avg Max Sum ---------------------------------- ---------------------------------- Total: 0.000
09-29
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值