Partition Pruning

本文介绍Oracle数据库如何通过分区和子分区识别优化SQL查询性能。分区剪除技术能有效减少不必要的索引和数据分区访问,显著提升查询效率。文章详细解释了在不同类型的分区条件下,如何利用等值、范围、LIKE及IN列表谓词实现分区剪除。

The Oracle database server explicitly recognizes partitions and subpartitions. It then
optimizes SQL statements to mark the partitions or subpartitions that need to be
accessed and eliminates (prunes) unnecessary partitions or subpartitions from access
by those SQL statements. In other words, partition pruning is the skipping of
unnecessary index and data partitions or subpartitions in a query.

For each SQL statement, depending on the selection criteria specified, unneeded
partitions or subpartitions can be eliminated. For example, if a query only involves
March sales data, then there is no need to retrieve data for the remaining eleven
months. Such intelligent pruning can dramatically reduce the data volume, resulting
in substantial improvements in query performance.

If the optimizer determines that the selection criteria used for pruning are satisfied by
all the rows in the accessed partition or subpartition, it removes those criteria from the
predicate list (WHERE clause) during evaluation in order to improve performance.
However, the optimizer cannot prune partitions if the SQL statement applies a
function to the partitioning column (with the exception of the TO_DATE function).
Similarly, the optimizer cannot use an index if the SQL statement applies a function to
the indexed column, unless it is a function-based index.

当对分区列进行函数运算时,优化器不能进行分区剪除。

Pruning can eliminate index partitions even when the underlying table's partitions
cannot be eliminated, but only when the index and table are partitioned on different
columns. You can often improve the performance of operations on large tables by
creating partitioned indexes that reduce the amount of data that your SQL statements
need to access or modify.

Equality, range, LIKE, and IN-list predicates are considered for partition pruning with
range or list partitioning, and equality and IN-list predicates are considered for
partition pruning with hash partitioning.

等于,范围,like,in list 都能进行分区剪除,其中等于和in list 进行哈希分区剪除。

Partition Pruning Example
We have a partitioned table called cust_orders. The partition key for cust_orders
is order_date. Let us assume that cust_orders has six months of data, January to
June, with a partition for each month of data. If the following query is run:
SELECT SUM(value)
FROM cust_orders
WHERE order_date BETWEEN '28-MAR-98' AND '23-APR-98';
Partition pruning is achieved by:
■ First, partition elimination of January, February, May, and June data partitions.
Then either:
■ An index scan of the March and April data partition due to high index selectivity
or
■ A full scan of the March and April data partition due to low index selectivity

[@more@]

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/10599713/viewspace-1002503/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/10599713/viewspace-1002503/

<think>嗯,用户这次问的是Spark的动态分区剪枝(DPP)机制,特别是如何通过子查询获取维度表的连接键值来剪枝事实表分区。从引用内容看,用户已经做了一些功课,引用了包括DPP生效条件、子查询转换、分区剪枝原理等技术资料。 用户可能正在优化Spark SQL查询性能,遇到大表join时分区剪枝未生效的问题。ta的提问非常精准,说明有一定Spark调优经验,但可能对DPP底层实现细节不熟悉。需要重点解释三个层面:1)DPP触发条件 2)子查询转换为动态过滤器的过程 3)分区剪枝的实际执行逻辑。 注意到引用[3]和[4]特别关键,其中提到DynamicPruningSubquery的转换过程和运行时过滤机制。要结合示例说明:当维度表较小且包含分区键时,Spark会自动将IN子查询改写成DynamicPruningExpression,这个表达式会在Scan前注入分区过滤条件。但有几个易错点需要强调:比如维度表过滤条件是否包含分区列、子查询结果集大小是否超过spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnlyThreshold阈值等。 用户可能还隐含想知道如何验证DPP是否生效。可以在回复中加入Spark UI的验证方法,比如检查Exchange节点是否插入DynamicPruning子查询,或者执行计划中是否出现"PartitionFilters: [dynamicpruningexpression]"提示。 考虑到用户引用资料涉及分布式文件列表(spark.sql.sources.maxConcurrentWrites)和分区推断(partitionColumnTypeInference),虽然这些配置和DPP无直接关联,但可能暗示用户还在处理分区元数据问题。不过当前问题应聚焦DPP机制,其他配置只需简要提及关联性。 </think>在Spark中启用动态分区剪枝(Dynamic Partition Pruning, DPP)时,通过子查询获取维度表连接键的不同值来剪枝事实表分区的流程如下,结合核心机制分步说明: --- ### **1. DPP触发条件** 需同时满足: - **分区表**:事实表需是被分区表(如按`date`分区) - **等值连接**:事实表与维度表在**分区键**上进行等值连接(如`fact.partition_key = dim.join_key`) - **维度表较小**:维度表数据量需较小(通常自动广播,或手动设置`spark.sql.autoBroadcastJoinThreshold`) - **子查询过滤**:子查询需直接关联分区键(如`WHERE fact.partition_key IN (SELECT dim_key FROM dim_table WHERE ...)`) > 示例SQL: > ```sql > SELECT * > FROM fact_table > WHERE fact_table.partition_date IN ( > SELECT DISTINCT dim_date > FROM dim_table > WHERE dim_table.category = 'Electronics' -- 过滤维度表 > ); > ``` --- ### **2. Spark优化器转换过程** #### **步骤1:子查询重写** 优化器将`IN`子查询转换为**半连接(Semi-Join)**: ```sql SELECT fact_table.* FROM fact_table LEFT SEMI JOIN dim_table ON fact_table.partition_date = dim_table.dim_date AND dim_table.category = 'Electronics' ``` #### **步骤2:注入动态过滤器** 生成`DynamicPruningExpression`表达式并下推: - 从维度表提取`dim_date`的**去重值集合**(如`[2023-01-01, 2023-01-02]`) - 将此集合作为**运行时过滤条件**注入事实表扫描 - 事实表仅扫描匹配的分区(跳过无关分区) > 在物理计划中体现为: > ```scala > FileScan parquet fact_table > PartitionFilters: > [isnotnull(partition_date), > dynamicpruningexpression(partition_date IN dynamicpruning#123)] // 动态过滤器 > ``` --- ### **3. 分区剪枝执行流程** 1. **维度表扫描**: - 执行子查询:`SELECT DISTINCT dim_date FROM dim_table WHERE ...` - 结果集广播至所有Executor(通过`BroadcastExchange`) 2. **事实表扫描优化**: - Driver将维度表的结果集转换为**分区过滤器** - 各Executor读取事实表时**跳过未命中过滤器的分区目录** - 减少I/O和数据处理量(如跳过`/partition_date=2023-01-03/`) --- ### **4. 关键配置参数** | **配置项** | **默认值** | **作用** | |-------------------------------------------------|------------|----------------------------------------| | `spark.sql.optimizer.dynamicPartitionPruning.enabled` | `true` | 全局启用DPP | | `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` | `true` | 仅复用广播结果避免重复计算 | | `spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio` | `0.5` | 若维度表过大,回退到传统过滤的阈值 | --- ### **5. 性能优化建议** - **维度表过滤**:确保子查询充分过滤维度表(如添加`WHERE`条件),减少广播数据量 - **避免大维度表**:若维度表过大,DPP可能退化为全表扫描 + 过滤,可通过`DROP PARTITION`手动清理历史分区 - **监控执行计划**:检查`Explain`输出是否包含`DynamicPruningExpression`,确认DPP生效 --- ### **示例验证** 执行计划片段(关键部分): ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastHashJoin [partition_date], [dim_date], LeftSemi, BuildRight :- FileScan parquet fact_table [..., partition_date] : +- **PartitionFilters: [dynamicpruning(partition_date)]** // DPP生效 +- BroadcastExchange +- HashAggregate(keys=[dim_date]) +- Project [dim_date] +- Filter (category = 'Electronics') +- Scan dim_table ``` > 此计划显示: > 1. 维度表结果广播(`BroadcastExchange`) > 2. 事实表扫描使用`dynamicpruning(partition_date)`过滤分区 --- ### **常见问题** **Q1: DPP未触发怎么办?** - 检查连接条件是否包含**事实表的分区键** - 确认维度表可广播(大小 < `spark.sql.autoBroadcastJoinThreshold`) - 验证子查询返回分区键类型匹配(如避免`STRING` vs `DATE`) **Q2: 如何强制启用DPP?** ```sql SET spark.sql.optimizer.dynamicPartitionPruning.enabled=true; -- 若维度表稍大但仍适合DPP SET spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false; ``` **Q3: DPP与静态分区剪枝有何区别?** - **静态剪枝**:直接使用常量(如`WHERE partition_date = '2023-01-01'`) - **动态剪枝**:分区条件在运行时确定(如通过子查询结果) ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值