Hive及Spark中Join中过滤下推优化分析

最新推荐文章于 2024-09-17 09:15:00 发布

原创最新推荐文章于 2024-09-17 09:15:00 发布 · 997 阅读

CC 4.0 BY-SA版权

文章标签：

8 篇文章

订阅专栏

本文介绍SparkSQL中的PushPredicateThroughJoin优化规则，该规则基于Hive的Join行为进行设计。主要包含两条法则：一是Join条件过滤不能下推到保留行表；二是Where条件过滤不能下推到NULL补充表。

前言：在Spark SQL中有一条PushPredicateThroughJoin优化规则，其原理是参考Hive中的Join规则完成的，具体参考本文的规则1/2。

Preserved Row table: The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.
Null Supplying table: This is the table that has nulls filled in for its columns in unmatched rows. In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also Null Supplying tables.
During Join predicate: A predicate that is in the JOIN ON clause. For example, in ‘R1 join R2 on R1.x = 5’ the predicate ‘R1.x = 5’ is a During Join predicate.
After Join predicate: A predicate that is in the WHERE clause.

The logic can be summarized by these two rules:

第一条法则： join条件过滤不能下推到保留行表中。

比如以下选择，left join中左表s1为保留行表，所以on条件（join过滤条件）不能下推到s1中
select s1.key, s2.key from src s1 left join src s2 on s1.key > '2';

而s2表不是保留行，所以s2.key>2条件可以下推到s2表中：
select s1.key, s2.key from src s1 left join src s2 on s2.key > '2';

第二条法则：where条件过滤不能下推到NULL补充表。

比如以下选择left join的右表s2为NULL补充表所以，s1.key>2 where条件可以下推到s1：

select s1.key, s2.key from src s1 left join src s2 where s1.key > '2';

而以下选择由于s2未NULL补充表所以s2.key>2过滤条件不能下推

select s1.key, s2.key from src s1 left join src s2 where s2.key > '2';

引用网页：https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior#OuterJoinBehavior-Examples