问题背景
因为最近在利用spark SQL 进行数据的处理,在做两表进行join操作的时候,在join过程中,想探究数据的筛选方式是否会对执行速度有一定的影响。
探究过程
数据准备
create table stu(
id int , --唯一id
name string, -- 姓名
subject string, -- 学科
score int --分数
)
插入数据,数据如下
stu 表:
+---+----+-------+-----+
| id|name|subject|score|
+---+----+-------+-----+
| 1|张三| math| 50|
| 2|张三|English| 70|
| 3|张三|Chinese| 80|
| 4|李四| math| 80|
| 5|李四|English| 40|
| 6|李四|Chinese| 60|
+---+----+-------+-----+
准备好teacher表
create table teacher(
id int, -- 唯一id
name string, -- 姓名
subject string -- 教授学科
)
插入数据
+---+--------+-------+
| id| name|subject|
+---+--------+-------+
| 1| jack| math|
| 2| tony|English|
| 3|ZhaoGang|Chinese|
+---+--------+-------+
提出问题
如果实现下述操作: 选取出每个学生的每个科目对应的老师
那么,针对这个问题的答案:
select s.id,
s.name,
s.subject,
s.score,
t.name
from stu s
left join teacher t
on s.subject=t.subject
结果如下:
+---+----+-------+-----+--------+
| id|name|subject|score| name|
+---+----+-------+-----+--------+
| 3|张三|Chinese| 80|ZhaoGang|
| 5|李四|English| 40| tony|
| 2|张三|English| 70| tony|
| 6|李四|Chinese| 60|ZhaoGang|
| 4|李四| math| 80| jack|
| 1|张三| math| 50| jack|
+---+----+-------+-----+--------+
查看这条sql的执行计划,我们使用explain函数,下同:
== Physical Plan ==
*(2) Project [id#1173, name#1174, subject#1175, score#1176, name#1310]
+- *(2) BroadcastHashJoin [subject#1175], [subject#1311], LeftOuter, BuildRight
:- *(2) ColumnarToRow
: +- FileScan parquet default.stu[id#1173,name#1174,subject#1175,score#1176] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-warehouse/stu], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,subject:string,score:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true])), [id=#742]
+- *(1) Project [name#1310, subject#1311]
+- *(1) Filter isnotnull(subject#1311)
+- *(1) ColumnarToRow
+- FileScan parquet default.teacher[name#1310,subject#1311] Batched: true, DataFilters: [isnotnull(subject#1311)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-warehouse/teacher], PartitionFilters: [], PushedFilters: [IsNotNull(subject)], ReadSchema: struct<name:string,subject:string>
解释:
这里从spark中读取了数据
在关联表中进行数据的过滤筛选
除去张三之外的每个学生的每个科目对应的老师
现在屏蔽表stu表中张三的数据,那么sql会是怎么样的执行呢?
select s.id,
s.name,
s.subject,
s.score,
t.name
from stu s
left join teacher t
on s.subject=t.subject
where s.name != '张三'
执行结果:
+---+----+-------+-----+--------+
| id|name|subject|score| name|
+---+----+-------+-----+--------+
| 4|李四| math| 80| jack|
| 5|李四|English| 40| tony|
| 6|李四|Chinese| 60|ZhaoGang|
+---+----+-------+-----+--------+
在这个例子当中,sql引擎的执行流程是先做两表之间的join关联,之后再对结果数据进行筛选。还是说先对数据进行筛选,然后再做join操作呢?不说废话,看执行计划
== Physical Plan ==
*(2) Project [id#1173, name#1174, subject#1175, score#1176, name#1310]
+- *(2) BroadcastHashJoin [subject#1175], [subject#1311], LeftOuter, BuildRight
:- *(2) Project [id#1173, name#1174, subject#1175, score#1176]
: +- *(2) Filter (isnotnull(name#1174) AND NOT (name#1174 = 张三))
: +- *(2) ColumnarToRow
: +- FileScan parquet default.stu[id#1173,name#1174,subject#1175,score#1176] Batched: true, DataFilters: [isnotnull(name#1174), NOT (name#1174 = 张三)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-warehouse/stu], PartitionFilters: [], PushedFilters: [IsNotNull(name), Not(EqualTo(name,张三))], ReadSchema: struct<id:int,name:string,subject:string,score:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true])), [id=#791]
+- *(1) Project [name#1310, subject#1311]
+- *(1) Filter isnotnull(subject#1311)
+- *(1) ColumnarToRow
+- FileScan parquet default.teacher[name#1310,subject#1311] Batched: true, DataFilters: [isnotnull(subject#1311)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-warehouse/teacher], PartitionFilters: [], PushedFilters: [IsNotNull(subject)], ReadSchema: struct<name:string,subject:string>
在这个sql的执行计划当中可以看到,两张表数据被加载之后,然后在stu表中先进行了数据的过滤操作,过滤出name != 张三的数据,得到的数据再做两张表之间的连接。
在两张关联表中开始进行数据筛选的操作
select s.id,
s.name,
s.subject,
s.score,
t.name
from stu s
left join teacher t
on s.subject = t.subject and s.name != '张三'
结果如下所示:
+---+----+-------+-----+--------+
| id|name|subject|score| name|
+---+----+-------+-----+--------+
| 1|张三| math| 50| null|
| 2|张三|English| 70| null|
| 3|张三|Chinese| 80| null|
| 4|李四| math| 80| jack|
| 5|李四|English| 40| tony|
| 6|李四|Chinese| 60|ZhaoGang|
+---+----+-------+-----+--------+
此时的SQL的执行计划为:
== Physical Plan ==
*(1) Project [id#90, name#91, subject#92, score#93, name#30]
+- *(1) BroadcastHashJoin [subject#92], [subject#31], LeftOuter, BuildRight, NOT (name#91 = 张三)
:- *(1) ColumnarToRow
: +- FileScan parquet [id#90,name#91,subject#92,score#93] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/wmh/PycharmProjects/pythonProject/paruqet/stu], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,subject:string,score:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false])), [id=#446]
+- LocalTableScan [name#30, subject#31]
可以看到,此处对数据的筛选是在join的过程当中,从这里可以看出来。在左连接中,如果是where的数筛选的话,(左表拿的是筛选之后的数据)数据会在关联表被加载的过程当中就已经做出了筛选操作(谓词下推)。
如果是在join条件中想对数据进行筛选的话,会发现得到的数据就不一致(左表拿的是全量数据)。最后两种方式得到的结果也不一致。
继续探究:
如果还是计算除去张三之外的每个学生的每个科目对应的老师但是Sql写成Inner join的方式:
select s.id,
s.name,
s.subject,
s.score,
t.name
from stu s
join teacher t
on s.subject = t.subject and s.name != '张三'
执行结果如下:
+---+----+-------+-----+--------+
| id|name|subject|score| name|
+---+----+-------+-----+--------+
| 4|李四| math| 80| jack|
| 5|李四|English| 40| tony|
| 6|李四|Chinese| 60|ZhaoGang|
+---+----+-------+-----+--------+
sql的执行计划如下:
== Physical Plan ==
*(1) Project [id#90, name#91, subject#92, score#93, name#30]
+- *(1) BroadcastHashJoin [subject#92], [subject#31], Inner, BuildRight
:- *(1) Project [id#90, name#91, subject#92, score#93]
: +- *(1) Filter ((isnotnull(name#91) AND NOT (name#91 = 张三)) AND isnotnull(subject#92))
: +- *(1) ColumnarToRow
: +- FileScan parquet [id#90,name#91,subject#92,score#93] Batched: true, DataFilters: [isnotnull(name#91), NOT (name#91 = 张三), isnotnull(subject#92)], Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/wmh/PycharmProjects/pythonProject/paruqet/stu], PartitionFilters: [], PushedFilters: [IsNotNull(name), Not(EqualTo(name,张三)), IsNotNull(subject)], ReadSchema: struct<id:int,name:string,subject:string,score:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false])), [id=#676]
+- LocalTableScan [name#30, subject#31]
另外一种的写法的sql如下:
select s.id,
s.name,
s.subject,
s.score,
t.name
from stu s
join teacher t
on s.subject=t.subject
where s.name != '张三'
结果如下:
+---+----+-------+-----+--------+
| id|name|subject|score| name|
+---+----+-------+-----+--------+
| 4|李四| math| 80| jack|
| 5|李四|English| 40| tony|
| 6|李四|Chinese| 60|ZhaoGang|
+---+----+-------+-----+--------+
sql的执行计划如下:
== Physical Plan ==
*(1) Project [id#90, name#91, subject#92, score#93, name#30]
+- *(1) BroadcastHashJoin [subject#92], [subject#31], Inner, BuildRight
:- *(1) Project [id#90, name#91, subject#92, score#93]
: +- *(1) Filter ((isnotnull(name#91) AND NOT (name#91 = 张三)) AND isnotnull(subject#92))
: +- *(1) ColumnarToRow
: +- FileScan parquet [id#90,name#91,subject#92,score#93] Batched: true, DataFilters: [isnotnull(name#91), NOT (name#91 = 张三), isnotnull(subject#92)], Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/wmh/PycharmProjects/pythonProject/paruqet/stu], PartitionFilters: [], PushedFilters: [IsNotNull(name), Not(EqualTo(name,张三)), IsNotNull(subject)], ReadSchema: struct<id:int,name:string,subject:string,score:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false])), [id=#752]
+- LocalTableScan [name#30, subject#31]
会发现上述两种方式的执行结果一致。连sql的执行计划也是一致的。
总结:
写这篇水文本来的目的是探究在两张表之间的join操作的过程当中,一个筛选条件写在表的where条件子句中,另外一个筛选条件写在join后面的on当中。发现在left join的操作当中,这两种写法得到的结果不一致,并且sql的执行计划也不一致。
在inner join当中,它们的sql的执行计划以及得到的结果都是一致的。(根据基本的sql的join语法,也可以推断出来)。
通过本次的测试记录,温故而知新了。如果看到这个流水账的大佬们,有什么其它的见解,请在下述评论区中指点一下。谢谢大佬~

本文探讨了在Spark中进行数据处理时,Join操作中where子句与on条件对执行速度的影响。通过实例展示了不同筛选方式下的执行计划和结果,揭示了在不同Join类型中筛选条件的执行顺序及其对结果的差异。最终总结了在不同Join类型中,筛选条件的位置如何影响查询效率和结果的一致性。
2475

被折叠的 条评论
为什么被折叠?



