spark Join的中where的筛选与join的on 条件筛选探究

本文探讨了在Spark中进行数据处理时,Join操作中where子句与on条件对执行速度的影响。通过实例展示了不同筛选方式下的执行计划和结果,揭示了在不同Join类型中筛选条件的执行顺序及其对结果的差异。最终总结了在不同Join类型中,筛选条件的位置如何影响查询效率和结果的一致性。

问题背景


因为最近在利用spark SQL 进行数据的处理,在做两表进行join操作的时候,在join过程中,想探究数据的筛选方式是否会对执行速度有一定的影响。

探究过程

数据准备

create table stu(
    id int ,  --唯一id
    name string,  -- 姓名
    subject string, -- 学科
    score int   --分数
)

插入数据,数据如下

stu 表:
+---+----+-------+-----+
| id|name|subject|score|
+---+----+-------+-----+
|  1|张三|   math|   50|
|  2|张三|English|   70|
|  3|张三|Chinese|   80|
|  4|李四|   math|   80|
|  5|李四|English|   40|
|  6|李四|Chinese|   60|
+---+----+-------+-----+

准备好teacher

create table teacher(
 id int,   -- 唯一id
 name string, -- 姓名
 subject string  -- 教授学科
)

插入数据

+---+--------+-------+
| id|    name|subject|
+---+--------+-------+
|  1|    jack|   math|
|  2|    tony|English|
|  3|ZhaoGang|Chinese|
+---+--------+-------+

提出问题

如果实现下述操作: 选取出每个学生的每个科目对应的老师
那么,针对这个问题的答案:

    select s.id,
          s.name,
          s.subject,
          s.score,
          t.name
     from stu s
left join teacher t
       on s.subject=t.subject 

结果如下:

+---+----+-------+-----+--------+
| id|name|subject|score|    name|
+---+----+-------+-----+--------+
|  3|张三|Chinese|   80|ZhaoGang|
|  5|李四|English|   40|    tony|
|  2|张三|English|   70|    tony|
|  6|李四|Chinese|   60|ZhaoGang|
|  4|李四|   math|   80|    jack|
|  1|张三|   math|   50|    jack|
+---+----+-------+-----+--------+

查看这条sql的执行计划,我们使用explain函数,下同:

== Physical Plan ==
*(2) Project [id#1173, name#1174, subject#1175, score#1176, name#1310]
+- *(2) BroadcastHashJoin [subject#1175], [subject#1311], LeftOuter, BuildRight
  :- *(2) ColumnarToRow
  :  +- FileScan parquet default.stu[id#1173,name#1174,subject#1175,score#1176] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-warehouse/stu], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,subject:string,score:int>
  +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true])), [id=#742]
     +- *(1) Project [name#1310, subject#1311]
        +- *(1) Filter isnotnull(subject#1311)
           +- *(1) ColumnarToRow
              +- FileScan parquet default.teacher[name#1310,subject#1311] Batched: true, DataFilters: [isnotnull(subject#1311)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-warehouse/teacher], PartitionFilters: [], PushedFilters: [IsNotNull(subject)], ReadSchema: struct<name:string,subject:string>

解释:
这里从spark中读取了数据

在关联表中进行数据的过滤筛选

除去张三之外的每个学生的每个科目对应的老师
现在屏蔽表stu表中张三的数据,那么sql会是怎么样的执行呢?

    select s.id,
           s.name,
           s.subject,
           s.score,
           t.name
      from stu s
 left join teacher t
        on s.subject=t.subject 
  where s.name != '张三'

执行结果:

+---+----+-------+-----+--------+
| id|name|subject|score|    name|
+---+----+-------+-----+--------+
|  4|李四|   math|   80|    jack|
|  5|李四|English|   40|    tony|
|  6|李四|Chinese|   60|ZhaoGang|
+---+----+-------+-----+--------+

在这个例子当中,sql引擎的执行流程是先做两表之间的join关联,之后再对结果数据进行筛选。还是说先对数据进行筛选,然后再做join操作呢?不说废话,看执行计划

== Physical Plan ==
*(2) Project [id#1173, name#1174, subject#1175, score#1176, name#1310]
+- *(2) BroadcastHashJoin [subject#1175], [subject#1311], LeftOuter, BuildRight
   :- *(2) Project [id#1173, name#1174, subject#1175, score#1176]
   :  +- *(2) Filter (isnotnull(name#1174) AND NOT (name#1174 = 张三))
   :     +- *(2) ColumnarToRow
   :        +- FileScan parquet default.stu[id#1173,name#1174,subject#1175,score#1176] Batched: true, DataFilters: [isnotnull(name#1174), NOT (name#1174 = 张三)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-warehouse/stu], PartitionFilters: [], PushedFilters: [IsNotNull(name), Not(EqualTo(name,张三))], ReadSchema: struct<id:int,name:string,subject:string,score:int>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true])), [id=#791]
      +- *(1) Project [name#1310, subject#1311]
         +- *(1) Filter isnotnull(subject#1311)
            +- *(1) ColumnarToRow
               +- FileScan parquet default.teacher[name#1310,subject#1311] Batched: true, DataFilters: [isnotnull(subject#1311)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-warehouse/teacher], PartitionFilters: [], PushedFilters: [IsNotNull(subject)], ReadSchema: struct<name:string,subject:string>

在这个sql的执行计划当中可以看到,两张表数据被加载之后,然后在stu表中先进行了数据的过滤操作,过滤出name != 张三的数据,得到的数据再做两张表之间的连接。

在两张关联表中开始进行数据筛选的操作

   select s.id,
          s.name,
          s.subject,
          s.score,
          t.name
     from stu s
left join teacher t
       on s.subject = t.subject and s.name != '张三'

结果如下所示:

+---+----+-------+-----+--------+
| id|name|subject|score|    name|
+---+----+-------+-----+--------+
|  1|张三|   math|   50|    null|
|  2|张三|English|   70|    null|
|  3|张三|Chinese|   80|    null|
|  4|李四|   math|   80|    jack|
|  5|李四|English|   40|    tony|
|  6|李四|Chinese|   60|ZhaoGang|
+---+----+-------+-----+--------+

此时的SQL的执行计划为:

== Physical Plan ==
*(1) Project [id#90, name#91, subject#92, score#93, name#30]
+- *(1) BroadcastHashJoin [subject#92], [subject#31], LeftOuter, BuildRight, NOT (name#91 = 张三)
   :- *(1) ColumnarToRow
   :  +- FileScan parquet [id#90,name#91,subject#92,score#93] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/wmh/PycharmProjects/pythonProject/paruqet/stu], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,subject:string,score:int>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false])), [id=#446]
      +- LocalTableScan [name#30, subject#31]

可以看到,此处对数据的筛选是在join的过程当中,从这里可以看出来。在左连接中,如果是where的数筛选的话,(左表拿的是筛选之后的数据)数据会在关联表被加载的过程当中就已经做出了筛选操作(谓词下推)。
如果是在join条件中想对数据进行筛选的话,会发现得到的数据就不一致(左表拿的是全量数据)。最后两种方式得到的结果也不一致。


继续探究:

如果还是计算除去张三之外的每个学生的每个科目对应的老师但是Sql写成Inner join的方式:

   select s.id,
          s.name,
          s.subject,
          s.score,
          t.name
     from stu s
 join teacher t
       on s.subject = t.subject and s.name != '张三'

执行结果如下:


+---+----+-------+-----+--------+
| id|name|subject|score|    name|
+---+----+-------+-----+--------+
|  4|李四|   math|   80|    jack|
|  5|李四|English|   40|    tony|
|  6|李四|Chinese|   60|ZhaoGang|
+---+----+-------+-----+--------+

sql的执行计划如下:

== Physical Plan ==
*(1) Project [id#90, name#91, subject#92, score#93, name#30]
+- *(1) BroadcastHashJoin [subject#92], [subject#31], Inner, BuildRight
   :- *(1) Project [id#90, name#91, subject#92, score#93]
   :  +- *(1) Filter ((isnotnull(name#91) AND NOT (name#91 = 张三)) AND isnotnull(subject#92))
   :     +- *(1) ColumnarToRow
   :        +- FileScan parquet [id#90,name#91,subject#92,score#93] Batched: true, DataFilters: [isnotnull(name#91), NOT (name#91 = 张三), isnotnull(subject#92)], Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/wmh/PycharmProjects/pythonProject/paruqet/stu], PartitionFilters: [], PushedFilters: [IsNotNull(name), Not(EqualTo(name,张三)), IsNotNull(subject)], ReadSchema: struct<id:int,name:string,subject:string,score:int>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false])), [id=#676]
      +- LocalTableScan [name#30, subject#31]

另外一种的写法的sql如下:

    select s.id,
           s.name,
           s.subject,
           s.score,
           t.name
      from stu s
  join teacher t
        on s.subject=t.subject
  where s.name != '张三'

结果如下:

+---+----+-------+-----+--------+
| id|name|subject|score|    name|
+---+----+-------+-----+--------+
|  4|李四|   math|   80|    jack|
|  5|李四|English|   40|    tony|
|  6|李四|Chinese|   60|ZhaoGang|
+---+----+-------+-----+--------+

sql的执行计划如下:

== Physical Plan ==
*(1) Project [id#90, name#91, subject#92, score#93, name#30]
+- *(1) BroadcastHashJoin [subject#92], [subject#31], Inner, BuildRight
   :- *(1) Project [id#90, name#91, subject#92, score#93]
   :  +- *(1) Filter ((isnotnull(name#91) AND NOT (name#91 = 张三)) AND isnotnull(subject#92))
   :     +- *(1) ColumnarToRow
   :        +- FileScan parquet [id#90,name#91,subject#92,score#93] Batched: true, DataFilters: [isnotnull(name#91), NOT (name#91 = 张三), isnotnull(subject#92)], Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/wmh/PycharmProjects/pythonProject/paruqet/stu], PartitionFilters: [], PushedFilters: [IsNotNull(name), Not(EqualTo(name,张三)), IsNotNull(subject)], ReadSchema: struct<id:int,name:string,subject:string,score:int>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false])), [id=#752]
      +- LocalTableScan [name#30, subject#31]

会发现上述两种方式的执行结果一致。连sql的执行计划也是一致的。

总结:

写这篇水文本来的目的是探究在两张表之间的join操作的过程当中,一个筛选条件写在表的where条件子句中,另外一个筛选条件写在join后面的on当中。发现在left join的操作当中,这两种写法得到的结果不一致,并且sql的执行计划也不一致。
inner join当中,它们的sql的执行计划以及得到的结果都是一致的。(根据基本的sqljoin语法,也可以推断出来)。
通过本次的测试记录,温故而知新了。如果看到这个流水账的大佬们,有什么其它的见解,请在下述评论区中指点一下。谢谢大佬~
在这里插入图片描述

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值