关于in/not in ,exists/not exists 以及left join and null 的执行效率对比
先抛出执行效率结论:left join > exists/not exists > in/not in
一、建表以及插入测试数据
create table test (
`id` string comment '用户id',
`taskno` string comment 'task',
`part` string comment '访问时间')
comment '用户访问记录表'
row format delimited fields terminated by '\t'
lines terminated by '\n' stored as orc;
insert overwrite table test
select '1001' as id, '10011' as taskno, '轮胎' as part union all
select '1001' as id, '10011' as taskno, '前保险杠皮' as part union all
select '1002' as id, '10021' as taskno, '钢圈' as part union all
select '1001' as id, '10011' as taskno, '前下摆臂(右)' as part union all
select '1002' as id, '10021' as taskno, '后下摆臂(左)' as part union all
select '1001' as id, '10011' as taskno, '后桥总成' as part union all
select '1001' as id, '10011' as taskno, '倒车镜(右)' as part union all
select '1002' as id, '10021' as taskno, '前大灯(右)' as part union all
select '1001' as id, '10011' as taskno, '前轮挡泥板(右)' as part union all
select '1001' as id, '10011' as taskno, '前减震器(右)' as part ;
二、如下示例:此表中name只要等于c,就去除该name对应的id,也就是id=2**
id | name |
---|---|
1 | a |
2 | b |
2 | c |
3 | c |
最后得出如下表:
id | name |
---|---|
1 | a |
说回test表
2、找出表中part字段包含有“钢圈”的对应的ID,并去除表中所有此ID
in/not in >>>>> 用时------1m21s
select a.* from test a where a.id='1001' and a.taskno='10011'
and a.id not in (select b.id from test b where b.part like '%钢圈%')
exists/not exists >>>>> 用时------37.187s
select * from test a where a.id='1001' and a.taskno='10011' and not EXISTS (
select b.id from test b where b.part like '%钢圈%' and a.id=b.id)
join >>>>> 用时--------18.221s
select a.* from test2 a
left join
(
select b.id from test2 b where b.part like '%钢圈%') b
on a.id =b.id where a.id='1001' and a.taskno='10011' and b.id is NULL
三、执行顺序
简单来说就是非相关子查询(in/not in)和相关子查询(exists/not exists)的执行效率的比较
解释一下概念以及执行顺序:
1、非相关子查询(in/not in)
即子查询能够独立执行
select a.* from test a where a.id='1001' and a.taskno='10011'
and a.id not in (select b.id from test b where b.part like '%钢圈%')
第一步:先执行子查询
select b.id from test b where b.part like '%钢圈%'
第二步:将子查询的结果替换成子查询,如第一步的结果是 “1002”
select a.* from test a where a.id='1001' and a.taskno='10011'
and a.id not in ('1002')
第三步:返回最后结果
2、相关子查询(exists/not exists)
即子查询不能够独立执行
select * from test a where a.id='1001' and a.taskno='10011' and not EXISTS (
select b.id from test b where b.part like '%钢圈%' and a.id=b.id)
第一步:先执行外查询
select * from test a where a.id='1001' and a.taskno='10011'
第二步:将外查询的元组(好多人说的元组,其实就是查询结果临时表)依次传入子查询中,如果结果为true,则返回结果
select a.* from test a where a.id='1001' and a.taskno='10011'
and a.id not in ('1002')
第三步:子查询反复外查询传入的结果,得到最终结果
总结
如果可以的话尽量按照join >exists > in 不仅准确性高而且效率也是会有明显提高。
但我看有些文章说数据如果足够小用in效率会比较高,我不是很认同,因为我这数据已经足够小了.
当然这是我一家之言,如果大家有什么不同意见可以探讨探讨