Hive-- not in + in的条数不等于总条数

最新推荐文章于 2024-12-19 15:09:29 发布

韩家小志

最新推荐文章于 2024-12-19 15:09:29 发布

阅读量883

点赞数

分类专栏：错误集锦文章标签： hive

本文链接：https://blog.youkuaiyun.com/qq_46893497/article/details/127451005

版权

错误集锦专栏收录该内容

25 篇文章

订阅专栏

本文记录了一个关于Hive查询的问题，当使用notin操作符时，未明确考虑null值导致查询结果不准确。问题在于Hive的notin在内部隐含了is not null条件。解决方法是将notin查询与is null条件结合使用，以确保得到正确的数据统计。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

今天帮同事看他的一个bug的时候又遇到这个问题，记录一下（之前自己遇到的时候忘记记录了～）

问题：

某列存在null值，关联表太多且对数据不熟悉，导致了个小bug
查询数据的时候not in + in的条数不等于总条数

复现：

step1：准备数据

drop table bi_temp.temp_hzy_20221021;
create table bi_temp.temp_hzy_20221021  as
select 1 as id,'张三' union all
select 2 as id,'李四' union all
select 3 as id,'王五' union all
select null as id,'赵六' union all
select null as id ,'孙七'
;

step2：查看数据
- 总共5条

select * from bi_temp.temp_hzy_20221021;

在这里插入图片描述

step3：查看not in数据
- 一条，id=3，无null的

select * 
from bi_temp.temp_hzy_20221021
where id not in (1,2)
;

在这里插入图片描述

step4：查看in数据
- 两条，id=1和id=2，无null的

select * 
from bi_temp.temp_hzy_20221021
where id not in (1,2)
;

在这里插入图片描述

step5：发现问题
- 可以看到not in + in的条数不等于总条数

原因：

hive的where条件中使用的not in 或者in时，隐藏了 is not null 的条件

解决

not in条数 + in 条数再加上 is null 就能等于总数
也就是说


select * 
from bi_temp.temp_hzy_20221021
where id not in (1,2)
;

等价于


select * 
from bi_temp.temp_hzy_20221021
where id not in (1,2) and id is not null
;