hive的五种去重方式

最新推荐文章于 2024-07-30 09:01:20 发布

不想起的昵称

最新推荐文章于 2024-07-30 09:01:20 发布

阅读量1.1w

点赞数 5

CC 4.0 BY-SA版权

分类专栏： hive 文章标签：大数据 hadoop hive

本文链接：https://blog.youkuaiyun.com/weixin_40267121/article/details/119173532

hive 专栏收录该内容

42 篇文章

订阅专栏

本文通过四个SQL示例，展示了如何利用大数据处理技术解决实际问题，如用户去重、新增用户计算等。其中，介绍了distinct、groupby、row_number()、leftjoin和位操作等方法，并对比了它们在不同场景下的应用。在大数据环境下，位操作在处理新增用户问题时表现出高效性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.distinct

问题：
每个app下只保留一个用户
案例：

spark-sql> with test1 as
         > (select 122 as userid,100024 as apptypeid
         > union all
         > select 123 as userid,100024 as apptypeid
         > union all
         > select 123 as userid,100024 as apptypeid)
         > select 
         >   distinct userid,apptypeid
         > from test1;
122     100024                                                                  
123		100024
Time taken: 4.781 seconds, Fetched 2 row(s)

2.group by

问题：
每个app下只保留一个用户
案例：

spark-sql> with test1 as
         > (select 122 as userid,100024 as apptypeid
         > union all
         > select 123 as userid,100024 as apptypeid
         > union all
         > select 123 as userid,100024 as apptypeid)
         > select 
         >   userid,
         >   apptypeid
         > from 
         > (select 
         >   userid,
         >   apptypeid
         > from test1) t1
         > group by userid,apptypeid;
122     100024                                                                  
123		100024
Time taken: 10.5 seconds, Fetched 2 row(s)

3.row_number()

问题：
每个app下，每个用户取最近的渠道、版本、操作系统数据
分析：
distinct只是简单的去重，解决不了问题；group by也是简单的分组去重，也解决不了问题；order by只是简单的排序，也解决不了问题。那这个时候row_number()就派上用场了，分组完再排序
案例：

spark-sql> with test1 as
         > (select 122 as userid,100024 as apptypeid,'appstore' as qid,'ios' as os,'1.0.2' as ver,1627440618 as dateline
         > union all
         > select 123 as userid,100024 as apptypeid,'huawei' as qid,'android' as os,'1.0.3' as ver,1627440620 as dateline
         > union all
         > select 123 as userid,100024 as apptypeid,'huawei' as qid,'android' as os,'1.0.4' as ver,1627440621 as dateline)
         > select 
         >   userid,
         >   apptypeid,
         >   qid,
         >   os,
         >   ver
         > from 
         > (select 
         >   userid,
         >   apptypeid,
         >   qid,
         >   os,
         >   ver,
         >   row_number() over(distribute by apptypeid,userid sort by dateline desc) as rank
         > from test1) t1
         > where t1.rank=1;
122     100024  	appstore        ios     	1.0.2                                   
123		100024		huawei			android		1.0.4
Time taken: 5.286 seconds, Fetched 2 row(s)

4.left join

问题：
求每天的新增用户。现在有一张每天的用户表test1，有一张历史的新用户表test2（新用户：每个app下，每个用户只有一条数据）
分析：
1.每天的用户表test1用group by进行去重，得到每天的用户数据
2.再将用户数据根据历史新用户表进行关联，不在历史新用户表里面的，即为每天新增用户
案例：

spark-sql> with test1 as
         > (select 122 as userid,100024 as apptypeid
         > union all
         > select 123 as userid,100024 as apptypeid
         > union all
         > select 123 as userid,100024 as apptypeid),
         > 
         > test2 as
         > (select 122 as userid,100024 as apptypeid
         > union all
         > select 124 as userid,100024 as apptypeid
         > union all
         > select 125 as userid,100024 as apptypeid)
         > select 
         >   t1.userid,
         >   t1.apptypeid
         > from 
         > (select 
         >   userid,
         >   apptypeid
         > from test1
         > group by userid,apptypeid) t1
         > 
         > left join
         > (select 
         >   userid,
         >   apptypeid
         > from test2) t2
         > on t1.apptypeid=t2.apptypeid and t1.userid=t2.userid
         > where t2.userid is null;
123     	100024                                                                  
Time taken: 19.816 seconds, Fetched 1 row(s)

5.位操作：union all+group by

问题：
求每天的新增用户。现在有一张每天的用户表test1，有一张历史的新用户表test2（新用户：每个app下，每个用户只有一条数据）
分析：
1.每天的用户表test1用group by进行去重，得到每天的用户数据
2.将每天的用户数据打上标签10，历史的新用户数据打上标签1（位操作的标签）
3.进行union all拼接，对标签进行汇总，取标签为10的数据，即为每天的新增用户
案例：

spark-sql> with test1 as
         > (select 122 as userid,100024 as apptypeid
         > union all
         > select 123 as userid,100024 as apptypeid
         > union all
         > select 123 as userid,100024 as apptypeid),
         > 
         > test2 as
         > (select 122 as userid,100024 as apptypeid
         > union all
         > select 124 as userid,100024 as apptypeid
         > union all
         > select 125 as userid,100024 as apptypeid)
         > 
         > select 
         >   userid,
         >   apptypeid
         > from 
         > (select 
         >   sum(tag) as tag,
         >   userid,
         >   apptypeid
         > from 
         > (select 
         >   10 as tag,
         >   t1.userid,
         >   t1.apptypeid
         > from 
         > (select 
         >   userid,
         >   apptypeid
         > from test1
         > group by userid,apptypeid) t1
         > 
         > union all
         > select 
         >   1 as tag,
         >   userid,
         >   apptypeid
         > from test2) t2
         > group by userid,apptypeid) t3
         > where t3.tag=10;
123    	 100024                                                                  
Time taken: 10.428 seconds, Fetched 1 row(s)