别再使用count distinct了

原创已于 2024-10-16 15:08:45 修改 · 6.6k 阅读

13 ·

CC 4.0 BY-SA版权

本文为博主原创文章，转载请注明出处http://blog.youkuaiyun.com/wsdc0521

文章标签：

#数仓 #优化 #distinct #SQL #大数据

于 2021-06-13 21:27:49 首次发布

Impala/Hive/Kudu 专栏收录该内容

22 篇文章

订阅专栏

本文分享了一种优化数仓开发中大数据量去重统计的方法，通过先去重再汇总，避免了count(distinct)带来的性能瓶颈。通过实例展示了如何将SQL中的distinct操作替换为子查询和case when，提升查询效率。

在数仓开发中经常会对数据去重后统计，而对于大数据量来说，count(distinct )操作明显非常的消耗资源且性能很慢。

下面介绍我平时使用最多的一种优化方式，供大家参考。

原SQL：

select 
  group_id,
  app_id,
  count(distinct case when dt>='${7d_before}' then user_id else null end) as 7d_uv, -- 7日内UV
  count(distinct case when dt>='${14d_before}' then user_id else null end) as 14d_uv --14日内UV
from tbl
where dt>='${14d_before}'
group by 
  group_id,
  app_id
;

优化后：

先去重，再汇总。

select  group_id
        ,app_id
        ,sum(case when 7d_cnt>0 then 1 else 0 end) AS 7d_uv, -- 7日内UV
        ,sum(case when 14d_cnt>0 then 1 else 0 end) AS 14d_uv --14日内UV
from    (
        select   
            group_id,
            app_id,
            user_id, --按user_id去重
            count(case when dt>='${7d_before}' then user_id else null end) as 7d_cnt, -- 7日内各用户的点击量
            count(case when dt>='${14d_before}' then user_id else null end) as 14d_cnt --14日内各用户的点击量
        from tbl
        where dt>='${14d_before}'
        group by 
            group_id,
            app_id,
            user_id
        ) a
group by group_id,
         app_id
;

希望本文对你有帮助，请点个赞鼓励一下作者吧~ 谢谢！