with as 和group by 代替 count distinct，提高性能

最新推荐文章于 2025-02-02 22:47:31 发布

kkkkkk-00a00

最新推荐文章于 2025-02-02 22:47:31 发布

阅读量1k

点赞数 1

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/kevin77ya/article/details/46392087

本文深入探讨了在PostgreSQL中使用distinct进行去重操作时性能瓶颈的问题，通过对比原始全表扫描方法与优化后利用HashAggregate的方法，展示了如何显著提升查询效率。详细分析了两者执行计划的区别，揭示了HashAggregate的去重机制及性能优势，为数据库优化提供实用指南。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据库：postgresql

背景：使用distinct在对某张表某个字段做去重统计的时候，发现有统计特别慢（30-40s），当前表大小是400w，distinct 后大概60w

sql 很简单（出于安全考虑，字段和表名称都做了替换）:

原sql：

select count(distinct(name)) d from test.a

优化后sql：

with c as

(

select count(1) from test.a group by name

)

select count(1) from c;

查看两者的执行计划

执行计划1：

test=> EXPLAIN ANALYZE select count(distinct(name)) d from test.a;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=141971.71..141971.72 rows=1 width=14) (actual time=29829.643..29829.644 rows=1 loops=1)

-> Seq Scan on a (cost=0.00..129323.37 rows=5059337 width=14) (actual time=0.007..772.693 rows=5371860 loops=1)

Total runtime: 29829.683 ms

(3 rows)

执行计划2：

explain ANALYZE with c as

(

select count(1) from test.a group by name

)

select count(1) from c;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=161422.68..161422.70 rows=1 width=0) (actual time=4296.913..4296.913 rows=1 loops=1)

CTE c

-> HashAggregate (cost=154618.10..156711.82 rows=209372 width=14) (actual time=3636.989..3981.577 rows=777972 loops=1)

-> Seq Scan on a (cost=0.00..129321.73 rows=5059273 width=14) (actual time=0.031..1097.439 rows=5371718 loops=1)

-> CTE Scan on c (cost=0.00..4187.44 rows=209372 width=0) (actual time=3636.994..4241.660 rows=777972 loops=1)

Total runtime: 4319.893 ms

(6 rows)

可以看到，对于两个sql而言都需要进行全表扫描，不同点在于，扫描后的聚合方式。

对于group by 而言使用的是 HashAggregate，通过一个hash表来进行去重（但是不能出现hash冲突，内存消耗大），但是速度快。可以理解成 o(n)

对于 distinct 来说，聚合数据的时候，是通过对数据进行排序来去重的，性能最快也就是 nlogn

参考：

性能比较： http://blog.pgaddict.com/posts/count-distinct-improvements

pg explain： http://blog.youkuaiyun.com/ls3648098/article/details/7602136