在实际应用中,我们经常需要计算数据库中某张表的某个字段的的最小值/最大值,并用它来作为查询条件。
使用SQL求数据库中表字段的最小值/最大值,通常由两种方法:
一种是 使用聚集函数 min()/max(), 另一种是使用order BY + “字段” 子句,对该表排序之后,取第一条记录中该字段的值。
本文在PostgreSQL 11.4上,对min() 和 order by … limit 1 以及它们的变体做一个全方位的对比。
现在有一张表tb_person,它的定义如下:
create table tb_person (
person_id integer not null,
person_name varchar(64),
sex int,
birthday date,
phone varchar(64),
job_no varchar(48),
department_id int,
create_time timestamp with time zone,
update_time timestamp with time zone,
constraint pk_tb_person primary key (person_id)
);
comment on table tb_person is '人员表';
comment on column tb_person.person_id is '主键id';
comment on column tb_person.person_name is '人员姓名';
comment on column tb_person.sex is '性别。0:男性;1:女性';
comment on column tb_person.birthday is '出生日期';
comment on column tb_person.phone is '联系电话';
comment on column tb_person.job_no is '员工工号';
comment on column tb_person.department_id is '所属部门id';
comment on column tb_person.create_time is '创建时间';
comment on column tb_person.update_time is '更新时间';
create index idx_tb_person_job_no on tb_person (job_no);
create index idx_tb_person_department_id on tb_person (department_id);
create index idx_tb_person_birthday on tb_person (birthday);
我们通过下面的sql 向表中插入300000 行数据:
insert into tb_person (person_id, person_name, sex, birthday, phone, job_no, department_id, create_time, update_time)
select num, substr('abcdefghijklmnopqrstuvwxyz', mod(num,26), 10), ceil(random()*2), '1970-01-01'::date + (interval '1 hour') * num, null,
to_char(num, 'FM000000'), ceil(random()*3000), now(), now() from
(select generate_series(1,300000) as num) a;
接下来我们通过一些查询案例来对比。
1. 查找人员表中最小的工号
获得人员表中最小的工号,有两种方法:
select min(job_no) from tb_person;
和
select job_no from tb_person order by job_no limit 1;
现在我们用 explain analyze,查看两种方法的查询计划,比较它们的性能。我们主要关注cost 和 execution time 两个指标。
查看第一种方法的查询计划:
explain analyze
select min(job_no) from tb_person;
结果如下:
Result (cost=0.45..0.46 rows=1 width=32) (actual time=0.029..0.030 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.42..0.45 rows=1 width=32) (actual time=0.023..0.024 rows=1 loops=1)
-> Index Only Scan using idx_tb_person_job_no on tb_person (cost=0.42..8550.42 rows=300000 width=32) (actual time=0.020..0.020 rows=1 loops=1)
Index Cond: (job_no IS NOT NULL)
Heap Fetches: 0
Planning time: 0.161 ms
Execution time: 0.060 ms
查看第二种方法的查询计划:
explain analyze
select job_no from tb_person order by job_no limit 1;
结果如下:
Limit (cost=0.42..0.46 rows=1 width=61) (actual time=0.032..0.033 rows=1 loops=1)
-> Index Scan using idx_tb_person_job_no on tb_person (cost=0.42..11213.42 rows=300000 width=61) (actual time=0.028..0.028 rows=1 loops=1)
Planning time: 0.182 ms
Execution time: 0.069 ms
从上面的执行时间和代价估计可以看出,两种方法没有明显性能差别。
2. 查找工号最小的人
查找工号最小的人也有两种方法。
第一种方法是:
select * from tb_person where job_no = ( select min(job_no) from tb_person);
第二种方法是:
select * from tb_person order by job_no limit 1;
下面我们比较两种方法的查询计划:
查看第一种方法的查询计划:
explain analyze select * from tb_person where job_no = ( select min(job_no) from tb_person );
结果如下:
Index Scan using idx_tb_person_job_no on tb_person (cost=0.88..8.90 rows=1 width=61) (actual time=0.129..0.131 rows=1 loops=1)
Index Cond: ((job_no)::text = $1)
InitPlan 2 (returns $1)
-> Result (cost=0.45..0.46 rows=1 width=32) (actual time=0.084..0.085 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.42..0.45 rows=1 width=32) (actual time=0.077..0.078 rows=1 loops=1)
-> Index Only Scan using idx_tb_person_job_no on tb_person tb_person_1 (cost=0.42..8550.42 rows=300000 width=32) (actual time=0.072..0.072 rows=1 loops=1)
Index Cond: (job_no IS NOT NULL)
Heap Fetches: 0
Planning time: 0.255 ms
Execution time: 0.179 ms
查看第二种方法的查询计划:
explain analyze
select * from tb_person order by job_no limit 1;
结果如下:
Limit (cost=0.42..0.46 rows=1 width=61) (actual time=0.021..0.022 rows=1 loops=1)
-> Index Scan using idx_tb_person_job_no on tb_person (cost=0.42..11213.42 rows=300000 width=61) (actual time=0.018..0.018 rows=1 loops=1)
Planning time: 0.164 ms
Execution time: 0.054 ms
可以看出,第二种方法的性能优于第一种。第一种方法做了两次扫描,而第二种方法只做了一次。
3. 求每个部门工号最小的人的id
这里需要根据department_id分组。实现它也有两种方法。
第一种方法:
select person_id from tb_person where job_no in (
select min(job_no) from tb_person group by department_id
);
这种方法是根据 department_id 将人员数据分组,获取每个部门最小的job_no,再根据job_no 获取对应的person_id
第二种方法:
select (array_agg(person_id order by job_no))[1] from tb_person group by department_id
这种方法是根据department_id分组,将每个组的person_id 聚合成数组,聚合是按照job_no递增的顺序的,并取每个数组中第一个元素,即每个部门最小job_no对应的person_id
下面我们比较两种方法的性能。
查看第一种方法的查询计划:
explain analyze
select person_id from tb_person where job_no in (
select min(job_no) from tb_person group by department_id
);
结果如下:
Nested Loop (cost=6291.75..6468.99 rows=9 width=4) (actual time=394.590..394.990 rows=21 loops=1)
-> HashAggregate (cost=6291.33..6291.54 rows=21 width=32) (actual time=394.544..394.564 rows=21 loops=1)
Group Key: min((tb_person_1.job_no)::text)
-> Finalize GroupAggregate (cost=6290.54..6291.07 rows=21 width=36) (actual time=394.320..394.512 rows=21 loops=1)
Group Key: tb_person_1.department_id
-> Sort (cost=6290.54..6290.65 rows=42 width=36) (actual time=394.303..394.361 rows=63 loops=1)
Sort Key: tb_person_1.department_id
Sort Method: quicksort Memory: 27kB
-> Gather (cost=6285.00..6289.41 rows=42 width=36) (actual time=393.681..394.231 rows=63 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=5285.00..5285.21 rows=21 width=36) (actual time=305.755..305.774 rows=21 loops=3)
Group Key: tb_person_1.department_id
-> Parallel Seq Scan on tb_person tb_person_1 (cost=0.00..4660.00 rows=125000 width=11) (actual time=0.015..117.754 rows=100000 loops=3)
-> Index Scan using idx_tb_person_job_no on tb_person (cost=0.42..8.44 rows=1 width=11) (actual time=0.013..0.015 rows=1 loops=21)
Index Cond: ((job_no)::text = (min((tb_person_1.job_no)::text)))
Planning time: 0.351 ms
Execution time: 403.014 ms
查看第二种方法的查询计划:
explain analyze
select (array_agg(person_id order by job_no))[1] from tb_person group by department_id
结果如下:
GroupAggregate (cost=0.42..23822.21 rows=21 width=8) (actual time=67.788..2884.911 rows=21 loops=1)
Group Key: department_id
-> Index Scan using idx_tb_person_department_id on tb_person (cost=0.42..22321.95 rows=300000 width=15) (actual time=0.017..382.140 rows=300000 loops=1)
Planning time: 0.119 ms
Execution time: 2885.164 ms
可以看出,当分组字段的每个值对应的行数较多时(这里每个department_id对应100条数据),第二种方法的性能远远不如第一种。
4. 每个部门中抽选一人
这个问题也有两种实现。
第一种方法:
select min(person_id) , department_id from tb_person group by department_id
每个部门中抽选person_id最小的人。
第二种方法:
select distinct on (department_id) person_id,person_name,department_id from tb_person order by department_id
每个部门抽选一人,不关注person_id的字典序。
比较下列两种方法的性能
查看第一种方法的查询计划:
explain analyze
select min(person_id) , department_id from tb_person group by department_id
结果如下:
Finalize GroupAggregate (cost=6290.54..6291.07 rows=21 width=8) (actual time=289.036..289.159 rows=21 loops=1)
Group Key: department_id
-> Sort (cost=6290.54..6290.65 rows=42 width=8) (actual time=289.025..289.079 rows=63 loops=1)
Sort Key: department_id
Sort Method: quicksort Memory: 27kB
-> Gather (cost=6285.00..6289.41 rows=42 width=8) (actual time=288.456..288.945 rows=63 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=5285.00..5285.21 rows=21 width=8) (actual time=210.916..210.938 rows=21 loops=3)
Group Key: department_id
-> Parallel Seq Scan on tb_person (cost=0.00..4660.00 rows=125000 width=8) (actual time=0.013..107.747 rows=100000 loops=3)
Planning time: 0.149 ms
Execution time: 296.498 ms
查看第二种方法的查询计划:
explain analyze
select distinct on (department_id) person_id, department_id from tb_person order by department_id
结果如下:
Unique (cost=0.42..23071.95 rows=21 width=8) (actual time=0.021..676.080 rows=21 loops=1)
-> Index Scan using idx_tb_person_department_id on tb_person (cost=0.42..22321.95 rows=300000 width=8) (actual time=0.017..426.196 rows=300000 loops=1)
Planning time: 0.105 ms
Execution time: 676.146 ms
从上面的结果可以,看出第一种方法的性能好一些。
5. 查找每个部门中最年长的人
接下来查找每个部门中最年长的人。若两人出生日期,部门均相同,则任取一个,这个需求也有三种sql 实现。
第一种方法:
select min(person_id) from tb_person inner join (
select min(birthday) as birthday, department_id from tb_person group by department_id
) a using (birthday, department_id) GROUP BY department_id;
第二种方法:
select (array_agg(person_id order by birthday))[1] from tb_person group by department_id
第三种方法:
SELECT distinct on (a.department_id) person_id from tb_person inner join
(
select min(birthday) as birthday, department_id from tb_person group by department_id
) a using(birthday,department_id) order by a.department_id;
比较他们的性能:
查看第一种方法的查询计划:
explain analyze
select min(person_id) from tb_person inner join (
select min(birthday) as birthday, department_id from tb_person group by department_id
) a using (birthday, department_id) GROUP BY department_id;
结果如下:
GroupAggregate (cost=6295.27..9269.40 rows=17 width=8) (actual time=295.161..296.710 rows=21 loops=1)
Group Key: tb_person.department_id
-> Nested Loop (cost=6295.27..9269.14 rows=17 width=8) (actual time=295.067..296.616 rows=53 loops=1)
-> Finalize GroupAggregate (cost=6290.54..6291.07 rows=21 width=8) (actual time=294.975..295.122 rows=21 loops=1)
Group Key: tb_person_1.department_id
-> Sort (cost=6290.54..6290.65 rows=42 width=8) (actual time=294.964..295.022 rows=63 loops=1)
Sort Key: tb_person_1.department_id
Sort Method: quicksort Memory: 27kB
-> Gather (cost=6285.00..6289.41 rows=42 width=8) (actual time=294.303..294.874 rows=63 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=5285.00..5285.21 rows=21 width=8) (actual time=209.180..209.203 rows=21 loops=3)
Group Key: tb_person_1.department_id
-> Parallel Seq Scan on tb_person tb_person_1 (cost=0.00..4660.00 rows=125000 width=8) (actual time=0.015..106.866 rows=100000 loops=3)
-> Bitmap Heap Scan on tb_person (cost=4.73..141.78 rows=2 width=12) (actual time=0.031..0.056 rows=3 loops=21)
Recheck Cond: (birthday = (min(tb_person_1.birthday)))
Filter: (tb_person_1.department_id = department_id)
Rows Removed by Filter: 39
Heap Blocks: exact=859
-> Bitmap Index Scan on idx_tb_person_birthday (cost=0.00..4.73 rows=41 width=0) (actual time=0.011..0.011 rows=42 loops=21)
Index Cond: (birthday = (min(tb_person_1.birthday)))
Planning time: 0.303 ms
Execution time: 303.298 ms
查看第二种方法的查询计划:
explain analyze
select (array_agg(person_id order by birthday))[1] from tb_person group by department_id
结果如下:
GroupAggregate (cost=0.42..23822.21 rows=21 width=8) (actual time=27.898..1042.577 rows=21 loops=1)
Group Key: department_id
-> Index Scan using idx_tb_person_department_id on tb_person (cost=0.42..22321.95 rows=300000 width=12) (actual time=0.022..465.663 rows=300000 loops=1)
Planning time: 0.175 ms
Execution time: 1050.754 ms
查看第三种方法的查询计划:
explain analyze
SELECT distinct on (a.department_id) person_id from tb_person inner join
(
select min(birthday) as birthday, department_id from tb_person group by department_id
) a using(birthday,department_id) order by a.department_id;
结果如下:
Unique (cost=6295.27..9269.19 rows=17 width=8) (actual time=294.638..296.429 rows=21 loops=1)
-> Nested Loop (cost=6295.27..9269.14 rows=17 width=8) (actual time=294.634..296.344 rows=53 loops=1)
-> Finalize GroupAggregate (cost=6290.54..6291.07 rows=21 width=8) (actual time=294.538..294.702 rows=21 loops=1)
Group Key: tb_person_1.department_id
-> Sort (cost=6290.54..6290.65 rows=42 width=8) (actual time=294.506..294.586 rows=63 loops=1)
Sort Key: tb_person_1.department_id
Sort Method: quicksort Memory: 27kB
-> Gather (cost=6285.00..6289.41 rows=42 width=8) (actual time=293.917..294.438 rows=63 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=5285.00..5285.21 rows=21 width=8) (actual time=211.328..211.351 rows=21 loops=3)
Group Key: tb_person_1.department_id
-> Parallel Seq Scan on tb_person tb_person_1 (cost=0.00..4660.00 rows=125000 width=8) (actual time=0.015..106.619 rows=100000 loops=3)
-> Bitmap Heap Scan on tb_person (cost=4.73..141.78 rows=2 width=12) (actual time=0.036..0.062 rows=3 loops=21)
Recheck Cond: (birthday = (min(tb_person_1.birthday)))
Filter: (tb_person_1.department_id = department_id)
Rows Removed by Filter: 39
Heap Blocks: exact=859
-> Bitmap Index Scan on idx_tb_person_birthday (cost=0.00..4.73 rows=41 width=0) (actual time=0.013..0.013 rows=42 loops=21)
Index Cond: (birthday = (min(tb_person_1.birthday)))
Planning time: 0.298 ms
Execution time: 304.337 ms
不出所料的是第二种方法的性能还是最差的,因为生日在同一天的人很多。但令人惊讶的但第一种,第三种方法性能相差无几,他们的执行计划也几乎相同。尽管第三种方法虽然使用了 distinct on … order by。因此distinct on … order by语句的性能并不总是低于 group by语句。