PostgreSQL 中 min() 和 order by limit 1 的对比

最新推荐文章于 2024-07-15 06:52:45 发布

江天水一泓

最新推荐文章于 2024-07-15 06:52:45 发布

阅读量3.3k

点赞数

分类专栏： postgresql 文章标签：数据库 postgresql

本文链接：https://blog.youkuaiyun.com/international24/article/details/107042720

版权

postgresql 专栏收录该内容

25 篇文章

订阅专栏

在实际应用中，我们经常需要计算数据库中某张表的某个字段的的最小值/最大值，并用它来作为查询条件。

使用SQL求数据库中表字段的最小值/最大值，通常由两种方法：

一种是使用聚集函数 min()/max(), 另一种是使用order BY + “字段” 子句，对该表排序之后，取第一条记录中该字段的值。

本文在PostgreSQL 11.4上，对min() 和 order by … limit 1 以及它们的变体做一个全方位的对比。

现在有一张表tb_person，它的定义如下：

create table tb_person (

   person_id integer not null,

    person_name varchar(64),

    sex int,

    birthday date,

    phone varchar(64),

    job_no varchar(48),

    department_id int,

    create_time timestamp with time zone,

    update_time timestamp with time zone,

    constraint pk_tb_person primary key (person_id)

);

comment on table tb_person is '人员表';

comment on column tb_person.person_id is '主键id';

comment on column tb_person.person_name is '人员姓名';

comment on column tb_person.sex is '性别。0:男性；1：女性';

comment on column tb_person.birthday is '出生日期';

comment on column tb_person.phone is '联系电话';

comment on column tb_person.job_no is '员工工号';

comment on column tb_person.department_id is '所属部门id';

comment on column tb_person.create_time is '创建时间';

comment on column tb_person.update_time is '更新时间';

create index idx_tb_person_job_no on tb_person (job_no);

create index idx_tb_person_department_id on tb_person (department_id);

create index idx_tb_person_birthday on tb_person (birthday);

我们通过下面的sql 向表中插入300000 行数据：

insert into tb_person (person_id, person_name, sex, birthday, phone, job_no, department_id, create_time, update_time)

select num, substr('abcdefghijklmnopqrstuvwxyz', mod(num,26), 10), ceil(random()*2), '1970-01-01'::date + (interval '1 hour') * num, null,

to_char(num, 'FM000000'), ceil(random()*3000), now(), now() from

(select generate_series(1,300000) as num) a;

接下来我们通过一些查询案例来对比。

1. 查找人员表中最小的工号

获得人员表中最小的工号，有两种方法：

select min(job_no) from tb_person;

和

select job_no from tb_person order by job_no limit 1;

现在我们用 explain analyze，查看两种方法的查询计划，比较它们的性能。我们主要关注cost 和 execution time 两个指标。

查看第一种方法的查询计划：

explain analyze

select min(job_no) from tb_person;

结果如下：

Result (cost=0.45..0.46 rows=1 width=32) (actual time=0.029..0.030 rows=1 loops=1)

InitPlan 1 (returns $0)

    -> Limit (cost=0.42..0.45 rows=1 width=32) (actual time=0.023..0.024 rows=1 loops=1)

          -> Index Only Scan using idx_tb_person_job_no on tb_person (cost=0.42..8550.42 rows=300000 width=32) (actual time=0.020..0.020 rows=1 loops=1)

                Index Cond: (job_no IS NOT NULL)

                Heap Fetches: 0

Planning time: 0.161 ms

Execution time: 0.060 ms

查看第二种方法的查询计划：

explain analyze

select job_no from tb_person order by job_no limit 1;

结果如下：

Limit (cost=0.42..0.46 rows=1 width=61) (actual time=0.032..0.033 rows=1 loops=1)

-> Index Scan using idx_tb_person_job_no on tb_person (cost=0.42..11213.42 rows=300000 width=61) (actual time=0.028..0.028 rows=1 loops=1)

Planning time: 0.182 ms

Execution time: 0.069 ms

从上面的执行时间和代价估计可以看出，两种方法没有明显性能差别。

2. 查找工号最小的人

查找工号最小的人也有两种方法。

第一种方法是：

select * from tb_person where job_no = ( select min(job_no) from tb_person);

第二种方法是：

select * from tb_person order by job_no limit 1;

下面我们比较两种方法的查询计划：

查看第一种方法的查询计划：

explain analyze select * from tb_person where job_no = ( select min(job_no) from tb_person );

结果如下：

Index Scan using idx_tb_person_job_no on tb_person (cost=0.88..8.90 rows=1 width=61) (actual time=0.129..0.131 rows=1 loops=1)

Index Cond: ((job_no)::text = $1)

InitPlan 2 (returns $1)

    -> Result (cost=0.45..0.46 rows=1 width=32) (actual time=0.084..0.085 rows=1 loops=1)

          InitPlan 1 (returns $0)

            -> Limit (cost=0.42..0.45 rows=1 width=32) (actual time=0.077..0.078 rows=1 loops=1)

                  -> Index Only Scan using idx_tb_person_job_no on tb_person tb_person_1 (cost=0.42..8550.42 rows=300000 width=32) (actual time=0.072..0.072 rows=1 loops=1)

                        Index Cond: (job_no IS NOT NULL)

                        Heap Fetches: 0

Planning time: 0.255 ms

Execution time: 0.179 ms

查看第二种方法的查询计划：

explain analyze

select * from tb_person order by job_no limit 1;

结果如下：

Limit (cost=0.42..0.46 rows=1 width=61) (actual time=0.021..0.022 rows=1 loops=1)

-> Index Scan using idx_tb_person_job_no on tb_person (cost=0.42..11213.42 rows=300000 width=61) (actual time=0.018..0.018 rows=1 loops=1)

Planning time: 0.164 ms

Execution time: 0.054 ms

可以看出，第二种方法的性能优于第一种。第一种方法做了两次扫描，而第二种方法只做了一次。

3. 求每个部门工号最小的人的id

这里需要根据department_id分组。实现它也有两种方法。

第一种方法：

select person_id from tb_person where job_no in (

select min(job_no) from tb_person group by department_id

);

这种方法是根据 department_id 将人员数据分组，获取每个部门最小的job_no，再根据job_no 获取对应的person_id

第二种方法：

select (array_agg(person_id order by job_no))[1] from tb_person group by department_id

这种方法是根据department_id分组，将每个组的person_id 聚合成数组，聚合是按照job_no递增的顺序的，并取每个数组中第一个元素，即每个部门最小job_no对应的person_id

下面我们比较两种方法的性能。

查看第一种方法的查询计划：

explain analyze

select person_id from tb_person where job_no in (

select min(job_no) from tb_person group by department_id

);

结果如下：

Nested Loop (cost=6291.75..6468.99 rows=9 width=4) (actual time=394.590..394.990 rows=21 loops=1)

-> HashAggregate (cost=6291.33..6291.54 rows=21 width=32) (actual time=394.544..394.564 rows=21 loops=1)

        Group Key: min((tb_person_1.job_no)::text)

        -> Finalize GroupAggregate (cost=6290.54..6291.07 rows=21 width=36) (actual time=394.320..394.512 rows=21 loops=1)

              Group Key: tb_person_1.department_id

              -> Sort (cost=6290.54..6290.65 rows=42 width=36) (actual time=394.303..394.361 rows=63 loops=1)

                    Sort Key: tb_person_1.department_id

                    Sort Method: quicksort Memory: 27kB

                    -> Gather (cost=6285.00..6289.41 rows=42 width=36) (actual time=393.681..394.231 rows=63 loops=1)

                          Workers Planned: 2

                          Workers Launched: 2

                          -> Partial HashAggregate (cost=5285.00..5285.21 rows=21 width=36) (actual time=305.755..305.774 rows=21 loops=3)

                                Group Key: tb_person_1.department_id

                                -> Parallel Seq Scan on tb_person tb_person_1 (cost=0.00..4660.00 rows=125000 width=11) (actual time=0.015..117.754 rows=100000 loops=3)

-> Index Scan using idx_tb_person_job_no on tb_person (cost=0.42..8.44 rows=1 width=11) (actual time=0.013..0.015 rows=1 loops=21)

        Index Cond: ((job_no)::text = (min((tb_person_1.job_no)::text)))

Planning time: 0.351 ms

Execution time: 403.014 ms

查看第二种方法的查询计划：

explain analyze

select (array_agg(person_id order by job_no))[1] from tb_person group by department_id

结果如下：

GroupAggregate (cost=0.42..23822.21 rows=21 width=8) (actual time=67.788..2884.911 rows=21 loops=1)

Group Key: department_id

-> Index Scan using idx_tb_person_department_id on tb_person (cost=0.42..22321.95 rows=300000 width=15) (actual time=0.017..382.140 rows=300000 loops=1)

Planning time: 0.119 ms

Execution time: 2885.164 ms

可以看出，当分组字段的每个值对应的行数较多时（这里每个department_id对应100条数据），第二种方法的性能远远不如第一种。

4. 每个部门中抽选一人

这个问题也有两种实现。

第一种方法：

select min(person_id) , department_id from tb_person group by department_id

每个部门中抽选person_id最小的人。

第二种方法：

select distinct on (department_id) person_id,person_name,department_id from tb_person order by department_id

每个部门抽选一人，不关注person_id的字典序。

比较下列两种方法的性能

查看第一种方法的查询计划：

explain analyze

select min(person_id) , department_id from tb_person group by department_id

结果如下：

Finalize GroupAggregate (cost=6290.54..6291.07 rows=21 width=8) (actual time=289.036..289.159 rows=21 loops=1)

Group Key: department_id

-> Sort (cost=6290.54..6290.65 rows=42 width=8) (actual time=289.025..289.079 rows=63 loops=1)

        Sort Key: department_id

        Sort Method: quicksort Memory: 27kB

        -> Gather (cost=6285.00..6289.41 rows=42 width=8) (actual time=288.456..288.945 rows=63 loops=1)

              Workers Planned: 2

              Workers Launched: 2

              -> Partial HashAggregate (cost=5285.00..5285.21 rows=21 width=8) (actual time=210.916..210.938 rows=21 loops=3)

                    Group Key: department_id

                    -> Parallel Seq Scan on tb_person (cost=0.00..4660.00 rows=125000 width=8) (actual time=0.013..107.747 rows=100000 loops=3)

Planning time: 0.149 ms

Execution time: 296.498 ms

查看第二种方法的查询计划：

explain analyze

select distinct on (department_id) person_id, department_id from tb_person order by department_id

结果如下：

Unique (cost=0.42..23071.95 rows=21 width=8) (actual time=0.021..676.080 rows=21 loops=1)

-> Index Scan using idx_tb_person_department_id on tb_person (cost=0.42..22321.95 rows=300000 width=8) (actual time=0.017..426.196 rows=300000 loops=1)

Planning time: 0.105 ms

Execution time: 676.146 ms

从上面的结果可以，看出第一种方法的性能好一些。

5. 查找每个部门中最年长的人

接下来查找每个部门中最年长的人。若两人出生日期，部门均相同，则任取一个，这个需求也有三种sql 实现。

第一种方法：

select min(person_id) from tb_person inner join (

select min(birthday) as birthday, department_id from tb_person group by department_id

) a using (birthday, department_id) GROUP BY department_id;

第二种方法：

select (array_agg(person_id order by birthday))[1] from tb_person group by department_id

第三种方法：

SELECT distinct on (a.department_id) person_id from tb_person inner join

(

select min(birthday) as birthday, department_id from tb_person group by department_id

) a using(birthday,department_id) order by a.department_id;

比较他们的性能：

查看第一种方法的查询计划：

explain analyze

select min(person_id) from tb_person inner join (

select min(birthday) as birthday, department_id from tb_person group by department_id

) a using (birthday, department_id) GROUP BY department_id;

结果如下：

GroupAggregate (cost=6295.27..9269.40 rows=17 width=8) (actual time=295.161..296.710 rows=21 loops=1)

Group Key: tb_person.department_id

-> Nested Loop (cost=6295.27..9269.14 rows=17 width=8) (actual time=295.067..296.616 rows=53 loops=1)

        -> Finalize GroupAggregate (cost=6290.54..6291.07 rows=21 width=8) (actual time=294.975..295.122 rows=21 loops=1)

              Group Key: tb_person_1.department_id

              -> Sort (cost=6290.54..6290.65 rows=42 width=8) (actual time=294.964..295.022 rows=63 loops=1)

                    Sort Key: tb_person_1.department_id

                    Sort Method: quicksort Memory: 27kB

                    -> Gather (cost=6285.00..6289.41 rows=42 width=8) (actual time=294.303..294.874 rows=63 loops=1)

                          Workers Planned: 2

                          Workers Launched: 2

                          -> Partial HashAggregate (cost=5285.00..5285.21 rows=21 width=8) (actual time=209.180..209.203 rows=21 loops=3)

                                Group Key: tb_person_1.department_id

                                -> Parallel Seq Scan on tb_person tb_person_1 (cost=0.00..4660.00 rows=125000 width=8) (actual time=0.015..106.866 rows=100000 loops=3)

        -> Bitmap Heap Scan on tb_person (cost=4.73..141.78 rows=2 width=12) (actual time=0.031..0.056 rows=3 loops=21)

              Recheck Cond: (birthday = (min(tb_person_1.birthday)))

              Filter: (tb_person_1.department_id = department_id)

              Rows Removed by Filter: 39

              Heap Blocks: exact=859

              -> Bitmap Index Scan on idx_tb_person_birthday (cost=0.00..4.73 rows=41 width=0) (actual time=0.011..0.011 rows=42 loops=21)

                    Index Cond: (birthday = (min(tb_person_1.birthday)))

Planning time: 0.303 ms

Execution time: 303.298 ms

查看第二种方法的查询计划：

explain analyze

select (array_agg(person_id order by birthday))[1] from tb_person group by department_id

结果如下：

GroupAggregate (cost=0.42..23822.21 rows=21 width=8) (actual time=27.898..1042.577 rows=21 loops=1)

Group Key: department_id

-> Index Scan using idx_tb_person_department_id on tb_person (cost=0.42..22321.95 rows=300000 width=12) (actual time=0.022..465.663 rows=300000 loops=1)

Planning time: 0.175 ms

Execution time: 1050.754 ms

查看第三种方法的查询计划：

explain analyze

SELECT distinct on (a.department_id) person_id from tb_person inner join

(

select min(birthday) as birthday, department_id from tb_person group by department_id

) a using(birthday,department_id) order by a.department_id;

结果如下：

Unique (cost=6295.27..9269.19 rows=17 width=8) (actual time=294.638..296.429 rows=21 loops=1)

-> Nested Loop (cost=6295.27..9269.14 rows=17 width=8) (actual time=294.634..296.344 rows=53 loops=1)

        -> Finalize GroupAggregate (cost=6290.54..6291.07 rows=21 width=8) (actual time=294.538..294.702 rows=21 loops=1)

              Group Key: tb_person_1.department_id

              -> Sort (cost=6290.54..6290.65 rows=42 width=8) (actual time=294.506..294.586 rows=63 loops=1)

                   Sort Key: tb_person_1.department_id

                    Sort Method: quicksort Memory: 27kB

                    -> Gather (cost=6285.00..6289.41 rows=42 width=8) (actual time=293.917..294.438 rows=63 loops=1)

                          Workers Planned: 2

                          Workers Launched: 2

                          -> Partial HashAggregate (cost=5285.00..5285.21 rows=21 width=8) (actual time=211.328..211.351 rows=21 loops=3)

                                Group Key: tb_person_1.department_id

                                -> Parallel Seq Scan on tb_person tb_person_1 (cost=0.00..4660.00 rows=125000 width=8) (actual time=0.015..106.619 rows=100000 loops=3)

        -> Bitmap Heap Scan on tb_person (cost=4.73..141.78 rows=2 width=12) (actual time=0.036..0.062 rows=3 loops=21)

              Recheck Cond: (birthday = (min(tb_person_1.birthday)))

              Filter: (tb_person_1.department_id = department_id)

              Rows Removed by Filter: 39

              Heap Blocks: exact=859

              -> Bitmap Index Scan on idx_tb_person_birthday (cost=0.00..4.73 rows=41 width=0) (actual time=0.013..0.013 rows=42 loops=21)

                    Index Cond: (birthday = (min(tb_person_1.birthday)))

Planning time: 0.298 ms

Execution time: 304.337 ms

不出所料的是第二种方法的性能还是最差的，因为生日在同一天的人很多。但令人惊讶的但第一种，第三种方法性能相差无几，他们的执行计划也几乎相同。尽管第三种方法虽然使用了 distinct on … order by。因此distinct on … order by语句的性能并不总是低于 group by语句。