深分页刨析与解决
深分页(Deep Pagination)是指处理大数据集时,使用传统的分页查询( LIMIT
和 OFFSET
)可能会随着页数的增加查询性能急剧下降
1. 为什么深分页性能下降
使用分页查询(
LIMIT
和OFFSET
),随着OFFSET
值的增加,数据库需要先扫描并丢弃前面很多不必要的记录,导致性能下降。如果深分页查询SQL还伴随着索引失效,数据排序和全表扫描以及二级索引回表等,那么这个深分页将会是性能深渊
场景模拟:
-
创建task表
CREATE TABLE `task` ( `id` bigint NOT NULL AUTO_INCREMENT COMMENT 'ID', `batch_id` bigint DEFAULT NULL COMMENT '批次ID', `json_object` text COMMENT '内容', PRIMARY KEY (`id`), KEY `idx_batch_id` (`batch_id`) USING BTREE ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='任务记录表';
-
存储函数,模拟插入100w数据
DELIMITER $$ CREATE PROCEDURE insert_task_records() BEGIN DECLARE i INT DEFAULT 1; DECLARE batch_size INT DEFAULT 1000; -- 每批插入1000条记录 DECLARE json_data TEXT; WHILE i <= 1000000 DO -- 开启一个事务 START TRANSACTION; -- 执行批量插入 SET @end = i + batch_size - 1; WHILE i <= @end DO SET json_data = CONCAT('{"key": "value_', i, '"}'); INSERT INTO `task` (`batch_id`, `json_object`) VALUES (1830889785603571980, json_data); SET i = i + 1; END WHILE; -- 提交事务 COMMIT; END WHILE; END$$ DELIMITER ;
-
执行存储过程
CALL insert_task_records();
-
查询前 10 条数据,根据执行返回结果,sql执行时间为 0.01s
mysql> select * from task where batch_id = '1830889785603571980' limit 0, 10; +----+---------------------+---------------------+ | id | batch_id | json_object | +----+---------------------+---------------------+ | 1 | 1830889785603571980 | {"key": "value_1"} | | 2 | 1830889785603571980 | {"key": "value_2"} | | 3 | 1830889785603571980 | {"key": "value_3"} | | 4 | 1830889785603571980 | {"key": "value_4"} | | 5 | 1830889785603571980 | {"key": "value_5"} | | 6 | 1830889785603571980 | {"key": "value_6"} | | 7 | 1830889785603571980 | {"key": "value_7"} | | 8 | 1830889785603571980 | {"key": "value_8"} | | 9 | 1830889785603571980 | {"key": "value_9"} | | 10 | 1830889785603571980 | {"key": "value_10"} | +----+---------------------+---------------------+ 10 rows in set (0.01 sec)
-
查询最后 10 条数据,根据执行返回结果,sql执行时间为 1.99s
mysql> select * from task where batch_id = '1830889785603571980' limit 999990, 10; +---------+---------------------+--------------------------+ | id | batch_id | json_object | +---------+---------------------+--------------------------+ | 999991 | 1830889785603571980 | {"key": "value_999991"} | | 999992 | 1830889785603571980 | {"key": "value_999992"} | | 999993 | 1830889785603571980 | {"key": "value_999993"} | | 999994 | 1830889785603571980 | {"key": "value_999994"} | | 999995 | 1830889785603571980 | {"key": "value_999995"} | | 999996 | 1830889785603571980 | {"key": "value_999996"} | | 999997 | 1830889785603571980 | {"key": "value_999997"} | | 999998 | 1830889785603571980 | {"key": "value_999998"} | | 999999 | 1830889785603571980 | {"key": "value_999999"} | | 1000000 | 1830889785603571980 | {"key": "value_1000000"} | +---------+---------------------+--------------------------+ 10 rows in set (1.99 sec)
性能相差近200倍,数据量大,深分页造成的性能下降只多不少
分析上述的SQL执行流程
- 使用
idx_batch_id
二级索引,在非聚簇索引树上获得10个主键ID - 回表,根据主键ID返回查询列数据
本次案例中查询条件有二级索引,实际开发如果没有二级索引只会更慢。并且如果扫描更多的行数可能会造成更多的回表操作
2.如何优化深分页?
深分页慢就满在传统的分页OFFSET偏移量大所造成的性能开销。用慢SQL优化的思想优化深分页,必然有用,比如二级索引,索引覆盖等等,这些都是推荐的。这里主要介绍针对深分页的优化
2.1 子查询与延迟关联
子查询和延迟关联优化深分页的核心在于提前获取主键以减少回表造成的性能开销,本质无异
-
子查询优化深分页
mysql> select * from task -> where id in ( -> select id from task where -> batch_id = '1830889785603571980' limit 999990, 10); ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
目的显而易见,但是不支持IN的子查询,我们可以用JOIN替换。刚好就是延迟关联优化深分页的版本(本案例的sql不适合用子查询优化,思想本质就是子查询获取主键id减少回表)
-
延迟关联优化深分页:查询执行时间缩减至0.05s
mysql> SELECT -> * -> FROM -> task -> INNER JOIN ( SELECT id FROM task WHERE batch_id = '1830889785603571980' LIMIT 999990, 10 ) AS sub_task ON task.id = sub_task.id; +---------+---------------------+--------------------------+---------+ | id | batch_id | json_object | id | +---------+---------------------+--------------------------+---------+ | 999991 | 1830889785603571980 | {"key": "value_999991"} | 999991 | | 999992 | 1830889785603571980 | {"key": "value_999992"} | 999992 | | 999993 | 1830889785603571980 | {"key": "value_999993"} | 999993 | | 999994 | 1830889785603571980 | {"key": "value_999994"} | 999994 | | 999995 | 1830889785603571980 | {"key": "value_999995"} | 999995 | | 999996 | 1830889785603571980 | {"key": "value_999996"} | 999996 | | 999997 | 1830889785603571980 | {"key": "value_999997"} | 999997 | | 999998 | 1830889785603571980 | {"key": "value_999998"} | 999998 | | 999999 | 1830889785603571980 | {"key": "value_999999"} | 999999 | | 1000000 | 1830889785603571980 | {"key": "value_1000000"} | 1000000 | +---------+---------------------+--------------------------+---------+ 10 rows in set (0.05 sec)
子查询和延迟关联通过提前获取主键以减少回表造成的性能开销,性能优化效果不稳定。更推荐下面的游标
2.2 标签记录法
标签记录法通过记录上一次分页查询的标签(通常是某一列的值,如时间戳、ID 或其他唯一标识符),而不是使用 OFFSET
,来实现分页。每次分页时,我们只查询比当前标签值更大的数据,从而避免了跳过大量记录的操作。
核心思想:通过使用某个有序字段(如主键 ID 或时间戳)来标记分页的起始位置,从而跳过大部分不必要的数据,避免了 OFFSET
的性能瓶颈。
标签记录法优化后的SQL,查询执行时间来到了0.01s,crazy!
mysql> select * from task where batch_id = '1830889785603571980' and id > 999990 limit 10;
+---------+---------------------+--------------------------+
| id | batch_id | json_object |
+---------+---------------------+--------------------------+
| 999991 | 1830889785603571980 | {"key": "value_999991"} |
| 999992 | 1830889785603571980 | {"key": "value_999992"} |
| 999993 | 1830889785603571980 | {"key": "value_999993"} |
| 999994 | 1830889785603571980 | {"key": "value_999994"} |
| 999995 | 1830889785603571980 | {"key": "value_999995"} |
| 999996 | 1830889785603571980 | {"key": "value_999996"} |
| 999997 | 1830889785603571980 | {"key": "value_999997"} |
| 999998 | 1830889785603571980 | {"key": "value_999998"} |
| 999999 | 1830889785603571980 | {"key": "value_999999"} |
| 1000000 | 1830889785603571980 | {"key": "value_1000000"} |
+---------+---------------------+--------------------------+
10 rows in set (0.01 sec)
标签记录法能优化深分页具有稳定的性能优化,首推!但是标签记录法有个弊端。不支持跳页,只能上一页和下一页。每次点击下一页就需要携带标签作为OFFSET偏移量。