跟着官网学习
https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html
窗口函数
具体用法
| Name | Description |
|---|---|
CUME_DIST() | Cumulative distribution value |
DENSE_RANK() | Rank of current row within its partition, without gaps |
FIRST_VALUE() | Value of argument from first row of window frame |
LAG() | Value of argument from row lagging current row within partition |
LAST_VALUE() | Value of argument from last row of window frame |
LEAD() | Value of argument from row leading current row within partition |
NTH_VALUE() | Value of argument from N-th row of window frame |
NTILE() | Bucket number of current row within its partition. |
PERCENT_RANK() | Percentage rank value |
RANK() | Rank of current row within its partition, with gaps |
ROW_NUMBER() | Number of current row within its partition |
ROW_NUMBER / RANK / DENSE_RANK
生成测试样本数据
drop table if exists numbers;
create table numbers(
`val` int
);
insert into numbers values (1),(1),(2),(3),(3),(3),(4),(4),(5);
执行结果:
mysql> SELECT
val,
ROW_NUMBER() OVER w AS 'row_number',
RANK() OVER w AS 'rank',
DENSE_RANK() OVER w AS 'dense_rank'
FROM numbers
WINDOW w AS (ORDER BY val);
+-----+------------+------+------------+
| val | row_number | rank | dense_rank |
+-----+------------+------+------------+
| 1 | 1 | 1 | 1 |
| 1 | 2 | 1 | 1 |
| 2 | 3 | 3 | 2 |
| 3 | 4 | 4 | 3 |
| 3 | 5 | 4 | 3 |
| 3 | 6 | 4 | 3 |
| 4 | 7 | 7 | 4 |
| 4 | 8 | 7 | 4 |
| 5 | 9 | 9 | 5 |
+-----+------------+------+------------+
9 rows in set
############################
ROW_NUMBER() over_clause
RANK() over_clause
DENSE_RANK() over_clause
ROW_NUMBER():分组内行号
RANK():分组内的秩
返回当前行在其分组内的不连续的秩(带间隙)
Returns the rank of the current row within its partition, with gaps. Peers are considered ties and receive the same rank. This function does not assign consecutive ranks to peer groups if groups of size greater than one exist; the result is noncontiguous rank numbers.
DENSE_RANK():分组内的秩
返回当前行在其分组内连续的秩(无间隙)
Returns the rank of the current row within its partition, without gaps. Peers are considered ties and receive the same rank. This function assigns consecutive ranks to peer groups; the result is that groups of size greater than one do not produce noncontiguous rank numbers.
############################
CUME_DIST / PERCENT_RANK
生成测试样本数据
drop table if exists numbers;
create table numbers(
val int
);
insert into numbers values (1),(1),(2),(3),(3),(3),(4),(4),(5);
执行结果:
mysql> SELECT
val,
ROW_NUMBER() OVER w AS 'row_number',
CUME_DIST() OVER w AS 'cume_dist',
PERCENT_RANK() OVER w AS 'percent_rank'
FROM numbers
WINDOW w AS (ORDER BY val);
+-----+------------+--------------------+--------------+
| val | row_number | cume_dist | percent_rank |
+-----+------------+--------------------+--------------+
| 1 | 1 | 0.2222222222222222 | 0 |
| 1 | 2 | 0.2222222222222222 | 0 |
| 2 | 3 | 0.3333333333333333 | 0.25 |
| 3 | 4 | 0.6666666666666666 | 0.375 |
| 3 | 5 | 0.6666666666666666 | 0.375 |
| 3 | 6 | 0.6666666666666666 | 0.375 |
| 4 | 7 | 0.8888888888888888 | 0.75 |
| 4 | 8 | 0.8888888888888888 | 0.75 |
| 5 | 9 | 1 | 1 |
+-----+------------+--------------------+--------------+
9 rows in set
############################
CUME_DIST()的解释是:Cumulative distribution value(累积分布值)
计算方法:(分组内最大行数)/(总行数)
什么是累积分布值呢?就是当前样本在总体里的累积占比。
看执行结果
值1(val=1)有2个,分组内最大行数是2,总行数是9,所以CUME_DIST=2/9
值2(val=2)有1个,分组内最大行数是3,总行数是9,所以CUME_DIST=3/9
值3(val=3)有3个,分组内最大行数是6,总行数是9,所以CUME_DIST=6/9
......
所以结果就是:这一组数值中的累积分布,即分区值小于或等于当前行中的值的百分比。
Returns the cumulative distribution of a value within a group of values; that is, the percentage of partition values less than or equal to the value in the current row.
PERCENT_RANK()的解释是:Percentage rank value(等级百分比)
计算方法:
(rank - 1) / (rows - 1)
即:(分组最小行号-1)/(总行数-1)
看执行结果
值1(val=1)有2个,分组内最小行号是1,总行数是9,所以PERCENT_RANK=0/8=0
值2(val=2)有1个,分组内最小行号是3,总行数是9,所以PERCENT_RANK=2/8
值3(val=3)有3个,分组内最小行号是4,总行数是9,所以PERCENT_RANK=3/8
......
所以结果就是:返回分区值小于当前行中的值(不包括最高值)的百分比。
Returns the percentage of partition values less than the value in the current row, excluding the highest value.
############################
FIRST_VALUE / LAST_VALUE / NTH_VALUE
FIRST_VALUE(expr) [null_treatment] over_clause
生成测试样本数据
drop table if exists observations;
create table observations(
`time` time, `subject` varchar(10), `val` int
);
insert into observations values
('07:00:00','st113',10)
,('07:15:00','st113', 9)
,('07:30:00','st113',25)
,('07:45:00','st113',20)
,('07:00:00','xh458', 0)
,('07:15:00','xh458',10)
,('07:30:00','xh458', 5)
,('07:45:00','xh458',30)
,('08:00:00','xh458',25)
;
执行结果:
mysql> select * from observations;
+----------+---------+-----+
| time | subject | val |
+----------+---------+-----+
| 07:00:00 | st113 | 10 |
| 07:15:00 | st113 | 9 |
| 07:30:00 | st113 | 25 |
| 07:45:00 | st113 | 20 |
| 07:00:00 | xh458 | 0 |
| 07:15:00 | xh458 | 10 |
| 07:30:00 | xh458 | 5 |
| 07:45:00 | xh458 | 30 |
| 08:00:00 | xh458 | 25 |
+----------+---------+-----+
9 rows in set
mysql> SELECT
time, subject, val,
FIRST_VALUE(val) OVER w AS 'first',
LAST_VALUE(val) OVER w AS 'last',
NTH_VALUE(val, 2) OVER w AS 'second',
NTH_VALUE(val, 4) OVER w AS 'fourth'
FROM observations
WINDOW w AS (PARTITION BY subject ORDER BY time
ROWS UNBOUNDED PRECEDING);
+----------+---------+-----+-------+------+--------+--------+
| time | subject | val | first | last | second | fourth |
+----------+---------+-----+-------+------+--------+--------+
| 07:00:00 | st113 | 10 | 10 | 10 | NULL | NULL |
| 07:15:00 | st113 | 9 | 10 | 9 | 9 | NULL |
| 07:30:00 | st113 | 25 | 10 | 25 | 9 | NULL |
| 07:45:00 | st113 | 20 | 10 | 20 | 9 | 20 |
| 07:00:00 | xh458 | 0 | 0 | 0 | NULL | NULL |
| 07:15:00 | xh458 | 10 | 0 | 10 | 10 | NULL |
| 07:30:00 | xh458 | 5 | 0 | 5 | 10 | NULL |
| 07:45:00 | xh458 | 30 | 0 | 30 | 10 | 30 |
| 08:00:00 | xh458 | 25 | 0 | 25 | 10 | 30 |
+----------+---------+-----+-------+------+--------+--------+
9 rows in set
############################
FIRST_VALUE()的解释是:Value of argument from first row of window frame(窗口内第一个值)
LAST_VALUE()的解释是:Value of argument from last row of window frame(窗口内最后一个值)
NTH_VALUE()的解释是:Value of argument from N-th row of window frame(窗口内第N个值)
看窗口:PARTITION BY subject ORDER BY time ROWS UNBOUNDED PRECEDING
以subject为分组,time排序
看结果,
############################
LAG / LEAD
LAG(expr [, N[, default]]) [null_treatment] over_clause
生成测试样本数据
drop table if exists series;
create table series(
`t` time, `val` int
);
insert into series values
('12:00:00',100)
,('13:00:00',125)
,('14:00:00',132)
,('15:00:00',145)
,('16:00:00',140)
,('17:00:00',150)
,('18:00:00',200)
;
执行结果:
mysql> select * from series;
+----------+-----+
| t | val |
+----------+-----+
| 12:00:00 | 100 |
| 13:00:00 | 125 |
| 14:00:00 | 132 |
| 15:00:00 | 145 |
| 16:00:00 | 140 |
| 17:00:00 | 150 |
| 18:00:00 | 200 |
+----------+-----+
7 rows in set
mysql> SELECT
t, val,
LAG(val) OVER w AS 'lag',
LEAD(val) OVER w AS 'lead',
val - LAG(val) OVER w AS 'lag diff',
val - LEAD(val) OVER w AS 'lead diff'
FROM series
WINDOW w AS (ORDER BY t);
+----------+-----+------+------+----------+-----------+
| t | val | lag | lead | lag diff | lead diff |
+----------+-----+------+------+----------+-----------+
| 12:00:00 | 100 | NULL | 125 | NULL | -25 |
| 13:00:00 | 125 | 100 | 132 | 25 | -7 |
| 14:00:00 | 132 | 125 | 145 | 7 | -13 |
| 15:00:00 | 145 | 132 | 140 | 13 | 5 |
| 16:00:00 | 140 | 145 | 150 | -5 | -10 |
| 17:00:00 | 150 | 140 | 200 | 10 | -50 |
| 18:00:00 | 200 | 150 | NULL | 50 | NULL |
+----------+-----+------+------+----------+-----------+
7 rows in set
mysql> SELECT n FROM fib ORDER BY n;
+------+
| n |
+------+
| 1 |
| 1 |
| 2 |
| 3 |
| 5 |
| 8 |
+------+
mysql> SELECT
n,
LAG(n, 1, 0) OVER w AS 'lag',
LEAD(n, 1, 0) OVER w AS 'lead',
n + LAG(n, 1, 0) OVER w AS 'next_n',
n + LEAD(n, 1, 0) OVER w AS 'next_next_n'
FROM fib
WINDOW w AS (ORDER BY n);
+------+------+------+--------+-------------+
| n | lag | lead | next_n | next_next_n |
+------+------+------+--------+-------------+
| 1 | 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 | 3 |
| 2 | 1 | 3 | 3 | 5 |
| 3 | 2 | 5 | 5 | 8 |
| 5 | 3 | 8 | 8 | 13 |
| 8 | 5 | 0 | 13 | 8 |
+------+------+------+--------+-------------+
############################
LAG()的解释是:Value of argument from row lagging current row within partition(分区内)
############################
NTILE
NTILE(N) over_clause
生成测试样本数据
drop table if exists numbers;
create table numbers(
`val` int
);
insert into numbers values (1),(1),(2),(3),(3),(3),(4),(4),(5);
执行结果:
mysql> SELECT
val,
ROW_NUMBER() OVER w AS 'row_number',
NTILE(2) OVER w AS 'ntile2',
NTILE(4) OVER w AS 'ntile4'
FROM numbers
WINDOW w AS (ORDER BY val);
+------+------------+--------+--------+
| val | row_number | ntile2 | ntile4 |
+------+------------+--------+--------+
| 1 | 1 | 1 | 1 |
| 1 | 2 | 1 | 1 |
| 2 | 3 | 1 | 1 |
| 3 | 4 | 1 | 2 |
| 3 | 5 | 1 | 2 |
| 3 | 6 | 2 | 3 |
| 4 | 7 | 2 | 3 |
| 4 | 8 | 2 | 4 |
| 5 | 9 | 2 | 4 |
+------+------------+--------+--------+
############################
MySQL 8.0的部分新特性
测试case
简单模拟一个订单表,字段分别是订单号,用户编号,金额,创建时间
drop table if exists order_info;
create table order_info(
order_id int primary key,
user_no varchar(10),
amount int,
create_date date
);
insert into order_info values (1,'u0001',100,'2018-1-1');
insert into order_info values (2,'u0001',300,'2018-1-2');
insert into order_info values (3,'u0001',300,'2018-1-2');
insert into order_info values (4,'u0001',800,'2018-1-10');
insert into order_info values (5,'u0001',900,'2018-1-20');
insert into order_info values (6,'u0002',500,'2018-1-5');
insert into order_info values (7,'u0002',600,'2018-1-6');
insert into order_info values (8,'u0002',300,'2018-1-10');
insert into order_info values (9,'u0002',800,'2018-1-16');
insert into order_info values (10,'u0002',800,'2018-1-22');
看看初始数据
+----------+---------+--------+-------------+
| order_id | user_no | amount | create_date |
+----------+---------+--------+-------------+
| 1 | u0001 | 100 | 2018-01-01 |
| 2 | u0001 | 300 | 2018-01-02 |
| 3 | u0001 | 300 | 2018-01-02 |
| 4 | u0001 | 800 | 2018-01-10 |
| 5 | u0001 | 900 | 2018-01-20 |
| 6 | u0002 | 500 | 2018-01-05 |
| 7 | u0002 | 600 | 2018-01-06 |
| 8 | u0002 | 300 | 2018-01-10 |
| 9 | u0002 | 800 | 2018-01-16 |
| 10 | u0002 | 800 | 2018-01-22 |
+----------+---------+--------+-------------+
10 rows in set
现在要求sql查询:求每个用户的最新的一个订单。
传统的方式:
SELECT * FROM (
SELECT
IF(@y=a.user_no, @x:=@x+1, @x:=1) X,
IF(@y=a.user_no, @y, @y:=a.user_no) Y,
a.*
FROM order_info a, (SELECT @x:=0, @y:=NULL) b
ORDER BY a.user_no, a.create_date desc
) a
WHERE X <= 1;
以下是执行结果,当然执行结果是可以满足需求的。
+---+-------+----------+---------+--------+-------------+
| X | Y | order_id | user_no | amount | create_date |
+---+-------+----------+---------+--------+-------------+
| 1 | u0001 | 5 | u0001 | 900 | 2018-01-20 |
| 1 | u0002 | 10 | u0002 | 800 | 2018-01-22 |
+---+-------+----------+---------+--------+-------------+
2 rows in set
如果采用新的窗口函数的方法,就是使用row_number() over WINDOW as row_num 给原始记录编一个号(其中WINDOW使用 partition by user_no order by create_date desc),然后取第一个编号的数据,自然就是“用户的最新的一条订单”,实现逻辑上清晰了很多,代码也简洁,可读了很多。
select * from (
select
row_number() OVER w as row_num,
order_id,user_no,amount,create_date
from order_info
WINDOW w AS (partition by user_no order by create_date desc)
) t
where row_num=1;
结果:
+---------+----------+---------+--------+-------------+
| row_num | order_id | user_no | amount | create_date |
+---------+----------+---------+--------+-------------+
| 1 | 5 | u0001 | 900 | 2018-01-20 |
| 1 | 10 | u0002 | 800 | 2018-01-22 |
+---------+----------+---------+--------+-------------+
2 rows in set
然后
具体窗口函数的介绍正式开始
序号函数
row_number()
(分组)排序编号,正如上面的例子, partition by user_no order by create_date desc按照用户分组,按照create_date排序,对已有数据生成一个编号。
也可以不分组,对整体进行排序。
任何一个窗口函数,都可以分组统计或者不分组统计(即可以不要partition by XXX 部分,看需求)
select
row_number() OVER w as row_num,
order_id,user_no,amount,create_date
from order_info
WINDOW w AS (order by create_date desc)
;
结果:
+---------+----------+---------+--------+-------------+
| row_num | order_id | user_no | amount | create_date |
+---------+----------+---------+--------+-------------+
| 1 | 10 | u0002 | 800 | 2018-01-22 |
| 2 | 5 | u0001 | 900 | 2018-01-20 |
| 3 | 9 | u0002 | 800 | 2018-01-16 |
| 4 | 4 | u0001 | 800 | 2018-01-10 |
| 5 | 8 | u0002 | 300 | 2018-01-10 |
| 6 | 7 | u0002 | 600 | 2018-01-06 |
| 7 | 6 | u0002 | 500 | 2018-01-05 |
| 8 | 2 | u0001 | 300 | 2018-01-02 |
| 9 | 3 | u0001 | 300 | 2018-01-02 |
| 10 | 1 | u0001 | 100 | 2018-01-01 |
+---------+----------+---------+--------+-------------+
10 rows in set
rank()
类似于 row_number(),也是排序功能,但是rank()有什么不一样?新的事物的出现必然是为了解决潜在的问题。
如果再往测试表中写入一条数据:
insert into order_info values (11,'u0002',800,'2018-1-22');
对于测试表中的U002用户来说,有两条create_date完全一样的数据(假设有这样的数据),那么在row_number()编号的时候,这两条数据却被编了两个不同的号。
理论上讲,这两条的数据的排名是并列最新的。因此rank()就是为了解决这个问题的,也即:排序条件一样的情况下,其编号也一样。
select * from (
select
rank() OVER w as row_num,
order_id,user_no,amount,create_date
from order_info
WINDOW w AS (partition by user_no order by create_date desc)
)t ;
结果(其中order_id为10,11的记录row_num并列为1):
+---------+----------+---------+--------+-------------+
| row_num | order_id | user_no | amount | create_date |
+---------+----------+---------+--------+-------------+
| 1 | 5 | u0001 | 900 | 2018-01-20 |
| 2 | 4 | u0001 | 800 | 2018-01-10 |
| 3 | 2 | u0001 | 300 | 2018-01-02 |
| 3 | 3 | u0001 | 300 | 2018-01-02 |
| 5 | 1 | u0001 | 100 | 2018-01-01 |
| 1 | 10 | u0002 | 800 | 2018-01-22 |
| 1 | 11 | u0002 | 800 | 2018-01-22 |
| 3 | 9 | u0002 | 800 | 2018-01-16 |
| 4 | 8 | u0002 | 300 | 2018-01-10 |
| 5 | 7 | u0002 | 600 | 2018-01-06 |
| 6 | 6 | u0002 | 500 | 2018-01-05 |
+---------+----------+---------+--------+-------------+
11 rows in set
dense_rank()
dense_rank()的出现是为了解决rank()编号存在的问题的。
rank()编号的时候存在跳号的问题,如果有两个并列第1,那么下一个名次的编号就是3,结果就是没有编号为2的数据。
如果不想跳号,可以使用dense_rank()替代。
select
dense_rank() OVER w as row_num,
order_id,user_no,amount,create_date
from order_info
where user_no = 'u0002'
WINDOW w AS (partition by user_no order by create_date desc)
;
结果:
+---------+----------+---------+--------+-------------+
| row_num | order_id | user_no | amount | create_date |
+---------+----------+---------+--------+-------------+
| 1 | 10 | u0002 | 800 | 2018-01-22 |
| 1 | 11 | u0002 | 800 | 2018-01-22 |
| 2 | 9 | u0002 | 800 | 2018-01-16 |
| 3 | 8 | u0002 | 300 | 2018-01-10 |
| 4 | 7 | u0002 | 600 | 2018-01-06 |
| 5 | 6 | u0002 | 500 | 2018-01-05 |
+---------+----------+---------+--------+-------------+
6 rows in set
avg,sum等聚合函数在窗口函数中的的增强
可以在聚合函数中使用窗口功能,比如sum(amount) over WINDOW,实现一个累积求和的功能。
select
order_id,user_no,amount,create_date,
sum(amount) OVER w as sum_amount,
avg(amount) OVER w as avg_amount
from order_info
WINDOW w AS (partition by user_no order by create_date desc)
;
结果:
+----------+---------+--------+-------------+------------+------------+
| order_id | user_no | amount | create_date | sum_amount | avg_amount |
+----------+---------+--------+-------------+------------+------------+
| 5 | u0001 | 900 | 2018-01-20 | 900 | 900.0000 |
| 4 | u0001 | 800 | 2018-01-10 | 1700 | 850.0000 |
| 2 | u0001 | 300 | 2018-01-02 | 2300 | 575.0000 |
| 3 | u0001 | 300 | 2018-01-02 | 2300 | 575.0000 |
| 1 | u0001 | 100 | 2018-01-01 | 2400 | 480.0000 |
| 10 | u0002 | 800 | 2018-01-22 | 1600 | 800.0000 |
| 11 | u0002 | 800 | 2018-01-22 | 1600 | 800.0000 |
| 9 | u0002 | 800 | 2018-01-16 | 2400 | 800.0000 |
| 8 | u0002 | 300 | 2018-01-10 | 2700 | 675.0000 |
| 7 | u0002 | 600 | 2018-01-06 | 3300 | 660.0000 |
| 6 | u0002 | 500 | 2018-01-05 | 3800 | 633.3333 |
+----------+---------+--------+-------------+------------+------------+
11 rows in set
NTILE(N) 将数据按照某些排序分成N组
举个简单的例子,按照分数线的倒序排列,将学生成绩分成上中下3组,可以得到哪个程序数据上中下三个组中哪一部分,就可以使用NTILE(3) 来实现。
这种需求倒是用的不是非常多。
如下,还是使用上面的表,按照时间将user_no = 'u0002'的订单按照时间的纬度,划分为3组,看每一行数据数据哪一组。
select
NTILE(3) OVER w as ntile_num,
order_id,user_no,amount,create_date
from order_info
WINDOW w AS (partition by user_no order by create_date desc)
;
结果:
+-----------+----------+---------+--------+-------------+
| ntile_num | order_id | user_no | amount | create_date |
+-----------+----------+---------+--------+-------------+
| 1 | 5 | u0001 | 900 | 2018-01-20 |
| 1 | 4 | u0001 | 800 | 2018-01-10 |
| 2 | 2 | u0001 | 300 | 2018-01-02 |
| 2 | 3 | u0001 | 300 | 2018-01-02 |
| 3 | 1 | u0001 | 100 | 2018-01-01 |
| 1 | 10 | u0002 | 800 | 2018-01-22 |
| 1 | 11 | u0002 | 800 | 2018-01-22 |
| 2 | 9 | u0002 | 800 | 2018-01-16 |
| 2 | 8 | u0002 | 300 | 2018-01-10 |
| 3 | 7 | u0002 | 600 | 2018-01-06 |
| 3 | 6 | u0002 | 500 | 2018-01-05 |
+-----------+----------+---------+--------+-------------+
11 rows in set
first_value(column_name) and last_value(column_name)
first_value和last_value基本上见名知意了,就是取某一组数据,按照某种方式排序的,最早的和最新的某一个字段的值。
看结果体会一下。
select
first_value(create_date) OVER w as first_value_date,
last_value(create_date) OVER w as last_value_date,
order_id,user_no,amount,create_date
from order_info
WINDOW w AS (partition by user_no order by create_date)
;
结果:
+------------------+-----------------+----------+---------+--------+-------------+
| first_value_date | last_value_date | order_id | user_no | amount | create_date |
+------------------+-----------------+----------+---------+--------+-------------+
| 2018-01-01 | 2018-01-01 | 1 | u0001 | 100 | 2018-01-01 |
| 2018-01-01 | 2018-01-02 | 2 | u0001 | 300 | 2018-01-02 |
| 2018-01-01 | 2018-01-02 | 3 | u0001 | 300 | 2018-01-02 |
| 2018-01-01 | 2018-01-10 | 4 | u0001 | 800 | 2018-01-10 |
| 2018-01-01 | 2018-01-20 | 5 | u0001 | 900 | 2018-01-20 |
| 2018-01-05 | 2018-01-05 | 6 | u0002 | 500 | 2018-01-05 |
| 2018-01-05 | 2018-01-06 | 7 | u0002 | 600 | 2018-01-06 |
| 2018-01-05 | 2018-01-10 | 8 | u0002 | 300 | 2018-01-10 |
| 2018-01-05 | 2018-01-16 | 9 | u0002 | 800 | 2018-01-16 |
| 2018-01-05 | 2018-01-22 | 10 | u0002 | 800 | 2018-01-22 |
| 2018-01-05 | 2018-01-22 | 11 | u0002 | 800 | 2018-01-22 |
+------------------+-----------------+----------+---------+--------+-------------+
11 rows in set
nth_value(column_name,n)
从排序的第n行还是返回nth_value字段中的值,这个函数用的不多,要表达的这种逻辑,说实话,很难用语言表达出来,看个例子体会一下就行。
n = 3
select
nth_value(order_id,3) OVER w as nth_value_1,
order_id,user_no,amount,create_date
from order_info
WINDOW w AS (partition by user_no order by create_date asc)
;
结果:
+-------------+----------+---------+--------+-------------+
| nth_value_1 | order_id | user_no | amount | create_date |
+-------------+----------+---------+--------+-------------+
| NULL| 1 | u0001 | 100 | 2018-01-01 |
| 3 | 2 | u0001 | 300 | 2018-01-02 |
| 3 | 3 | u0001 | 300 | 2018-01-02 |
| 3 | 4 | u0001 | 800 | 2018-01-10 |
| 3 | 5 | u0001 | 900 | 2018-01-20 |
| NULL| 6 | u0002 | 500 | 2018-01-05 |
| NULL| 7 | u0002 | 600 | 2018-01-06 |
| 8 | 8 | u0002 | 300 | 2018-01-10 |
| 8 | 9 | u0002 | 800 | 2018-01-16 |
| 8 | 10 | u0002 | 800 | 2018-01-22 |
| 8 | 11 | u0002 | 800 | 2018-01-22 |
+-------------+----------+---------+--------+-------------+
11 rows in set
n = 4
select
nth_value(order_id,4) OVER w as nth_value_1,
order_id,user_no,amount,create_date
from order_info
WINDOW w AS (partition by user_no order by create_date asc)
;
结果:
+-------------+----------+---------+--------+-------------+
| nth_value_1 | order_id | user_no | amount | create_date |
+-------------+----------+---------+--------+-------------+
| NULL| 1 | u0001 | 100 | 2018-01-01 |
| NULL| 2 | u0001 | 300 | 2018-01-02 |
| NULL| 3 | u0001 | 300 | 2018-01-02 |
| 4 | 4 | u0001 | 800 | 2018-01-10 |
| 4 | 5 | u0001 | 900 | 2018-01-20 |
| NULL| 6 | u0002 | 500 | 2018-01-05 |
| NULL| 7 | u0002 | 600 | 2018-01-06 |
| NULL| 8 | u0002 | 300 | 2018-01-10 |
| 9 | 9 | u0002 | 800 | 2018-01-16 |
| 9 | 10 | u0002 | 800 | 2018-01-22 |
| 9 | 11 | u0002 | 800 | 2018-01-22 |
+-------------+----------+---------+--------+-------------+
11 rows in set
cume_dist
在某种排序条件下,小于等于当前行值的行数/总行数,得到的是数据在某一个纬度的分布百分比情况。
比如如下示例:
第1行数据的日期(create_date)是2018-01-05 00:00:00,小于等于2018-01-05 00:00:00的数据是1行,计算方式是:1/6 = 0.166666666
第2行数据的日期(create_date)是2018-01-06 00:00:00,小于等于2018-01-06 00:00:00的数据是2行,计算方式是:2/6 = 0.333333333
依次类推 第4行数据的日期(create_date)是2018-01-16 00:00:00,小于等于2018-01-16 00:00:00的数据是4行,计算方式是:4/6 = 0.6666666666
第一行数据的0.6666666666 意味着,小于第四行日期(create_date)的数据占了符合条件数据的66.66666666666%
select
order_id,user_no,amount,create_date,
cume_dist() OVER w as cume_dist_value
from order_info
WINDOW w AS (partition by user_no order by create_date asc)
;
结果:
+----------+---------+--------+-------------+---------------------+
| order_id | user_no | amount | create_date | cume_dist_value |
+----------+---------+--------+-------------+---------------------+
| 1 | u0001 | 100 | 2018-01-01 | 0.2 |
| 2 | u0001 | 300 | 2018-01-02 | 0.6 |
| 3 | u0001 | 300 | 2018-01-02 | 0.6 |
| 4 | u0001 | 800 | 2018-01-10 | 0.8 |
| 5 | u0001 | 900 | 2018-01-20 | 1 |
| 6 | u0002 | 500 | 2018-01-05 | 0.16666666666666666 |
| 7 | u0002 | 600 | 2018-01-06 | 0.3333333333333333 |
| 8 | u0002 | 300 | 2018-01-10 | 0.5 |
| 9 | u0002 | 800 | 2018-01-16 | 0.6666666666666666 |
| 10 | u0002 | 800 | 2018-01-22 | 1 |
| 11 | u0002 | 800 | 2018-01-22 | 1 |
+----------+---------+--------+-------------+---------------------+
11 rows in set
percent_rank()
同样是数据分布的计算方式,只不过算法变成了:当前RANK值-1/总行数-1 。
具体算法不细说,这个实际中用的也不多。
select
order_id,user_no,amount,create_date,
rank() OVER w as rank_num,
percent_rank() OVER w as percent_rank_value
from order_info
WINDOW w AS (partition by user_no order by create_date asc)
;
结果:
+----------+---------+--------+-------------+----------+--------------------+
| order_id | user_no | amount | create_date | rank_num | percent_rank_value |
+----------+---------+--------+-------------+----------+--------------------+
| 1 | u0001 | 100 | 2018-01-01 | 1 | 0 |
| 2 | u0001 | 300 | 2018-01-02 | 2 | 0.25 |
| 3 | u0001 | 300 | 2018-01-02 | 2 | 0.25 |
| 4 | u0001 | 800 | 2018-01-10 | 4 | 0.75 |
| 5 | u0001 | 900 | 2018-01-20 | 5 | 1 |
| 6 | u0002 | 500 | 2018-01-05 | 1 | 0 |
| 7 | u0002 | 600 | 2018-01-06 | 2 | 0.2 |
| 8 | u0002 | 300 | 2018-01-10 | 3 | 0.4 |
| 9 | u0002 | 800 | 2018-01-16 | 4 | 0.6 |
| 10 | u0002 | 800 | 2018-01-22 | 5 | 0.8 |
| 11 | u0002 | 800 | 2018-01-22 | 5 | 0.8 |
+----------+---------+--------+-------------+----------+--------------------+
11 rows in set
lag以及lead
lag(column,n)获取当前数据行按照某种排序规则的上n行数据的某个字段,lead(column,n)获取当前数据行按照某种排序规则的下n行数据的某个字段。
举个实际例子,按照时间排序,获取当前订单的上一笔订单发生时间和下一笔订单发生时间,(可以计算订单的时间上的间隔度或者说买买买的频繁程度)。
select
order_id,
user_no,
amount,
create_date,
lag(create_date,1) OVER w as 'last_transaction_time',
lead(create_date,1) OVER w as 'next_transaction_time'
from order_info
WINDOW w AS (partition by user_no order by create_date asc)
;
结果:
+----------+---------+--------+-------------+-----------------------+-----------------------+
| order_id | user_no | amount | create_date | last_transaction_time | next_transaction_time |
+----------+---------+--------+-------------+-----------------------+-----------------------+
| 1 | u0001 | 100 | 2018-01-01 | NULL | 2018-01-02 |
| 2 | u0001 | 300 | 2018-01-02 | 2018-01-01 | 2018-01-02 |
| 3 | u0001 | 300 | 2018-01-02 | 2018-01-02 | 2018-01-10 |
| 4 | u0001 | 800 | 2018-01-10 | 2018-01-02 | 2018-01-20 |
| 5 | u0001 | 900 | 2018-01-20 | 2018-01-10 | NULL |
| 6 | u0002 | 500 | 2018-01-05 | NULL | 2018-01-06 |
| 7 | u0002 | 600 | 2018-01-06 | 2018-01-05 | 2018-01-10 |
| 8 | u0002 | 300 | 2018-01-10 | 2018-01-06 | 2018-01-16 |
| 9 | u0002 | 800 | 2018-01-16 | 2018-01-10 | 2018-01-22 |
| 10 | u0002 | 800 | 2018-01-22 | 2018-01-16 | 2018-01-22 |
| 11 | u0002 | 800 | 2018-01-22 | 2018-01-22 | NULL |
+----------+---------+--------+-------------+-----------------------+-----------------------+
11 rows in set
CTE 公用表表达式
CTE有两种用法,非递归的CTE和递归的CTE。
非递归的CTE可以用来增加代码的可读性,增加逻辑的结构化表达。
平时我们比较痛恨一句sql几十行甚至上上百行,根本不知道其要表达什么,难以理解,对于这种SQL,可以使用CTE分段解决,
比如逻辑块A做成一个CTE,逻辑块B做成一个CTE,然后在逻辑块A和逻辑块B的基础上继续进行查询,这样与直接一句代码实现整个查询,逻辑上就变得相对清晰直观。
举个简单的例子,当然这里也不足以说明问题,比如还是第一个需求,查询每个用户的最新一条订单
第一步是对用户的订单按照时间排序编号,做成一个CTE,第二步对上面的CTE查询,取行号等于1的数据。
with cte as (
select row_number() OVER w as row_num,
order_id,user_no,amount,create_date
from order_info
WINDOW w AS (partition by user_no order by create_date desc)
)
select * from cte where row_num = 1
;
结果
+---------+----------+---------+--------+-------------+
| row_num | order_id | user_no | amount | create_date |
+---------+----------+---------+--------+-------------+
| 1 | 5 | u0001 | 900 | 2018-01-20 |
| 1 | 10 | u0002 | 800 | 2018-01-22 |
+---------+----------+---------+--------+-------------+
2 rows in set
另外一种是递归的CTE,递归的话,应用的场景也比较多,比如查询大部门下的子部门,每一个子部门下面的子部门等等,就需要使用递归的方式。
这里不做细节演示,仅演示一种递归的用法,用递归的方式生成连续日期。
with recursive date_step(datetime_step) as (
select cast(SYSDATE() as date)
UNION ALL
select DATE_ADD(datetime_step,INTERVAL 1 day)
from date_step
where datetime_step < ADDDATE(CURDATE(),13)
)
select datetime_step from date_step
;
当然递归不会无限下去,不同的数据库有不同的递归限制,MySQL 8.0中默认限制的最大递归次数是1000。
超过最大低估次数会报错:Recursive query aborted after 1001 iterations. Try increasing @@cte_max_recursion_depth to a larger value.
由参数@@cte_max_recursion_depth决定。
mysql> select @@cte_max_recursion_depth;
+---------------------------+
| @@cte_max_recursion_depth |
+---------------------------+
| 1000 |
+---------------------------+
1 row in set
关于CTE的限制,跟其他数据库并无太大差异,比如CTE内部的查询结果都要有字段名称,不允许连续对一个CTE多次查询等等,相信熟悉CTE的老司机都很清楚。
窗口函数和CTE的增加,简化了SQL代码的编写和逻辑的实现,并不是说没有这些新的特性,这些功能都无法实现,只是新特性的增加,可以用更优雅和可读性的方式来写SQL。
不过这都是在MySQL 8.0中实现的新功能,在8.0之前,还是老老实实按照较为复杂的方式实现吧。
本文深入讲解MySQL中的窗口函数,如ROW_NUMBER、RANK、DENSE_RANK等,通过实例展示如何利用这些函数解决常见业务问题,如查找每个用户的最新订单。
1550

被折叠的 条评论
为什么被折叠?



