在多易教育的大家庭中学习到了很多东西, 在这儿不仅仅是学习到哪些宝贵的知识 . 更是得到了宝贵的友谊 , 除了大学的舍友外可能这些人也是我们成长中的助力 .
如果有幸您看到我写的内容 , 不要纠结写的好不好 , 乱不乱,只要能帮到你就是好的 .
其实网上学习的资料很多 , 我其实特别喜欢B站(哔哩哔哩)这上面的学习资料应有尽有. 关注多易教育,尚硅谷教育, 黑马,老韩等等 这些人或者是教育机构 , 不要死死抓着一个不放,因为每个人的风格不同 ,接受能力不同 , 适应的能力也不同 . 所以择优选取 . 老师讲的东西上课可能能听懂 , 自己打代码就是不会 , 再有就是 课后练习又忘记 . 建议学习时候别只是看而已 , 适当更上节奏敲敲键盘 .
5.doris的查询语法
5.0 查询语法整体结构
SQL
SELECT
[ALL | DISTINCT | DISTINCTROW ] – 对查询字段的结果是否需要去重,还是全部保留等参数
select_expr [, select_expr …] – select的查询字段
[FROM table_references
[PARTITION partition_list] – from 哪个库里面的那张表甚至哪一个(几个)分区
[WHERE where_condition] – WHERE 查询
[GROUP BY {col_name | expr | position} – group by 聚合
[ASC | DESC], … [WITH ROLLUP]]
[HAVING where_condition] – having 针对聚合函数的再一次过滤
[ORDER BY {col_name | expr | position} – 对结果数据按照字段进行排序
[ASC | DESC], …] – 排序规则
[LIMIT {[offset,] row_count | row_count OFFSET offset}] – 限制输出多少行内容
[INTO OUTFILE ‘file_name’] – 将查询的结果导出到文件中
5.1 doris内置函数
5.1.1条件函数
5.1.1.1 if函数
语法示例:
SQL
if(boolean condition, type valueTrue, type valueFalseOrNull)
–如果表达式 condition 成立,返回结果 valueTrue;否则,返回结果 valueFalseOrNull
–返回值类型:valueTrue 表达式结果的类型
示例:
SQL
mysql> select user_id, if(user_id = 1, “true”, “false”) as test_if from test;
±--------±--------+
| user_id | test_if |
±--------±--------+
| 1 | true |
| 2 | false |
±--------±--------+
5.1.1.2 ifnull,nvl,coalesce,nullif函数
语法示例:
SQL
ifnull(expr1, expr2)
–如果 expr1 的值不为 NULL 则返回 expr1,否则返回 expr2
nvl(expr1, expr2)
–如果 expr1 的值不为 NULL 则返回 expr1,否则返回 expr2
coalesce(expr1, expr2, …, expr_n))
–返回参数中的第一个非空表达式(从左向右)
nullif(expr1, expr2)
– 如果两个参数相等,则返回NULL。否则返回第一个参数的值
示例:
SQL
mysql> select ifnull(1,0);
±-------------+
| ifnull(1, 0) |
±-------------+
| 1 |
±-------------+
mysql> select nvl(null,10);
±-----------------+
| nvl(null,10) |
±-----------------+
| 10 |
±-----------------+
mysql> select coalesce(NULL, ‘1111’, ‘0000’);
±-------------------------------+
| coalesce(NULL, ‘1111’, ‘0000’) |
±-------------------------------+
| 1111 |
±-------------------------------+
mysql> select coalesce(NULL, NULL,NULL,‘0000’, NULL);
±---------------------------------------+
| coalesce(NULL, NULL,NULL,‘0000’, NULL) |
±---------------------------------------+
| 0000 |
±---------------------------------------+
mysql> select nullif(1,1);
±-------------+
| nullif(1, 1) |
±-------------+
| NULL |
±-------------+
mysql> select nullif(1,0);
±-------------+
| nullif(1, 0) |
±-------------+
| 1 |
±-------------+
5.1.1.3 case
语法示例:
SQL
– 方式一
CASE expression
WHEN condition1 THEN result1
[WHEN condition2 THEN result2]
…
[WHEN conditionN THEN resultN]
[ELSE result]
END
– 方式二
CASE WHEN condition1 THEN result1
[WHEN condition2 THEN result2]
…
[WHEN conditionN THEN resultN]
[ELSE result]
END
– 将表达式和多个可能的值进行比较,当匹配时返回相应的结果
示例:
SQL
mysql> select user_id,
case user_id
when 1 then ‘user_id = 1’
when 2 then ‘user_id = 2’
else ‘user_id not exist’
end as test_case
from test;
±--------±------------+
| user_id | test_case |
±--------±------------+
| 1 | user_id = 1 |
| 2 | user_id = 2 |
| 3 | ‘user_id not exist’ |
±--------±------------+
mysql> select user_id,
case
when user_id = 1 then ‘user_id = 1’
when user_id = 2 then ‘user_id = 2’
else ‘user_id not exist’
end as test_case
from test;
±--------±------------+
| user_id | test_case |
±--------±------------+
| 1 | user_id = 1 |
| 2 | user_id = 2 |
±--------±------------+
5.1.2聚合函数
5.1.2.1 min,max,sum,avg,count
5.1.2.2 min_by和max_by
SQL
MAX_BY(expr1, expr2)
返回expr2最大值所在行的 expr1 (求分组top1的简介函数)
语法示例:
SQL
MySQL > select * from tbl;
±-----±-----±-----±-----+
| k1 | k2 | k3 | k4 |
±-----±-----±-----±-----+
| 0 | 3 | 2 | 100 |
| 1 | 2 | 3 | 4 |
| 4 | 3 | 2 | 2 |
| 3 | 4 | 2 | 1 |
±-----±-----±-----±-----+
MySQL > select max_by(k1, k4) from tbl;
select max_by(k1, k4) from tbl;
–取k4这个列中的最大值对应的k1这个列的值
±-------------------+
| max_by(k1
, k4
) |
±-------------------+
| 0 |
±-------------------+
练习:
SQL
name subject score
zss,chinese,99
zss,math,89
zss,English,79
lss,chinese,88
lss,math,88
lss,English,22
www,chinese,99
www,math,45
zll,chinese,23
zll,math,88
zll,English,80
www,English,94
– 建表语句
create table score
(
name varchar(50),
subject varchar(50),
score double
)
DUPLICATE KEY(name)
DISTRIBUTED BY HASH(name) BUCKETS 1;
– 通过本地文件的方式导入数据
curl
-u root:
-H “label:salary”
-H “column_separator:,”
-T /root/data/salary.txt
http://doitedu01:8040/api/test/salary/_stream_load
– 求每门课程成绩最高分的那个人
select
subject,max_by(name,score) as name
from score
group by subject
±--------±-----+
| subject | name |
±--------±-----+
| English | www |
| math | lss |
| chinese | www |
±--------±-----+
5.1.2.3 group_concat
求:每一个人有考试成绩的所有科目
select
name,
group_concat(subject,‘,’) as all_subject
from score
group by name
语法示例:
SQL
VARCHAR GROUP_CONCAT([DISTINCT] VARCHAR 列名[, VARCHAR sep]
该函数是类似于 sum() 的聚合函数,group_concat 将结果集中的多行结果连接成一个字符串
– group_concat对于收集的字段只能是string,varchar,char类型
–当不指定分隔符的时候,默认使用 ‘,’
VARCHAR :代表GROUP_CONCAT函数返回值类型
[DISTINCT]:可选参数,针对需要拼接的列的值进行去重
[, VARCHAR sep]:拼接成字符串的分隔符,默认是 ‘,’
示例:
SQL
–建表
create table example(
id int,
name varchar(50),
age int,
gender string,
is_marry boolean,
marry_date date,
marry_datetime datetime
)engine = olap
distributed by hash(id) buckets 3;
–插入数据
insert into example values
(1,‘zss’,18,‘male’,0,null,null),
(2,‘lss’,28,‘female’,1,‘2022-01-01’,‘2022-01-01 11:11:11’),
(3,‘ww’,38,‘male’,1,‘2022-02-01’,‘2022-02-01 11:11:11’),
(4,‘zl’,48,‘female’,0,null,null),
(5,‘tq’,58,‘male’,1,‘2022-03-01’,‘2022-03-01 11:11:11’),
(6,‘mly’,18,‘male’,1,‘2022-04-01’,‘2022-04-01 11:11:11’),
(7,null,18,‘male’,1,‘2022-05-01’,‘2022-05-01 11:11:11’);
–当收集的那一列,有值为null时,他会自动将null的值过滤掉
select
gender,
group_concat(name,‘,’) as gc_name
from example
group by gender;
±-------±--------------+
| gender | gc_name |
±-------±--------------+
| female | zl,lss |
| male | zss,ww,tq,mly |
±-------±--------------+
select
gender,
group_concat(DISTINCT cast(age as string)) as gc_age
from example
group by gender;
±-------±-----------+
| gender | gc_age |
±-------±-----------+
| female | 48, 28 |
| male | 58, 38, 18 |
±-------±-----------+
5.1.2.4 collect_list,collect_set (1.2版本上线)
语法示例:
SQL
ARRAY collect_list(expr)
–返回一个包含 expr 中所有元素(不包括NULL)的数组,数组中元素顺序是不确定的。
ARRAY collect_set(expr)
–返回一个包含 expr 中所有去重后元素(不包括NULL)的数组,数组中元素顺序是不确定的。
5.1.3日期函数
5.1.3.1 获取当前时间
curdate,current_date,now,curtime,current_time,current_timestamp
示例:
SQL
select current_date();
±---------------+
| current_date() |
±---------------+
| 2022-11-25 |
±---------------+
select curdate();
±-----------+
| curdate() |
±-----------+
| 2022-11-25 |
±-----------+
select now();
±--------------------+
| now() |
±--------------------+
| 2022-11-25 00:55:15 |
±--------------------+
select curtime();
±----------+
| curtime() |
±----------+
| 00:42:13 |
±----------+
select current_timestamp();
±--------------------+
| current_timestamp() |
±--------------------+
| 2022-11-25 00:42:30 |
±--------------------+
5.1.3.2last_day(1.2版本上线)
语法:
SQL
DATE last_day(DATETIME date)
– 返回输入日期中月份的最后一天;
–‘28’(非闰年的二月份),
–‘29’(闰年的二月份),
–‘30’(四月,六月,九月,十一月),
–‘31’(一月,三月,五月,七月,八月,十月,十二月)
select last_day(‘2000-03-03 01:00:00’); – 给我返回这个月份中的最后一天的日期 年月日
ERROR 1105 (HY000): errCode = 2, detailMessage = No matching function with signature: last_day(varchar(-1)).
5.1.3.3from_unixtime
语法:
SQL
DATETIME FROM_UNIXTIME(INT unix_timestamp[, VARCHAR string_format])
– 将 unix 时间戳转化为对应的 time 格式,返回的格式由 string_format 指定
–支持date_format中的format格式,默认为 %Y-%m-%d %H:%i:%s
– 正常使用的三种格式
yyyyMMdd
yyyy-MM-dd
yyyy-MM-dd HH:mm:ss
示例:
SQL
mysql> select from_unixtime(1196440219); – 时区
±--------------------------+
| from_unixtime(1196440219) |
±--------------------------+
| 2007-12-01 00:30:19 |
±--------------------------+
mysql> select from_unixtime(1196440219, ‘yyyy-MM-dd HH:mm:ss’);
±-------------------------------------------------+
| from_unixtime(1196440219, ‘yyyy-MM-dd HH:mm:ss’) |
±-------------------------------------------------+
| 2007-12-01 00:30:19 |
±-------------------------------------------------+
mysql> select from_unixtime(1196440219, ‘%Y-%m-%d’);
±----------------------------------------+
| from_unixtime(1196440219, ‘%Y-%m-%d’) |
±----------------------------------------+
| 2007-12-01 |
±----------------------------------------+
5.1.2.4unix_timestamp
语法:
SQL
UNIX_TIMESTAMP(),
UNIX_TIMESTAMP(DATETIME date),
UNIX_TIMESTAMP(DATETIME date, STRING fmt) – 给一个日期,指定这个日期的格式
– 将日期转换成时间戳,返回值是一个int类型
示例:
SQL
– 获取当前日期的时间戳
select unix_timestamp();
±-----------------+
| unix_timestamp() |
±-----------------+
| 1669309722 |
±-----------------+
– 获取指定日期的时间戳
select unix_timestamp(‘2022-11-26 01:09:01’);
±--------------------------------------+
| unix_timestamp(‘2022-11-26 01:09:01’) |
±--------------------------------------+
| 1669396141 |
±--------------------------------------+
– 给定一个特殊日期格式的时间戳,指定格式
select unix_timestamp(‘2022-11-26 01:09-01’, ‘%Y-%m-%d %H:%i-%s’);
±-----------------------------------------------------------+
| unix_timestamp(‘2022-11-26 01:09-01’, ‘%Y-%m-%d %H:%i-%s’) |
±-----------------------------------------------------------+
| 1669396141 |
±-----------------------------------------------------------+
5.1.3.5to_date
语法:
SQL
DATE TO_DATE(DATETIME)
–返回 DATETIME 类型中的日期部分。
示例:
SQL
select to_date(“2022-11-20 00:00:00”);
±-------------------------------+
| to_date(‘2022-11-20 00:00:00’) |
±-------------------------------+
| 2022-11-20 |
±-------------------------------+
5.1.3.6extract
语法:
SQL
extract(unit FROM DATETIME) --抽取
– 提取DATETIME某个指定单位的值。
–unit单位可以为year, month, day, hour, minute或者second
示例:
SQL
select
extract(year from ‘2022-09-22 17:01:30’) as year,
extract(month from ‘2022-09-22 17:01:30’) as month,
extract(day from ‘2022-09-22 17:01:30’) as day,
extract(hour from ‘2022-09-22 17:01:30’) as hour,
extract(minute from ‘2022-09-22 17:01:30’) as minute,
extract(second from ‘2022-09-22 17:01:30’) as second;
±-----±------±-----±-----±-------±-------+
| year | month | day | hour | minute | second |
±-----±------±-----±-----±-------±-------+
| 2022 | 9 | 22 | 17 | 1 | 30 |
±-----±------±-----±-----±-------±-------+
5.1.3.7date_add,date_sub,datediff
语法:
SQL
DATE_ADD(DATETIME date,INTERVAL expr type)
DATE_SUB(DATETIME date,INTERVAL expr type)
DATEDIFF(DATETIME expr1,DATETIME expr2)
– 计算两个日期相差多少天,结果精确到天。
– 向日期添加指定的时间间隔。
– date 参数是合法的日期表达式。
– expr 参数是您希望添加的时间间隔。
– type 参数可以是下列值:YEAR, MONTH, DAY, HOUR, MINUTE, SECOND
示例:
SQL
select date_add(‘2010-11-30 23:59:59’, INTERVAL 2 DAY);
±------------------------------------------------+
| date_add(‘2010-11-30 23:59:59’, INTERVAL 2 DAY) |
±------------------------------------------------+
| 2010-12-02 23:59:59 |
±------------------------------------------------+
–传一个负数进去也就等同于date_sub
select date_add(‘2010-11-30 23:59:59’, INTERVAL -2 DAY);
±-------------------------------------------------+
| date_add(‘2010-11-30 23:59:59’, INTERVAL -2 DAY) |
±-------------------------------------------------+
| 2010-11-28 23:59:59 |
±-------------------------------------------------+
mysql> select datediff(‘2022-11-27 22:51:56’,‘2022-11-24 22:50:56’);
±-------------------------------------------------------+
| datediff(‘2022-11-27 22:51:56’, ‘2022-11-24 22:50:56’) |
±-------------------------------------------------------+
| 3 |
±-------------------------------------------------------+
5.1.3.8date_format
语法:
SQL
VARCHAR DATE_FORMAT(DATETIME date, VARCHAR format)
–将日期类型按照format的类型转化为字符串
示例:
SQL
select date_format(‘2007-10-04 22:23:00’, ‘%H:%i:%s’);
±-----------------------------------------------+
| date_format(‘2007-10-04 22:23:00’, ‘%H:%i:%s’) |
±-----------------------------------------------+
| 22:23:00 |
±-----------------------------------------------+
select date_format(‘2007-10-04 22:23:00’, ‘yyyy-MM-dd’);
±-----------------------------------------------+
| date_format(‘2007-10-04 22:23:00’, ‘%Y-%m-%d’) |
±-----------------------------------------------+
| 2007-10-04 |
±-----------------------------------------------+
5.1.4 字符串函数
5.1.4.1 length,lower,upper,reverse
示例:
SQL
获取到字符串的长度,对字符串转大小写和字符串的反转
5.1.4.2 lpad,rpad
语法:
SQL
VARCHAR rpad(VARCHAR str, INT len, VARCHAR pad)
VARCHAR lpad(VARCHAR str, INT len, VARCHAR pad)
– 返回 str 中长度为 len(从首字母开始算起)的字符串。
–如果 len 大于 str 的长度,则在 str 的后面不断补充 pad 字符,
–直到该字符串的长度达到 len 为止。如果 len 小于 str 的长度,
–该函数相当于截断 str 字符串,只返回长度为 len 的字符串。
–len 指的是字符长度而不是字节长度。
示例:
SQL
– 向左边补齐
SELECT lpad(“1”, 5, “hellohello”);
±--------------------+
| lpad(“1”, 5, “0”) |
±--------------------+
| 00001 |
±--------------------+
– 向右边补齐
SELECT rpad(‘11’, 5, ‘0’);
±--------------------+
| rpad(‘11’, 5, ‘0’) |
±--------------------+
| 11000 |
±--------------------+
5.1.4.3 concat,concat_ws
语法:
SQL
select concat(“a”, “b”);
±-----------------+
| concat(‘a’, ‘b’) |
±-----------------+
| ab |
±-----------------+
select concat(“a”, “b”, “c”);
±----------------------+
| concat(‘a’, ‘b’, ‘c’) |
±----------------------+
| abc |
±----------------------+
– concat中,如果有一个值为null,那么得到的结果就是null
mysql> select concat(“a”, null, “c”);
±-----------------------+
| concat(‘a’, NULL, ‘c’) |
±-----------------------+
| NULL |
±-----------------------+
–使用第一个参数 sep 作为连接符
–将第二个参数以及后续所有参数(或ARRAY中的所有字符串)拼接成一个字符串。
– 如果分隔符是 NULL,返回 NULL。 concat_ws函数不会跳过空字符串,会跳过 NULL 值。
mysql> select concat_ws(““, “a”, “b”);
±---------------------------+
| concat_ws(””, “a”, “b”) |
±---------------------------+
| a_b |
±---------------------------+
mysql> select concat_ws(NULL, “d”, “is”);
±---------------------------+
| concat_ws(NULL, ‘d’, ‘is’) |
±---------------------------+
| NULL |
±---------------------------+
5.1.4.4 substr
语法:
SQL
–求子字符串,返回第一个参数描述的字符串中从start开始长度为len的部分字符串。
–首字母的下标为1。
mysql> select substr(“Hello doris”, 3, 5);
±----------------------------+
| substr(‘Hello doris’, 2, 1) |
±----------------------------+
| e |
±----------------------------+
mysql> select substr(“Hello doris”, 1, 2);
±----------------------------+
| substr(‘Hello doris’, 1, 2) |
±----------------------------+
| He |
±----------------------------+
5.1.4.5 ends_with,starts_with
语法:
SQL
BOOLEAN ENDS_WITH (VARCHAR str, VARCHAR suffix)
–如果字符串以指定后缀结尾,返回true。否则,返回false。
–任意参数为NULL,返回NULL。
BOOLEAN STARTS_WITH (VARCHAR str, VARCHAR prefix)
–如果字符串以指定前缀开头,返回true。否则,返回false。
–任意参数为NULL,返回NULL。
示例:
SQL
mysql> select ends_with(“Hello doris”, “doris”);
±----------------------------------+
| ends_with(‘Hello doris’, ‘doris’) |
±----------------------------------+
| 1 |
±----------------------------------+
mysql> select ends_with(“Hello doris”, “Hello”);
±----------------------------------+
| ends_with(‘Hello doris’, ‘Hello’) |
±----------------------------------+
| 0 |
±----------------------------------+
MySQL [(none)]> select starts_with(“hello world”,“hello”);
±------------------------------------+
| starts_with(‘hello world’, ‘hello’) |
±------------------------------------+
| 1 |
±------------------------------------+
MySQL [(none)]> select starts_with(“hello world”,“world”);
±------------------------------------+
| starts_with(‘hello world’, ‘world’) |
±------------------------------------+
| 0 |
±------------------------------------+
5.1.4.6 trim,ltrim,rtrim
语法:
SQL
VARCHAR trim(VARCHAR str)
– 将参数 str 中左侧和右侧开始部分连续出现的空格去掉
mysql> SELECT trim(’ ab d ') str;
±-----+
| str |
±-----+
| ab d |
±-----+
VARCHAR ltrim(VARCHAR str)
– 将参数 str 中从左侧部分开始部分连续出现的空格去掉
mysql> SELECT ltrim(’ ab d’) str;
±-----+
| str |
±-----+
| ab d |
±-----+
VARCHAR rtrim(VARCHAR str)
–将参数 str 中从右侧部分开始部分连续出现的空格去掉
mysql> SELECT rtrim('ab d ') str;
±-----+
| str |
±-----+
| ab d |
±-----+
5.1.4.7 null_or_empty,not_null_or_empty
SQL
BOOLEAN NULL_OR_EMPTY (VARCHAR str)
– 如果字符串为空字符串或者NULL,返回true。否则,返回false。
MySQL [(none)]> select null_or_empty(null);
±--------------------+
| null_or_empty(NULL) |
±--------------------+
| 1 |
±--------------------+
MySQL [(none)]> select null_or_empty(“”);
±------------------+
| null_or_empty(‘’) |
±------------------+
| 1 |
±------------------+
MySQL [(none)]> select null_or_empty(“a”);
±-------------------+
| null_or_empty(‘a’) |
±-------------------+
| 0 |
±-------------------+
BOOLEAN NOT_NULL_OR_EMPTY (VARCHAR str)
如果字符串为空字符串或者NULL,返回false。否则,返回true。
MySQL [(none)]> select not_null_or_empty(null);
±------------------------+
| not_null_or_empty(NULL) |
±------------------------+
| 0 |
±------------------------+
MySQL [(none)]> select not_null_or_empty(“”);
±----------------------+
| not_null_or_empty(‘’) |
±----------------------+
| 0 |
±----------------------+
MySQL [(none)]> select not_null_or_empty(“a”);
±-----------------------+
| not_null_or_empty(‘a’) |
±-----------------------+
| 1 |
±-----------------------+
5.1.4.8 replace
SQL
VARCHAR REPLACE (VARCHAR str, VARCHAR old, VARCHAR new)
– 将str字符串中的old子串全部替换为new串
mysql> select replace(“http://www.baidu.com:9090”, “9090”, “”);
±-----------------------------------------------------+
| replace(‘http://www.baidu.com:9090’, ‘9090’, ‘’) |
±-----------------------------------------------------+
| http://www.baidu.com: |
±-----------------------------------------------------+
5.1.4.9 split_part
SQL
VARCHAR split_part(VARCHAR content, VARCHAR delimiter, INT field)
– 根据分割符拆分字符串, 返回指定的分割部分(从一开始计数)。
mysql> select split_part(“hello world”, " ", 1);
±---------------------------------+
| split_part(‘hello world’, ’ ', 1) |
±---------------------------------+
| hello |
±---------------------------------+
mysql> select split_part(“hello world”, " ", 2);
±---------------------------------+
| split_part(‘hello world’, ’ ', 2) |
±---------------------------------+
| world |
±---------------------------------+
mysql> select split_part(“2019年7月8号”, “月”, 1);
±----------------------------------------+
| split_part(‘2019年7月8号’, ‘月’, 1) |
±----------------------------------------+
| 2019年7 |
±----------------------------------------+
mysql> select split_part(“abca”, “a”, 1);
±---------------------------+
| split_part(‘abca’, ‘a’, 1) |
±---------------------------+
| |
±---------------------------+
5.1.4.10 money_format
SQL
VARCHAR money_format(Number)
– 将数字按照货币格式输出,整数部分每隔3位用逗号分隔,小数部分保留2位
mysql> select money_format(17014116);
±-----------------------+
| money_format(17014116) |
±-----------------------+
| 17,014,116.00 |
±-----------------------+
mysql> select money_format(1123.456);
±-----------------------+
| money_format(1123.456) |
±-----------------------+
| 1,123.46 |
±-----------------------+
mysql> select money_format(1123.4);
±---------------------+
| money_format(1123.4) |
±---------------------+
| 1,123.40 |
±---------------------+
5.1.5数学函数
5.1.5.1ceil和floor
SQL
BIGINT ceil(DOUBLE x)
– 返回大于或等于x的最小整数值.
mysql> select ceil(1);
±----------+
| ceil(1.0) |
±----------+
| 1 |
±----------+
mysql> select ceil(2.4);
±----------+
| ceil(2.4) |
±----------+
| 3 |
±----------+
mysql> select ceil(-10.3);
±------------+
| ceil(-10.3) |
±------------+
| -10 |
±------------+
BIGINT floor(DOUBLE x)
– 返回小于或等于x的最大整数值.
mysql> select floor(1);
±-----------+
| floor(1.0) |
±-----------+
| 1 |
±-----------+
mysql> select floor(2.4);
±-----------+
| floor(2.4) |
±-----------+
| 2 |
±-----------+
mysql> select floor(-10.3);
±-------------+
| floor(-10.3) |
±-------------+
| -11 |
±-------------+
5.1.5.2round
SQL
round(x), round(x, d)
– 将x四舍五入后保留d位小数,d默认为0。
– 如果d为负数,则小数点左边d位为0。如果x或d为null,返回null。
mysql> select round(2.4);
±-----------+
| round(2.4) |
±-----------+
| 2 |
±-----------+
mysql> select round(2.5);
±-----------+
| round(2.5) |
±-----------+
| 3 |
±-----------+
mysql> select round(-3.4);
±------------+
| round(-3.4) |
±------------+
| -3 |
±------------+
mysql> select round(-3.5);
±------------+
| round(-3.5) |
±------------+
| -4 |
±------------+
mysql> select round(1667.2725, 2);
±--------------------+
| round(1667.2725, 2) |
±--------------------+
| 1667.27 |
±--------------------+
mysql> select round(1667.2725, -2);
±---------------------+
| round(1667.2725, -2) |
±---------------------+
| 1700 |
±---------------------+
5.1.5.3truncate
SQL
DOUBLE truncate(DOUBLE x, INT d)
– 按照保留小数的位数d对x进行数值截取。
– 规则如下:
– 当d > 0时:保留x的d位小数
– 当d = 0时:将x的小数部分去除,只保留整数部分
– 当d < 0时:将x的小数部分去除,整数部分按照 d所指定的位数,采用数字0进行替换
mysql> select truncate(124.3867, 2);
±----------------------+
| truncate(124.3867, 2) |
±----------------------+
| 124.38 |
±----------------------+
mysql> select truncate(124.3867, 0);
±----------------------+
| truncate(124.3867, 0) |
±----------------------+
| 124 |
±----------------------+
mysql> select truncate(-124.3867, -2);
±------------------------+
| truncate(-124.3867, -2) |
±------------------------+
| -100 |
±------------------------+
5.1.5.4abs
SQL
数值类型 abs(数值类型 x)
– 返回x的绝对值.
mysql> select abs(-2);
±--------+
| abs(-2) |
±--------+
| 2 |
±--------+
mysql> select abs(3.254655654);
±-----------------+
| abs(3.254655654) |
±-----------------+
| 3.254655654 |
±-----------------+
mysql> select abs(-3254654236547654354654767);
±--------------------------------+
| abs(-3254654236547654354654767) |
±--------------------------------+
| 3254654236547654354654767 |
±--------------------------------+
5.1.5.5pow
SQL
DOUBLE pow(DOUBLE a, DOUBLE b)
– 求幂次:返回a的b次方.
mysql> select pow(2,0);
±--------------+
| pow(2.0, 0.0) |
±--------------+
| 1 |
±--------------+
mysql> select pow(2,3);
±--------------+
| pow(2.0, 3.0) |
±--------------+
| 8 |
±--------------+
mysql> select round(pow(3,2.4),2);
±-------------------+
| pow(3.0, 2.4) |
±-------------------+
| 13.966610165238235 |
±-------------------+
5.1.5.6greatest和 least
SQL
greatest(col_a, col_b, …, col_n)
– 返回一行中 n个column的最大值.若column中有NULL,则返回NULL.
least(col_a, col_b, …, col_n)
– 返回一行中 n个column的最小值.若column中有NULL,则返回NULL.
mysql> select greatest(-1, 0, 5, 8);
±----------------------+
| greatest(-1, 0, 5, 8) |
±----------------------+
| 8 |
±----------------------+
mysql> select greatest(-1, 0, 5, NULL);
±-------------------------+
| greatest(-1, 0, 5, NULL) |
±-------------------------+
| NULL |
±-------------------------+
mysql> select greatest(6.3, 4.29, 7.6876);
±----------------------------+
| greatest(6.3, 4.29, 7.6876) |
±----------------------------+
| 7.6876 |
±----------------------------+
mysql> select greatest(“2022-02-26 20:02:11”,“2020-01-23 20:02:11”,“2020-06-22 20:02:11”);
±------------------------------------------------------------------------------+
| greatest(‘2022-02-26 20:02:11’, ‘2020-01-23 20:02:11’, ‘2020-06-22 20:02:11’) |
±------------------------------------------------------------------------------+
| 2022-02-26 20:02:11 |
±------------------------------------------------------------------------------+
小练习:
SQL
需求:求每个人工资组成部分中占比最高的工资
– 准备数据
name user_id jbgz jjgz tcgz
zss,1,2000,3000,5000
lss,2,1000,4000,1000
www,3,5000,1000,5000
tqq,4,4000,300,7000
name user_id jbgz jjgz tcgz
zss,1,2000,3000,5000
lss,2,1000,4000,1000
www,3,5000,1000,5000
tqq,4,4000,300,7000
– 建表语句
create table salary
(
name varchar(50),
user_id int,
jbgz double,
jjgz double,
tcgz double
)
DUPLICATE KEY(name)
DISTRIBUTED BY HASH(name) BUCKETS 1;
– 通过本地文件的方式导入数据
curl
-u root:
-H “label:salary”
-H “column_separator:,”
-T /root/data/salary.txt
http://doitedu01:8040/api/test/salary/_stream_load
select * from salary;
±-----±--------±-----±-----±-----+
| name | user_id | jbgz | jjgz | tcgz |
±-----±--------±-----±-----±-----+
| lss | 2 | 1000 | 4000 | 1000 |
| tqq | 4 | 4000 | 300 | 7000 |
| www | 3 | 5000 | 1000 | 5000 |
| zss | 1 | 2000 | 3000 | 5000 |
±-----±--------±-----±-----±-----+
逻辑分析:最高工资比较好求,直接用greatest函数
greatest用法:greatest(字段一,字段二,字段三。。。) 求多个字段中的最大值
工资类型怎么办?只能用这种等值匹配的方式去处理,让表自关联,然后最大值和三种工资匹配
±-----±--------±-----±-----±-----±-------±--------+
| name | user_id | jbgz | jjgz | tcgz | max_gz | gz_type |
±-----±--------±-----±-----±-----±-------±--------+
| lss | 2 | 1000 | 4000 | 1000 | 4000 | jjgz |
| tqq | 4 | 4000 | 300 | 7000 | 7000 | tcgz |
| www | 3 | 5000 | 1000 | 5000 | 5000 | jbgz |
| zss | 1 | 2000 | 3000 | 5000 | 5000 | tcgz |
±-----±--------±-----±-----±-----±-------±--------+
5.1.6数组函数(1.2版本正式添加)
Only supported in vectorized engine
仅支持向量化引擎中使用
5.1.6.1array()
SQL
ARRAY array(T, …)
– 把多个字段构造成一个数组
mysql> set enable_vectorized_engine=true;
mysql> select array(“1”, 2, 1.1);
±---------------------+
| array(‘1’, 2, ‘1.1’) |
±---------------------+
| [‘1’, ‘2’, ‘1.1’] |
±---------------------+
1 row in set (0.00 sec)
mysql> select array(null, 1);
±---------------+
| array(NULL, 1) |
±---------------+
| [NULL, 1] |
±---------------+
1 row in set (0.00 sec)
mysql> select array(1, 2, 3);
±---------------+
| array(1, 2, 3) |
±---------------+
| [1, 2, 3] |
±---------------+
1 row in set (0.00 sec)
5.1.6.2array_min,array_max,array_avg,array_sum,array_size
SQL
求数组中的最小值,最大值,平均值,数组中所有元素的和,数组的长度
– 数组中的NULL值会被跳过。空数组以及元素全为NULL值的数组,结果返回NULL值。
5.1.6.3array_remove
SQL
ARRAY array_remove(ARRAY arr, T val)
– 返回移除所有的指定元素后的数组,如果输入参数为NULL,则返回NULL
mysql> set enable_vectorized_engine=true;
mysql> select array_remove([‘test’, NULL, ‘value’], ‘value’);
±----------------------------------------------------+
| array_remove(ARRAY(‘test’, NULL, ‘value’), ‘value’) |
±----------------------------------------------------+
| [test, NULL] |
±----------------------------------------------------+
mysql> select k1, k2, array_remove(k2, 1) from array_type_table_1;
±-----±-------------------±----------------------+
| k1 | k2 | array_remove(k2
, 1) |
±-----±-------------------±----------------------+
| 1 | [1, 2, 3] | [2, 3] |
| 2 | [1, 3] | [3] |
| 3 | NULL | NULL |
| 4 | [1, 3] | [3] |
| 5 | [NULL, 1, NULL, 2] | [NULL, NULL, 2] |
±-----±-------------------±----------------------+
mysql> select k1, k2, array_remove(k2, k1) from array_type_table_1;
±-----±-------------------±-------------------------+
| k1 | k2 | array_remove(k2
, k1
) |
±-----±-------------------±-------------------------+
| 1 | [1, 2, 3] | [2, 3] |
| 2 | [1, 3] | [1, 3] |
| 3 | NULL | NULL |
| 4 | [1, 3] | [1, 3] |
| 5 | [NULL, 1, NULL, 2] | [NULL, 1, NULL, 2] |
±-----±-------------------±-------------------------+
5.1.6.4array_sort
SQL
ARRAY array_sort(ARRAY arr)
– 返回按升序排列后的数组,如果输入数组为NULL,则返回NULL。
– 如果数组元素包含NULL, 则输出的排序数组会将NULL放在最前面。
mysql> set enable_vectorized_engine=true;
mysql> select k1, k2, array_sort(k2) array_test;
±-----±----------------------------±----------------------------+
| k1 | k2 | array_sort(k2
) |
±-----±----------------------------±----------------------------+
| 1 | [1, 2, 3, 4, 5] | [1, 2, 3, 4, 5] |
| 2 | [6, 7, 8] | [6, 7, 8] |
| 3 | [] | [] |
| 4 | NULL | NULL |
| 5 | [1, 2, 3, 4, 5, 4, 3, 2, 1] | [1, 1, 2, 2, 3, 3, 4, 4, 5] |
| 6 | [1, 2, 3, NULL] | [NULL, 1, 2, 3] |
| 7 | [1, 2, 3, NULL, NULL] | [NULL, NULL, 1, 2, 3] |
| 8 | [1, 1, 2, NULL, NULL] | [NULL, NULL, 1, 1, 2] |
| 9 | [1, NULL, 1, 2, NULL, NULL] | [NULL, NULL, NULL, 1, 1, 2] |
±-----±----------------------------±----------------------------+
5.1.6.5array_contains
SQL
BOOLEAN array_contains(ARRAY arr, T value)
– 判断数组中是否包含value。返回结果如下:
– 1 - value在数组arr中存在;
– 0 - value不存在数组arr中;
– NULL - arr为NULL时。
mysql> set enable_vectorized_engine=true;
mysql> SELECT id,c_array,array_contains(c_array, 5) FROM array_test
;
±-----±----------------±-----------------------------+
| id | c_array | array_contains(c_array
, 5) |
±-----±----------------±-----------------------------+
| 1 | [1, 2, 3, 4, 5] | 1 |
| 2 | [6, 7, 8] | 0 |
| 3 | [] | 0 |
| 4 | NULL | NULL |
±-----±----------------±-----------------------------+
5.1.6.6array_except
SQL
ARRAY array_except(ARRAY array1, ARRAY array2)
– 返回一个数组,包含所有在array1内但不在array2内的元素,会对返回的结果数组去重
– 类似于取差集,将返回的差集结果数组去重
mysql> set enable_vectorized_engine=true;
mysql> select k1,k2,k3,array_except(k2,k3) from array_type_table;
±-----±----------------±-------------±-------------------------+
| k1 | k2 | k3 | array_except(k2
, k3
) |
±-----±----------------±-------------±-------------------------+
| 1 | [1, 2, 3] | [2, 4, 5] | [1, 3] |
| 2 | [2, 3] | [1, 5] | [2, 3] |
| 3 | [1, 1, 1] | [2, 2, 2] | [1] |
±-----±----------------±-------------±-------------------------+
5.1.6.7array_intersect
SQL
ARRAY array_intersect(ARRAY array1, ARRAY array2)
– 返回一个数组,包含array1和array2的交集中的所有元素,不包含重复项
– 两个数组去交集后。将返回的结果去重
mysql> set enable_vectorized_engine=true;
mysql> select k1,k2,k3,array_intersect(k2,k3) from array_type_table;
±-----±----------------±-------------±----------------------------+
| k1 | k2 | k3 | array_intersect(k2
, k3
) |
±-----±----------------±-------------±----------------------------+
| 1 | [1, 2, 3] | [2, 4, 5] | [2] |
| 2 | [2, 3] | [1, 5] | [] |
| 3 | [1, 1, 1] | [2, 2, 2] | [] |
±-----±----------------±-------------±----------------------------+
mysql> select k1,k2,k3,array_intersect(k2,k3) from array_type_table_nullable;
±-----±----------------±-------------±----------------------------+
| k1 | k2 | k3 | array_intersect(k2
, k3
) |
±-----±----------------±-------------±----------------------------+
| 1 | [1, NULL, 3] | [1, 3, 5] | [1, 3] |
| 2 | [NULL, NULL, 2] | [2, NULL, 4] | [NULL, 2] |
| 3 | NULL | [1, 2, 3] | NULL |
±-----±----------------±-------------±----------------------------+
5.1.6.8array_union
SQL
ARRAY array_union(ARRAY array1, ARRAY array2)
– 返回一个数组,包含array1和array2的并集中的所有元素,不包含重复项
– 取两个数组的并集,将返回的结果去重
mysql> set enable_vectorized_engine=true;
mysql> select k1,k2,k3,array_union(k2,k3) from array_type_table;
±-----±----------------±-------------±------------------------+
| k1 | k2 | k3 | array_union(k2
, k3
) |
±-----±----------------±-------------±------------------------+
| 1 | [1, 2, 3] | [2, 4, 5] | [1, 2, 3, 4, 5] |
| 2 | [2, 3] | [1, 5] | [2, 3, 1, 5] |
| 3 | [1, 1, 1] | [2, 2, 2] | [1, 2] |
±-----±----------------±-------------±------------------------+
5.1.6.9array_distinct
SQL
ARRAY array_distinct(ARRAY arr)
– 返回去除了重复元素的数组,如果输入数组为NULL,则返回NULL。
mysql> set enable_vectorized_engine=true;
mysql> select k1, k2, array_distinct(k2) from array_test;
±-----±----------------------------±--------------------------+
| k1 | k2 | array_distinct(k2) |
±-----±----------------------------±--------------------------+
| 1 | [1, 2, 3, 4, 5] | [1, 2, 3, 4, 5] |
| 2 | [6, 7, 8] | [6, 7, 8] |
| 3 | [] | [] |
| 4 | NULL | NULL |
| 5 | [1, 2, 3, 4, 5, 4, 3, 2, 1] | [1, 2, 3, 4, 5] |
| 6 | [1, 2, 3, NULL] | [1, 2, 3, NULL] |
| 7 | [1, 2, 3, NULL, NULL] | [1, 2, 3, NULL] |
±-----±----------------------------±--------------------------+
5.1.7JSON函数
建表,导入测试数据
SQL
CREATE TABLE test_json (
id INT,
json_string String
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 3
PROPERTIES(“replication_num” = “1”);
–测试数据
{“k1”:“v31”, “k2”: 300, “a1”: [{“k1”:“v41”, “k2”: 400}, 1, “a”, 3.14]}
{“k1”:“v32”, “k2”: 400, “a1”: [{“k1”:“v41”, “k2”: 400}, 2, “a”, 4.14],“a2”:{“k3”:“v33”, “k4”: 200,“a2”: [{“k1”:“v41”, “k2”: 400}, 2, “a”, 4.14]}}
{“k1”:“v33”, “k2”: 500, “a1”: [{“k1”:“v41”, “k2”: 400}, 3, “a”, 5.14],“a2”:{“k3”:“v33”, “k4”: 200,“a2”: [{“k5”:“v42”, “k6”: 600}]}}
{“k1”:“v31”}
{“k1”:“v31”, “k2”: 300}
{“k1”:“v31”, “k2”: 200 “a1”: []}
–json是一种里面存着一对对key,value类型的结构
–针对值类型的不同:
1.简单值:“k1”:“v31”
2.数组:[{“k1”:“v41”, “k2”: 400}, 1, “a”, 3.14]
3.对象:“a2”:{“k3”:“v33”, “k4”: 200,“a2”: [{“k5”:“v42”, “k6”: 600}]}
取值的时候,指定的’$.k1’==>这样的东西我们称之为json path ,json的路劲
– 通过本地文件的方式导入
curl
-u root:
-H “label:load_local_file1”
-H “column_separator:_”
-T /root/data/json.txt
http://doitedu01:8040/api/test/test_json/_stream_load
– 用insert into 的方式导入一条
INSERT INTO test_json VALUES(7, ‘{“k1”:“v1”, “k2”: 200}’);
5.1.7.1get_json_double,get_json_int,get_json_string
SQL
语法:
DOUBLE get_json_int(VARCHAR json_str, VARCHAR json_path)
INT get_json_int(VARCHAR json_str, VARCHAR json_path)
VARCHAR get_json_string(VARCHAR json_str, VARCHAR json_path)
– 解析并获取 json 字符串内指定路径的double,int,string 类型的内容。
– 其中 json_path 必须以 $ 符号作为开头,使用 . 作为路径分割符。
– 如果路径中包含 . ,则可以使用双引号包围。
– 使用 [ ] 表示数组下标,从 0 开始。
– path 的内容不能包含 ", [ 和 ]。
– 如果 json_string 格式不对,或 json_path 格式不对,或无法找到匹配项,则返回 NULL。
–1.获取到k1对应的value的值
mysql> select id, get_json_string(json_string,‘$.k1’) as k1 from test_json;
±-----±-----+
| id | k1 |
±-----±-----+
| 2 | v32 |
| 4 | v31 |
| 5 | v31 |
| 6 | v31 |
| 1 | v31 |
| 3 | v33 |
±-----±-----+
–2.获取到key 为a1 里面的数组
mysql> select id, get_json_string(json_string,‘$.a1’) as arr from test_json;
±-----±-----------------------------------+
| id | arr |
±-----±-----------------------------------+
| 1 | [{“k1”:“v41”,“k2”:400},1,“a”,3.14] |
| 3 | [{“k1”:“v41”,“k2”:400},3,“a”,5.14] |
| 2 | [{“k1”:“v41”,“k2”:400},2,“a”,4.14] |
| 4 | NULL |
| 5 | NULL |
| 6 | [] |
±-----±-----------------------------------+
–3.获取到key 为a1 里面的数组中第一个元素的值
mysql> select id, get_json_string(json_string,‘$.a1[0]’) as arr from test_json;
±-----±----------------------+
| id | arr |
±-----±----------------------+
| 2 | {“k1”:“v41”,“k2”:400} |
| 1 | {“k1”:“v41”,“k2”:400} |
| 3 | {“k1”:“v41”,“k2”:400} |
| 4 | NULL |
| 5 | NULL |
| 6 | NULL |
±-----±----------------------+
–4.获取到key 为a1 里面的数组中第一个元素的值(这个值是一个json串,再次获取到这个字符串中)
select id, get_json_string(get_json_string(json_string,‘
.
a
1
[
0
]
′
)
,
′
.a1[0]'),'
.a1[0]′),′.k1’) as arr from test_json;
±-----±-----+
| id | arr |
±-----±-----+
| 2 | v41 |
| 1 | v41 |
| 3 | v41 |
| 4 | NULL |
| 5 | NULL |
| 6 | NULL |
±-----±-----+
6 rows in set (0.02 sec)
5.1.7.2json_object
SQL
VARCHAR json_object(VARCHAR,…)
– 生成一个包含指定Key-Value对的json object,
– 传入的参数是key,value对,且key不能为null
87
MySQL> select json_object(‘time’,curtime());
±-------------------------------+
| json_object(‘time’, curtime()) |
±-------------------------------+
| {“time”: “10:49:18”} |
±-------------------------------+
MySQL> SELECT json_object(‘id’, 87, ‘name’, ‘carrot’);
±----------------------------------------+
| json_object(‘id’, 87, ‘name’, ‘carrot’) |
±----------------------------------------+
| {“id”: 87, “name”: “carrot”} |
±----------------------------------------+
json_object(‘id’, 87, ‘name’, ‘carrot’);
MySQL> select json_object(‘username’,null);
±--------------------------------+
| json_object(‘username’, ‘NULL’) |
±--------------------------------+
| {“username”: NULL} |
±--------------------------------+
5.1.8窗口函数
doris中的窗口函数和hive中的窗口函数的用法一样
5.1.8.1ROW_NUMBER(),DENSE_RANK(),RANK()
SQL
– 测试rank打行号,名次相同会并列排名,比如两个第一名,就是1 1 然后第二名会显示3
select x, y, rank() over(partition by x order by y) as rank from int_t;
x | y | rank |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
1 | 2 | 2 |
2 | 1 | 1 |
2 | 2 | 2 |
2 | 3 | 3 |
3 | 1 | 1 |
3 | 1 | 1 |
3 | 2 | 3 |
– 测试dense_rank(),名词相同会并列排名,比如两个第一名,就是1 1 然后第二名会显示2
select x, y, dense_rank() over(partition by x order by y) as rank from int_t;
x | y | rank |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
1 | 2 | 2 |
2 | 1 | 1 |
2 | 2 | 2 |
2 | 3 | 3 |
3 | 1 | 1 |
3 | 1 | 1 |
3 | 2 | 2 |
– 测试ROW_NUMBER() 按照分组排序要求,返回的编号依次底层,1 2 3 4 5 ,
– 不会有重复值,也不会有空缺值,就是连续递增的整数,从1 开始
select x, y, row_number() over(partition by x order by y) as rank from int_t;
x | y | rank |
---|---|---|
1 | 1 | 1 |
1 | 2 | 2 |
1 | 2 | 3 |
2 | 1 | 1 |
2 | 2 | 2 |
2 | 3 | 3 |
3 | 1 | 1 |
3 | 1 | 2 |
3 | 2 | 3 |
小练习: | ||
SQL | ||
– 案例数据 | ||
孙悟空,语文,87 | ||
孙悟空,数学,95 | ||
娜娜,英语,84 | ||
宋宋,语文,64 | ||
孙悟空,英语,68 | ||
宋宋,英语,84 | ||
婷婷,语文,65 | ||
娜娜,语文,94 | ||
宋宋,数学,86 | ||
婷婷,数学,85 | ||
娜娜,数学,56 | ||
婷婷,英语,78 |
– 建表语句
create table stu
(
name varchar(50),
subject varchar(50),
score double
)
DUPLICATE KEY(name)
DISTRIBUTED BY HASH(name) BUCKETS 1;
– 通过本地文件的方式导入数据
curl
-u root:
-H “label:num_test”
-H “column_separator:,”
-T /root/data/stu.txt
http://doitedu01:8040/api/test/stu/_stream_load
需求:
【相同分数并列(假设第一名有两个,排名就是并列第一,然后第三名从2开始)】
1.按照分数降序排序,求每个学科中每个人的名次
2.按照每个人的总分进行升序排列,得到每个人总分名次的名次
【相同分数并列(假设第一名有两个,排名就是并列第一,然后第三名从3开始)】
3.按照学科进行升序排列,得到每个人的每个学科的名次
4.按照每个人的总分进行升序排列,得到每个人总分名次的名次
【相同分数并列
(假设第一名有两个,排名就是并列第一,
就再单独比语文的成绩,然后数学,最后英语,
分数全部一样,按照学生名字的字典顺序,在前的为第一)】
5.按照每个人的总分进行升序排列,得到每个人总分名次的名次
sql:
SQL
– 1.按照学科进行升序排列,得到每个人的每个学科的名次
select
name,subject,score,
dense_rank() over(partition by subject order by score desc) as rank
from stu
±----------±--------±------±-----+
| name | subject | score | rank |
±----------±--------±------±-----+
| 孙悟空 | 数学 | 95 | 1 |
| 宋宋 | 数学 | 86 | 2 |
| 婷婷 | 数学 | 85 | 3 |
| 娜娜 | 数学 | 56 | 4 |
| 娜娜 | 英语 | 84 | 1 |
| 宋宋 | 英语 | 84 | 1 |
| 婷婷 | 英语 | 78 | 2 |
| 孙悟空 | 英语 | 68 | 3 |
| 娜娜 | 语文 | 94 | 1 |
| 孙悟空 | 语文 | 87 | 2 |
| 婷婷 | 语文 | 65 | 3 |
| 宋宋 | 语文 | 64 | 4 |
±----------±--------±------±-----+
– 2.按照每个人的总分进行升序排列,得到每个人总分名次的名次
select
name,sum_score,
– 因为是整体按照学生的总分进行求名次,所有学生为1组,就不需要分组了
dense_rank() over(order by sum_score desc) as rank
from
(
select
name,sum(score) as sum_score
from stu
group by name
) as t ;
±----------±----------±-----+
| name | sum_score | rank |
±----------±----------±-----+
| 孙悟空 | 250 | 1 |
| 宋宋 | 234 | 2 |
| 娜娜 | 234 | 2 |
| 婷婷 | 228 | 3 |
±----------±----------±-----+
【相同分数并列(假设第一名有两个,排名就是并列第一,然后第三名从3开始)】
– 3.按照学科进行升序排列,得到每个人的每个学科的名次
select
name,subject,score,
rank() over(partition by subject order by score desc) as rank
from stu
±----------±--------±------±-----+
| name | subject | score | rank |
±----------±--------±------±-----+
| 孙悟空 | 数学 | 95 | 1 |
| 宋宋 | 数学 | 86 | 2 |
| 婷婷 | 数学 | 85 | 3 |
| 娜娜 | 数学 | 56 | 4 |
| 娜娜 | 英语 | 84 | 1 |
| 宋宋 | 英语 | 84 | 1 |
| 婷婷 | 英语 | 78 | 3 |
| 孙悟空 | 英语 | 68 | 4 |
| 娜娜 | 语文 | 94 | 1 |
| 孙悟空 | 语文 | 87 | 2 |
| 婷婷 | 语文 | 65 | 3 |
| 宋宋 | 语文 | 64 | 4 |
±----------±--------±------±-----+
– 4.按照每个人的总分进行升序排列,得到每个人总分名次的名次
select
name,sum_score,
– 因为是整体按照学生的总分进行求名次,所有学生为1组,就不需要分组了
rank() over(order by sum_score desc) as rank
from
(
select
name,sum(score) as sum_score
from stu
group by name
) as t ;
±----------±----------±-----+
| name | sum_score | rank |
±----------±----------±-----+
| 孙悟空 | 250 | 1 |
| 宋宋 | 234 | 2 |
| 娜娜 | 234 | 2 |
| 婷婷 | 228 | 4 |
±----------±----------±-----+
【相同分数并列
(假设第一名有两个,排名就是并列第一,
就再单独比语文的成绩,然后数学,最后英语,
分数全部一样,按照学生名字的字典顺序,在前的为第一)】
– 5.按照每个人的总分进行升序排列,得到每个人总分名次的名次
–方案1:利用窗口函数来列转行
select
name,subject,score as math_score,english_score,chinese_score,sum_score,
row_number()over(order by sum_score desc ,chinese_score desc ,score desc ,english_score desc,name asc) as num
from
(
select
name,subject,score,
lead(score,1,0)over(partition by name order by subject) as english_score,
lead(score,2,0)over(partition by name order by subject) as chinese_score,
sum(score)over(partition by name) as sum_score,
row_number()over(partition by name) as num
from stu
) as tmp
where num = 1
– 方案2:利用if判断来列转行
select
name,chinese_score,match_score,english_score,sum_score,
row_number()over(order by sum_score desc ,chinese_score desc ,match_score desc ,english_score desc,name asc) as num
from
(
select
name,
sum(chinese_score) as chinese_score,
sum(match_score) as match_score,
sum(english_score) as english_score,
sum(chinese_score) + sum(match_score) + sum(english_score) as sum_score
from
(
select name,subject,
if(subject = ‘语文’,score,0) as chinese_score,
if(subject = ‘数学’,score,0) as match_score,
if(subject = ‘英语’,score,0) as english_score
from stu
)as t
group by name
) as t1
±----------±--------±-----------±--------------±--------------±----------±-----+
| name | subject | math_score | english_score | chinese_score | sum_score | num |
±----------±--------±-----------±--------------±--------------±----------±-----+
| 孙悟空 | 数学 | 95 | 68 | 87 | 250 | 1 |
| 娜娜 | 数学 | 56 | 84 | 94 | 234 | 2 |
| 宋宋 | 数学 | 86 | 84 | 64 | 234 | 3 |
| 婷婷 | 数学 | 85 | 78 | 65 | 228 | 4 |
±----------±--------±-----------±--------------±--------------±----------±-----+
5.1.8.2 min,max,sum,avg,count
SQL
min(x)over() – 取窗口中x列的最小值
max(x)over() – 取窗口中x列的最大值
sum(x)over() – 取窗口中x列的数据总和
avg(x)over() – 取窗口中x列的数据平均值
count(x)over() – 取窗口中x列有多少行
unbounded preceding
current row
1 following
1 PRECEDING
rows between unbounded preceding and current row --指在当前窗口中第一行到当前行的范围
rows between unbounded preceding and 1 following --指在当前窗口中第一行到当前行下一行的范围
rows between unbounded preceding and 1 PRECEDING --指在当前窗口中第一行到当前行前一行的范围
5.1.8.3LEAD() ,LAG()
SQL
– LAG() 方法用来计算当前行向前数若干行的值。
LAG(expr, offset, default) OVER (partition_by_clause order_by_clause)
– LEAD() 方法用来计算当前行向后数若干行的值。
LEAD(expr, offset, default]) OVER (partition_by_clause order_by_clause)
5.1.8.4窗口函数综合练习
1.打地鼠案例
SQL
需求:连续4次命中的人
-- seq:第几次打地鼠
-- m:是否命中,1-》命中,0-》未命中
uid,seq,m
u01,1,1
u01,2,0
u01,3,1
u01,6,1
u02,5,1
u02,6,0
u02,7,0
u02,1,1
u02,2,1
u03,4,1
u03,5,1
u03,6,0
u02,3,0
u02,4,1
u02,8,1
u01,4,1
u01,5,0
u02,9,1
u03,1,1
u03,2,1
u03,3,1
--建表语句
create table hit_mouse
(
user_id varchar(50),
seq int,
m int
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 1;
-- 通过本地文件的方式导入数据
curl \
-u root: \
-H "label:hit_mouse" \
-H "column_separator:," \
-T /root/data/hit_mouse.txt \
http://doitedu01:8040/api/test/hit_mouse/_stream_load
逻辑分析:
SQL
1.首先排除没有命中的数据
2.对每个用户进行分组,按照打地鼠编号的升序进行排序,打行号
3.用打地鼠的编号 - 所打的行号 ==》 如果连续命中的话,最后得到的结果应该是相等的
4.count出一样的结果的个数,满足大于等于4的就是我们需要的结果
-- 排除没有命中的数据,开窗打编号
select
user_id,seq,m,
row_number()over(partition by user_id order by seq asc) as num
from hit_mouse
where m != 0 ;
+---------+------+------+------+
| user_id | seq | m | num |
+---------+------+------+------+
| u01 | 1 | 1 | 1 | ==新的值
| u01 | 3 | 1 | 2 |
| u01 | 4 | 1 | 3 |
| u01 | 6 | 1 | 4 |
| u02 | 1 | 1 | 1 |
| u02 | 2 | 1 | 2 |
| u02 | 4 | 1 | 3 |
| u02 | 5 | 1 | 4 |
| u02 | 8 | 1 | 5 |
| u02 | 9 | 1 | 6 |
| u03 | 1 | 1 | 1 |
| u03 | 2 | 1 | 2 |
| u03 | 3 | 1 | 3 |
| u03 | 4 | 1 | 4 |
| u03 | 5 | 1 | 5 |
+---------+------+------+------+
-- 将上面的语句改变下,现在需要的是编号减去行号,可否直接拿编号减行号呢?
select
user_id,seq,m,
seq -row_number()over(partition by user_id order by seq asc) as num
from hit_mouse
where m != 0 ;
+---------+------+------+------+
| user_id | seq | m | num |
+---------+------+------+------+
| u01 | 1 | 1 | 0 |
| u01 | 3 | 1 | 1 |
| u01 | 4 | 1 | 1 |
| u01 | 6 | 1 | 2 |
| u02 | 1 | 1 | 0 |
| u02 | 2 | 1 | 0 |
| u02 | 4 | 1 | 1 |
| u02 | 5 | 1 | 1 |
| u02 | 8 | 1 | 3 |
| u02 | 9 | 1 | 3 |
| u03 | 1 | 1 | 0 |
| u03 | 2 | 1 | 0 |
| u03 | 3 | 1 | 0 |
| u03 | 4 | 1 | 0 |
| u03 | 5 | 1 | 0 |
+---------+------+------+------+
-- 看num重复的个数(在同一个user_id中)
select
user_id,
count(1) as cnt
from
(
select
user_id,seq,m,
seq -row_number()over(partition by user_id order by seq asc) as num
from hit_mouse
where m != 0
) as t
group by user_id,num
having cnt>=4
-- 得到最后的结果
+---------+------+
| user_id | cnt |
+---------+------+
| u03 | 5 |
+---------+------+
方案二:在不需要返回具体连续命中多少次,只需要返回user_id的情况下,还可以这么做
1.在去掉了未命中数据后
2.开窗,拿当前行下面的第三行数据,如果说该用户是连续登录的,
必然下面第三行的序号等于第一行的序号加3,如果结果不等于3,他们必然是不连续的,
并且结果只可能大于3,中间有为名中的被过滤了
3.最后查看结果等于3的用户并返回即可
select
user_id,seq,m,
(lead(seq,3,-4)over(partition by user_id order by seq asc) - seq) as diff
from hit_mouse
where m != 0
+---------+------+------+------+
| user_id | seq | m | diff |
+---------+------+------+------+
| u01 | 1 | 1 | 5 |
| u01 | 3 | 1 | -7 |
| u01 | 4 | 1 | -8 |
| u01 | 6 | 1 | -10 |
| u02 | 1 | 1 | 4 |
| u02 | 2 | 1 | 6 |
| u02 | 4 | 1 | 5 |
| u02 | 5 | 1 | -9 |
| u02 | 8 | 1 | -12 |
| u02 | 9 | 1 | -13 |
| u03 | 1 | 1 | 3 |
| u03 | 2 | 1 | 3 |
| u03 | 3 | 1 | -7 |
| u03 | 4 | 1 | -8 |
| u03 | 5 | 1 | -9 |
+---------+------+------+------+
– 最后判断diff的差值是否=3,然后返回对应的user_id(有可能会有多条相同的数据,去重)
select
user_id
from
(
select
user_id,seq,m,
(lead(seq,3,-4)over(partition by user_id order by seq asc) - seq) as res
from hit_mouse
where m != 0
) as t
where res = 3
-- group by 去重
group by user_id
+---------+
| user_id |
+---------+
| u03 |
+---------+
2.连续购买案例
SQL
需求:连续三天以上有销售记录的店铺名称
-- 数据准备
a,2017-02-05,100
a,2017-02-06,300
a,2017-02-07,800
a,2017-02-08,500
a,2017-02-10,700
b,2017-02-05,200
b,2017-02-06,400
b,2017-02-08,100
b,2017-02-09,400
b,2017-02-10,600
c,2017-01-31,200
c,2017-02-01,600
c,2017-02-02,600
c,2017-02-03,600
c,2017-02-10,700
a,2017-03-01,400
a,2017-03-02,300
a,2017-03-03,700
a,2017-03-04,400
–建表语句
create table shop_sale
(
shop_id varchar(50),
dt date,
amount double
)
DUPLICATE KEY(shop_id)
DISTRIBUTED BY HASH(shop_id) BUCKETS 1;
– 通过本地文件的方式导入数据
curl
-u root:
-H “label:shop_sale”
-H “column_separator:,”
-T /root/data/shop_sale.txt
http://doitedu01:8040/api/test/shop_sale/_stream_load
逻辑分析:
SQL
这样的连续销售记录问题(连续登录问题)和上面打地鼠的需求是一样的
1.按照店铺分组,对日期排序后打行号
2.用日期减去行号,得到的新的日期值,如果新的日期相同的话就代表是连续的
3.统计相同新日期的个数,来判断连续登录了几天
select
shop_id,new_date,
count(1) as cnt
from
(
select shop_id,dt,
date_sub(dt,row_number()over(partition by shop_id order by dt))as new_date
from shop_sale
) as t
group by shop_id,new_date
having cnt >=3;
+---------+------+
| shop_id | cnt |
+---------+------+
| a | 4 |
| a | 4 |
| b | 3 |
| c | 4 |
+---------+------+
方案二:
需要求连续三天的,我们取下面的第二行日期,拿取过来的下面的日期对当前行的日期相减,取间隔几天
如果他们的值 = 2 就代表是连续的
select
shop_id
from
(
select shop_id,dt,
datediff(lead(dt,2,'9999-12-31')over(partition by shop_id order by dt),dt)as day_diff_num
from shop_sale
) as t
where day_diff_num = 2
-- 给店铺去重
group by shop_id
+---------+
| shop_id |
+---------+
| c |
| a |
| b |
+---------+
3.分组topn案例
需求:
基于上面的表,求每个店铺金额最大的前三条订单 (row_number over)
求每个店铺销售金额前三名的订单
逻辑分析:
a,2017-02-07,800
a,2017-02-10,700
a,2017-02-08,500
a,2017-02-06,300
a,2017-02-05,100
b,2017-02-05,200
b,2017-02-06,400
b,2017-02-08,100
b,2017-02-09,400
b,2017-02-10,600
c,2017-02-10,700
c,2017-02-01,600
c,2017-02-02,600
c,2017-02-03,600
c,2017-01-31,200
d,2017-03-01,400
d,2017-03-02,300
d,2017-03-03,700
d,2017-03-04,400
select
shop_id,dt,amount
from
(
select
shop_id,dt,amount,
row_number()over(partition by shop_id order by amount desc) as num
from shop_sale
) as tmp
where num <=3
4.经典案例:遇到标志划分组
SQL
需求:将上面的表转化成下面的形式,首先按照用户进行分组,
在用户分组的基础上,name字段每遇到一个e*就分一组
user_id,name
u1,e1
u1,e1
u1,e*
u1,e2
u1,e3
u1,e*
u2,e1
u2,e2
u2,e*
u2,e1
u2,e3
u2,e*
u2,e*
上面的用户行为记录,每遇到一个e*,就分到一组,得到如下结果:
u1, [e1,e1,e*]
u1, [e2,e3,e*]
u2, [e1,e2,e*]
u2, [e1,e3,e*]
u2, [e*]
--建表语句
drop table if exists window_test;
create table window_test
(
user_id varchar(10),
name string
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 1;
-- 为了保证原数据的插入顺序,一条一条的inert 进去
insert into window_test values ('u1','e1');
insert into window_test values ('u1','e1');
insert into window_test values ('u1','e*');
insert into window_test values ('u1','e2');
insert into window_test values ('u1','e3');
insert into window_test values ('u1','e*');
insert into window_test values ('u2','e1');
insert into window_test values ('u2','e2');
insert into window_test values ('u2','e*');
insert into window_test values ('u2','e1');
insert into window_test values ('u2','e3');
insert into window_test values ('u2','e*');
insert into window_test values ('u2','e*');
逻辑分析:
1.我们需要关注的是name字段中的值是不是e*,所以可以将它转换成flag,1 0这样的标签
2.按照用户分组,来打行号(注意:这边必须要按照原来数据的顺序)
3.开窗,将flag的值从第一行加到当前行
4.将开窗的结果和原flag进行相减,得到一个新的flag标签结果
5.按照用户和新的标签结果进行分组,收集即可
-- 添加标签
select
user_id,
name,
if(name = 'e*',1,0) as flag
from window_test
+---------+------+------+
| user_id | name | flag |
+---------+------+------+
| u1 | e1 | 0 |
| u1 | e1 | 0 |
| u1 | e* | 1 |
| u1 | e2 | 0 |
| u1 | e3 | 0 |
| u1 | e* | 1 |
| u2 | e1 | 0 |
| u2 | e2 | 0 |
| u2 | e* | 1 |
| u2 | e1 | 0 |
| u2 | e3 | 0 |
| u2 | e* | 1 |
| u2 | e* | 1 |
+---------+------+------+
select
user_id,
name,
flag,
sum(flag)over(partition by user_id order by user_id rows between unbounded preceding and current row) as sum_flag
from
(
select
user_id,
name,
if(name = 'e*',1,0) as flag
from window_test
) as t;
+---------+------+------+----------+
| user_id | name | flag | sum_flag |
+---------+------+------+----------+
| u1 | e1 | 0 | 0 |
| u1 | e1 | 0 | 0 |
| u1 | e* | 1 | 1 |
| u1 | e2 | 0 | 1 |
| u1 | e3 | 0 | 1 |
| u1 | e* | 1 | 2 |
| u2 | e1 | 0 | 0 |
| u2 | e2 | 0 | 0 |
| u2 | e* | 1 | 1 |
| u2 | e1 | 0 | 1 |
| u2 | e3 | 0 | 1 |
| u2 | e* | 1 | 2 |
| u2 | e* | 1 | 3 |
+---------+------+------+----------+
观察现象,现在想把user_id和按照遇到e*就分组的这哥逻辑去处理的话,需要一个新标签,同一组相等
可以拿sum_flag - flag 得到的结果就是我们想要的
(plan2:或者拿取sum_flag 上面的一行数据,如果没有数据拿,默认就是0 ,这样的话也行)
select
user_id,
name,
flag,
sum(flag)over(partition by user_id order by user_id rows between unbounded preceding and current row) -flag as diff_flag
from
(
select
user_id,
name,
if(name = 'e*',1,0) as flag
from window_test
) as t;
+---------+------+------+-----------+
| user_id | name | flag | diff_flag |
+---------+------+------+-----------+
| u1 | e1 | 0 | 0 |
| u1 | e1 | 0 | 0 |
| u1 | e* | 1 | 0 |
| u1 | e2 | 0 | 1 |
| u1 | e3 | 0 | 1 |
| u1 | e* | 1 | 1 |
| u2 | e1 | 0 | 0 |
| u2 | e2 | 0 | 0 |
| u2 | e* | 1 | 0 |
| u2 | e1 | 0 | 1 |
| u2 | e3 | 0 | 1 |
| u2 | e* | 1 | 1 |
| u2 | e* | 1 | 2 |
+---------+------+------+-----------+
最后只要group by之后收集就行了,注意收集的时候没有collect_set 和ollect_list,只有
group_concat()
select
user_id,
group_concat(name,',') as res
from
(
select
user_id,
name,
flag,
sum(flag)over(partition by user_id order by user_id rows between unbounded preceding and current row) -flag as diff_flag
from
(
select
user_id,
name,
if(name = 'e*',1,0) as flag
from window_test
) as t
) as t1
group by user_id,diff_flag
+---------+----------+
| user_id | res |
+---------+----------+
| u1 | e1,e1,e* |
| u2 | e* |
| u1 | e2,e3,e* |
| u2 | e1,e2,e* |
| u2 | e1,e3,e* |
+---------+----------+
5.2综合案例之漏斗转化分析
业务目标、到达路径,路径步骤、步骤人数,步骤之间的相对转换率和绝对转换率
每一种业务都有他的核心任务和流程,而流程的每一个步骤,都可能有用户流失。
所以如果把每一个步骤及其对应的数据(如UV)拼接起来,就会形成一个上大下小的漏斗形态,这就是漏斗模型。
漏斗模型示例:
不同的业务场景有不同的业务路径 : 有先后顺序, 事件可以出现多次
注册转化漏斗 : 启动APP --> APP注册页面—>注册结果 -->提交订单–>支付成功
搜购转化漏斗 : 搜索商品–> 点击商品—>加入购物车–>提交订单–>支付成功
秒杀活动选购转化漏斗: 点击秒杀活动–>参加活动—>参与秒杀–>秒杀成功—>成功支付
电商的购买转化漏斗模型图:
处理步骤 :
明确漏斗名称:购买转化漏斗
起始事件:浏览了商品的详情页
目标事件:支付
业务流程事件链路:详情页->购物车->下单页->支付
[事件之间有没有时间间隔要求 , 链路中相邻的两个事件是否可以有其他事件]
SQL
需求:求购买转化漏斗模型的转换率(事件和事件之间没有时间间隔要求,并且相邻两个事件可以去干其他的事)
1.每一个步骤的uv
2.相对的转换率(下一个步骤的uv/上一个步骤的UV),绝对的转换率(当前步骤的UV第一步骤的UV)
关心的事件:e1,e2,e4,e5 ==> 先后顺序不能乱
> -- 准备数据
> user_id event_id event_action event_time u001,e1,view_detail_page,2022-11-01 01:10:21
> u001,e2,add_bag_page,2022-11-01 01:11:13
> u001,e3,collect_goods_page,2022-11-01 02:07:11
> u002,e3,collect_goods_page,2022-11-01 01:10:21
> u002,e4,order_detail_page,2022-11-01 01:11:13
> u002,e5,pay_detail_page,2022-11-01 02:07:11
> u002,e6,click_adver_page,2022-11-01 13:07:23
> u002,e7,home_page,2022-11-01 08:18:12
> u002,e8,list_detail_page,2022-11-01 23:34:29
> u002,e1,view_detail_page,2022-11-01 11:25:32
> u002,e2,add_bag_page,2022-11-01 12:41:21
> u002,e3,collect_goods_page,2022-11-01 16:21:15
> u002,e4,order_detail_page,2022-11-01 21:41:12
> u003,e5,pay_detail_page,2022-11-01 01:10:21
> u003,e6,click_adver_page,2022-11-01 01:11:13
> u003,e7,home_page,2022-11-01 02:07:11
> u001,e4,order_detail_page,2022-11-01 13:07:23
> u001,e5,pay_detail_page,2022-11-01 08:18:12
> u001,e6,click_adver_page,2022-11-01 23:34:29
> u001,e7,home_page,2022-11-01 11:25:32
> u001,e8,list_detail_page,2022-11-01 12:41:21
> u001,e1,view_detail_page,2022-11-01 16:21:15
> u001,e2,add_bag_page,2022-11-01 21:41:12
> u003,e8,list_detail_page,2022-11-01 13:07:23
> u003,e1,view_detail_page,2022-11-01 08:18:12
> u003,e2,add_bag_page,2022-11-01 23:34:29
> u003,e3,collect_goods_page,2022-11-01 11:25:32
> u003,e4,order_detail_page,2022-11-01 12:41:21
> u003,e5,pay_detail_page,2022-11-01 16:21:15
> u003,e6,click_adver_page,2022-11-01 21:41:12
> u004,e7,home_page,2022-11-01 01:10:21
> u004,e8,list_detail_page,2022-11-01 01:11:13
> u004,e1,view_detail_page,2022-11-01 02:07:11
> u004,e2,add_bag_page,2022-11-01 13:07:23
> u004,e3,collect_goods_page,2022-11-01 08:18:12
> u004,e4,order_detail_page,2022-11-01 23:34:29
> u004,e5,pay_detail_page,2022-11-01 11:25:32
> u004,e6,click_adver_page,2022-11-01 12:41:21
> u004,e7,home_page,2022-11-01 16:21:15
> u004,e8,list_detail_page,2022-11-01 21:41:12
> u005,e1,view_detail_page,2022-11-01 01:10:21
> u005,e2,add_bag_page,2022-11-01 01:11:13
> u005,e3,collect_goods_page,2022-11-01 02:07:11
> u005,e4,order_detail_page,2022-11-01 13:07:23
> u005,e5,pay_detail_page,2022-11-01 08:18:12
> u005,e6,click_adver_page,2022-11-01 23:34:29
> u005,e7,home_page,2022-11-01 11:25:32
> u005,e8,list_detail_page,2022-11-01 12:41:21
> u005,e1,view_detail_page,2022-11-01 16:21:15
> u005,e2,add_bag_page,2022-11-01 21:41:12
> u005,e3,collect_goods_page,2022-11-01 01:10:21
> u006,e4,order_detail_page,2022-11-01 01:11:13
> u006,e5,pay_detail_page,2022-11-01 02:07:11
> u006,e6,click_adver_page,2022-11-01 13:07:23
> u006,e7,home_page,2022-11-01 08:18:12
> u006,e8,list_detail_page,2022-11-01 23:34:29
> u006,e1,view_detail_page,2022-11-01 11:25:32
> u006,e2,add_bag_page,2022-11-01 12:41:21
> u006,e3,collect_goods_page,2022-11-01 16:21:15
> u006,e4,order_detail_page,2022-11-01 21:41:12
> u006,e5,pay_detail_page,2022-11-01 23:10:21
> u006,e6,click_adver_page,2022-11-01 01:11:13
> u007,e7,home_page,2022-11-01 02:07:11
> u007,e8,list_detail_page,2022-11-01 13:07:23
> u007,e1,view_detail_page,2022-11-01 08:18:12
> u007,e2,add_bag_page,2022-11-01 23:34:29
> u007,e3,collect_goods_page,2022-11-01 11:25:32
> u007,e4,order_detail_page,2022-11-01 12:41:21
> u007,e5,pay_detail_page,2022-11-01 16:21:15
> u007,e6,click_adver_page,2022-11-01 21:41:12
> u007,e7,home_page,2022-11-01 01:10:21
> u008,e8,list_detail_page,2022-11-01 01:11:13
> u008,e1,view_detail_page,2022-11-01 02:07:11
> u008,e2,add_bag_page,2022-11-01 13:07:23
> u008,e3,collect_goods_page,2022-11-01 08:18:12
> u008,e4,order_detail_page,2022-11-01 23:34:29
> u008,e5,pay_detail_page,2022-11-01 11:25:32
> u008,e6,click_adver_page,2022-11-01 12:41:21
> u008,e7,home_page,2022-11-01 16:21:15
> u008,e8,list_detail_page,2022-11-01 21:41:12
> u008,e1,view_detail_page,2022-11-01 01:10:21
> u009,e2,add_bag_page,2022-11-01 01:11:13
> u009,e3,collect_goods_page,2022-11-01 02:07:11
> u009,e4,order_detail_page,2022-11-01 13:07:23
> u009,e5,pay_detail_page,2022-11-01 08:18:12
> u009,e6,click_adver_page,2022-11-01 23:34:29
> u009,e7,home_page,2022-11-01 11:25:32
> u009,e8,list_detail_page,2022-11-01 12:41:21
> u009,e1,view_detail_page,2022-11-01 16:21:15
> u009,e2,add_bag_page,2022-11-01 21:41:12
> u009,e3,collect_goods_page,2022-11-01 01:10:21
> u010,e4,order_detail_page,2022-11-01 01:11:13
> u010,e5,pay_detail_page,2022-11-01 02:07:11
> u010,e6,click_adver_page,2022-11-01 13:07:23
> u010,e7,home_page,2022-11-01 08:18:12
> u010,e8,list_detail_page,2022-11-01 23:34:29
> u010,e5,pay_detail_page,2022-11-01 11:25:32
> u010,e6,click_adver_page,2022-11-01 12:41:21
> u010,e7,home_page,2022-11-01 16:21:15
> u010,e8,list_detail_page,2022-11-01 21:41:12
``
```sql
-- 创建表
drop table if exists event_info_log;
create table event_info_log
(
user_id varchar(20),
event_id varchar(20),
event_action varchar(20),
event_time datetime
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 1;
-- 通过本地文件的方式导入数据
curl \
-u root: \
-H "label:event_info_log" \
-H "column_separator:," \
-T /root/data/event_log.txt \
http://linux01:8040/api/test/event_info_log/_stream_load
逻辑分析:
--1. 先将用户的事件序列,按照漏斗模型定义的条件进行过滤,留下满足条件的事件
--2. 将同一个人的满足条件的事件ID收集到数组,按时间先后排序,拼接成字符串
--3. 将拼接好的字符串,匹配漏斗模型抽象出来的正则表达式
1.筛选时间条件,确定每个人的事件序列
select
user_id,
max(event_ll) as event_seq
from
(
select
user_id,
group_concat(event_id)over(partition by user_id order by report_date) as event_ll
from
(
select
user_id,event_id,report_date
from event_info_log
where event_id in ('e1','e2','e4','e5')
and to_date(report_date) = '2022-11-01'
order by user_id,report_date
) as temp
) as temp2
group by user_id;
+---------+------------------------+
| user_id | event_ll |
+---------+------------------------+
| u006 | e4, e5, e1, e2, e4, e5 |
| u007 | e1, e4, e5, e2 |
| u005 | e1, e2, e5, e4, e1, e2 |
| u004 | e1, e5, e2, e4 |
| u010 | e4, e5, e5 |
| u001 | e1, e2, e5, e4, e1, e2 |
| u003 | e5, e1, e4, e5, e2 |
| u002 | e4, e5, e1, e2, e4 |
| u008 | e1, e1, e5, e2, e4 |
| u009 | e2, e5, e4, e1, e2 |
+---------+------------------------+
2.确定匹配规则模型
select
user_id,
'购买转化漏斗' as funnel_name ,
case
-- 正则匹配,先触发过e1,在触发过e2,在触发过e4,在触发过e5
when event_seq rlike('e1.*e2.*e4.*e5') then 4
-- 正则匹配,先触发过e1,在触发过e2,在触发过e4
when event_seq rlike('e1.*e2.*e4') then 3
-- 正则匹配,先触发过e1,在触发过e2
when event_seq rlike('e1.*e2') then 2
-- 正则匹配,只触发过e1
when event_seq rlike('e1') then 1
else 0 end step
from
(
select
user_id,
max(event_ll) as event_seq
from
(
select
user_id,
group_concat(event_id)over(partition by user_id order by report_date) as event_ll
from
(
select
user_id,event_id,report_date
from event_info_log
where event_id in ('e1','e2','e4','e5')
and to_date(report_date) = '2022-11-01'
order by user_id,report_date
) as temp
) as temp2
group by user_id
) as tmp3;
+---------+--------------------+------+
| user_id | funnel_name | step |
+---------+--------------------+------+
| u006 | 购买转化漏斗 | 4 |
| u007 | 购买转化漏斗 | 2 |
| u005 | 购买转化漏斗 | 3 |
| u004 | 购买转化漏斗 | 3 |
| u010 | 购买转化漏斗 | 0 |
| u001 | 购买转化漏斗 | 3 |
| u003 | 购买转化漏斗 | 2 |
| u002 | 购买转化漏斗 | 3 |
| u008 | 购买转化漏斗 | 3 |
| u009 | 购买转化漏斗 | 2 |
+---------+--------------------+------+
-- 最后计算转换率
select
funnel_name,
sum(if(step >= 1 ,1,0)) as step1,
sum(if(step >= 2 ,1,0)) as step2,
sum(if(step >= 3 ,1,0)) as step3,
sum(if(step >= 4 ,1,0)) as step4,
round(sum(if(step >= 2 ,1,0))/sum(if(step >= 1 ,1,0)),2) as 'step1->step2_radio',
round(sum(if(step >= 3 ,1,0))/sum(if(step >= 2 ,1,0)),2) as 'step2->step3_radio',
round(sum(if(step >= 4 ,1,0))/sum(if(step >= 3 ,1,0)),2) as 'step3->step4_radio'
from
(
select
'购买转化漏斗' as funnel_name ,
case
-- 正则匹配,先触发过e1,在触发过e2,在触发过e4,在触发过e5
when event_seq regexp('e1.*e2.*e4.*e5') then 4
-- 正则匹配,先触发过e1,在触发过e2,在触发过e4
when event_seq regexp('e1.*e2.*.*e4') then 3
-- 正则匹配,先触发过e1,在触发过e2
when event_seq regexp('e1.*e2') then 2
-- 正则匹配,只触发过e1
when event_seq regexp('e1') then 1
else 0 end step
from
(
select
user_id,
max(event_seq) as event_seq
from
-- 因为在doris1.1版本中还不支持数组,所以拼接字符串的时候还没办法排序
(
select
user_id,
-- 用开窗的方式进行排序,然后在有序的按照时间升序,将事件拼接
group_concat(concat(report_date,'_',event_id),'|')over(partition by user_id order by report_date) as event_seq
from event_info_log
where to_date(report_date) = '2022-11-01'
and event_id in('e1','e4','e5','e2')
) as tmp
group by user_id
) as t1
) as t2
group by funnel_name;
+--------------------+-------+-------+-------+-------+--------------------+--------------------+--------------------+
| funnel_name | step1 | step2 | step3 | step4 | step1->step2_radio | step2->step3_radio | step3->step4_radio |
+--------------------+-------+-------+-------+-------+--------------------+--------------------+--------------------+
| 购买转化漏斗 | 9 | 9 | 6 | 1 | 1 | 0.67 | 0.17 |
+--------------------+-------+-------+-------+-------+--------------------+--------------------+--------------------+
5.3漏斗模型分析函数window_funnel
封装、要素(时间范围,事件的排序时间依据,漏斗模型的事件链)
语法:
window_funnel(window, mode, timestamp_column, event1, event2, ... , eventN)
漏斗分析函数搜索滑动时间窗口内最大的发生的最大事件序列长度。
-- window :滑动时间窗口大小,单位为秒。
-- mode :保留,目前只支持default。-- 相邻两个事件之间没有时间间隔要求,并且相邻两个事件中可以做其他的事件
-- timestamp_column :指定时间列,类型为DATETIME, 滑动窗口沿着此列工作。
-- eventN :表示事件的布尔表达式。
select
user_id,
window_funnel(3600*24, 'default', event_time, event_id='e1', event_id='e2' , event_id='e4', event_id='e5') as step
from event_info_log
group by user_id
+---------+------+
| user_id | step |
+---------+------+
| u006 | 4 |
| u007 | 2 |
| u005 | 3 |
| u004 | 3 |
| u010 | 0 |
| u001 | 3 |
| u003 | 2 |
| u002 | 3 |
| u008 | 3 |
| u009 | 2 |
+---------+------+
-- 算每一层级的转换率
select
'购买转化漏斗' as funnel_name,
sum(if(step >= 1 ,1,0)) as step1,
sum(if(step >= 2 ,1,0)) as step2,
sum(if(step >= 3 ,1,0)) as step3,
sum(if(step >= 4 ,1,0)) as step4,
round(sum(if(step >= 2 ,1,0))/sum(if(step >= 1 ,1,0)),2) as 'step1->step2_radio',
round(sum(if(step >= 3 ,1,0))/sum(if(step >= 2 ,1,0)),2) as 'step2->step3_radio',
round(sum(if(step >= 4 ,1,0))/sum(if(step >= 3 ,1,0)),2) as 'step3->step4_radio'
from
(
select
user_id,
window_funnel(3600*24, 'default', report_date, event_id='e1', event_id='e2' , event_id='e4', event_id='e5') as step
from event_info_log
where to_date(report_date) = '2022-11-01'
and event_id in('e1','e4','e5','e2')
group by user_id
) as t1
-- res
+--------------------+-------+-------+-------+-------+--------------------+--------------------+--------------------+
| funnel_name | step1 | step2 | step3 | step4 | step1->step2_radio | step2->step3_radio | step3->step4_radio |
+--------------------+-------+-------+-------+-------+--------------------+--------------------+--------------------+
| 购买转化漏斗 | 9 | 9 | 6 | 1 | 1 | 0.67 | 0.17 |
+--------------------+-------+-------+-------+-------+--------------------+--------------------+--------------------+