目录
一. 内置函数
一、关系运算
1. 等值比较: =
2. 等值比较:<=>
3. 不等值比较: <>和!=
4. 小于比较: <
5. 小于等于比较: <=
6. 大于比较: >
7. 大于等于比较: >=
8. 区间比较
9. 空值判断: IS NULL
10. 非空判断: IS NOT NULL
11. LIKE比较: LIKE
12. JAVA的LIKE操作: RLIKE
13. REGEXP操作: REGEXP
二、数学运算
1. 加法操作: +
2. 减法操作: –
3. 乘法操作: *
4. 除法操作: /
5. 取余操作: %
6. 位与操作: &
7. 位或操作: |
8. 位异或操作: ^
9.位取反操作: ~
三、逻辑运算
1. 逻辑与操作: AND、&&
2. 逻辑或操作: OR、||
3. 逻辑非操作: NOT、!
四、复合类型构造函数
1. map结构
2. struct结构
3. named_struct结构
4. array结构
5. create_union
五、复合类型操作符
1. 获取array中的元素
2. 获取map中的元素
3. 获取struct中的元素
六、数值计算函数
1. 取整函数: round
2. 指定精度取整函数: round
3. 向下取整函数: floor
4. 向上取整函数: ceil
5. 向上取整函数: ceiling
6. 取随机数函数: rand
7. 自然指数函数: exp
8. 以10为底对数函数: log10
9. 以2为底对数函数: log2
10. 对数函数: log
11. 幂运算函数: pow
12. 幂运算函数: power
13. 开平方函数: sqrt
14. 二进制函数: bin
15. 十六进制函数: hex
16. 反转十六进制函数: unhex
17. 进制转换函数: conv
18. 绝对值函数: abs
19. 正取余函数: pmod
20. 正弦函数: sin
21. 反正弦函数: asin
22. 余弦函数: cos
23. 反余弦函数: acos
24. positive函数: positive
25. negative函数: negative
七、集合操作函数
1. map类型大小:size
2. array类型大小:size
3. 判断元素数组是否包含元素:array_contains
4. 获取map中所有value集合
5. 获取map中所有key集合
6. 数组排序
八、类型转换函数
1. 二进制转换:binary
2. 基础类型之间强制转换:cast
cast(expr as <type>)
例如:cast(1 as bigint)
select cast(money as bigint);
0: jdbc:hive2://192.168.67.110:10000> select cast(1 as bigint);
+------+
| _c0 |
+------+
| 1 |
+------+
1 row selected (0.305 seconds)
0: jdbc:hive2://192.168.67.110:10000> select cast('hello' as bigint);
+-------+
| _c0 |
+-------+
| NULL |
+-------+
1 row selected (0.714 seconds)
九、日期函数
1. UNIX时间戳转日期函数: from_unixtime
2. 获取当前UNIX时间戳函数: unix_timestamp
3. 日期转UNIX时间戳函数: unix_timestamp
4. 指定格式日期转UNIX时间戳函数: unix_timestamp
5. 日期时间转日期函数: to_date
6. 日期转年函数: year
7. 日期转月函数: month
8. 日期转天函数: day
9. 日期转小时函数: hour
10. 日期转分钟函数: minute
11. 日期转秒函数: second
12. 日期转周函数: weekofyear
13. 日期比较函数: datediff
14. 日期增加函数: date_add
15. 日期减少函数: date_sub
1、date_format函数(根据格式整理日期)
yyyy-MM-dd HH:mm:ss
0: jdbc:hive2://192.168.67.110:10000> select date_format('2020-03-05','yyyy-MM');
+----------+
| _c0 |
+----------+
| 2020-03 |
+----------+
1 row selected (0.076 seconds)
2、date_add函数(加减日期)
0: jdbc:hive2://192.168.67.110:10000> select date_format('2020-03-05','yyyy-MM');
+----------+
| _c0 |
+----------+
| 2020-03 |
+----------+
1 row selected (0.076 seconds)
0: jdbc:hive2://192.168.67.110:10000> select date_add('2020-03-05',-1);
+-------------+
| _c0 |
+-------------+
| 2020-03-04 |
+-------------+
1 row selected (0.085 seconds)
0: jdbc:hive2://192.168.67.110:10000> select date_add('2020-03-05',1);
+-------------+
| _c0 |
+-------------+
| 2020-03-06 |
+-------------+
1 row selected (0.075 seconds)
0: jdbc:hive2://192.168.67.110:10000> select date_sub('2020-03-05',1);
+-------------+
| _c0 |
+-------------+
| 2020-03-04 |
+-------------+
1 row selected (0.083 seconds)
0: jdbc:hive2://192.168.67.110:10000>
3、next_day函数
(1)取当前天的下一个周一
0: jdbc:hive2://192.168.67.110:10000> select next_day('2020-03-05','MO');
+-------------+
| _c0 |
+-------------+
| 2020-03-09 |
+-------------+
1 row selected (0.078 seconds)
说明:星期一到星期日的英文(Monday,Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday)
(2)取当前周的周一
0: jdbc:hive2://192.168.67.110:10000> select date_add(next_day('2020-03-05','MO'),-7);
+-------------+
| _c0 |
+-------------+
| 2020-03-02 |
+-------------+
1 row selected (0.085 seconds)
(3)last_day函数(求当月最后一天日期)
0: jdbc:hive2://192.168.67.110:10000> select last_day('2020-03-05');
+-------------+
| _c0 |
+-------------+
| 2020-03-31 |
+-------------+
1 row selected (0.088 seconds)
十、条件函数
1. If函数: if
2. 非空查找函数: COALESCE
3. 条件判断函数:CASE
十一、字符串函数
1. 字符ascii码函数:ascii
2. base64字符串
3. 字符串连接函数:concat
4. 带分隔符字符串连接函数:concat_ws
5. 数组转换成字符串的函数:concat_ws
6. 小数位格式化成字符串函数:format_number
7. 字符串截取函数:substr,substring
8. 字符串截取函数:substr,substring
9. 字符串查找函数:instr
10. 字符串长度函数:length
11. 字符串查找函数:locate
12. 字符串格式化函数:printf
13. 字符串转换成map函数:str_to_map
14. base64解码函数:unbase64(string str)
15. 字符串转大写函数:upper,ucase
16. 字符串转小写函数:lower,lcase
17. 去空格函数:trim
18. 左边去空格函数:ltrim
19. 右边去空格函数:rtrim
20. 正则表达式替换函数:regexp_replace
21. 正则表达式解析函数:regexp_extract
22. URL解析函数:parse_url
23. json解析函数:get_json_object
24. 空格字符串函数:space
25. 重复字符串函数:repeat
26. 左补足函数:lpad
27. 右补足函数:rpad
28. 分割字符串函数: split
29. 集合查找函数: find_in_set
30. 分词函数:sentences
31. 分词后统计一起出现频次最高的TOP-K
32. 分词后统计与指定单词一起出现频次最高的TOP-K
切割
split(string str, string pat)
select split('nihao|hello|nice','\\|')
0: jdbc:hive2://192.168.67.110:10000> select split('nihao|hello|nice','\\|');
+---------------------------+
| _c0 |
+---------------------------+
| ["nihao","hello","nice"] |
+---------------------------+
1 row selected (0.123 seconds)
正则表达式截取字符串。
regexp_extract(string subject, string pattern, int index)
select regexp_extract('hello<B>nice</B>haha','<B>(.*)</B>',1)
select regexp_extract(字段名,正则表达式,索引)
索引为1,0时:
0: jdbc:hive2://192.168.67.110:10000> select regexp_extract('hello<B>nice</B>haha','<B>(.*)</B>',1);
+-------+
| _c0 |
+-------+
| nice |
+-------+
1 row selected (0.123 seconds)
0: jdbc:hive2://192.168.67.110:10000> select regexp_extract('hello<B>nice</B>haha','<B>(.*)</B>',0);
+--------------+
| _c0 |
+--------------+
| <B>nice</B> |
+--------------+
1 row selected (0.1 seconds)
将字符串去掉空格。
去掉字符串前后空格 trim(string A)
0: jdbc:hive2://192.168.67.110:10000> select trim(' hello ');
+--------+
| _c0 |
+--------+
| hello |
+--------+
1 row selected (0.103 seconds)
去掉字符串前面空格 ltrim(string A)
0: jdbc:hive2://192.168.67.110:10000> select ltrim(' hello ');
+-----------+
| _c0 |
+-----------+
| hello |
+-----------+
1 row selected (0.096 seconds)
去掉字符串后面空格 rtrim(string A)
0: jdbc:hive2://192.168.67.110:10000> select rtrim(' hello ');
+------------+
| _c0 |
+------------+
| hello |
+------------+
1 row selected (0.094 seconds)
拼接字符串。
concat(string A, string B...)
0: jdbc:hive2://192.168.67.110:10000> select concat('hao','hello');
+-----------+
| _c0 |
+-----------+
| haohello |
+-----------+
1 row selected (0.104 seconds)
字符串的截取。
select substr('abcde',3,2)
0: jdbc:hive2://192.168.67.110:10000> select substr('abcde',3,2)
. . . . . . . . . . . . . . . . . . > ;
+------+
| _c0 |
+------+
| cd |
+------+
1 row selected (0.097 seconds)
炸裂函数。
select explode(split("nice|good|well","\\|"));
0: jdbc:hive2://192.168.67.110:10000> select explode(split("nice|good|well","\\|"));
+-------+
| col |
+-------+
| nice |
| good |
| well |
+-------+
3 rows selected (0.103 seconds)
十二、混合函数
1. 调用Java函数:java_method
2. 调用Java函数:reflect
3. 字符串的hash值:hash
十三、XPath解析XML函数
1. xpath
2. xpath_string
3. xpath_boolean
4. xpath_short, xpath_int, xpath_long
5. xpath_float, xpath_double, xpath_number
十四、汇总统计函数(UDAF)
1. 个数统计函数: count
2. 总和统计函数: sum
3. 平均值统计函数: avg
4. 最小值统计函数: min
5. 最大值统计函数: max
6. 非空集合总体变量函数: var_pop
7. 非空集合样本变量函数: var_samp
8. 总体标准偏离函数: stddev_pop
9. 样本标准偏离函数: stddev_samp
10.中位数函数: percentile
11. 中位数函数: percentile
12. 近似中位数函数: percentile_approx
13. 近似中位数函数: percentile_approx
14. 直方图: histogram_numeric
15. 集合去重数:collect_set
16. 集合不去重函数:collect_list
十五、表格生成函数Table-Generating Functions (UDTF)
1.数组拆分成多行:explode(array)
2.Map拆分成多行:explode(map)
case when
打分、评级的时候。
数据准备。
tom,95
hua,90
hong,100
lele,85
kaka,70
kebi,60
ming,55
kang,78
lolo,93
0: jdbc:hive2://192.168.67.110:10000> create table student_level(name string,score int)
. . . . . . . . . . . . . . . . . . > row format delimited fields terminated by ",";
No rows affected (0.414 seconds)
0: jdbc:hive2://192.168.67.110:10000> load data local inpath '/opt/testData/hive/score.txt' into table student_level;
No rows affected (0.411 seconds)
0: jdbc:hive2://192.168.67.110:10000> select * from student_level;
+---------------------+----------------------+
| student_level.name | student_level.score |
+---------------------+----------------------+
| tom | 95 |
| | NULL |
| hua | 90 |
| | NULL |
| hong | 100 |
| | NULL |
| lele | 85 |
| | NULL |
| kaka | 70 |
| | NULL |
| kebi | 60 |
| | NULL |
| ming | 55 |
| | NULL |
| kang | 78 |
| | NULL |
| lolo | 93 |
+---------------------+----------------------+
17 rows selected (0.944 seconds)
0: jdbc:hive2://192.168.67.110:10000> select name,score,
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > case when score >= 90 then 'very good'
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > when score >= 80 and score <90 then 'double good'
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > when score >= 70 and score <80 then 'good'
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > when score >= 60 and score <70 then 'go on'
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > else 'zhencai'
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > end level
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > from student_level;
+-------+--------+--------------+
| name | score | level |
+-------+--------+--------------+
| tom | 95 | very good |
| | NULL | zhencai |
| hua | 90 | very good |
| | NULL | zhencai |
| hong | 100 | very good |
| | NULL | zhencai |
| lele | 85 | double good |
| | NULL | zhencai |
| kaka | 70 | good |
| | NULL | zhencai |
| kebi | 60 | go on |
| | NULL | zhencai |
| ming | 55 | zhencai |
| | NULL | zhencai |
| kang | 78 | good |
| | NULL | zhencai |
| lolo | 93 | very good |
+-------+--------+--------------+
17 rows selected (0.373 seconds)
炸裂函数实际应用
tom a,b,c
jim b,c,d
tony a,c,d
0: jdbc:hive2://192.168.67.110:10000> create table test1(name string,subject string)
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > fields terminated by " ";
No rows affected (0.101 seconds)
0: jdbc:hive2://192.168.67.110:10000> load data local inpath '/oopt/testData/hive/info.txt' into table test1;
Error: Error while compiling statement: FAILED: SemanticException Line 1:23 Invalid path ''/oopt/testData/hive/info.txt'': No files matching path file:/oopt/testData/hive/info.txt (state=42000,code=40000)
0: jdbc:hive2://192.168.67.110:10000> load data local inpath '/opt/testData/hive/info.txt' into table test1;
No rows affected (0.299 seconds)
0: jdbc:hive2://192.168.67.110:10000> select name,sub
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > from test1
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > LATERAL VIEW explode(split(subject,','))temp as sub;
+-------+------+
| name | sub |
+-------+------+
| tom | a |
| tom | b |
| tom | c |
| jim | b |
| jim | c |
| jim | d |
| tony | a |
| tony | c |
| tony | d |
+-------+------+
9 rows selected (0.094 seconds)
二. 处理json数据
现有json数据
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
大数据的数据来源:爬虫采集(python+scrapy)、前后端的埋点数据、业务数据。
创建表,并load数据
0: jdbc:hive2://192.168.67.110:10000> create table json(data string);
No rows affected (0.111 seconds)
0: jdbc:hive2://192.168.67.110:10000> load data local inpath '/opt/testData/hive/json.txt' into table json;
No rows affected (0.3 seconds)
0: jdbc:hive2://192.168.67.110:10000> select * from json;
+----------------------------------------------------+
| json.data |
+----------------------------------------------------+
| {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} |
| |
| {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"} |
| |
| {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"} |
| |
| {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"} |
| |
| {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"} |
| |
| {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"} |
| |
| {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} |
| |
| {"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"} |
| |
| {"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"} |
+----------------------------------------------------+
17 rows selected (0.209 seconds)
查询json数据
0: jdbc:hive2://192.168.67.110:10000> select get_json_object(data,'$.movie') as movie,
. . . . . . . . . . . . . . . . . . > get_json_object(data,'$.rate') as rate
. . . . . . . . . . . . . . . . . . > from json;
+--------+-------+
| movie | rate |
+--------+-------+
| 1193 | 5 |
| NULL | NULL |
| 661 | 3 |
| NULL | NULL |
| 914 | 3 |
| NULL | NULL |
| 3408 | 4 |
| NULL | NULL |
| 2355 | 5 |
| NULL | NULL |
| 1197 | 3 |
| NULL | NULL |
| 1287 | 5 |
| NULL | NULL |
| 2804 | 5 |
| NULL | NULL |
| 594 | 4 |
+--------+-------+
17 rows selected (0.146 seconds)
三. 窗口函数
3.1 窗口聚合
准备数据
cookie1,2015-04-10,1
cookie1,2015-04-13,3
cookie1,2015-04-11,5
cookie1,2015-04-12,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie1,2015-04-14,4
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
创建表
0: jdbc:hive2://192.168.67.110:10000> select * from cookie;
+------------------+--------------------+------------+
| cookie.cookieid | cookie.createtime | cookie.pv |
+------------------+--------------------+------------+
+------------------+--------------------+------------+
No rows selected (0.162 seconds)
加载数据
0: jdbc:hive2://192.168.67.110:10000> load data local inpath "/opt/testData/hive/cookie.txt" overwrite into table cookie;
No rows affected (0.434 seconds)
0: jdbc:hive2://192.168.67.110:10000> select * from cookie;
+------------------+--------------------+------------+
| cookie.cookieid | cookie.createtime | cookie.pv |
+------------------+--------------------+------------+
| cookie1 | 2015-04-10 | 1 |
| cookie1 | 2015-04-13 | 3 |
| cookie1 | 2015-04-11 | 5 |
| cookie1 | 2015-04-12 | 2 |
| cookie1 | 2015-04-15 | 4 |
| cookie1 | 2015-04-16 | 4 |
| cookie1 | 2015-04-14 | 4 |
| cookie2 | 2015-04-14 | 3 |
| cookie2 | 2015-04-15 | 9 |
| cookie2 | 2015-04-16 | 7 |
| cookie2 | 2015-04-10 | 2 |
| cookie2 | 2015-04-11 | 3 |
| cookie2 | 2015-04-12 | 5 |
| cookie2 | 2015-04-13 | 6 |
+------------------+--------------------+------------+
14 rows selected (0.169 seconds)
sum(pv) over()
我们通过cookieid分组,createtime排序,求pv的和。
求之前行到当前行的pv和。不加范围限定,默认也是这种。
sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1,
sum(pv) over (partition by cookieid order by createtime) as pv2,
如果只进行了分组,没有排序,会将分组内的所有数据进行求和。
sum(pv) over (partition by cookieid) as pv3,
求当前行与前3行的pv和。
sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4,
当前行的前3行到后2行。
sum(pv) over(partition by cookid order by createtime rows between 3 preceding and 2 following) as pv5,
当前行到最后行。
sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6
代码
0: jdbc:hive2://192.168.67.110:10000> create table cookie_sum
. . . . . . . . . . . . . . . . . . > as
. . . . . . . . . . . . . . . . . . > select cookieid,createtime,
. . . . . . . . . . . . . . . . . . > pv,. . . . . . . . . . . . . . . . . . > sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1,
. . . . . . . . . . . . . . . . . . > sum(pv) over (partition by cookieid order by createtime) as pv2,
. . . . . . . . . . . . . . . . . . > sum(pv) over (partition by cookieid) as pv3,
. . . . . . . . . . . . . . . . . . > sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4,
. . . . . . . . . . . . . . . . . . > sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 2 following) as pv5,
. . . . . . . . . . . . . . . . . . > sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6
. . . . . . . . . . . . . . . . . . > from cookie;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (60.957 seconds)
avg(pv) over()
min(pv) over()
max(pv) over()
3.2 窗口分片
数据准备
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7
创建表
0: jdbc:hive2://192.168.67.110:10000> create table cookie2(cookieid string, createtime string, pv int)
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > fields terminated by ',';
No rows affected (0.176 seconds)
加载数据
0: jdbc:hive2://192.168.67.110:10000> load data local inpath "/opt/testData/hive/cookie2.txt" into table cookie2;
No rows affected (0.531 seconds)
查看数据
0: jdbc:hive2://192.168.67.110:10000> load data local inpath "/opt/testData/hive/cookie2.txt" overwrite into table cookie2;
No rows affected (0.42 seconds)
0: jdbc:hive2://192.168.67.110:10000> select * from cookie2;
+-------------------+---------------------+-------------+
| cookie2.cookieid | cookie2.createtime | cookie2.pv |
+-------------------+---------------------+-------------+
| cookie1 | 2015-04-10 | 1 |
| cookie1 | 2015-04-11 | 5 |
| cookie1 | 2015-04-12 | 7 |
| cookie1 | 2015-04-13 | 3 |
| cookie1 | 2015-04-14 | 2 |
| cookie1 | 2015-04-15 | 4 |
| cookie1 | 2015-04-16 | 4 |
| cookie2 | 2015-04-10 | 2 |
| cookie2 | 2015-04-11 | 3 |
| cookie2 | 2015-04-12 | 5 |
| cookie2 | 2015-04-13 | 6 |
| cookie2 | 2015-04-14 | 3 |
| cookie2 | 2015-04-15 | 9 |
| cookie2 | 2015-04-16 | 7 |
+-------------------+---------------------+-------------+
14 rows selected (0.172 seconds)
以下不支持rows between
ntile(n) over()
按顺序将组内的数据分为几片,一般用来求前几分之几的数据。
ntile(2) over (partition by cookieid order by createtime) as rn1,
ntile(3) over (partition by cookieid order by createtime) as rn2,
如果不加分区,会将所有数据分成多片。
代码
0: jdbc:hive2://192.168.67.110:10000> create table cookie_ntile
. . . . . . . . . . . . . . . . . . > as
. . . . . . . . . . . . . . . . . . > select cookieid,createtime,pv,
. . . . . . . . . . . . . . . . . . > ntile(2) over (partition by cookieid order by createtime) as rn1,
. . . . . . . . . . . . . . . . . . > ntile(3) over (partition by cookieid order by createtime) as rn2,
. . . . . . . . . . . . . . . . . . > ntile(4) over (order by createtime) as rn3
. . . . . . . . . . . . . . . . . . > from cookie2 order by cookieid,createtime;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (71.657 seconds)
结果
0: jdbc:hive2://192.168.67.110:10000> select * from cookie_ntile;
+------------------------+--------------------------+------------------+-------------------+-------------------+-------------------+
| cookie_ntile.cookieid | cookie_ntile.createtime | cookie_ntile.pv | cookie_ntile.rn1 | cookie_ntile.rn2 | cookie_ntile.rn3 |
+------------------------+--------------------------+------------------+-------------------+-------------------+-------------------+
| cookie1 | 2015-04-10 | 1 | 1 | 1 | 1 |
| cookie1 | 2015-04-11 | 5 | 1 | 1 | 1 |
| cookie1 | 2015-04-12 | 7 | 1 | 1 | 2 |
| cookie1 | 2015-04-13 | 3 | 1 | 2 | 2 |
| cookie1 | 2015-04-14 | 2 | 2 | 2 | 3 |
| cookie1 | 2015-04-15 | 4 | 2 | 3 | 4 |
| cookie1 | 2015-04-16 | 4 | 2 | 3 | 4 |
| cookie2 | 2015-04-10 | 2 | 1 | 1 | 1 |
| cookie2 | 2015-04-11 | 3 | 1 | 1 | 1 |
| cookie2 | 2015-04-12 | 5 | 1 | 1 | 2 |
| cookie2 | 2015-04-13 | 6 | 1 | 2 | 2 |
| cookie2 | 2015-04-14 | 3 | 2 | 2 | 3 |
| cookie2 | 2015-04-15 | 9 | 2 | 3 | 3 |
| cookie2 | 2015-04-16 | 7 | 2 | 3 | 4 |
+------------------------+--------------------------+------------------+-------------------+-------------------+-------------------+
14 rows selected (0.315 seconds)
比如,统计一个cookie,pv数最多的前1/3的天。
0: jdbc:hive2://192.168.67.110:10000> create table cookie_temp
. . . . . . . . . . . . . . . . . . > as
. . . . . . . . . . . . . . . . . . > select
. . . . . . . . . . . . . . . . . . > cookieid,
. . . . . . . . . . . . . . . . . . > createtime,
. . . . . . . . . . . . . . . . . . > pv,
. . . . . . . . . . . . . . . . . . > ntile(3) over (partition by cookieid order by pv desc) as rn
. . . . . . . . . . . . . . . . . . > from cookie2;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (24.227 seconds)
结果。
0: jdbc:hive2://192.168.67.110:10000> select * from cookie_temp;+-----------------------+-------------------------+-----------------+-----------------+
| cookie_temp.cookieid | cookie_temp.createtime | cookie_temp.pv | cookie_temp.rn |
+-----------------------+-------------------------+-----------------+-----------------+
| cookie1 | 2015-04-12 | 7 | 1 |
| cookie1 | 2015-04-11 | 5 | 1 |
| cookie1 | 2015-04-16 | 4 | 1 |
| cookie1 | 2015-04-15 | 4 | 2 |
| cookie1 | 2015-04-13 | 3 | 2 |
| cookie1 | 2015-04-14 | 2 | 3 |
| cookie1 | 2015-04-10 | 1 | 3 |
| cookie2 | 2015-04-15 | 9 | 1 |
| cookie2 | 2015-04-16 | 7 | 1 |
| cookie2 | 2015-04-13 | 6 | 1 |
| cookie2 | 2015-04-12 | 5 | 2 |
| cookie2 | 2015-04-11 | 3 | 2 |
| cookie2 | 2015-04-14 | 3 | 3 |
| cookie2 | 2015-04-10 | 2 | 3 |
+-----------------------+-------------------------+-----------------+-----------------+
14 rows selected (0.315 seconds)
我们取rn=1的就是pv最多的前三分之一。
3.3 窗口排序
row_number() over()
分组排序,并记录名次,一般用来取前n名
row_number() over (partition by cookieid order by pv desc) as rn1
100 99 98 98 97 96
1,2,3,4,5,6
rank() over()
rank() over(partition by cookieid order by pv desc) as rn2
100 99 98 98 97 96
1,2,3,3,5,6
dense_rank() over()
dense_rank() over(partition by cookieid order by pv desc) as rn3
100 99 98 98 97 96
1,2,3,3,4,5
代码
0: jdbc:hive2://192.168.67.110:10000> create table cookie_order
. . . . . . . . . . . . . . . . . . > as
. . . . . . . . . . . . . . . . . . > select
. . . . . . . . . . . . . . . . . . > cookieid,
. . . . . . . . . . . . . . . . . . > createtime,
. . . . . . . . . . . . . . . . . . > pv,
. . . . . . . . . . . . . . . . . . > rank() over (partition by cookieid order by pv desc) as rn1,
. . . . . . . . . . . . . . . . . . > dense_rank() over (partition by cookieid order by pv desc) as rn2,
. . . . . . . . . . . . . . . . . . > row_number() over (partition by cookieid order by pv desc) as rn3
. . . . . . . . . . . . . . . . . . > from cookie2
. . . . . . . . . . . . . . . . . . > where cookieid='cookie1';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (39.18 seconds)
结果
0: jdbc:hive2://192.168.67.110:10000> select * from cookie_order;
+------------------------+--------------------------+------------------+-------------------+-------------------+-------------------+
| cookie_order.cookieid | cookie_order.createtime | cookie_order.pv | cookie_order.rn1 | cookie_order.rn2 | cookie_order.rn3 |
+------------------------+--------------------------+------------------+-------------------+-------------------+-------------------+
| cookie1 | 2015-04-12 | 7 | 1 | 1 | 1 |
| cookie1 | 2015-04-11 | 5 | 2 | 2 | 2 |
| cookie1 | 2015-04-16 | 4 | 3 | 3 | 3 |
| cookie1 | 2015-04-15 | 4 | 3 | 3 | 4 |
| cookie1 | 2015-04-13 | 3 | 5 | 4 | 5 |
| cookie1 | 2015-04-14 | 2 | 6 | 5 | 6 |
| cookie1 | 2015-04-10 | 1 | 7 | 6 | 7 |
+------------------------+--------------------------+------------------+-------------------+-------------------+-------------------+
7 rows selected (0.336 seconds)
3.4 上下移动
数据准备
cookie1,2015-04-10 10:00:02,url2
cookie1,2015-04-10 10:00:00,url1
cookie1,2015-04-10 10:03:04,url3
cookie1,2015-04-10 10:50:05,url6
cookie1,2015-04-10 11:00:00,url7
cookie1,2015-04-10 10:10:00,url4
cookie1,2015-04-10 10:50:01,url5
cookie2,2015-04-10 10:00:02,url22
cookie2,2015-04-10 10:00:00,url11
cookie2,2015-04-10 10:03:04,url33
cookie2,2015-04-10 10:50:05,url66
cookie2,2015-04-10 11:00:00,url77
cookie2,2015-04-10 10:10:00,url44
cookie2,2015-04-10 10:50:01,url55
创建表
0: jdbc:hive2://192.168.67.110:10000> create table cookie3(cookieid string, createtime string, url string)
. . . . . . . . . . . . . . . . . . > row format delimited fields terminated by ',';
No rows affected (0.186 seconds)
加载数据
0: jdbc:hive2://192.168.67.110:10000> load data local inpath "/opt/testData/hive/cookie_2.txt" into table cookie3;
No rows affected (0.54 seconds)
查看数据
0: jdbc:hive2://192.168.67.110:10000> select * from cookie3;
+-------------------+----------------------+--------------+
| cookie3.cookieid | cookie3.createtime | cookie3.url |
+-------------------+----------------------+--------------+
| cookie1 | 2015-04-10 10:00:02 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 |
| cookie1 | 2015-04-10 10:03:04 | url3 |
| cookie1 | 2015-04-10 10:50:05 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 |
| cookie2 | 2015-04-10 10:00:02 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 |
| cookie2 | 2015-04-10 10:03:04 | url33 |
| cookie2 | 2015-04-10 10:50:05 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 |
+-------------------+----------------------+--------------+
14 rows selected (0.241 seconds)
LAG(col,n,DEFAULT)
用于将当前列往上移n行
第一个参数为列名。
第二个参数为往上第n行(可选,默认为1)。
第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)。
LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as last_1_time
LAG(createtime,2) over (partition by cookieid order by createtime) as last_2_time
代码
0: jdbc:hive2://192.168.67.110:10000> create table cookie3_log
. . . . . . . . . . . . . . . . . . > as
. . . . . . . . . . . . . . . . . . > select cookieid,createtime,url,
. . . . . . . . . . . . . . . . . . > row_number() over (partition by cookieid order by createtime) as rn,
. . . . . . . . . . . . . . . . . . > LAG(createtime,1,'2022-10-20 00:00:00') over (partition by cookieid order by createtime) as log_1,
. . . . . . . . . . . . . . . . . . > LAG(createtime,2) over (partition by cookieid order by createtime) as log_2
. . . . . . . . . . . . . . . . . . > from cookie3;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
结果
0: jdbc:hive2://192.168.67.110:10000> select * from cookie3_log;
+-----------------------+-------------------------+------------------+-----------------+----------------------+----------------------+
| cookie3_log.cookieid | cookie3_log.createtime | cookie3_log.url | cookie3_log.rn | cookie3_log.log_1 | cookie3_log.log_2 |
+-----------------------+-------------------------+------------------+-----------------+----------------------+----------------------+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 2022-10-20 00:00:00 | NULL |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie1 | 2015-04-10 10:03:04 | url3 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 2022-10-20 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie2 | 2015-04-10 10:03:04 | url33 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
+-----------------------+-------------------------+------------------+-----------------+----------------------+----------------------+
14 rows selected (0.268 seconds)
LEAD(col,n,DEFAULT)
与上面的相似,用于将当前列往下移n行。
代码
0: jdbc:hive2://192.168.67.110:10000> create table cookie3_lead
. . . . . . . . . . . . . . . . . . > as
. . . . . . . . . . . . . . . . . . > select cookieid,createtime,url,
. . . . . . . . . . . . . . . . . . > LAG(createtime,1,'2022-10-20') over (partition by cookieid order by createtime) as lead_1,
. . . . . . . . . . . . . . . . . . > LAG(createtime,2) over (partition by cookieid order by createtime) as lead_2
. . . . . . . . . . . . . . . . . . > from cookie3;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
结果
0: jdbc:hive2://192.168.67.110:10000> select * from cookie3_lead;
+------------------------+--------------------------+-------------------+----------------------+----------------------+
| cookie3_lead.cookieid | cookie3_lead.createtime | cookie3_lead.url | cookie3_lead.lead_1 | cookie3_lead.lead_2 |
+------------------------+--------------------------+-------------------+----------------------+----------------------+
| cookie1 | 2015-04-10 10:00:00 | url1 | 2022-10-20 | NULL |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2015-04-10 10:00:00 | NULL |
| cookie1 | 2015-04-10 10:03:04 | url3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 2022-10-20 | NULL |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2015-04-10 10:00:00 | NULL |
| cookie2 | 2015-04-10 10:03:04 | url33 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
+------------------------+--------------------------+-------------------+----------------------+----------------------+
14 rows selected (0.265 seconds)
3.5 首尾值
FIRST_VALUE(url) over ()
分组排序后截至到当前行的第一个值。
FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last1
LAST_VALUE(url) over ()
分组排序后截至到当前行的最后一个值。
FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last2
代码
0: jdbc:hive2://192.168.67.110:10000> create table cookie3_1
. . . . . . . . . . . . . . . . . . > as
. . . . . . . . . . . . . . . . . . > select cookieid,createtime,url,
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > row_number() over (partition by cookieid order by createtime) as rn,
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last1,
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > LAST_VALUE(url) over (partition by cookieid order by createtime desc) as last2
. . . . . . . . . . . . . . . . . . >
. . . . . . . . . . . . . . . . . . > from cookie3;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
结果
0: jdbc:hive2://192.168.67.110:10000> select * from cookie3_1;
+---------------------+-----------------------+----------------+---------------+------------------+------------------+
| cookie3_1.cookieid | cookie3_1.createtime | cookie3_1.url | cookie3_1.rn | cookie3_1.last1 | cookie3_1.last2 |
+---------------------+-----------------------+----------------+---------------+------------------+------------------+
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url7 | url6 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url7 | url5 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url7 | url4 |
| cookie1 | 2015-04-10 10:03:04 | url3 | 3 | url7 | url3 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url7 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url7 | url1 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url77 | url66 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url77 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url77 | url44 |
| cookie2 | 2015-04-10 10:03:04 | url33 | 3 | url77 | url33 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url77 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url77 | url11 |
+---------------------+-----------------------+----------------+---------------+------------------+------------------+
14 rows selected (0.256 seconds)
四. 自定义函数
当 Hive 提供的内置函数无法满足业务处理需要时,此时就可以考虑使用用户自定义函数。
UDF(user-defined function)作用于单个数据行,产生一个数据行作为输出。
UDAF(用户定义聚集函数 User- Defined Aggregation Funcation):接收多个输入数据行,并产生一个输出数据行。类似于max、min。
UDTF(表格生成函数 User-Defined Table Functions):接收一行输入,输出多行。类似于explode。