Hive-窗口函数入门

博客围绕Hive SQL展开,给出订单数据,介绍建表、加载数据操作。通过SQL语句实现求每个人订单明细和总量、指定窗口查询、按百分比取数据、特定时间购买顾客查询、顾客购买明细及月购买总额查询等功能,还提及month函数使用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

给出数据:姓名,日期,花费
jack,2020-01-01,10
tony,2020-01-02,15
jack,2020-02-03,23
tony,2020-01-04,29
jack,2020-01-05,46
jack,2020-04-06,42
tony,2020-01-07,50
jack,2020-01-08,55
mart,2020-04-08,62
mart,2020-04-09,68
neil,2020-05-10,12
mart,2020-04-11,75
neil,2020-06-12,80
mart,2020-04-13,94
数据存于/data/orders中
建表:
create table tb_orders(
name string ,
ctime string ,
money double
)
row format delimited fields terminated by ‘,’ ;
load data local inpath “/data/orders” into table tb_orders ;

select * from tb_orders ;
±----------------±-----------------±-----------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money |
±----------------±-----------------±-----------------+
| jack | 2020-01-01 | 10.0 |
| tony | 2020-01-02 | 15.0 |
| jack | 2020-02-03 | 23.0 |
| tony | 2020-01-04 | 29.0 |
| jack | 2020-01-05 | 46.0 |
| jack | 2020-04-06 | 42.0 |
| tony | 2020-01-07 | 50.0 |
| jack | 2020-01-08 | 55.0 |
| mart | 2020-04-08 | 62.0 |
| mart | 2020-04-09 | 68.0 |
| neil | 2020-05-10 | 12.0 |
| mart | 2020-04-11 | 75.0 |
| neil | 2020-06-12 | 80.0 |
| mart | 2020-04-13 | 94.0 |
±----------------±-----------------±-----------------+

– 求每个人的订单明细和总量
思路:
①select * from tb_orders order by name;
②select
name ,
count(1)
from
tb_orders
group by name ;

±------±-----+
| name | _c1 |
±------±-----+
| jack | 5 |
| mart | 4 |
| neil | 2 |
| tony | 3 |
±------±-----+

整合:
with x as (
select * from tb_orders order by name
),
y as (
select
name ,
count(1) cnt ,
sum(money) total_money
from
tb_orders
group by name
)
select
x.*,
y.cnt,
y.total_money
from
x
join
y
on
x.name = y.name
order by x.name , x.ctime
;

结果:
±--------±------------±---------±-------±---------------+
| x.name | x.ctime | x.money | y.cnt | y.total_money |
±--------±------------±---------±-------±---------------+
| jack | 2020-01-01 | 10.0 | 5 | 176.0 |
| jack | 2020-01-05 | 46.0 | 5 | 176.0 |
| jack | 2020-01-08 | 55.0 | 5 | 176.0 |
| jack | 2020-02-03 | 23.0 | 5 | 176.0 |
| jack | 2020-04-06 | 42.0 | 5 | 176.0 |
| mart | 2020-04-08 | 62.0 | 4 | 299.0 |
| mart | 2020-04-09 | 68.0 | 4 | 299.0 |
| mart | 2020-04-11 | 75.0 | 4 | 299.0 |
| mart | 2020-04-13 | 94.0 | 4 | 299.0 |
| neil | 2020-05-10 | 12.0 | 2 | 92.0 |
| neil | 2020-06-12 | 80.0 | 2 | 92.0 |
| tony | 2020-01-02 | 15.0 | 3 | 94.0 |
| tony | 2020-01-04 | 29.0 | 3 | 94.0 |
| tony | 2020-01-07 | 50.0 | 3 | 94.0 |
±--------±------------±---------±-------±---------------+
指定窗口为整个表

select

  • ,
    count(1) over() – 整个表
    from
    tb_orders ;

±----------------±-----------------±-----------------±----------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | count_window_0 |
±----------------±-----------------±-----------------±----------------+
| mart | 2020-04-13 | 94.0 | 14 |
| neil | 2020-06-12 | 80.0 | 14 |
| mart | 2020-04-11 | 75.0 | 14 |
| neil | 2020-05-10 | 12.0 | 14 |
| mart | 2020-04-09 | 68.0 | 14 |
| mart | 2020-04-08 | 62.0 | 14 |
| jack | 2020-01-08 | 55.0 | 14 |
| tony | 2020-01-07 | 50.0 | 14 |
| jack | 2020-04-06 | 42.0 | 14 |
| jack | 2020-01-05 | 46.0 | 14 |
| tony | 2020-01-04 | 29.0 | 14 |
| jack | 2020-02-03 | 23.0 | 14 |
| tony | 2020-01-02 | 15.0 | 14 |
| jack | 2020-01-01 | 10.0 | 14 |
±----------------±-----------------±-----------------±----------------+

– 指定窗口大小是一个人
select

  • ,
    count(1) over(partition by name) ,
    sum(money) over(partition by name)
    from
    tb_orders ;

±----------------±-----------------±-----------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money |
±----------------±-----------------±-----------------+
| jack | 2020-01-01 | 10.0 |
| tony | 2020-01-02 | 15.0 |
| jack | 2020-02-03 | 23.0 |
| tony | 2020-01-04 | 29.0 |
| jack | 2020-01-05 | 46.0 |
| jack | 2020-04-06 | 42.0 |
| tony | 2020-01-07 | 50.0 |
| jack | 2020-01-08 | 55.0 |
| mart | 2020-04-08 | 62.0 |
| mart | 2020-04-09 | 68.0 |
| neil | 2020-05-10 | 12.0 |
| mart | 2020-04-11 | 75.0 |
| neil | 2020-06-12 | 80.0 |
| mart | 2020-04-13 | 94.0 |
±----------------±-----------------±-----------------+
限制窗口大小
这里 order by ctime 和rows between unbounded preceding and current row 起始效果是一样的,写一个就可以

select

  • ,
    sum(money) over(partition by name order by ctime rows between unbounded preceding and current row )
    from
    tb_orders ;

分开写:

select
*,
sum(money) over (partition by name order by ctime)
from
tb_orders;


select
*,
sum(money) over (partition by name rows between unbounded preceding and current row)
from
tb_orders;

这三个结果都一样:
±----------------±-----------------±-----------------±--------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | sum_window_0 |
±----------------±-----------------±-----------------±--------------+
| jack | 2020-01-01 | 10.0 | 10.0 |
| jack | 2020-01-05 | 46.0 | 56.0 |
| jack | 2020-01-08 | 55.0 | 111.0 |
| jack | 2020-02-03 | 23.0 | 134.0 |
| jack | 2020-04-06 | 42.0 | 176.0 |
| mart | 2020-04-08 | 62.0 | 62.0 |
| mart | 2020-04-09 | 68.0 | 130.0 |
| mart | 2020-04-11 | 75.0 | 205.0 |
| mart | 2020-04-13 | 94.0 | 299.0 |
| neil | 2020-05-10 | 12.0 | 12.0 |
| neil | 2020-06-12 | 80.0 | 92.0 |
| tony | 2020-01-02 | 15.0 | 15.0 |
| tony | 2020-01-04 | 29.0 | 44.0 |
| tony | 2020-01-07 | 50.0 | 94.0 |
±----------------±-----------------±-----------------±--------------+

** 起始行 unbounded preceding
当前行 current row
前n行 n preceding
后n行 n following
结束行 unbounded following**

– 上一行和当前行 – order by ctime 起始行和当前行
select

  • ,
    sum(money) over(partition by name order by ctime rows between 1 preceding and current row )
    from
    tb_orders ;

– 上一行 当前行 和下一行
select

  • ,
    sum(money) over(partition by name order by ctime rows between 1 preceding and 1 following )
    from
    tb_orders ;

lag 向前n行的数据 :lag(字段,偏移量,默认值(可省略)) 没值的补NULL

select

  • ,
    lag(ctime ,1) over(partition by name order by ctime)
    from
    tb_orders ;

±----------------±-----------------±-----------------±--------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | lag_window_0 |
±----------------±-----------------±-----------------±--------------+
| jack | 2020-01-01 | 10.0 | NULL |
| jack | 2020-01-05 | 46.0 | 2020-01-01 |
| jack | 2020-01-08 | 55.0 | 2020-01-05 |
| jack | 2020-02-03 | 23.0 | 2020-01-08 |
| jack | 2020-04-06 | 42.0 | 2020-02-03 |
| mart | 2020-04-08 | 62.0 | NULL |
| mart | 2020-04-09 | 68.0 | 2020-04-08 |
| mart | 2020-04-11 | 75.0 | 2020-04-09 |
| mart | 2020-04-13 | 94.0 | 2020-04-11 |
| neil | 2020-05-10 | 12.0 | NULL |
| neil | 2020-06-12 | 80.0 | 2020-05-10 |
| tony | 2020-01-02 | 15.0 | NULL |
| tony | 2020-01-04 | 29.0 | 2020-01-02 |
| tony | 2020-01-07 | 50.0 | 2020-01-04 |
±----------------±-----------------±-----------------±--------------+

select

  • ,
    lag(ctime ,1 , ‘自定义默认值’) over(partition by name order by ctime)
    from
    tb_orders ;

±----------------±-----------------±-----------------±--------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | lag_window_0 |
±----------------±-----------------±-----------------±--------------+
| jack | 2020-01-01 | 10.0 | 自定义默认值 |
| jack | 2020-01-05 | 46.0 | 2020-01-01 |
| jack | 2020-01-08 | 55.0 | 2020-01-05 |
| jack | 2020-02-03 | 23.0 | 2020-01-08 |
| jack | 2020-04-06 | 42.0 | 2020-02-03 |
| mart | 2020-04-08 | 62.0 | 自定义默认值 |
| mart | 2020-04-09 | 68.0 | 2020-04-08 |
| mart | 2020-04-11 | 75.0 | 2020-04-09 |
| mart | 2020-04-13 | 94.0 | 2020-04-11 |
| neil | 2020-05-10 | 12.0 | 自定义默认值 |
| neil | 2020-06-12 | 80.0 | 2020-05-10 |
| tony | 2020-01-02 | 15.0 | 自定义默认值 |
| tony | 2020-01-04 | 29.0 | 2020-01-02 |
| tony | 2020-01-07 | 50.0 | 2020-01-04 |
±----------------±-----------------±-----------------±--------------+

lead 向后n行的数据 lead(字段,偏移量,默认值(可省略))没值的补NULL
select
*,
lead(ctime,1) over(partition by name order by ctime)
from
tb_orders;

±----------------±-----------------±-----------------±---------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | lead_window_0 |
±----------------±-----------------±-----------------±---------------+
| jack | 2020-01-01 | 10.0 | 2020-01-05 |
| jack | 2020-01-05 | 46.0 | 2020-01-08 |
| jack | 2020-01-08 | 55.0 | 2020-02-03 |
| jack | 2020-02-03 | 23.0 | 2020-04-06 |
| jack | 2020-04-06 | 42.0 | NULL |
| mart | 2020-04-08 | 62.0 | 2020-04-09 |
| mart | 2020-04-09 | 68.0 | 2020-04-11 |
| mart | 2020-04-11 | 75.0 | 2020-04-13 |
| mart | 2020-04-13 | 94.0 | NULL |
| neil | 2020-05-10 | 12.0 | 2020-06-12 |
| neil | 2020-06-12 | 80.0 | NULL |
| tony | 2020-01-02 | 15.0 | 2020-01-04 |
| tony | 2020-01-04 | 29.0 | 2020-01-07 |
| tony | 2020-01-07 | 50.0 | NULL |
±----------------±-----------------±-----------------±---------------+

select
*,
lead(ctime,1,‘向后自定义默认值’) over(partition by name order by ctime)
from
tb_orders;

±----------------±-----------------±-----------------±---------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | lead_window_0 |
±----------------±-----------------±-----------------±---------------+
| jack | 2020-01-01 | 10.0 | 2020-01-05 |
| jack | 2020-01-05 | 46.0 | 2020-01-08 |
| jack | 2020-01-08 | 55.0 | 2020-02-03 |
| jack | 2020-02-03 | 23.0 | 2020-04-06 |
| jack | 2020-04-06 | 42.0 | 向后自定义默认值 |
| mart | 2020-04-08 | 62.0 | 2020-04-09 |
| mart | 2020-04-09 | 68.0 | 2020-04-11 |
| mart | 2020-04-11 | 75.0 | 2020-04-13 |
| mart | 2020-04-13 | 94.0 | 向后自定义默认值 |
| neil | 2020-05-10 | 12.0 | 2020-06-12 |
| neil | 2020-06-12 | 80.0 | 向后自定义默认值 |
| tony | 2020-01-02 | 15.0 | 2020-01-04 |
| tony | 2020-01-04 | 29.0 | 2020-01-07 |
| tony | 2020-01-07 | 50.0 | 向后自定义默认值 |
±----------------±-----------------±-----------------±---------------+
(1) 查询前20%时间的订单信息

NTILE(n) 数据分成n分,可实现按百分比取出数据

例:取出20%,也就是总数的五分之一
select
name ,
ctime ,
money
from
(select

  • ,
    ntile(5) over(order by ctime) as num
    from
    tb_orders)t
    where num =1 ;
    ±------±------------±-------+
    | name | ctime | money |
    ±------±------------±-------+
    | jack | 2020-01-01 | 10.0 |
    | tony | 2020-01-02 | 15.0 |
    | tony | 2020-01-04 | 29.0 |
    ±------±------------±-------+

(2)查询在2020年4月份购买过的顾客及总人数
select
*
from
tb_orders where substring(ctime , 0 , 7) = ‘2020-04’ ;
±----------------±-----------------±-----------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money |
±----------------±-----------------±-----------------+
| jack | 2020-04-06 | 42.0 |
| mart | 2020-04-08 | 62.0 |
| mart | 2020-04-09 | 68.0 |
| mart | 2020-04-11 | 75.0 |
| mart | 2020-04-13 | 94.0 |
±----------------±-----------------±-----------------+

select
name,
count(1) over()
from
(
select
distinct name
from
tb_orders where substring(ctime , 0 , 7) = ‘2020-04’
)t ;

(3)查询顾客的购买明细及月购买总额
select

  • ,
    sum(money) over(partition by name , substring(ctime ,1 ,7))
    from
    tb_orders ;

±----------------±-----------------±-----------------±--------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | sum_window_0 |
±----------------±-----------------±-----------------±--------------+
| jack | 2020-01-05 | 46.0 | 111.0 |
| jack | 2020-01-08 | 55.0 | 111.0 |
| jack | 2020-01-01 | 10.0 | 111.0 |
| jack | 2020-02-03 | 23.0 | 23.0 |
| jack | 2020-04-06 | 42.0 | 42.0 |
| mart | 2020-04-13 | 94.0 | 299.0 |
| mart | 2020-04-11 | 75.0 | 299.0 |
| mart | 2020-04-09 | 68.0 | 299.0 |
| mart | 2020-04-08 | 62.0 | 299.0 |
| neil | 2020-05-10 | 12.0 | 12.0 |
| neil | 2020-06-12 | 80.0 | 80.0 |
| tony | 2020-01-04 | 29.0 | 94.0 |
| tony | 2020-01-02 | 15.0 | 94.0 |
| tony | 2020-01-07 | 50.0 | 94.0 |
±----------------±-----------------±-----------------±--------------+

month(ctime)month函数(对日期格式有要求):
±---------------------------------------------------+
| tab_name |
±---------------------------------------------------+
| month(param) - Returns the month component of the date/timestamp/interval |
要求格式为:日期,时间戳,定时器类型

select
month(ctime)
from
tb_orders;

±-----+
| _c0 |
±-----+
| 1 |
| 1 |
| 2 |
| 1 |
| 1 |
| 4 |
| 1 |
| 1 |
| 4 |
| 4 |
| 5 |
| 4 |
| 6 |
| 4 |
±-----+

select

  • ,
    sum(money) over(partition by name , month(ctime))
    from
    tb_orders ;

±----------------±-----------------±-----------------±--------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | sum_window_0 |
±----------------±-----------------±-----------------±--------------+
| jack | 2020-01-05 | 46.0 | 111.0 |
| jack | 2020-01-08 | 55.0 | 111.0 |
| jack | 2020-01-01 | 10.0 | 111.0 |
| jack | 2020-02-03 | 23.0 | 23.0 |
| jack | 2020-04-06 | 42.0 | 42.0 |
| mart | 2020-04-13 | 94.0 | 299.0 |
| mart | 2020-04-11 | 75.0 | 299.0 |
| mart | 2020-04-09 | 68.0 | 299.0 |
| mart | 2020-04-08 | 62.0 | 299.0 |
| neil | 2020-05-10 | 12.0 | 12.0 |
| neil | 2020-06-12 | 80.0 | 80.0 |
| tony | 2020-01-04 | 29.0 | 94.0 |
| tony | 2020-01-02 | 15.0 | 94.0 |
| tony | 2020-01-07 | 50.0 | 94.0 |
±----------------±-----------------±-----------------±--------------+

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值