HiveSQL中last_value函数的应用

利用last_value函数在SQL中获取每个分区的最新非NULL值
文章介绍了如何在SQL查询中使用last_value函数按更新时间获取每个分区的最新非NULL值,通过示例展示了如何结合partitionby,orderby,和rowsbetween等技巧实现数据合并。

一、背景

在以下数据中如何实现对每一个列按照更新时间取最新的非null值?

 1   a      a      null  202301     202301
 1   b      b      null  null       202302
 1   null   c      null  null       202303
 1   d      null   null  null       202304
 2   a      a      null  202301     202301
-- 预期实现
 1   d      c      null  202301     202304
 2   a      a      null  202301     202301

二、last_value函数的使用

select last_value(age) over(partition by a order by b,c desc)
SELECT *
FROM
(SELECT id
       ,last_value(name,TRUE)    OVER (PARTITION BY id ORDER BY up_time) name
       ,last_value(age,TRUE)     OVER (PARTITION BY id ORDER BY up_time) age
       ,last_value(address,TRUE) OVER (PARTITION BY id ORDER BY up_time) address
       ,last_value(ct_time,TRUE) OVER (PARTITION BY id ORDER BY up_time) ct_time
       ,up_time
       ,row_number() over (partition by id order by up_time desc ) as rank
FROM
    (select *
     from
         (select 1 as id,'a'  as name ,'a'  as age,null as address,202301 as ct_time,202301   as up_time
          union all
          select 1 as id,'b'  as name ,'b'  as age,null as address,null   as ct_time,  202302 as up_time
          union all
          select 1 as id,null as name,'c'   as age,null as address,null   as ct_time,  202303 as up_time
          union all
          select 1 as id,'d'  as name ,null as age,null as address,null   as ct_time,  202304 as up_time
          union all
          select 2 as id,'a'  as name ,'a'  as age,null as address,202301 as ct_time,  202301 as up_time
         ) t
    )
)
WHERE rank=1
;

SELECT *
FROM
    (SELECT id
          ,last_value(name,TRUE)    OVER (PARTITION BY id ORDER BY up_time ROWS BETWEEN unbounded preceding and unbounded following) name
          ,last_value(age,TRUE)     OVER (PARTITION BY id ORDER BY up_time ROWS BETWEEN unbounded preceding and unbounded following) age
          ,last_value(address,TRUE) OVER (PARTITION BY id ORDER BY up_time ROWS BETWEEN unbounded preceding and unbounded following) address
          ,last_value(ct_time,TRUE) OVER (PARTITION BY id ORDER BY up_time ROWS BETWEEN unbounded preceding and unbounded following) ct_time
          ,up_time
          ,row_number() over (partition by id order by up_time desc ) as rank
     FROM
         (select *
          from
              (select 1 as id,'a'  as name ,'a'  as age,null as address,202301 as ct_time,202301   as up_time
               union all
               select 1 as id,'b'  as name ,'b'  as age,null as address,null   as ct_time,  202302 as up_time
               union all
               select 1 as id,null as name,'c'   as age,null as address,null   as ct_time,  202303 as up_time
               union all
               select 1 as id,'d'  as name ,null as age,null as address,null   as ct_time,  202304 as up_time
               union all
               select 2 as id,'a'  as name ,'a'  as age,null as address,202301 as ct_time,  202301 as up_time
              ) t
         )
    )
WHERE rank=1
;
在上述sql中,使用last_value函数对每一个列按照主键id分组,取一个最新值,如果遇见null值,使用参数true进行忽略,最后再使用窗口函数row_number进行分组排序取最大一条数据即可实现数据合并。

在这里插入图片描述
在这里插入图片描述

### Hive 中 `LAST_VALUE` 函数与 `IGNORE NULLS` 的使用 在Hive查询语言中,窗口函数提供了强大的数据分析能力。其中,`LAST_VALUE` 是一种常用的窗口函数,用于获取指定窗口内的最后一个值。 #### `LAST_VALUE` 基本语法 ```sql LAST_VALUE(expression) [IGNORE NULLS] OVER (window_specification) ``` - `expression`: 表达式可以是列名或其他计算表达式。 - `[IGNORE NULLS]`: 可选参数,当设置此选项时会跳过NULL值来寻找最近的有效值。 - `OVER (window_specification)`: 定义了如何划分数据以及排序方式。 #### 示例说明 假设有一个销售记录表 `sales_records`,包含字段 `id`, `sale_date`, 和 `amount`: ```sql SELECT id, sale_date, amount, LAST_VALUE(amount) IGNORE NULLS OVER ( PARTITION BY id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS last_non_null_amount FROM sales_records; ``` 这段SQL语句的作用是在每个ID分区内按日期升序排列,并返回到当前行为止最新的非空销售额[^1]。 对于更复杂的场景,比如想要在整个分区范围内找到最后一条有效记录,则可以通过调整窗口定义实现: ```sql SELECT id, sale_date, amount, LAST_VALUE(amount) IGNORE NULLS OVER ( PARTITION BY id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS final_valid_amount_in_partition FROM sales_records; ``` 这里的关键在于通过设定不同的窗口范围(`ROWS BETWEEN ...`) 来控制所考虑的数据区间。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

文文鑫

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值