hive系统函数collect_list和collect_set的应用

本文分析了Spark SQL中collect_list和collect_set函数在处理数据时的区别,collect_list返回的是不重复的集合,而collect_set则自动去除重复项。通过案例展示,介绍了如何利用它们分别获取用户观看视频列表,如张三丰和张无忌的观看记录。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

collect_list:收集数据,返回是一个集合,集合中元素不去重
collect_set:收集数据,返回是一个集合,集合中元素去重
结合concat_ws函数返回行转列的结果

案例1 collect_list:

spark-sql> with test1 as (
         > select '张三丰' as user_name,'向往的生活' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张三丰' as user_name,'极限挑战' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张三丰' as user_name,'今夜百乐门' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张三丰' as user_name,'向往的生活' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张三丰' as user_name,'极限挑战' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张无忌' as user_name,'王牌对王牌' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张无忌' as user_name,'觉醒年代' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张无忌' as user_name,'欢乐喜剧人' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张无忌' as user_name,'向往的生活' as video_name,'2021-07-28' as dateline)
         > 
         > select 
         >   user_name,
         >   dateline,
         >   concat_ws(',',collect_list(video_name)) as video_list
         > from test1
         > group by user_name,dateline;
张无忌  		2021-07-28      王牌对王牌,向往的生活,觉醒年代,欢乐喜剧人               
张三丰		2021-07-28		今夜百乐门,极限挑战,向往的生活,极限挑战,向往的生活
Time taken: 17.38 seconds, Fetched 2 row(s)

案例2 collect_set:

spark-sql> with test1 as (
         > select '张三丰' as user_name,'向往的生活' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张三丰' as user_name,'极限挑战' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张三丰' as user_name,'今夜百乐门' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张三丰' as user_name,'向往的生活' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张三丰' as user_name,'极限挑战' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张无忌' as user_name,'王牌对王牌' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张无忌' as user_name,'觉醒年代' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张无忌' as user_name,'欢乐喜剧人' as video_name,'2021-07-28' as dateline
         > union all 
         > select '张无忌' as user_name,'向往的生活' as video_name,'2021-07-28' as dateline)
         > 
         > select 
         >   user_name,
         >   dateline,
         >   concat_ws(',',collect_set(video_name)) as video_list
         > from test1
         > group by user_name,dateline;
张三丰		2021-07-28		今夜百乐门,极限挑战,向往的生活
张无忌		2021-07-28		欢乐喜剧人,向往的生活,王牌对王牌,觉醒年代
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值