collect_set
- 作用:把一个字段中的值组合到一个集合中,搭配where(筛选)、group by使用;
- 用途:可将非group by字段进行转换,使其可以查询;
- 示例:
-- id:用户;time:登陆日期
select id,collect_set(time) as t from t_action_login where time<='20150906' group by id
该查询结果如下:
123@163.com | ["20150620","20150619"] |
| abc@163.com | ["20150816"] |
| cde@qq.com | ["20150606","20150608","20150607","20150609","20150613","20150610","20150616","20150615"] |
| 789@sohu.com | ["20150827","20150623","20150627","20150820","20150823","20150612","20150717"] |
| 987@163.com | ["20150701","20150829","20150626","20150625","20150726","20150722","20150629","20150824","20150716","20150 |
| ddsf@163.com | ["20150804","20150803","20150801","20150809","20150807","20150806","20150905","20150904","20150730","20150 |
| 182@163.com |["20150803","20150801","20150809","20150808","20150805","20150806","20150906","20150904","20150730","20150 |
| 22225@163.com | ["20150604","20150609","20150622","20150827","20150625","20150620","20150613","20150610","20150614","20150 |
| 18697@qq.com | ["20150902"] |
| 1905@qq.com | ["20150709"]
该结果显示每个用户在2015年9月6日之前的全部登陆时间
- 场景应用:在hive中求出一个数据表中在2015年9月6日首次登陆的人数
对应SQL语句:
select count(a.id) from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) as a where size(a.t)=1 and a.t[0]='20150906'
注意:size(a.t)=1 and a.t[0]='20150906'
参考:https://blog.youkuaiyun.com/liyantianmin/article/details/48262109
group by
- hive不允许直接查询访问非group by的字段,错误示例:
select name,sex from people group by sex;
该语句会报错,因为name是非group by字段,解决方法如下:
- 只查询group by字段:
select sex from people group by sex;
- 对于非group by字段,只通过聚合函数查询:
select sex,count(name) from people group by sex;
- 对于非group by字段,通过collect_set等函数转换后查询:
select sex,collect_set(name)[0] as name from people group by sex;
注:可以用原字段名称;
grouping sets
- 用途:可以按照多个标准进行分组,增强group by的功能;
- 示例:查询每个部门每类工作平均工资、每个部门的平均工资及整个企业中每个雇员的平均工资
select department_id,job_id,avg(salary)
from hr.employees
group by grouping sets ((department_id,job_id),department_id,());
-- ()表示对整体进行一次聚类
-- 也可以写成:
select department_id,job_id,avg(salary)
from hr.employees
group by department_id,grouping sets ((department_id,job_id),());
limit
- 用途:限制查询/返回的条数;
- 用法:
- 指定初始位置:
limit 初始位置,条数
- 不指定初始位置:
limit 条数
- 由offset指定初始位置:
limit 条数 offset 初始位置
- 指定初始位置:
- 说明:
- 初始位置:下标(从0开始),表示从该位置开始返回;
- 条数:表示返回的记录数
参考:https://blog.youkuaiyun.com/qq_33801641/article/details/107196036