Hive开发问题汇总

最新推荐文章于 2024-12-29 07:11:49 发布

原创最新推荐文章于 2024-12-29 07:11:49 发布 · 2.4k 阅读

9 ·

CC 4.0 BY-SA版权

Hive 专栏收录该内容

1 篇文章

订阅专栏

本文详细解析了Hive与Presto在数据处理上的高级技巧，包括groupby操作、时间戳转换、随机取数等。并对比了二者在json数据处理、时间戳转换等功能上的差异。

文章目录

一、group by操作后将其他字段串接
二、求collect_set()数组内的大小
三、group by 操作时ParseException line 7:22 missing ) at ',' near ''
四、hive和presto的一些对比
五、hive和presto的转换时间戳对比
六、hive随机取数
七、hive中的like语句无法匹配下划线
八、海量数据取差集
九、truncate partition、drop partition 区别
十、hive中去重，并保留唯一值
十一、identifiers must not start with a digit； surround the identifier with double quotes
十二、sql做left join 操作时，on两边都是null值时，所有数据被过滤
十三、hive中的行转列和列转行

一、group by操作后将其他字段串接

select uid,concat_ws('|', collect_set(device)) from tmp_test group by uid;

collect_set 是 Hive 内置的一个聚合函数, 它返回一个消除了重复元素的对象集合, 其返回值类型是 array
Hive group by操作后将其他字段串接

mysql中使用的是group_concat(字段，‘分隔符’)

//第一种是正确用法
select id,group_concat(字段 separator ';') from aa group by id;  

//该种方法放回的分隔符会带上,分隔符，最终分隔符是;,
select id,group_concat(字段,';') from aa group by id; 

--- 默认的分隔符是","
select id,group_concat(name separator ';') from tablexxx group by id;
--- 如果要对name去重
select id,group_concat(distinct name) from tablexxx group by id;
--- 如果还想对name排序
select id,group_concat(name order by name desc) from tablexxx group by id;

https://www.cnblogs.com/fpcing/p/10452245.html
https://www.iteye.com/blog/amosjiayou-2274542
https://blog.youkuaiyun.com/bitcarmanlee/article/details/101783409

presto中使用的是array_distinct或者array_join(array_agg(字段), ‘,’)

https://blog.youkuaiyun.com/qq_35531549/article/details/90379760
https://blog.youkuaiyun.com/butterfly1009/article/details/54016746?utm_source=blogxgwz5

array_join(array_agg(url), ',', 'null') as urls,

二、求collect_set()数组内的大小

用size( collect_set(***) ) 求内部数组大小

三、group by 操作时ParseException line 7:22 missing ) at ‘,’ near ‘’

HIve自身的bug，需要在每个子查询后面加上别名并且 group by 后面多个字段时，第一个字段不可以在表名，建议去掉括号

select a,b,c
  from (select a,b,c
          from (select a,b,c
                  from table) a
         where rno = 1) tb
 group by a;
 
SELECT tab1.a,
       b, d
       SUM(tab1.c)
FROM tab1 join tab2 on tab1.a = tab2.b
GROUP BY tab1.a,b, d

GROUPING SETS ((b, tab1.a, d))

https://blog.youkuaiyun.com/syfly007/article/details/18225327
https://blog.youkuaiyun.com/fengzheku/article/details/80599320

四、hive和presto的一些对比

处理json数据时，presto用的是json_extract_scalar，hive用的是get_json_object
二者的取交集都是INTERSECT，可以参考https://blog.youkuaiyun.com/youzi_yun/article/details/97621355

https://blog.youkuaiyun.com/u012535605/article/details/83857079
https://blog.youkuaiyun.com/whathellll/article/details/90671182
https://blog.youkuaiyun.com/circle2015/article/details/101372194

五、hive和presto的转换时间戳对比

标准时间格式——时间戳

hive:

select unix_timestamp(cast (‘2017-08-30 10:36:15’ as timestamp))

presto:

select to_unixtime(cast (‘2017-08-30 10:36:15’ as timestamp))

时间戳——标准时间格式

presto:

select format_datetime(from_unixtime(1510284058),‘yyyy-MM-dd HH:mm:ss’)

hive:

select from_unixtime(1323308943123,‘yyyy-MM-dd HH:mm:ss’)

其中会遇到将字符串的时间戳转换为数值类型，hive和presto中都提供了cast转换

cast(value AS type) type显式转换一个值的类型。可以将varchar类型的值转为数字类型，反过来转换也可以。 try_cast(value AS type) type与cast类似，不过，如果转换失败会返回null，这个只有presto有

另外需要注意的是
hive中的int类型是就是int，而presto中是包装类型Integer，如果cast的type写错也会报错

六、hive随机取数

distribute by rand()sort by rand()limit1000

https://www.jianshu.com/p/818e45384094

七、hive中的like语句无法匹配下划线

当无法匹配下划线的字段时
应该使用“`”对字段进行引用
https://www.jianshu.com/p/3c982de0908d
无法匹配like子句内容时
此时需要对like后面的字符串添加转义字符，比如like ‘%_abc_%’
于是sql判断的是包含_abc_内容的数据
https://blog.youkuaiyun.com/huangli1466384630/article/details/80320402
https://www.cnblogs.com/cat-fish/p/4445025.html

八、海量数据取差集

较优化的方法是类似如下，取join之后，取右表中的值为空的结果数据，速度提升N倍

select count(*) from a left join b on a.id=b.id where a.is_del=1 and b.id is null

https://mengkang.net/361.html

九、truncate partition、drop partition 区别

功能：
两者都用于删除数据，即将对应的partition的数据文件删除。

不同点：
truncate 只删除数据文件，保存在mysql中的metadata不会被删除。
drop partition 只删除数据文件且删除在mysql中的metadata。

TRUNCATE TABLE food PARTITION (dt='20151219');
ALTER TABLE food DROP IF EXISTS PARTITION (dt='20151219');

https://blog.youkuaiyun.com/weixin_34112181/article/details/92048840

十、hive中去重，并保留唯一值

使用row_number()函数
row_number()over (partition by tran_idorder by timestamp desc) num 取num=1 的
意思是先根据tran_id进行分组，并在分组内部按timestamp 降序排序，row_number()函数计算的值就表示某个tran_id组内部排序后的顺序编号（该编号在一个组内是连续并且唯一的) 。
所以最后直接去每个分组内的第一个（num=1）即可。

https://blog.youkuaiyun.com/lsxy117/article/details/50387395/

同时，当group by 分组后，取前几的时候，也可以用上述方法
https://blog.youkuaiyun.com/u014156013/article/details/84233148

十一、identifiers must not start with a digit； surround the identifier with double quotes

或许你尝试了很多遍也没有解决这个问题，甚至没有看懂这句话，
没关系，现在你懂了

因为你在命名的时候用的是“7days“这样数字开头英文结尾的，sql里面可以别名里含有下划线和数字，但一定不能用数字开头取别名，改个名字就好
https://blog.youkuaiyun.com/weixin_43792309/article/details/109901049

十二、sql做left join 操作时，on两边都是null值时，所有数据被过滤

当做join操作时，如果出现null值，可能会导致没有数据，此时需要将null字段设置为默认值
比如 presto中 COALESCE(字段,‘null’)
https://blog.youkuaiyun.com/kaaosidao/article/details/78563904

十三、hive中的行转列和列转行

列转行用列转行用 lateral view explode

select SNO, name, add_DEPART
from students_info si 
lateral  view explode(split(si.DEPART,','))  b AS add_DEPART

https://zhuanlan.zhihu.com/p/59351085
https://www.cnblogs.com/hdc520/p/11133157.html