笔者使用hive 语句的过程中发现一些问题,故做简单分析
1.问题起因
hive 查询语句:
主要逻辑:A表LEFT JOIN B表 然后按照A表中某个字段排序后取指定limit的数据;
错误的执行语句如下:
select A.id, A.tag,A.is_jingpin,A.is_time_sensitive,A.first_publish_time from(
select data['id'] as id,
data['intf_qiwen_vertical_level1_category_tag_name'] as tag,
data['is_video_jingpin_mark'] as is_jingpin,
data['is_small_video_time_sensitive'] as is_time_sensitive,
data['first_publish_time']as first_publish_time,
data['uploader_vertical_iqiyihao_internal_level']as internal_level
from recindexing_sv_corpus_l1_fields
where cast(data['business_type'] as int)& 2 > 0
and(data['is_video_jingpin_mark']='false' or data['is_video_jingpin_mark']='')
and cast(data['create_time']as bigint)>=(unix_timestamp(current_timestamp,'yyyy-MM-dd HH:mm:ss')-90*24*60*60)
and (not data['intf_qiwen_vertical_category_tag_name_v2'] like '%无聊日常%')
and (not data['intf_qiwen_vertical_category_tag_name_v2']like '%儿童%')
and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)>=0
and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)<=0.1
and cast(nvl(data['intf_qiwen_vertical_cover_quality_score_v2'],0)as double)>=0.85
and cast(nvl(data['intf_qiwen_vertical_content_quality_score_v2'],0)as double)>=0.95
and cast(nvl(data['intf_qiwen_vertical_cover_quality_resolution_score'],0)as double)>=0.9
)A
left outer join (
select id from sv_boutique_candidate_produced)B
on A.id = B.id
where B.id is null
ORDER by A.internal_level desc limit 2500;
错误信息:
FAILED: SemanticException [Error 10004]: Line 25:28 Invalid table alias or column reference 'internal_level': (possible column names are: id, tag, is_jingpin, is_time_sensitive, first_publish_time)
修正后的hive 语句如下:
select A.id, A.tag,A.is_jingpin,A.is_time_sensitive,A.first_publish_time,A.internal_level from(
select data['id'] as id,
data['intf_qiwen_vertical_level1_category_tag_name'] as tag,
data['is_video_jingpin_mark'] as is_jingpin,
data['is_small_video_time_sensitive'] as is_time_sensitive,
data['first_publish_time']as first_publish_time,
data['uploader_vertical_iqiyihao_internal_level']as internal_level
from recindexing_sv_corpus_l1_fields
where cast(data['business_type'] as int)& 2 > 0
and(data['is_video_jingpin_mark']='false' or data['is_video_jingpin_mark']='')
and cast(data['create_time']as bigint)>=(unix_timestamp(current_timestamp,'yyyy-MM-dd HH:mm:ss')-90*24*60*60)
and (not data['intf_qiwen_vertical_category_tag_name_v2'] like '%无聊日常%')
and (not data['intf_qiwen_vertical_category_tag_name_v2']like '%儿童%')
and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)>=0
and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)<=0.1
and cast(nvl(data['intf_qiwen_vertical_cover_quality_score_v2'],0)as double)>=0.85
and cast(nvl(data['intf_qiwen_vertical_content_quality_score_v2'],0)as double)>=0.95
and cast(nvl(data['intf_qiwen_vertical_cover_quality_resolution_score'],0)as double)>=0.9
)A
left outer join (
select id from sv_boutique_candidate_produced)B
on A.id = B.id
where B.id is null
ORDER by A.internal_level desc limit 2500;
主要修改是在select 语句中加入了A.internal_level
2 问题分析
使用mysql 进行类似的查询
select A.feed_id,A.name,A.impression_goal from (
select feed_id,name,duration,impression_goal, create_time as ct
from guaranteed_delivery_plan_conf)A
left join (
select feed_id,conf_id from guaranteed_delivery_plan_item)B
on A.feed_id = B.feed_id
where B.feed_id >3809986964540500
order by A.ct desc
limit 10;
mysql 类似语句可以正常执行,但是hive 语句缺无法正常运行;
3 分析缘由
hive ORDER BY 会做产出一个reducer 做全局排序故需要先查询出相应的字段
继续分析hive的order by,sort by,distribute by,cluster by
3.1order by :
select col,col2...
from tableName
where condition
order by col1,col2 [asc|desc
(1):order by后面可以有多列进行排序,默认按字典排序。
(2):order by为全局排序。
(3):order by需要reduce操作,且只有一个reduce,无法配置(因为多个reduce无法完成全局排序)。
图示:
3.2 sort by
sort by不是全局排序,其在数据进入reducer前完成排序,因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只会保证每个reducer的输出有序,并不保证全局有序。sort by不同于order by,它不受hive.mapred.mode属性的影响,sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序
图示:
3.3 distribute by
distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列,对应reduce的个数进行分发,默认是采用hash算法
图片描述
3.4 cluster by
cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒序,不能指定排序规则为ASC或者DESC。
图示:
参考文章:
1.https://zhuanlan.zhihu.com/p/93747613
2.https://blog.youkuaiyun.com/lzm1340458776/article/details/43306115