通常,在hive中对于模糊匹配关联方面的查询效率是非常低的,如or 关联,基于like的模糊匹配关联,对于此类问题往往需要找到好的优化方案。
对于join关联时涉及多个or连接,本次优化方案转化为union 或 union all的实现形式。
1、需求
有一天,旁边的做数据分析的同事,发我一个sql语句,说跑了15min多了,查询进度条一直没有进度,叫我帮忙优化一下,语句如下:
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,reg.order_no as `订单号`
,reg.hos_name as `医院名称`
,reg.first_dept_name as `一级科室`
,reg.second_dept_name as `二级科室`
,reg.doctor_name as `医生名称`
,split_part(split_part(reg.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,reg.order_create_time as `挂号时间`
,reg.treatment_dt as `就诊时间`
,reg.order_status as `订单状态`
,reg.product_price as `订单金额`
,reg.patient_name as `就诊卡姓名`
,reg.patient_cred_no as `身份证号`
,reg.patient_phone as `预留手机号`
,reg.order_ip as `ip`
from
(
select
order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
)reg --15837934条数据
left join
(select
u_type,val,op_log
from dw.fact_black_list
where id is not null
) list --402422条数据
on case when list.u_type='PHONE' then val else '非' end=reg.patient_phone or
case when list.u_type='IP' then val else '非' end=reg.order_ip or
case when list.u_type ='CARD_RISK' then val else '非' end=reg.patient_card_no or
case when list.u_type ='UID' then val else '非' end=reg.user_id or
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;
2、问题分析
看到以上逻辑后,第一感觉就是在join关联时,严重影响了查询效率,并且其中涉及多个join on ... or ... or 关联的情形。
于是进行了尝试,验证上边的假设:
(1) 首先去掉or后边的所有条件,只保留on匹配,结果很快就跑完了;
(2) 当保留on 匹配,外加1个or匹配时,出结果的速度明显比上边慢下来了;
(3) 当保留on 匹配,外加2个or匹配时,sql查询根本跑不动,查询进度条就一直停滞不前进了。
于是验证了上边的猜想,是由于join on ... or ... or 模糊匹配关联时,or条件导致的查询速度太慢,接下来进行hive查询语句的优化。
3、进行优化
思路:需要避免上边join on ... or ... or 模糊匹配的情况,需要把其拆分开。并且把case...when尽量不要放在on后边。
拆分开后,使用union的方式,查询出数据用了 57秒 完成。
with base_entity as( --15837934条数据
select
order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
)
,
base_list as --402422条数据
(select
u_type,val,op_log
,case when u_type='PHONE' then val else '非' end aa
,case when u_type='IP' then val else '非' end bb
,case when u_type ='CARD_RISK' then val else '非' end cc
,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list
where id is not null
)
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.aa=entity.patient_phone
union
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.bb=entity.order_ip
union
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.cc=entity.patient_card_no
union
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.dd=entity.user_id
;
拆分开后,使用union all的方式,查询出数据只用了 11秒 就完成了。
with base_entity as( --15837934条数据
select
order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status
,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
)
,
base_list as --402422条数据
(select
u_type,val,op_log
,case when u_type='PHONE' then val else '非' end aa
,case when u_type='IP' then val else '非' end bb
,case when u_type ='CARD_RISK' then val else '非' end cc
,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list
where id is not null
)
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.aa=entity.patient_phone
union all
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.bb=entity.order_ip
union all
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.cc=entity.patient_card_no
union all
select
list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.dd=entity.user_id
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;
但是上述的方式显示显得很啰嗦,如果后面需要匹配的or比较多,比如有n个的时候,那么同样的逻辑就要union all n-1次代码看起来相当繁琐,且性能较低。
针对以上问题,也可以采用一种优雅的实现方式:我们知道采用or连接的时候,无非就是base_entity表中的字段在base_list表中匹配到了就成功,对于这种需要匹配就成功的连接方式,我们自然想到hive中高效的实现方式locate()函数,对于该函数的理解,可以具体参考如下文章:
https://blog.youkuaiyun.com/godlovedaniel/article/details/125126193
hive中字符串查找函数 instr 和 locate_奔跑者-辉的博客-优快云博客
4、小结
要避免hive中 join 基于or 形式模糊匹配关联,可以借助于union all的实现方式,或借助于locate()模糊匹配的方法代码简洁优雅,在hive中用途较广,读者也需要务必掌握。