Hive使用lateral view和explode处理嵌套json

本文介绍如何在Hive中解析嵌套的多层JSON数据,以提取所有brandName。探讨了使用get_json_object函数的局限性,提出了通过自定义UDF函数和使用Hive SQL的lateral view与explode函数的解决方案。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

最近有个需求,需要解析数仓中某张表的某个字段,该字段为Json,且为嵌套的多层Json,Json数据格式化之后如下:
在这里插入图片描述
由于是生产环境上的数据,因此对于某些value以xxx进行替代,并不影响sql的编写
样例:

[{"categoryId":"9","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]},{"categoryId":"5","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]},{"categoryId":"10","categoryName":"xxx","brandList":[{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}]}]
[{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"516","brandName":"xxx"},{"brandId":"650","brandName":"xxx"},{"brandId":"586","brandName":"xxx"},{"brandId":"630","brandName":"xxx"}],"categoryId":"542","categoryName":"xxx"},{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"650","brandName":"xxx"}],"categoryId":"7","categoryName":"xxx"},{"brandList":[{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}],"categoryId":"10","categoryName":"xxx"}]

需求是需要提取出每个这种json中所有的brandName

思路1:
使用hive自带的get_json_object函数进行处理:

select get_json_object(brand_control,"$[0].brandList"),
       get_json_object(get_json_object(brand_control,"$[0].brandList"), "$[0].brandName")
from 库名.表名
where dayid='20190729'

尝试之后,发现最终的效果只能是取出其中一个brandname,并不能取出全部

思路2:
既然使用自带的json处理函数不能满足,那么就自己去开发一个udf函数,思路比较简单,只要将读入的每个json进行解析,一个for循环,将里面的brandName依次拿出来就行

思路3:
其实Hive SQL也可以一行sql直接搞定,没必要写什么udf,思路如下:

  1. 将json 以brand_name进行切分成多行 使用lateral view和explode进行配合处理
  2. 对每一行数据进行处理 提取出品牌名
  3. 剔除为数字的情况

SQL如下:

select 
  seller_id,
  collect_set(split(split(brand_name,'":"')[1],'"')[0]) as brand_name
from 
  (select 
    seller_id,
    brand_name,
    brand_control
  from 库名.表名
  lateral view 
    explode(
      split(brand_control,'brandName')
      ) adTable as brand_name
  where dayid = '20190729'
  ) a
where (split(split(brand_name,'":"')[1],'"')[0] REGEXP '[^0-9.]')!=0  --剔除brand_name为数字的情况
group by seller_id

这样,对应的一个seller_id就将对应的所有branName给全部取了出来,以一个list的形式拼接在了一起

with tmp_table as (select update_time, e1 as integratedindex, e2 as pm25, e3 as pm10, e4 as co, e5 as no2, e6 as so2, e7 as o3, e8 as aqi, e9 as tsp from dc_dm_air_city_air_data_national_ranking_d t lateral view explode_json_array_json(integrated_index_list) tmp1 as e1 lateral view explode_json_array_json(pm25_list) tmp2 as e2 lateral view explode_json_array_json(pm10_list) tmp3 as e3 lateral view explode_json_array_json(co_list) tmp4 as e4 lateral view explode_json_array_json(no2_list) tmp5 as e5 lateral view explode_json_array_json(so2_list) tmp6 as e6 lateral view explode_json_array_json(o3_list) tmp7 as e7 lateral view explode_json_array_json(aqi_list) tmp8 as e8 lateral view explode_json_array_json(tsp_list) tmp9 as e9 where air_region_key = #{ranking} and update_time BETWEEN date_format('${sTime}', '%Y-%m-%d') AND date_format('${eTime}', '%Y-%m-%d')), yoy_tmp_table as (select update_time, e1 as integratedindex, e2 as pm25, e3 as pm10, e4 as co, e5 as no2, e6 as so2, e7 as o3, e8 as aqi, e9 as tsp from dc_dm_air_city_air_data_national_ranking_d t lateral view explode_json_array_json(integrated_index_list) tmp1 as e1 lateral view explode_json_array_json(pm25_list) tmp2 as e2 lateral view explode_json_array_json(pm10_list) tmp3 as e3 lateral view explode_json_array_json(co_list) tmp4 as e4 lateral view explode_json_array_json(no2_list) tmp5 as e5 lateral view explode_json_array_json(so2_list) tmp6 as e6 lateral view explode_json_array_json(o3_list) tmp7 as e7 lateral view explode_json_array_json(aqi_list) tmp8 as e8 lateral view explode_json_array_json(tsp_list) tmp9 as e9 where air_region_key = #{ranking} and update_time between date_sub(date_format('${sTime}', '%Y-%m-%d'), interval 1 YEAR) and date_sub(date_format('${eTime}','%Y-%m-%d'), interval 1 YEAR) )优化一下该sql查询,现在太慢了
03-11
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值