1. 目的
从string类型的字段中解析并汇总每种category类型的总amount
2. 素材
表名:test_table
order_no hotel_seq discount_detail
D8662EF4E 10212527 NULL
45C024849 01801434 {"detail":[{"discount_type":"返现","discount_name":"bonus_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"bonus_commission_subtract","amount":10.0,"category":"营销"},{"discount_type":"返现","discount_name":"coupon_back_rate","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"coupon_commission_subtract","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"upto_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"firstnight_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"firstnight_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"discount_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"discount_commission_subtract","amount":0.0,"category":"营销"}]}
268A58R09 01224011 {"detail":[{"discount_type":"返现","discount_name":"bonus_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"bonus_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"coupon_back_rate","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"coupon_commission_subtract","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"upto_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"firstnight_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"firstnight_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"discount_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"discount_commission_subtract","amount":0.0,"category":"营销"}]}
3. UDF编写
注意:hive环境中的NULL值传递给python的形态是\N
#!/usr/bin/env python
# -- coding: utf-8 --
import sys
import json
reload(sys);
sys.setdefaultencoding('utf-8');
for line in sys.stdin:
line_sp = line.strip().split('\t')
order_no = line_sp[0]
hotel_seq = line_sp[1]
#判断解析字段是否为空
if line_sp[2] != '\N':
str_json = line_sp[2].split('[')[1].split(']')[0]
str_json_sp=str_json.replace('},{','}&{').split('&')
marketing = 0.0
pricing = 0.0
other = 0.0
#累加各类别的amount
for single_json in str_json_sp:
djson = json.loads(single_json)
if djson['category'] == '营销':
marketing = marketing+djson['amount']
elif djson['category'] == '定价':
pricing = pricing + djson['amount']
elif djson['category'] == '其他':
other = other + djson['amount']
else:
marketing = 0.0
pricing = 0.0
other = 0.0
print order_no,'\t',hotel_seq,'\t',marketing,'\t',pricing,'\t',other
4. UDF使用
测试:
#可以将待处理字段复制在文本中,在shell环境中预先测试
cat test20181109a.txt|python test20181109.py
结果示例:
0B0CFED94 91219624 0.0 11.0 0.0
D83EC96F4 91253482 4.0 0.0 0.0
C11D6127E 90932407 0.0 0.0 0.0
A54AE774E 10212527 0.0 0.0 0.0
sql调用:
#调用编写的UDF(本地路径非HDFS)
add FILE /tmp/parseUDF.py
#使用UDF解析
SELECT
TRANSFORM(order_no, hotel_seq,discount_detail) USING 'python test20181109.py' AS (order_no, hotel_seq, marketing, pricing, other)
FROM test_table
LIMIT 10;
结果示例:
45C024849 01801434 10.0 0.0 0.0
268A8A6B9 01224011 0.0 0.0 0.0
E071C63CA 01224011 20.0 42.0 0.0
7C42DC06E 01105013 0.0 10.0 0.0
6B04415D6 00701122 0.0 0.0 0.0