hive python udf_Python编写Hive UDF-优快云博客

1. 目的

从string类型的字段中解析并汇总每种category类型的总amount

2. 素材

表名：test_table

order_no hotel_seq discount_detail

D8662EF4E 10212527 NULL

45C024849 01801434 {"detail":[{"discount_type":"返现","discount_name":"bonus_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"bonus_commission_subtract","amount":10.0,"category":"营销"},{"discount_type":"返现","discount_name":"coupon_back_rate","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"coupon_commission_subtract","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"upto_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"firstnight_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"firstnight_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"discount_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"discount_commission_subtract","amount":0.0,"category":"营销"}]}

268A58R09 01224011 {"detail":[{"discount_type":"返现","discount_name":"bonus_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"bonus_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"coupon_back_rate","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"coupon_commission_subtract","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"upto_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"firstnight_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"firstnight_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"discount_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"discount_commission_subtract","amount":0.0,"category":"营销"}]}

3. UDF编写

注意：hive环境中的NULL值传递给python的形态是\N

#!/usr/bin/env python

# -- coding: utf-8 --

import sys

import json

reload(sys);

sys.setdefaultencoding('utf-8');

for line in sys.stdin:

line_sp = line.strip().split('\t')

order_no = line_sp[0]

hotel_seq = line_sp[1]

#判断解析字段是否为空

if line_sp[2] != '\N':

str_json = line_sp[2].split('[')[1].split(']')[0]

str_json_sp=str_json.replace('},{','}&{').split('&')

marketing = 0.0

pricing = 0.0

other = 0.0

#累加各类别的amount

for single_json in str_json_sp:

djson = json.loads(single_json)

if djson['category'] == '营销':

marketing = marketing+djson['amount']

elif djson['category'] == '定价':

pricing = pricing + djson['amount']

elif djson['category'] == '其他':

other = other + djson['amount']

else:

marketing = 0.0

pricing = 0.0

other = 0.0

print order_no,'\t',hotel_seq,'\t',marketing,'\t',pricing,'\t',other

4. UDF使用

测试：

#可以将待处理字段复制在文本中，在shell环境中预先测试

cat test20181109a.txt|python test20181109.py

结果示例：

0B0CFED94 91219624 0.0 11.0 0.0

D83EC96F4 91253482 4.0 0.0 0.0

C11D6127E 90932407 0.0 0.0 0.0

A54AE774E 10212527 0.0 0.0 0.0

sql调用：

#调用编写的UDF(本地路径非HDFS)

add FILE /tmp/parseUDF.py

#使用UDF解析

SELECT

TRANSFORM(order_no, hotel_seq,discount_detail) USING 'python test20181109.py' AS (order_no, hotel_seq, marketing, pricing, other)

FROM test_table

LIMIT 10;

结果示例：

45C024849 01801434 10.0 0.0 0.0

268A8A6B9 01224011 0.0 0.0 0.0

E071C63CA 01224011 20.0 42.0 0.0

7C42DC06E 01105013 0.0 10.0 0.0

6B04415D6 00701122 0.0 0.0 0.0