hive python udf_Python编写Hive UDF

1. 目的

从string类型的字段中解析并汇总每种category类型的总amount

2. 素材

表名:test_table

order_no hotel_seq discount_detail

D8662EF4E 10212527 NULL

45C024849 01801434 {"detail":[{"discount_type":"返现","discount_name":"bonus_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"bonus_commission_subtract","amount":10.0,"category":"营销"},{"discount_type":"返现","discount_name":"coupon_back_rate","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"coupon_commission_subtract","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"upto_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"firstnight_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"firstnight_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"discount_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"discount_commission_subtract","amount":0.0,"category":"营销"}]}

268A58R09 01224011 {"detail":[{"discount_type":"返现","discount_name":"bonus_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"bonus_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"coupon_back_rate","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"coupon_commission_subtract","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"upto_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"firstnight_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"firstnight_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"discount_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"discount_commission_subtract","amount":0.0,"category":"营销"}]}

3. UDF编写

注意:hive环境中的NULL值传递给python的形态是\N

#!/usr/bin/env python

# -- coding: utf-8 --

import sys

import json

reload(sys);

sys.setdefaultencoding('utf-8');

for line in sys.stdin:

line_sp = line.strip().split('\t')

order_no = line_sp[0]

hotel_seq = line_sp[1]

#判断解析字段是否为空

if line_sp[2] != '\N':

str_json = line_sp[2].split('[')[1].split(']')[0]

str_json_sp=str_json.replace('},{','}&{').split('&')

marketing = 0.0

pricing = 0.0

other = 0.0

#累加各类别的amount

for single_json in str_json_sp:

djson = json.loads(single_json)

if djson['category'] == '营销':

marketing = marketing+djson['amount']

elif djson['category'] == '定价':

pricing = pricing + djson['amount']

elif djson['category'] == '其他':

other = other + djson['amount']

else:

marketing = 0.0

pricing = 0.0

other = 0.0

print order_no,'\t',hotel_seq,'\t',marketing,'\t',pricing,'\t',other

4. UDF使用

测试:

#可以将待处理字段复制在文本中,在shell环境中预先测试

cat test20181109a.txt|python test20181109.py

结果示例:

0B0CFED94 91219624 0.0 11.0 0.0

D83EC96F4 91253482 4.0 0.0 0.0

C11D6127E 90932407 0.0 0.0 0.0

A54AE774E 10212527 0.0 0.0 0.0

sql调用:

#调用编写的UDF(本地路径非HDFS)

add FILE /tmp/parseUDF.py

#使用UDF解析

SELECT

TRANSFORM(order_no, hotel_seq,discount_detail) USING 'python test20181109.py' AS (order_no, hotel_seq, marketing, pricing, other)

FROM test_table

LIMIT 10;

结果示例:

45C024849 01801434 10.0 0.0 0.0

268A8A6B9 01224011 0.0 0.0 0.0

E071C63CA 01224011 20.0 42.0 0.0

7C42DC06E 01105013 0.0 10.0 0.0

6B04415D6 00701122 0.0 0.0 0.0

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值