hive transform python

自己写的一个简单例子,用来做话题描述去重,表中的desc字段 “a-b-a-b-b-c”需要去重  
python代码如下:  
#!/usr/bin/python  
import sys  
reload(sys)  
sys.setdefaultencoding('utf8')  
def quchong(desc):  
    a=desc.split('-')  
    return '-'.join(set(a))  
while True:  
        line = sys.stdin.readline()  
        if line == "":  
                break  
        line = line.rstrip('\n')  
        # your process code here  
        parts = line.split('\t')  
        parts[2]=quchong(parts[2])  
        print "\t".join(parts)  
  
下面是转载过来的,比较详细  
二、hive map中字段自增的写法(转)  
  
1、建立表结构  
  
hive> CREATE TABLE t3 (foo STRING, bar MAP<STRING,INT>)  
    > ROW FORMAT DELIMITED  
    > FIELDS TERMINATED BY '/t'  
    > COLLECTION ITEMS TERMINATED BY ','  
    > MAP KEYS TERMINATED BY ':'  
    > STORED AS TEXTFILE;  
OK  
  
   
  
2、建成的效果  
  
hive> describe t3;  
OK  
foo     string  
bar     map<string,int>  
  
   
  
3、生成test.txt  
  
jeffgeng        click:13,uid:15  
  
   
  
4、把test.txt load进来  
  
hive> LOAD DATA LOCAL INPATH 'test.txt' OVERWRITE INTO TABLE t3;  
Copying data from file:/root/src/hadoop/hadoop-0.20.2/contrib/hive-0.5.0-bin/bin/test.txt  
Loading data to table t3  
OK  
  
   
  
load完效果如下  
  
hive> select * from t3;  
OK  
jeffgeng        {"click":13,"uid":15}  
  
   
  
5、可以这样查map的值  
  
hive> select bar['click'] from t3;  
  
...一系列的mapreduce...  
  
OK  
13  
  
   
  
6、编写add_mapper  
  
#!/usr/bin/python  
import sys  
import datetime  
  
for line in sys.stdin:  
    line = line.strip()  
    foo, bar = line.split('/t')  
    d = eval(bar)  
    d['click'] += 1  
    print '/t'.join([foo, str(d)])  
  
   
  
7、在hive中执行  
  
hive> CREATE TABLE t4 (foo STRING, bar MAP<STRING,INT>)  
    > ROW FORMAT DELIMITED  
    > FIELDS TERMINATED BY '/t'  
    > COLLECTION ITEMS TERMINATED BY ','  
    > MAP KEYS TERMINATED BY ':'  
    > STORED AS TEXTFILE;  
  
   
  
hive> add FILE add_mapper.py  
  
   
  
INSERT OVERWRITE TABLE t4  
    > SELECT  
    >   TRANSFORM (foo, bar)  
    >   USING 'python add_mapper.py'  
    >   AS (foo, bar)  
    > FROM t3;  
FAILED: Error in semantic analysis: line 1:23 Cannot insert into target table because column number/types are different t4: Cannot convert column 1 from string to map<string,int>.  
  
   
  
8、为什么会报出以上错误?貌似add_mapper.py的输出是string格式的,hive无法此这种格式的map认出。后查明,AS后边可以为字段强制指定类型  
  
INSERT OVERWRITE TABLE t4  
SELECT  
  TRANSFORM (foo, bar)  
  USING 'python add_mapper.py'  
  AS (foo string, bar map<string,int>)  
FROM t3;  
  
   
  
9、同时python脚本要去除字典转换后遗留下来的空格,引号,左右花排号等  
  
#!/usr/bin/python  
import sys  
import datetime  
  
for line in sys.stdin:  
    line = line.strip('/t')  
    foo, bar = line.split('/t')  
    d = eval(bar)  
    d['click'] += 1  
    d['uid'] += 1  
    strmap = ''  
    for x in str(d):  
        if x in (' ', "'"):  
            continue  
        strmap += x  
    print '/t'.join([foo, strmap])  
  
   
  
10、执行后的结果  
  
hive> select * from t4;  
OK  
jeffgeng        {"click":14,"uid":null}  
Time taken: 0.146 seconds 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值