自己写的一个简单例子,用来做话题描述去重,表中的desc字段 “a-b-a-b-b-c”需要去重 python代码如下: #!/usr/bin/python import sys reload(sys) sys.setdefaultencoding('utf8') def quchong(desc): a=desc.split('-') return '-'.join(set(a)) while True: line = sys.stdin.readline() if line == "": break line = line.rstrip('\n') # your process code here parts = line.split('\t') parts[2]=quchong(parts[2]) print "\t".join(parts)
hive> CREATE TABLE t3 (foo STRING, bar MAP<STRING,INT>) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '/t' > COLLECTION ITEMS TERMINATED BY ',' > MAP KEYS TERMINATED BY ':' > STORED AS TEXTFILE; OK
2、建成的效果
hive> describe t3; OK foo string bar map<string,int>
3、生成test.txt
jeffgeng click:13,uid:15
4、把test.txt load进来
hive> LOAD DATA LOCAL INPATH 'test.txt' OVERWRITE INTO TABLE t3; Copying data from file:/root/src/hadoop/hadoop-0.20.2/contrib/hive-0.5.0-bin/bin/test.txt Loading data to table t3 OK
load完效果如下
hive> select * from t3; OK jeffgeng {"click":13,"uid":15}
5、可以这样查map的值
hive> select bar['click'] from t3;
...一系列的mapreduce...
OK 13
6、编写add_mapper
#!/usr/bin/python import sys import datetime
for line in sys.stdin: line = line.strip() foo, bar = line.split('/t') d = eval(bar) d['click'] += 1 print '/t'.join([foo, str(d)])
7、在hive中执行
hive> CREATE TABLE t4 (foo STRING, bar MAP<STRING,INT>) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '/t' > COLLECTION ITEMS TERMINATED BY ',' > MAP KEYS TERMINATED BY ':' > STORED AS TEXTFILE;
hive> add FILE add_mapper.py
INSERT OVERWRITE TABLE t4 > SELECT > TRANSFORM (foo, bar) > USING 'python add_mapper.py' > AS (foo, bar) > FROM t3; FAILED: Error in semantic analysis: line 1:23 Cannot insert into target table because column number/types are different t4: Cannot convert column 1 from string to map<string,int>.
INSERT OVERWRITE TABLE t4 SELECT TRANSFORM (foo, bar) USING 'python add_mapper.py' AS (foo string, bar map<string,int>) FROM t3;
9、同时python脚本要去除字典转换后遗留下来的空格,引号,左右花排号等
#!/usr/bin/python import sys import datetime
for line in sys.stdin: line = line.strip('/t') foo, bar = line.split('/t') d = eval(bar) d['click'] += 1 d['uid'] += 1 strmap = '' for x in str(d): if x in (' ', "'"): continue strmap += x print '/t'.join([foo, strmap])
10、执行后的结果
hive> select * from t4; OK jeffgeng {"click":14,"uid":null} Time taken: 0.146 seconds