一个小需求,不想写java的mapreduce的程序,想用streaming + python 处理一下就行了,遇到一些问题,做个笔记。
以后遇到这样的场景,可以放心使用了。
我是在windows 下的 pyCharm 编写的mapper 和 reducer ,直接上传到linux服务器,发现根本运行不了,老是报:
./maper.py file or directory not find
而且找不到原因,后来发现是windows和linux文件格式差异造成的。
也算是个教训吧。
python shell 脚本如果在windows上编写,到linux上运行,需要转换一下格式: dos2unix + 文件名
看一下python脚本:
mapper.py
#!/usr/bin/python
#coding:utf8
import json
import sys
import re
wrapper = ["fqa","fqq","gtt","zxb","zxa"]
for line in sys.stdin:
try:
line = line.decode("utf8")
str = re.split("\\s+",line)[2].strip()
s = json.loads(str)
if len(s["data"]["tts"]) == 0:
continue
minPr = sys.maxint
for prline in s["data"]["tts"]:
pr = prline["pr"]
minPr = min(minPr,pr)
for l in s["data"]["tts"]:
if l["cl"] in wrapper:
if l["pr"]==minPr:
print "%s-%s,%s\t%s" % (l["dc"],l["ac"],l["cl"],"1")
else:
print "%s-%s,%s\t%s" % (l["dc"],l["ac"],l["cl"],"0")
except Exception,ex:
pass</span>
reducer.py :
#!/usr/bin/env python
import sys
tcount = 0.0
lcount = 0.0
lastkey = ""
for line in sys.stdin:
key,val = line.split('\t')
if lastkey != key and lastkey != "":
lastkey = key
print "%s\t%d\t%d\t%f" % (lastkey,tcount,lcount,lcount/tcount)
tcount = 0.0
lcount = 0.0
elif lastkey == "":
lastkey = key
tcount += 1
if val.strip() == "1":
lcount += 1
代码中需要注意的有:
#!/usr/bin/env python
#coding:utf8
line = line.decode("utf8")
try:
except Exception,ex:
pass
这些点都需要注意,否则,一个小问题就会导致任务失败
其中,如果输入数据中有脏数据,python脚本抛异常,但是如果代理里没有处理异常,就会报错,类型下面的:
minRecWrittenToEnableSkip_=9223372036854775807 HOST=null
USER=datadev
HADOOP_USER=null
last tool output: |LLA-AMS,zxl 0|
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
把异常抛给了hadoop框架处理,它当然是按照错误来处理,报错,退出。
看一下提交作业脚本,这个也很重要:
#!/bin/bash
export HADOOP_HOME=/home/q/hadoop-2.2.0
sudo -u flightdev hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
-D mapred.job.queue.name=queue1 \
-D stream.map.input.ignoreKey=true \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-input /input/date=2014-11-17.lzo/* \
-output /output/20141117 \
-mapper maper.py \
-file maper.py \
-reducer reducer.py \
-file reducer.py
下面两点需要注意
-D stream.map.input.ignoreKey=true \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
表示忽略输入的lzo文件中的行号,避免行号对输入数据的影响