源自Udacity,Intro to hadoop这门课Final Project
题目说想知道 帖子的长度跟回答的长度有没有相关性
思路:
Mapper 过滤出question和answer,输出id,type(question, answer), length
Reducer 统计每一个id对应的 question的len 以及 answer的average len
# mapper
#!/usr/bin/python
import sys
import csv
reader = csv.reader(sys.stdin, delimiter = "\t")
next(reader, None)
for line in reader:
if len(line) == 19:
node_type = line[5]
body = line[4]
node_id = line[0]
parent_id = line[6]
if node_type == "question":
print "{0}\t{1}\t{2}".format(node_id, node_type, len(body))
elif node_type == "answer":
print "{0}\t{1}\t{2}".format(parent_id, node_type, len(body))
# reducer
#!/usr/bin/python
import sys
questionLen = 0
answerLen = 0
answerCount = 0
oldKey = None
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) != 3:
continue
thisKey, node_type, length = data
if oldKey and oldKey != thisKey:
if answerCount == 0:
print "{0}\t{1}\t{2}".format(oldKey, questionLen, 0)
else :
print "{0}\t{1}\t{2}".format(oldKey, questionLen, answerLen/answerCount)
oldKey = thisKey
questionLen = 0
answerLen = 0
answerCount = 0
oldKey = thisKey
if node_type == "question":
questionLen = int(length)
elif node_type == "answer":
answerCount += 1
answerLen += float(length)
if oldKey != None:
if answerCount == 0:
print "{0}\t{1}\t{2}".format(oldKey, questionLen, 0)
else:
print "{0}\t{1}\t{2}".format(oldKey, questionLen, answerLen/answerCount)
本文介绍了一个使用Hadoop进行数据分析的项目案例,旨在探究在线论坛中问题帖与回答帖长度之间的相关性。通过自定义MapReduce任务,分别统计了每个问题及其对应回答的长度,并进一步计算了回答长度的平均值。

被折叠的 条评论
为什么被折叠?



