源自Udacity,Intro to hadoop这门课Final Project
题目要求,从论坛帖子的结构化数据中,找出每个用户po贴最多的时间。
比如,用户A在每天的8点,po最多的帖子,数量标记为x
那么reducer的输出就是 A x
思路:
Mapper需要输出 author_id;
Reducer则对每个用户做一个{hour : count}的统计,最后输出count最多的那个hour
#mapper
#!/usr/bin/python
import sys
import csv
reader = csv.reader(sys.stdin, delimiter = "\t")
next(reader, None)
for line in reader:
if len(line) == 19:
author_id = line[3]
hour = line[8][11:13]
print "{0}\t{1}".format(author_id, hour)
# reducer
#!/usr/bin/python
import sys
oldId = None
hours = {}
for i in range(24):
hours[i] = 0
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) == 2:
thisAuthorId, hour = data
if oldId and oldId != thisAuthorId:
print "{0}\t{1}".format(oldId, max(hours, key=hours.get))
oldId = thisAuthorId
for i in range(24):
hours[i] = 0
oldId = thisAuthorId
hours[int(hour)] += 1
if oldId != None:
print "{0}\t{1}".format(oldId, max(hours, key=hours.get))