［MapReduce］ correlation between the length of a post and the length of answers

本文介绍了一个使用Hadoop进行数据分析的项目案例，旨在探究在线论坛中问题帖与回答帖长度之间的相关性。通过自定义MapReduce任务，分别统计了每个问题及其对应回答的长度，并进一步计算了回答长度的平均值。

源自Udacity，Intro to hadoop这门课Final Project

题目说想知道帖子的长度跟回答的长度有没有相关性

思路：

Mapper 过滤出question和answer，输出id，type(question, answer)， length

Reducer 统计每一个id对应的 question的len 以及 answer的average len

# mapper
#!/usr/bin/python

import sys
import csv

reader = csv.reader(sys.stdin, delimiter = "\t")
next(reader, None)

for line in reader:
    if len(line) == 19:
        node_type = line[5]
        body = line[4]
        node_id = line[0]
        parent_id = line[6]
        if node_type == "question":
            print "{0}\t{1}\t{2}".format(node_id, node_type, len(body))
        elif node_type == "answer":
            print "{0}\t{1}\t{2}".format(parent_id, node_type, len(body))

# reducer
#!/usr/bin/python

import sys

questionLen = 0
answerLen = 0
answerCount = 0
oldKey = None

for line in sys.stdin:
    data = line.strip().split("\t")
    if len(data) != 3:
        continue

    thisKey, node_type, length = data
    if oldKey and oldKey != thisKey:
        if answerCount == 0:
            print "{0}\t{1}\t{2}".format(oldKey, questionLen, 0)
        else :
            print "{0}\t{1}\t{2}".format(oldKey, questionLen, answerLen/answerCount)

        oldKey = thisKey
        questionLen = 0
        answerLen = 0
        answerCount = 0

    oldKey = thisKey
    if node_type == "question":
        questionLen = int(length)
    elif node_type == "answer":
        answerCount += 1
        answerLen += float(length)

if oldKey != None:
    if answerCount == 0:
        print "{0}\t{1}\t{2}".format(oldKey, questionLen, 0)
    else:
        print "{0}\t{1}\t{2}".format(oldKey, questionLen, answerLen/answerCount)