[MapReduce] correlation between the length of a post and the length of answers

本文介绍了一个使用Hadoop进行数据分析的项目案例,旨在探究在线论坛中问题帖与回答帖长度之间的相关性。通过自定义MapReduce任务,分别统计了每个问题及其对应回答的长度,并进一步计算了回答长度的平均值。
源自Udacity,Intro to hadoop这门课Final Project

题目说想知道 帖子的长度跟回答的长度有没有相关性

思路:
Mapper 过滤出question和answer,输出id,type(question, answer), length
Reducer 统计每一个id对应的 question的len 以及 answer的average len 


# mapper
#!/usr/bin/python

import sys
import csv

reader = csv.reader(sys.stdin, delimiter = "\t")
next(reader, None)

for line in reader:
    if len(line) == 19:
        node_type = line[5]
        body = line[4]
        node_id = line[0]
        parent_id = line[6]
        if node_type == "question":
            print "{0}\t{1}\t{2}".format(node_id, node_type, len(body))
        elif node_type == "answer":
            print "{0}\t{1}\t{2}".format(parent_id, node_type, len(body))

# reducer
#!/usr/bin/python

import sys

questionLen = 0
answerLen = 0
answerCount = 0
oldKey = None

for line in sys.stdin:
    data = line.strip().split("\t")
    if len(data) != 3:
        continue

    thisKey, node_type, length = data
    if oldKey and oldKey != thisKey:
        if answerCount == 0:
            print "{0}\t{1}\t{2}".format(oldKey, questionLen, 0)
        else :
            print "{0}\t{1}\t{2}".format(oldKey, questionLen, answerLen/answerCount)

        oldKey = thisKey
        questionLen = 0
        answerLen = 0
        answerCount = 0

    oldKey = thisKey
    if node_type == "question":
        questionLen = int(length)
    elif node_type == "answer":
        answerCount += 1
        answerLen += float(length)

if oldKey != None:
    if answerCount == 0:
        print "{0}\t{1}\t{2}".format(oldKey, questionLen, 0)
    else:
        print "{0}\t{1}\t{2}".format(oldKey, questionLen, answerLen/answerCount)


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值