利用mincemeat编写简单的MapReduce程序

该博客介绍如何利用mincemeat库编写MapReduce程序,对给定的源文件中每个作者的文章标题进行词频统计。内容涉及对作者及其对应文章的词项数量的计算,排除了停用词和特定字符。提供了源代码示例及程序运行截图,并在后续进行了更新,包括对作者词项频率的排序和字典排序的实现。此外,还分享了一个在线MapReduce计算的网站链接。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >




问题描述:

提供若干个需要分析的源文件,每行都是如下的形式:

journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.

代表:

paper-id:::author1::author2::…. ::authorN:::title

需要计算出每个作者对应的文章标题每个词项的数量,例如:

作者Alberto Pettorossi的结果为: program:3, transformation:2, transforming:2, using:2, programs:2, logic:2.

注意:每个文档id对应与多个作者,每个作者对应多个词项。词项不包含停用词,单个字母、连字符同样不计。


源代码如下:

# -*- coding: utf-8 -*-
#!/usr/bin/env python
import glob
import mincemeat

text_files=glob.glob('E:\\hw3data\\/*')

def file_contents(file_name):
    f=open(file_name)
    try:
        return f.read()
    finally:
        f.close()

source=dict((file_name,file_contents(file_name))
            for file_name in text_files)

# setup map and reduce functions

def mapfn(key, value):
    stop_words=['all', 'herself', 'should', 'to', 'only', 'under', 'do', 'weve',
            'very', 'cannot', 'werent', 'yourselves', 'him', 'did', 'these',
            'she', 'havent', 'where', 'whens', 'up', 'are', 'further', 'what',
            'heres', 'above', 'between', 'youll', 'we', 'here', 'hers', 'both',
 
最近一直在学coursera上面web intelligence and big data这门课,上周五印度老师布置了一个家庭作业,要求写一个mapreduce程序,用python来实现。 具体描述如下: Programming Assignment for HW3 Homework 3 (Programming Assignment A) Download data files bundled as a .zip file from hw3data.zip Each file in this archive contains entries that look like: journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. that represent bibliographic information about publications, formatted as follows: paper-id:::author1::author2::…. ::authorN:::title Your task is to compute how many times every term occurs across titles, for each author. For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2. Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms. The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python. These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively. I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar. Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author. Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce. Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.
评论 20
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值