[MapReduce]Filter Pattern

本博客详细介绍了Udacity的课程练习,内容涉及从大量的论坛帖子中筛选出只有一句话的帖子的过程。通过使用Python编程语言和csv模块,实现对数据文件进行解析并计数符合条件的帖子数量。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

这是Udacity的课程 intro to hadoop and mapReduce里面Lesson4的练习


该练习完成了从大量的论坛post中,过滤出只有一句话的post。以下是题目具体描述以及python代码

#!/usr/bin/python
import sys
import csv

# To run this code on the actual data, please download the additional dataset.
# You can find instructions in the course materials (wiki) and in the instructor notes.
# There are some things in this data file that are different from what you saw
# in Lesson 3. The dataset is more complicated and closer to what you might
# see in the real world. It was generated by exporting data from a SQL database.
# 
# The data in at least one of the fields (the body field) can include newline
# characters, and all the fields are enclosed in double quotes. Therefore, we
# will need to process the data file in a way other than using split(","). To do this, 
# we have provided sample code for using the csv module of Python. Each 'line'
# will be a list that contains each field in sequential order.
# 
# In this exercise, we are interested in the field 'body' (which is the 5th field, 
# line[4]). The objective is to count the number of forum nodes where 'body' either 
# contains none of the three punctuation marks: period ('.'), exclamation point ('!'), 
# question mark ('?'), or else 'body' contains exactly one such punctuation mark as the 
# last character. There is no need to parse the HTML inside 'body'. Also, do not pay
# special attention to newline characters.

punctualSet = ['.', '!', '?']

def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:

        # YOUR CODE HERE
        if one_sentence(line[4]):
            writer.writerow(line)
            
def one_sentence(body):
    tokens = [".", "!", "?"]
    if body[-1] in tokens:
        body = body[:-1]
    if containsAny(body, tokens):
        return False
    # for token in tokens:
    #     if len(body.strip().split(token)) > 1:
    #         return False
    return True

def containsAny(body, tokens):
    """Check whether 'str' contains ANY of the chars in 'set'"""
    return 1 in [c in body for c in tokens]

test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"This is one sentence\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Also one sentence!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Hey!\nTwo sentences!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One. Two! Three?\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One Period. Two Sentences\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Three\nlines, one sentence\n\"\t\"\"
"""

# This function allows you to test the mapper with the provided test string
def main():
    import StringIO
    sys.stdin = StringIO.StringIO(test_text)
    mapper()
    sys.stdin = sys.__stdin__

if __name__ == "__main__":
    main()

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值