hadoop学习（1）

最新推荐文章于 2025-08-06 14:53:23 发布

zoe9698

最新推荐文章于 2025-08-06 14:53:23 发布

阅读量210

点赞数

CC 4.0 BY-SA版权

分类专栏： hadoop 文章标签： hadoop python

本文链接：https://blog.youkuaiyun.com/zoe9698/article/details/79820999

hadoop 专栏收录该内容

1 篇文章

订阅专栏

原文点击打开链接

关于mapreduce：map是“分”------数据分割，reduce是对分割后的数据进行进一步的运算。

Example1-“WordCount”：

首先写一个map程序用来将输入的字符串分割成单个的单词，然后，reduce这些单个的单词，相同的单词，相同的单词就对其进行计数，不同的单词分别输出，结果输出每一个单词出现的频数。

1,mapper.py

1 #!/usr/bin/env python
2 import sys
3 
4 for line in sys.stdin:  # 遍历读入数据的每一行
5     
6     line = line.strip()  # 将行尾行首的空格去除
7     words = line.split()  #按空格将句子分割成单个单词
8     for word in words:
9         print '%s\t%s' %(word, 1)

2,reducer.py

 1 #!/usr/bin/env python
 2 
 3 from operator import itemgetter
 4 import sys
 5 
 6 current_word = None  # 为当前单词
 7 current_count = 0  # 当前单词频数
 8 word = None
 9 
10 for line in sys.stdin:
11     words = line.strip()  # 去除字符串首尾的空白字符
12     word, count = words.split('\t')  # 按照制表符分隔单词和数量
13     
14     try:
15         count = int(count)  # 将字符串类型的‘1’转换为整型1
16     except ValueError:
17         continue
18 
19     if current_word == word:  # 如果当前的单词等于读入的单词
20         current_count += count  # 单词频数加1
21     else:
22         if current_word:  # 如果当前的单词不为空则打印其单词和频数
23             print '%s\t%s' %(current_word, current_count)  
24         current_count = count  # 否则将读入的单词赋值给当前单词，且更新频数
25         current_word = word
26 
27 if current_word == word:
28     print '%s\t%s' %(current_word, current_count)