借助Streaming用三种语言编写MapReduce

最新推荐文章于 2020-08-18 17:41:27 发布

sdankle

最新推荐文章于 2020-08-18 17:41:27 发布

阅读量2k

点赞数

CC 4.0 BY-SA版权

分类专栏：大数据文章标签：大数据 mapreduce hadoop

本文链接：https://blog.youkuaiyun.com/sdankle/article/details/51100528

大数据专栏收录该内容

0 篇文章

订阅专栏

本文介绍了如何借助Hadoop Streaming，分别使用Python、C++和Shell编写MapReduce程序来实现词频统计。详细阐述了每个语言的mapper和reducer步骤，包括文件权限设置、测试验证及结合Hadoop执行过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

借助Streaming用三种语言编写MapReduce

Streaming的原理我就不介绍了，不过我也不是特别懂，我只知道Streaming会把标准输入带给Mapper和Reducer。想了解的具体可以看Hadoop官网。我只是做一个实战备忘啦。

一.Python with Streaming
第一步：写mapper.py

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\\t%s' % (word, 1)

第二步：写reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        pass

# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print '%s\\t%s'% (word, count)

第三步：对mapper.py和reducer.py赋权

很重要很重要！！！！
chmod +x $你的路径/mapper.py

chmod +x $你的路径/reducer.py

第四步：测试一下mapper.py和reducer.py能不能实现功能

这一步不需要用到Hadoop

echo "foo foo labs foo bar" | $你的路径/mapper.py | sort | $你的路径/reducer.py

这一步如果你没有做第三步的权限，就会出问题的。
得到结果：

 bar     1
 foo     3
 labs    1

这一步结果正确就可以进入下一步了

第五步：结合Python用Hadoop的Streaming来做词的统计

1.开启Hadoop

bin/start-all.sh

2.在HDFS中新建一个文件夹input，并把要统计的文件放到input里面

hadoop fs -mkdir input
hadoop fs -put myfile.txt input

3.把HDFS中已经存在的output删除，否则会出现问题

 hadoop fs -rmr output

4.用mapper.py和reducer.py做词的统计

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
-input input\* \
-output output\
-mapper $你的路径\mapper.py\
-reducer $你的路径\reducer.py\

显示结果(统计结果就是上传到input中所有文件的词频)

 bar     1
 foo     3
 labs    1

二.C++ with Streaming

第一步：写mapper.cpp

//mapper
#include <stdio.h>
#include <string>
#include <iostream>
using namespace std;

int main(){
        string key;
        string value = "1";
        while(cin>>key){
                cout<<key<<"\t"<<value<<endl;
        }
        return 0;
}

第二步：写reducer.cpp

//reducer
#include <string>
#include <map>
#include <iostream>
#include <iterator>
using namespace std;
int main(){
        string key;
        string value;
        map<string, int> word2count;
        map<string, int>::iterator it;
        while(cin>>key){
                cin>>value;
                it = word2count.find(key);
                if(it != word2count.end()){
                        (it->second)++;
                }
                else{
                        word2count.insert(make_pair(key, 1));
                }
        }

        for(it = word2count.begin(); it != word2count.end(); ++it){
                cout<<it->first<<"\t"<<it->second<<endl;
        }
        return 0;
}

第三步：对mapper.cpp和reducer.cpp赋权

很重要很重要！！！！(我反正都赋权了)
chmod +x $你的路径/mapper.cpp

chmod +x $你的路径/reducer.cpp

第四步：生成mapper和reducer可执行文件

g++ -o mapper mapper.cpp
g++ -o reducer reducer.cpp

第五步：测试一下mapper和reducer能不能实现功能

这一步不需要用到Hadoop

echo "foo foo labs foo bar" | $你的路径/mapper | sort | $你的路径/reducer

得到结果：

 bar     1
 foo     3
 labs    1

这一步结果正确就可以进入下一步了
第五步：结合C++用Hadoop的Streaming来做词的统计

1.开启Hadoop

bin/start-all.sh

2.在HDFS中新建一个文件夹input，并把要统计的文件放到input里面

hadoop fs -mkdir input
hadoop fs -put myfile.txt input

3.把HDFS中已经存在的output删除，否则会出现问题

 hadoop fs -rmr output

4.用mapper.py和reducer.py做词的统计

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
-input input\* \
-output output\
-mapper $你的路径\mapper\
-reducer $你的路径\reducer\

显示结果(统计结果就是上传到input中所有文件的词频)

 bar     1
 foo     3
 labs    1

三.Shell with Streaming
第一步：写mapper.sh

#! /bin/bash
while read LINE; do
  for word in $LINE
  do
    echo "$word 1"
  done
done

第二步：写reducer.sh

#! /bin/bash
count=0
started=0
word=""
while read LINE;do
  newword=`echo $LINE | cut -d ' '  -f 1`
  if [ "$word" != "$newword" ];then
    [ $started -ne 0 ] && echo "$word\t$count"
    word=$newword
    count=1
    started=1
  else
    count=$(( $count + 1 ))
  fi
done
echo "$word\t$count"

第三步：对mapper.sh和reducer.sh赋权

很重要很重要！！！！
chmod +x $你的路径/mapper.sh

chmod +x $你的路径/reducer.sh

第四步：测试一下mapper.sh和reducer.sh能不能实现功能

这一步不需要用到Hadoop

echo "foo foo labs foo bar" | $你的路径/mapper.sh | sort | $你的路径/reducer.sh

这一步如果你没有做第三步的权限，就会出问题的。
得到结果：

 bar     1
 foo     3
 labs    1

这一步结果正确就可以进入下一步了

第五步：结合Shell用Hadoop的Streaming来做词的统计

1.开启Hadoop

bin/start-all.sh

2.在HDFS中新建一个文件夹input，并把要统计的文件放到input里面

hadoop fs -mkdir input
hadoop fs -put myfile.txt input

3.把HDFS中已经存在的output删除，否则会出现问题

 hadoop fs -rmr output

4.用mapper.py和reducer.py做词的统计

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
-input input\* \
-output output\
-mapper $你的路径\mapper.sh\
-reducer $你的路径\reducer.sh\

显示结果(统计结果就是上传到input中所有文件的词频)