借助Streaming用三种语言编写MapReduce
Streaming的原理我就不介绍了,不过我也不是特别懂,我只知道Streaming会把标准输入带给Mapper和Reducer。想了解的具体可以看Hadoop官网。我只是做一个实战备忘啦。
一.Python with Streaming
第一步:写mapper.py
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\\t%s' % (word, 1)
第二步:写reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\\t%s'% (word, count)
第三步:对mapper.py和reducer.py赋权
很重要很重要!!!!
chmod +x $你的路径/mapper.py
chmod +x $你的路径/reducer.py
第四步:测试一下mapper.py和reducer.py能不能实现功能
这一步不需要用到Hadoop
echo "foo foo labs foo bar" | $你的路径/mapper.py | sort | $你的路径/reducer.py
这一步如果你没有做第三步的权限,就会出问题的。
得到结果:
bar 1
foo 3
labs 1
这一步结果正确就可以进入下一步了
第五步:结合Python用Hadoop的Streaming来做词的统计
1.开启Hadoop
bin/start-all.sh
2.在HDFS中新建一个文件夹input,并把要统计的文件放到input里面
hadoop fs -mkdir input
hadoop fs -put myfile.txt input
3.把HDFS中已经存在的output删除,否则会出现问题
hadoop fs -rmr output
4.用mapper.py和reducer.py做词的统计
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
-input input\* \
-output output\
-mapper $你的路径\mapper.py\
-reducer $你的路径\reducer.py\
显示结果(统计结果就是上传到input中所有文件的词频)
bar 1
foo 3
labs 1
二.C++ with Streaming
第一步:写mapper.cpp
//mapper
#include <stdio.h>
#include <string>
#include <iostream>
using namespace std;
int main(){
string key;
string value = "1";
while(cin>>key){
cout<<key<<"\t"<<value<<endl;
}
return 0;
}
第二步:写reducer.cpp
//reducer
#include <string>
#include <map>
#include <iostream>
#include <iterator>
using namespace std;
int main(){
string key;
string value;
map<string, int> word2count;
map<string, int>::iterator it;
while(cin>>key){
cin>>value;
it = word2count.find(key);
if(it != word2count.end()){
(it->second)++;
}
else{
word2count.insert(make_pair(key, 1));
}
}
for(it = word2count.begin(); it != word2count.end(); ++it){
cout<<it->first<<"\t"<<it->second<<endl;
}
return 0;
}
第三步:对mapper.cpp和reducer.cpp赋权
很重要很重要!!!!(我反正都赋权了)
chmod +x $你的路径/mapper.cpp
chmod +x $你的路径/reducer.cpp
第四步:生成mapper和reducer可执行文件
g++ -o mapper mapper.cpp
g++ -o reducer reducer.cpp
第五步:测试一下mapper和reducer能不能实现功能
这一步不需要用到Hadoop
echo "foo foo labs foo bar" | $你的路径/mapper | sort | $你的路径/reducer
得到结果:
bar 1
foo 3
labs 1
这一步结果正确就可以进入下一步了
第五步:结合C++用Hadoop的Streaming来做词的统计
1.开启Hadoop
bin/start-all.sh
2.在HDFS中新建一个文件夹input,并把要统计的文件放到input里面
hadoop fs -mkdir input
hadoop fs -put myfile.txt input
3.把HDFS中已经存在的output删除,否则会出现问题
hadoop fs -rmr output
4.用mapper.py和reducer.py做词的统计
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
-input input\* \
-output output\
-mapper $你的路径\mapper\
-reducer $你的路径\reducer\
显示结果(统计结果就是上传到input中所有文件的词频)
bar 1
foo 3
labs 1
三.Shell with Streaming
第一步:写mapper.sh
#! /bin/bash
while read LINE; do
for word in $LINE
do
echo "$word 1"
done
done
第二步:写reducer.sh
#! /bin/bash
count=0
started=0
word=""
while read LINE;do
newword=`echo $LINE | cut -d ' ' -f 1`
if [ "$word" != "$newword" ];then
[ $started -ne 0 ] && echo "$word\t$count"
word=$newword
count=1
started=1
else
count=$(( $count + 1 ))
fi
done
echo "$word\t$count"
第三步:对mapper.sh和reducer.sh赋权
很重要很重要!!!!
chmod +x $你的路径/mapper.sh
chmod +x $你的路径/reducer.sh
第四步:测试一下mapper.sh和reducer.sh能不能实现功能
这一步不需要用到Hadoop
echo "foo foo labs foo bar" | $你的路径/mapper.sh | sort | $你的路径/reducer.sh
这一步如果你没有做第三步的权限,就会出问题的。
得到结果:
bar 1
foo 3
labs 1
这一步结果正确就可以进入下一步了
第五步:结合Shell用Hadoop的Streaming来做词的统计
1.开启Hadoop
bin/start-all.sh
2.在HDFS中新建一个文件夹input,并把要统计的文件放到input里面
hadoop fs -mkdir input
hadoop fs -put myfile.txt input
3.把HDFS中已经存在的output删除,否则会出现问题
hadoop fs -rmr output
4.用mapper.py和reducer.py做词的统计
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
-input input\* \
-output output\
-mapper $你的路径\mapper.sh\
-reducer $你的路径\reducer.sh\
显示结果(统计结果就是上传到input中所有文件的词频)
bar 1
foo 3
labs 1
哈哈,虽然有些冗余,但是也是挺好的。
参考文献:
【1】Hadoop Streaming 编程
http://dongxicheng.org/mapreduce/hadoop-streaming-programming/
【2】Hadoop Streaming 实战: c++编写map&reduce程序
http://blog.youkuaiyun.com/yfkiss/article/details/6430154
【3】C++ Hadoop实战备忘
http://blog.youkuaiyun.com/segen_jaa/article/details/8633939/
【4】使用Python实现Hadoop MapReduce程序
http://my.oschina.net/sanpeterguo/blog/215922?fromerr=ZJLZeYoj