Hadoop Streaming统计电影出现的次数

本文介绍如何使用 Hadoop Streaming 技术通过 Python 脚本来处理日志文件中特定电影出现的次数,并提供了单机和集群两种运行方式的详细步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

map.py
#!/usr/bin/python
# encoding:utf-8
import sys

word2count = {}

for line in sys.stdin:
line = line.strip()
splited = line.split(',')[0]
if "捉妖记" in splited:
print '%s\t%s' % (splited, 1)
--------------------------------------------------------------------------------------------------
red.py
#!/usr/bin/python
# encoding:utf-8
from operator import itemgetter
import sys

word2count = {}

for line in sys.stdin:
line = line.strip()
word, count = line.split()

try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
pass

#sorted_word2count = sorted(word2count.items(), lambda x, y: cmp(x[1], y[1]))
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
for word, count in sorted_word2count:
print '%s\t%s' % (word, count)

---------------------------------------------------------------------------------
数据:
dat0204.log
----------------------------------------------------------------------------------
运行脚本:(集群)
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input /home/hduser/dat0204.log -output /home/hduser/output -file /home/hduser/map.py -file /home/hduser/red.py -mapper "python map.py" -reducer "python red.py"

submit.sh($hadoop_home)
hadoop fs -rmr /home/hduser/ou*
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-D mapred.reduce.tasks=1 \
-mapper "python map.py" \
-reducer "python red.py" \
-file /home/hduser/map.py \
-file /home/hduser/red.py \
-input /home/hduser/dat0204.log \
-output /home/hduser/output

hadoop fs -text /home/hduser/output/pa*

单机:
cat /home/hduser/dat0204.log |python map.py|sort|python red.py


---------------------------------------------------------------------------------------------------------------------------------------------------------
map.py
#!/usr/bin/python
# encoding:utf-8
import sys

word2count = {}

for line in sys.stdin:
line = line.strip()
splited = line.split(';')
for words in splited:
word = words.split('\t')[0]
if "道士下山" in word:
print '%s\t%s' % (word, 1)


red.py
#!/usr/bin/python
# encoding:utf-8
from operator import itemgetter
import sys

word2count = {}

for line in sys.stdin:
line = line.strip()
word, count = line.split()

try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
pass

#sorted_word2count = sorted(word2count.items(), lambda x, y: cmp(x[1], y[1]))
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
for word, count in sorted_word2count:
print '%s\t%s' % (word, count)
---------------------------------------------------------------------------------
数据:
dat0203.log

运行脚本如上所示:
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input /home/hduser/dat0203.log -output /home/hduser/output -file /home/hduser/map.py -file /home/hduser/red.py -mapper "python map.py" -reducer "python red.py"

submit.sh
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-input /home/hduser/dat0203.dat \
-output /home/hduser/output \
-file /home/hduser/map.py \
-file /home/hduser/red.py \
-mapper "python map.py" \
-reducer "python red.py" \
-jobconf mapred.reduce.tasks=1 \
-jobconf mapred.job.name="qianjc_test"

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值