1.hadoop客户端环境
1.直接找有hadoop服务的机器,这样你访问的就是本机的hadoop集群,也就不用在配置了
2.如果你要远程其他hadoop集群,那么你需要配置相关文件,配置方式如同配置hadoop集群一样
hadoop集群搭建详见:https://blog.youkuaiyun.com/xzpdxz/article/details/86692631 修改相应的配置
注意确保你的环境有java
2.mapreducer
mapper:可以理解为数据分片计算
reducer:可以理解为将分片进行合计算
最常见的就是计算词频
a) 准备长篇英语文章,input.txt
上传input.txt到hdfs
[hadoop_test@hserver1 hadoop_test] # hadoop dfs -put input.txt /user/hadoop_test/xxxx/input.txt
b) mapper.py函数
实现的功能为将单词按空格分开,也就是单词
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' %(word, 1)
c) reducer.py函数
将mapper的单词用来计算词频
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
words = line.strip()
word, count = words.split('\t')
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s\t%s' %(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' %(current_word, current_count)
d) run.sh运行函数
在hadoop上运行mapreducer
#! /bin/bash
HADOOP=hadoop // hadoop命令,如果你的环境没有hadoop则必须使用hadoop全路径
STREAM=~/hadoop-2.9.2/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar // 环境的stream
task_name="lijiacai" // 任务名称
mapper_num=2 // mapper 任务数
reducer_num=2 // reducer 任务数
priority=HIGH // 优先级
capacity_mapper=5000 // mapper最大数
capacity_reducer=1000 // reducer最大数
mapper_file=./mapper.py // mapper文件,可以取别的名字,由于我上面用的mapper
reducer_file=./reducer.py // reducer文件
input_path=/user/hadoop_test/xxxx/input.txt // hadoop上的数据输入
output_path=/user/hadoop_test/xxx/output // reducer之后的数据输出
name="hadoop_test" // hadoop用户名
passwd="123456" // hadoop用户密码
$HADOOP fs -rm -r $output_path // 每次运行前先删除之前的output目录,不然无法在写入该路径
$HADOOP jar $STREAM \
-D mapred.job.name="$task_name" \ //
-D mapred.job.priority=$priority \
-D mapred.map.tasks=$mapper_num \
-D mapred.reducer.tasks=$reducer_num \
-D mapred.job.map.capacity=$capacity_mapper \
-D mapred.job.reduce.capacity=$capacity_mapper \
-D hadoop.job.ugi="${name},${passwd}" \
-input ${input_path} \
-output ${output_path} \
-mapper $mapper_file \
-reducer $reducer_file \
-file $mapper_file \
-file $reducer_file
运行得到如下结果:
[hadoop_test@hserver1 hadoop_test] # sh run.sh
Deleted /user/hadoop_test/lijiacai/output
19/03/20 17:21:05 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop-unjar3119144331866952236/] [] /tmp/streamjob6424092896909659797.jar tmpDir=null
19/03/20 17:21:06 INFO client.RMProxy: Connecting to ResourceManager at hserver1/10.58.107.38:8032
19/03/20 17:21:06 INFO client.RMProxy: Connecting to ResourceManager at hserver1/10.58.107.38:8032
19/03/20 17:21:07 INFO mapred.FileInputFormat: Total input files to process : 1
19/03/20 17:21:07 INFO mapreduce.JobSubmitter: number of splits:2
19/03/20 17:21:07 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
19/03/20 17:21:07 INFO Configuration.deprecation: mapred.job.priority is deprecated. Instead, use mapreduce.job.priority
19/03/20 17:21:07 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/20 17:21:07 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/03/20 17:21:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1548924494331_0031
19/03/20 17:21:07 INFO impl.YarnClientImpl: Submitted application application_1548924494331_0031
19/03/20 17:21:07 INFO mapreduce.Job: The url to track the job: http://hserver1:8088/proxy/application_1548924494331_0031/
19/03/20 17:21:07 INFO mapreduce.Job: Running job: job_1548924494331_0031
19/03/20 17:21:15 INFO mapreduce.Job: Job job_1548924494331_0031 running in uber mode : false
19/03/20 17:21:15 INFO mapreduce.Job: map 0% reduce 0%
19/03/20 17:21:22 INFO mapreduce.Job: map 100% reduce 0%
19/03/20 17:21:27 INFO mapreduce.Job: map 100% reduce 100%
19/03/20 17:21:28 INFO mapreduce.Job: Job job_1548924494331_0031 completed successfully