MapReduce Java API编程实验 (仅供课堂教学演示)
实验步骤:
1)在win7下用Eclipese创建Java Project,再创建WordCount.java源代码文件,并编写源代码
2)直接导入hadoop-2.6.0-cdh5.7.0.tar.gz安装包中与MapReduce API编程的有关jar包
在win7下把hadoop-2.6.0-cdh5.7.0.tar.gz解压缩,需要导入的包路径(不因为清楚到底需要哪些包,干脆全部导入):
hadoop-2.6.0-cdh5.7.0\share\hadoop\mapreduce2
hadoop-2.6.0-cdh5.7.0\share\hadoop\mapreduce2\lib
hadoop-2.6.0-cdh5.7.0\share\hadoop\common
hadoop-2.6.0-cdh5.7.0\share\hadoop\common\lib
hadoop-2.6.0-cdh5.7.0\share\hadoop\hdfs
hadoop-2.6.0-cdh5.7.0\share\hadoop\hdfs\lib
3)在Linux下用使用Hadoop命令行运行jar包
a) 先在win7下用Eclipese中生成jar包
右击项目名称 --> Export --> JAR file --> Next --> 设置导出路径和jar包名称 --> 选择Main Class--> Finish”即可,生成WordCount.jar包
b)启动伪分布式主机Hadoop,依次执行start-dfs.sh和start-yarn.sh,启动HDFS和YARN
c) 将win7的WordCount.jar包文件远程发送到Linux
利用远程终端工具XShell的XFtp功能,利用sftp协议实现文件传输,将WordCount.jar文件发送到Linux系统
d)hadoop fs -put wordcount.txt / 把要进行单词统计的文档上传到HDFS
e)hadoop jar WordCount.jar /word.txt /output 执行WordCount.jar程序,输出word.txt的单词统计结果到/output,输出目录/output不能事先存在
f)hadoop fs -cat /output/part-r-00000 查看单词统计的结果
word.txt内容:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines each offering local computation and storage. Rather than rely on hardware to deliver high-availability the library itself is designed to detect and handle failures at the application layer so delivering a highly-available service on top of a cluster of computers each of which may be prone to failures.A web-based tool for provisioning managing and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS Hadoop MapReduce Hive HCatalog HBase ZooKeeper Oozie Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications including ETL machine learning stream processing and graph computation.
WordCount程序的输出结果:
Ambari 1
Apache 2
ETL 1
HBase 1
HCatalog 1
HDFS 1
Hadoop 5
Hive 2
It 1
MapReduce 2
Oozie 1
Pig 2
Rather 1
Spark 1
Sqoop. 1
The 1
ZooKeeper 1
a 7
ability 1
across 1
allows 1
alongwith 1
also 1
and 9
application 1
applications 2
as 1
at 1
be 1
characteristics 1
cluster 2
clusters 2
computation 1
computation. 1
compute 1
computers 2
dashboard 1
data 1
data. 1
deliver 1
delivering 1
designed 2
detect 1
diagnose 1
distributed 1
each 2
engine 1
expressive 1
failures 1
failures.A 1
fast 1
features 1
for 5
framework 1
from 1
general 1
graph 1
handle 1
hardware 1
health 1
heatmaps 1
high-availability 1
highly-available 1
in 1
includes 1
including 1
is 3
itself 1
large 1
layer 1
learning 1
library 2
local 1
machine 1
machines 1
managing 1
manner.A 1
may 1
model 1
models. 1
monitoring 1
of 7
offering 1
on 2
performance 1
processing 2
programming 2
prone 1
provides 2
provisioning 1
range 1
rely 1
scale 1
servers 1
service 1
sets 1
simple 2
single 1
so 1
software 1
storage. 1
stream 1
such 1
support 1
supports 1
than 1
that 2
the 3
their 1
thousands 1
to 7
tool 1
top 1
up 1
user-friendly 1
using 1
view 1
viewing 1
visually 1
web-based 1
which 2
wide 1
示例源码一(词频统计)
WordCount.java文件
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class Wor