本篇详细介绍hadoop的安装、部署、使用,希望与hadoop的使用者共同分享。
介绍
集群里的一台机器被指定为 NameNode,另一台不同的机器被指定为JobTracker,这些机器是masters。余下的机器即作为DataNode也作为TaskTracker,这些机器是slaves。
1.先决条件
1.确保在你集群中的每个节点上都安装了所有必需软件:JDK,ssh
2.ssh 必须安装并且保证 sshd一直运行,并使用无密码链接的形式,以便用Hadoop 脚本管理端Hadoop守护进程。
2.实验环境搭建
2.1.准备工作
操作系统:CentOS6.4
部署软件:VMWare 9.0
支撑软件:
jdk-7u45-linux-x64.tar.gz
hadoop-1.1.2.tar.gz
zlib-1.2.8.tar.gz (升级openssh使用)
openssl-1.0.1e.tar.gz (升级openssh使用)
openssh-6.3p1.tar.gz (升级openssh使用)
在VMWare 9.0安装好一台CentOS6.4虚拟机后,可以导出或者克隆出另外三台虚拟机。
说明:
保证虚拟机和主机之间可以相互通信。
准备机器:一台master,三台slave,配置每台机器的/etc/hosts保证各台机器之间通过机器名可以互访,例如:
192.168.3.200 hadoop-m01 (master)
192.168.3.201 hadoop-s01 (slave1)
192.168.3.202 hadoop-s02 (slave2)
主机信息以在集群中的角色:
机器名 | IP地址 | 集群中的角色 | 备注 |
hadoop-m01 | 192.168.2.200 | NameNode、JobTracker |
|
hadoop-s01 | 192.168.2.201 | DataNode、TaskTracker |
|
hadoop-s02 | 192.168.2.202 | DataNode、TaskTracker |
|
1.配置虚拟机的网络适配器
2.开启虚拟机,配置网卡ifcfg-eth0
# vi /etc/sysconfig/network-script/ifcfg-etho
DEVICE="eth0"
#BOOTPROTO="dhcp"
BOOTPROTO="none"
HWADDR="00:0C:29:81:F0:07"
IPV6INIT="no"
#NM_CONTROLLED="yes"
ONBOOT="yes"
TYPE="Ethernet"
UUID="08dce555-8571-4357-902e-8797995fc32d"
IPADDR="192.168.3.200"
NETMASK="255.255.255.0"
GATEWAY="192.168.3.1"
# vi /etc/hosts
# add
192.168.3.200 hadoop-m01
重启
# reboot
2.2.创建组和用户
# useradd hadoop
# groupadd -G hadoop hadoop
# groups hadoop 查看用户所属组
# cat /etc/passwd | grep "^hadoop"
# cat /etc/group | grep "^hadoop"
注意:
在所有的机器上都建立相同的组(hadoop)和用户(hadoop)。(最好不要使用root安装,因为不推荐各个机器之间使用root访问 )
2.3.安装JDK
参见JDK安装文档
##给java安装路径授权
[root@hadoop-m01 java]# chown -R hadoop:hadoop /usr/local/java/
##查看授权
[root@hadoop-m01 local]# ll /usr/local/java
total 44
... ...
drwxr-xr-x. 2 root root 4096 Sep 23 2011 include
drwxr-xr-x. 10 hadoop hadoop 4096 Nov 1 02:04 java
drwxr-xr-x. 2 root root 4096 Sep 23 2011 lib
... ...
##切换hadoop用户
[root@hadoop-m01 local]# su hadoop
##执行java -verison验证是否成功
hadoop@ hadoop-m01:~$ java -version
java version "1.X.X_XX"
Java(TM) SE Runtime Environment (build 1.X.X_XX-b06)
Java HotSpot(TM) Client VM (build 20.45-b01, mixed mode, sharing)
安装成功
2.4.安装ssh和配置
##检查ssh是否安装
[hadoop@hadoop-m01 Desktop]$ ssh -V
OpenSSH_5.3p1, OpenSSL 1.0.0-fips 29 Mar 2010
如果未安装则进行安装,执行1),否则跳过安装步骤直接进行配置,执行2)。
1)安装
sudo apt-get install ssh
这个安装完后,可以直接使用ssh命令了。
执行$ netstat -nat 查看22端口是否开启了。
测试:ssh localhost。
输入当前用户的密码,回车就ok了。说明安装成功,同时ssh登录需要密码。
(这种默认安装方式完后,默认配置文件是在/etc/ssh/目录下。sshd配置文件是:/etc/ssh/sshd_config):
注意:在所有机子都需要安装ssh。
2)配置
在Hadoop启动以后,NameNode是通过SSH(Secure Shell)来启动和停止各个DataNode上的各种守护进程的,这就须要在节点之间执行指令的时候是不须要输入密码的形式,故我们须要配置SSH运用无密码公钥认证的形式。
以本文中的三台机器为例,现在hadoop-m01是主节点,他须要连接hadoop-s01、hadoop-s02。须要确定每台机器上都安装了ssh,并且DataNode机器上sshd服务已经启动。
在hadoop-m01上执行命令
[hadoop@hadoop-m01 Desktop]$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
2f:bd:cf:1f:bb:93:fe:c1:3b:d3:a4:65:36:55:de:4f hadoop@hadoop-m01
The key's randomart image is:
+--[ RSA 2048]----+
| |
| .|
| .o|
| E|
| S .o|
| o ..*|
| . o .Xo|
| . o +=+|
| ..oo==+|
+-----------------+
这个命令将为hadoop-m01上的用户hadoop生成其空密码的密钥对,询问其保存路径时直接回车采用默认路径,直接回车。生成的密钥对id_rsa,id_rsa.pub,默认存储在/home/hadoop/.ssh目录下
[hadoop@hadoop-m01 Desktop]$ cd /home/hadoop/.ssh/
然后执行命令
[hadoop@hadoop-m01 .ssh]$ cat id_rsa.pub >> authorized.keys
将生成的authorized_keys文件复制到每台机器的相同位置:/home/hadoop/.ssh/
最后将每台机器上的authorized_keys修改权限如下:
[hadoop@hadoop-m01 .ssh]$ chmod 644 authorized.keys
3)设置NameNode的ssh登录
首先设置NameNode的ssh为无需密码的、自动登录。
切换到hadoop用户( 保证用户hadoop可以无需密码登录,因为我们后面安装的hadoop属主是hadoop用户)
##清除hadoop用户密码
[root@hadoop-m01 local]# passwd -d hadoop
Removing password for user hadoop.
passwd: Success
##查看hadoop用户密码
[root@hadoop-m01 local]# passwd -S hadoop
hadoop NP 2013-11-13 0 99999 7 -1 (Empty password.)
第一次ssh会有提示信息:
The authenticity of host ‘hadoop-m01 (192.168.10.203)’ can’t be established.
RSA key fingerprint is 03:e0:30:cb:6e:13:a8:70:c9:7e:cf:ff:33:2a:67:30.
Are you sure you want to continue connecting (yes/no)?
输入 yes 来继续。这会把该服务器添加到你的已知主机的列表中,下次就可以直接链接了。
SSH安装和配置完成
2.5.安装Hadoop
在官网下载hadoop-1.1.2.tar.gz,下载到/home/hadoop/目录下。
#切换为root用户,执行如下命令
##指定安装路径
[root@hadoop-m01 Desktop]# mkdir /usr/local/hadoop
##切换到指定安装路径
[root@hadoop-m01 Desktop]# cd /usr/local/hadoop
##解压到指定安装路径
[root@hadoop-m01 hadoop]# tar -zxvf hadoop-1.1.2-bin.tar.gz -C /usr/local/hadoop/
##移动
[root@hadoop-m01 hadoop]# mv hadoop-1.1.2/* .
##检查是否已移动
[root@hadoop-m01 hadoop]# ls hadoop-1.1.2/
##删除解压路径
[root@hadoop-m01 hadoop]# rm -fr hadoop-1.1.2/
##切换路径
[root@hadoop-m01 hadoop]# cd /usr/local
##授权
[root@hadoop-m01 local]# chown -R hadoop:hadoop hadoop
##配置conf/hadoop-env.sh文件
[root@hadoop-m01 hadoop]# vi conf/hadoop-env.sh
... ...
# The java implementation to use. Required.
# update zhangzhen
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/local/java
... ...
修改完成后保存
##配置环境变量
[hadoop@hadoop-m01 conf]$ vi /home/hadoop/.bashrc
... ...
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
... ...
修改完毕后,执行source /home/hadoop/.bashrc 来使其生效。
1 ) 安装Hadoop集群通常要将安装软件解压到集群内的所有机器上。并且安装路径要一致,如果我们用HADOOP_HOME指代安装的根路径,通常,集群里的所有机器的HADOOP_HOME路径相同。
2 ) 如果集群内机器的环境完全一样,可以在一台机器上配置好,然后把配置好的软件即hadoop整个文件夹拷贝到其他机器的相同位置即可。
3 ) 可以将Master上的Hadoop通过scp拷贝到每一个Slave相同的目录下,同时根据每一个Slave的Java_HOME 的不同修改其hadoop-env.sh 。
4) 为了方便,使用hadoop命令或者start-all.sh等命令,修改Master上/home/hadoop/.bashrc 新增以下内容:
2.6.配置剩余hadoop服务器
1)克隆虚拟机
2)配置IP地址
##删除克隆时自动增加的网卡auto-eth1,将eth0的mac地址改为auto-eth1的mac地址
System=>>Preferences=>>Network Connections
##配置,将eth0的mac地址改为auto-eth1的mac地址,删除auto-eth1
# vi /etc/udev/rules.d/70-persistent-net.rules
##修改eth0配置
# vi /etc/sysconfig/network-script/ifcfg-etho
DEVICE="eth0"
#BOOTPROTO="dhcp"
BOOTPROTO="none"
#IPV6INIT="yes"
IPV6INIT="no"
#NM_CONTROLLED="yes"
ONBOOT="yes"
TYPE="Ethernet"
UUID="08dce555-8571-4357-902e-8797995fc32d"
IPADDR="192.168.3.201"
NETMASK="255.255.255.0"
GATEWAY="192.168.3.1"
HWADDR=00:0C:29:5C:EA:73
PREFIX=24
DEFROUTE=yes
IPV4_FAILURE_FATAL=yes
NAME=eth0
LAST_CONNECT=1384435317
#修改host文件
# vi /etc/hosts
# add
192.168.3.201 hadoop-s01
3.集群配置(所有节点相同)
[root@hadoop-s01 Desktop]# mkdir /home/hadoop/hadoop_home
[root@hadoop-s01 Desktop]# mkdir /home/hadoop/hadoop_home/var
[root@hadoop-s01 Desktop]# chown -R hadoop:hadoop /home/hadoop/hadoop_home/
3.1.配置文件conf/core-site.xml
<?xml version="1.0"?>
< ?xml-stylesheet type="text/xsl"href="configuration.xsl"?>
< configuration>
< property>
<name>fs.default.name</name>
<value>hdfs://hadoop-m01:49000</value>
< /property>
< property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop_home/var</value>
< /property>
< /configuration>
1)fs.default.name是NameNode的URI。hdfs://主机名:端口/
2)hadoop.tmp.dir :Hadoop的默认临时路径,这个最好配置,如果在新增节点或者其他情况下莫名其妙的DataNode启动不了,就删除此文件中的tmp目录即可。不过如果删除了NameNode机器的此目录,那么就需要重新执行NameNode格式化的命令。
3.2.配置文件conf/mapred-site.xml
<?xmlversion="1.0"?>
<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-m01:49001</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/hadoop_home/var</value>
</property>
</configuration>
1)mapred.job.tracker是JobTracker的主机(或者IP)和端口。主机:端口。
3.3.配置文件conf/hdfs-site.xml
<?xmlversion="1.0"?>
<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/name1, /home/hadoop/name2</value> #hadoop的name目录路径
<description> </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data1, /home/hadoop/data2</value>
<description> </description>
</property>
<property>
<name>dfs.replication</name>
<!-- 我们的集群有两个结点,所以rep两份 -->
<value>2</vaue>
</property>
</configuration>
1) dfs.name.dir是NameNode持久存储名字空间及事务日志的本地文件系统路径。当这个值是一个逗号分割的目录列表时,nametable数据将会被复制到所有目录中做冗余备份。
2) dfs.data.dir是DataNode存放块数据的本地文件系统路径,逗号分割的列表。当这个值是逗号分割的目录列表时,数据将被存储在所有目录下,通常分布在不同设备上。
3)dfs.replication是数据需要备份的数量,默认是3,如果此数大于集群的机器数会出错。
注意:
此处的name1、name2、data1、data2目录不能预先创建,hadoop格式化时会自动创建,如果预先创建反而会有问题。
3.4.配置masters和slaves主从结点
配置conf/masters和conf/slaves来设置主从结点,注意最好使用主机名,并且保证机器之间通过主机名可以互相访问,每个主机名一行。
vi /etc/hosts
#
192.168.3.200 hadoop-m01
192.168.3.201 hadoop-s01
192.168.3.202 hadoop-s02
vi masters:
输入:
hadoop-m01
vi slaves:
输入:
hadoop-s01
hadoop-s02
配置结束,把配置好的hadoop文件夹拷贝到其他集群的机器中,并且保证上面的配置对于其他机器而言正确,例如:如果其他机器的Java安装路径不一样,要修改conf/hadoop-env.sh
$ scp -r /home/hadoop/hadoop-0.20.203 root@node2: /home/hadoop/
4.hadoop启动
4.1.格式化一个新的分布式文件系统
##切换hadoop用户,在master节点
# su hadoop
##切换路径
$ cd /usr/local/hadoop
##式化一个新的分布式文件系统
$ bin/hadoop namenode -format
成功情况下系统输出:
Warning: $HADOOP_HOME is deprecated.
13/11/14 07:04:36 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop-m01/192.168.3.200
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.1.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
13/11/14 07:04:37 INFO util.GSet: VM type = 64-bit
13/11/14 07:04:37 INFO util.GSet: 2% max memory = 19.33375 MB
13/11/14 07:04:37 INFO util.GSet: capacity = 2^21 = 2097152 entries
13/11/14 07:04:37 INFO util.GSet: recommended=2097152, actual=2097152
13/11/14 07:04:37 INFO namenode.FSNamesystem: fsOwner=hadoop
13/11/14 07:04:37 INFO namenode.FSNamesystem: supergroup=supergroup
13/11/14 07:04:37 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/11/14 07:04:37 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/11/14 07:04:37 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/11/14 07:04:37 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/11/14 07:04:38 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/11/14 07:04:38 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/home/hadoop/name1/current/edits
13/11/14 07:04:38 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/home/hadoop/name1/current/edits
13/11/14 07:04:38 INFO common.Storage: Storage directory /home/hadoop/name1 has been successfully formatted.
13/11/14 07:04:38 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/11/14 07:04:38 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/home/hadoop/name2/current/edits
13/11/14 07:04:38 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/home/hadoop/name2/current/edits
13/11/14 07:04:38 INFO common.Storage: Storage directory /home/hadoop/name2 has been successfully formatted.
13/11/14 07:04:38 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-m01/192.168.3.200
************************************************************/
##查看输出保证分布式文件系统格式化成功
[hadoop@hadoop-m01 hadoop]$ ls /home/hadoop/na*
/home/hadoop/name1:
current image
/home/hadoop/name2:
current image
执行完后可以到master机器上看到/home/hadoop/name1和/home/hadoop/name2两个目录。
4.2.启动所有节点
关闭所有节点防火墙
#关闭防火墙
# chkconfig iptables off
#临时关闭防火墙
# service iptables stop
在主节点master上面启动hadoop,主节点会启动所有从节点的hadoop。
##切换hadoop用户,在master节点
# su hadoop
##切换路径
$ cd /usr/local/hadoop
##通过ssh连接hadoop-m01,hadoop-s01,hadoop-s02
$ ssh hadoop-m01
$ ssh hadoop-s01
$ ssh hadoop-s02
4.2.1.启动方式1(同时启动HDFS和Map/Reduce)
关闭所有节点防火墙
#关闭防火墙
# chkconfig iptables off
#临时关闭防火墙
# service iptables stop
$ bin/start-all.sh (同时启动HDFS和Map/Reduce)
系统输出:
Warning: $HADOOP_HOME is deprecated.
starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-namenode-hadoop-m01.out
hadoop-s01: Warning: $HADOOP_HOME is deprecated.
hadoop-s01:
hadoop-s02: Warning: $HADOOP_HOME is deprecated.
hadoop-s02:
hadoop-s01: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-datanode-hadoop-s01.out
hadoop-s02: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-datanode-hadoop-s02.out
hadoop-m01: Warning: $HADOOP_HOME is deprecated.
hadoop-m01:
hadoop-m01: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-hadoop-m01.out
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-hadoop-m01.out
hadoop-s01: Warning: $HADOOP_HOME is deprecated.
hadoop-s01:
hadoop-s01: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-hadoop-s01.out
hadoop-s02: Warning: $HADOOP_HOME is deprecated.
hadoop-s02:
hadoop-s02: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-hadoop-s02.out
执行完后可以到master(node1)和slave(node1,node2)机器上看到/home/hadoop/hadoop/data1和/home/hadoop/data2两个目录。
4.2.2.启动方式2(单独启动HDFS集群/Map/Reduce)
关闭所有节点防火墙
#关闭防火墙
# chkconfig iptables off
#临时关闭防火墙
# service iptables stop
启动Hadoop集群需要启动HDFS集群和Map/Reduce集群。
1)#在分配的NameNode上,运行下面的命令启动HDFS:
$ bin/start-dfs.sh(单独启动HDFS集群)
注:bin/start-dfs.sh脚本会参照NameNode上${HADOOP_CONF_DIR}/slaves文件的内容,在所有列出的slave上启动DataNode守护进程。
2)#在分配的JobTracker上,运行下面的命令启动Map/Reduce:
$bin/start-mapred.sh (单独启动Map/Reduce)
注:bin/start-mapred.sh脚本会参照JobTracker上${HADOOP_CONF_DIR}/slaves文件的内容,在所有列出的slave上启动TaskTracker守护进程。
4.3.关闭所有节点
从主节点master关闭hadoop,主节点会关闭所有从节点的hadoop。
$ bin/stop-all.sh
Hadoop守护进程的日志写入到 ${HADOOP_LOG_DIR} 目录 (默认是 ${HADOOP_HOME}/logs).
${HADOOP_HOME}就是安装路径.
5.测试
5.1.浏览NameNode和JobTracker的网络接口
它们的地址默认为:
NameNode -- http://hadoop-m01:50070/
JobTracker -- http://hadoop-m01:50030/
5.2.使用netstat–nat查看端口
查看49000和49001是否正在使用。
[hadoop@hadoop-m01 hadoop]$ netstat -nat
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 :::50030 :::* LISTEN
tcp 0 0 :::50070 :::* LISTEN
tcp 0 0 :::22 :::* LISTEN
tcp 0 0 ::ffff:192.168.3.200:49000 :::* LISTEN
tcp 0 0 ::ffff:192.168.3.200:49001 :::* LISTEN
tcp 0 0 ::ffff:192.168.3.200:49000 ::ffff:192.168.3.202:47128 ESTABLISHED
tcp 0 0 ::ffff:192.168.3.200:49000 ::ffff:192.168.3.201:35975 ESTABLISHED
5.3.使用jps查看进程
要想检查守护进程是否正在运行,可以使用 jps 命令(这是用于 JVM 进程的ps 实用程序)。这个命令列出 5 个守护进程及其进程标识符。
在NameNode上
[hadoop@hadoop-m01 hadoop]$ jps
8033 Jps
7763 JobTracker
7506 NameNode
在DataNode上
[hadoop@hadoop-s01 hadoop]# jps
4833 Jps
4579 DataNode
5.4.查看集群状态
[hadoop@hadoop-m01 hadoop]$ hadoop dfsadmin -report
Safe mode is ON
Configured Capacity: 117390180352 (109.33 GB)
Present Capacity: 98652184576 (91.88 GB)
DFS Remaining: 98651758592 (91.88 GB)
DFS Used: 425984 (416 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)
Name: 192.168.3.202:50010
Decommission Status : Normal
Configured Capacity: 58695090176 (54.66 GB)
DFS Used: 212992 (208 KB)
Non DFS Used: 9386442752 (8.74 GB)
DFS Remaining: 49308434432(45.92 GB)
DFS Used%: 0%
DFS Remaining%: 84.01%
Last contact: Mon Nov 18 00:01:29 PST 2013
Name: 192.168.3.201:50010
Decommission Status : Normal
Configured Capacity: 58695090176 (54.66 GB)
DFS Used: 212992 (208 KB)
Non DFS Used: 9351553024 (8.71 GB)
DFS Remaining: 49343324160(45.95 GB)
DFS Used%: 0%
DFS Remaining%: 84.07%
Last contact: Mon Nov 18 00:01:29 PST 2013
6.准备input数据(测试hadoop运行)
6.1.在hdfs中建立一个input 目录
查看hdfs目录
[hadoop@hadoop-m01 hadoop]$ hadoop fs -ls
在hdfs中创建input目录
[hadoop@hadoop-m01 hadoop]$ hadoop fs -mkdir input
6.2.在本地磁盘建立两个输入文件file01和file02:
[hadoop@hadoop-m01 hadoop]$ echo "Hello World Bye World" > file01
[hadoop@hadoop-m01 hadoop]$ echo "Hello Hadoop Goodbye Hadoop" > file02
6.3.将file01和file02 拷贝到hdfs中
[hadoop@hadoop-m01 hadoop]$ hadoop fs -copyFromLocal /home/hadoop/test/file0* input
[hadoop@hadoop-m01 hadoop]$ hadoop fs -ls /user/hadoop/input
... ...
-rw-r--r-- 2 hadoop supergroup 22 2013-11-18 01:05 /user/hadoop/input/file01
-rw-r--r-- 2 hadoop supergroup 28 2013-11-18 01:05 /user/hadoop/input/file02
... ...
[hadoop@hadoop-m01 hadoop]$ hadoop fs -ls /user/hadoop/input/file*
Warning: $HADOOP_HOME is deprecated.
-rw-r--r-- 2 hadoop supergroup 22 2013-11-18 01:05 /user/hadoop/input/file01
-rw-r--r-- 2 hadoop supergroup 28 2013-11-18 01:05 /user/hadoop/input/file02
[hadoop@hadoop-m01 hadoop]$ hadoop dfs -mv input/file0* input/test
6.4.执行wordcount
[hadoop@hadoop-m01 hadoop]$ hadoop jar hadoop-examples-1.1.2.jar wordcount input/test output
13/11/18 18:20:47 INFO input.FileInputFormat: Total input paths to process : 2
13/11/18 18:20:47 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/18 18:20:47 WARN snappy.LoadSnappy: Snappy native library not loaded
13/11/18 18:20:48 INFO mapred.JobClient: Running job: job_201311181820_0001
13/11/18 18:20:49 INFO mapred.JobClient: map 0% reduce 0%
13/11/18 18:21:26 INFO mapred.JobClient: map 50% reduce 0%
13/11/18 18:21:33 INFO mapred.JobClient: map 100% reduce 0%
13/11/18 18:21:38 INFO mapred.JobClient: map 100% reduce 33%
13/11/18 18:21:41 INFO mapred.JobClient: map 100% reduce 100%
13/11/18 18:21:43 INFO mapred.JobClient: Job complete: job_201311181820_0001
13/11/18 18:21:43 INFO mapred.JobClient: Counters: 29
13/11/18 18:21:43 INFO mapred.JobClient: Job Counters
13/11/18 18:21:43 INFO mapred.JobClient: Launched reduce tasks=1
13/11/18 18:21:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=72101
13/11/18 18:21:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/11/18 18:21:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/11/18 18:21:43 INFO mapred.JobClient: Launched map tasks=2
13/11/18 18:21:43 INFO mapred.JobClient: Data-local map tasks=2
13/11/18 18:21:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13349
13/11/18 18:21:43 INFO mapred.JobClient: File Output Format Counters
13/11/18 18:21:43 INFO mapred.JobClient: Bytes Written=41
13/11/18 18:21:43 INFO mapred.JobClient: FileSystemCounters
13/11/18 18:21:43 INFO mapred.JobClient: FILE_BYTES_READ=79
13/11/18 18:21:43 INFO mapred.JobClient: HDFS_BYTES_READ=286
13/11/18 18:21:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=153783
13/11/18 18:21:43 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
13/11/18 18:21:43 INFO mapred.JobClient: File Input Format Counters
13/11/18 18:21:43 INFO mapred.JobClient: Bytes Read=50
13/11/18 18:21:43 INFO mapred.JobClient: Map-Reduce Framework
13/11/18 18:21:43 INFO mapred.JobClient: Map output materialized bytes=85
13/11/18 18:21:43 INFO mapred.JobClient: Map input records=2
13/11/18 18:21:43 INFO mapred.JobClient: Reduce shuffle bytes=85
13/11/18 18:21:43 INFO mapred.JobClient: Spilled Records=12
13/11/18 18:21:43 INFO mapred.JobClient: Map output bytes=82
13/11/18 18:21:43 INFO mapred.JobClient: Total committed heap usage (bytes)=246685696
13/11/18 18:21:43 INFO mapred.JobClient: CPU time spent (ms)=45630
13/11/18 18:21:43 INFO mapred.JobClient: Combine input records=8
13/11/18 18:21:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=236
13/11/18 18:21:43 INFO mapred.JobClient: Reduce input records=6
13/11/18 18:21:43 INFO mapred.JobClient: Reduce input groups=5
13/11/18 18:21:43 INFO mapred.JobClient: Combine output records=6
13/11/18 18:21:43 INFO mapred.JobClient: Physical memory (bytes) snapshot=431890432
13/11/18 18:21:43 INFO mapred.JobClient: Reduce output records=5
13/11/18 18:21:43 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2194870272
13/11/18 18:21:43 INFO mapred.JobClient: Map output records=8
6.5.完成之后,查看结果
[hadoop@hadoop-m01 hadoop]$ hadoop dfs -ls output
Found 3 items
-rw-r--r-- 2 hadoop supergroup 0 2013-11-18 18:21 /user/hadoop/output/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2013-11-18 18:20 /user/hadoop/output/_logs
-rw-r--r-- 2 hadoop supergroup 41 2013-11-18 18:21 /user/hadoop/output/part-r-00000
6.5.1.在HDFS上直接查看输出文件
[hadoop@hadoop-m01 hadoop]$ hadoop dfs -cat output/part-r-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
6.5.2.查看从HDFS拷贝到本地的输出文件
[hadoop@hadoop-m01 hadoop]$ hadoop dfs -get output /home/hadoop/test/
[hadoop@hadoop-m01 hadoop]$ ls /home/hadoop/test/
file01 file02 output
[hadoop@hadoop-m01 hadoop]$ cat /home/hadoop/test/output/*
cat: /home/hadoop/test/output/_logs: Is a directory
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2