服务器基础配置
本次Hadoop平台搭建采用伪分布式文件储存方式,使用三台liunx系统虚拟机,版本为centOS8.0以上,Hadoop版本为3.3.0.x64位,VimWare版本为16pro。
集群环境搭建
在windows系统中确认VimWare的网络服务都完全启动
任务管理器→服务:
确认VimWare生成的网关地址
在VW窗口菜单栏:编辑→虚拟网络编辑器→选择VMnet8→NAT设置查看网关,此处网关便为三台虚拟机网关。
确认VimNet8的网卡配置好了IP地址和DNS。
Windows设置→状态→更改适配器选项→选择VimNet8→右键属性→选择ipv4协议打开属性。
设置ip地址和网关在同一网段,默认网关为之前虚拟机查看的网关,首选DNS服务器为谷歌的8.8.8.8.
虚拟机IP环境设置
使用镜像文件快速安装虚拟机,内存设为2G,储存设为20G,三台名字分别设置为node01、node02、node03。
启动虚拟机,查看每台虚拟机的物理地址。
VM虚拟机界面→点击设置→网络适配器(NAT)→高级:
老版VimWare生成虚拟机三台使用的为同一物理地址,则需要点击生成新的物理地址,可以确定在16pro版本后虚拟机的安装不会继承同一地址,所以此处只是查看物理地址,下一步将该物理地址编写到服务器网卡配置文件中:[root@node01 ~]# vim /etc/udev/rules.d/70-persistent-ipoib.rules
将ATTR{address}字段改写为之前的物理地址,并只留一行网卡配置,名字最好设为“eth0”
之后修改IP地址配置文件:[root@node01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens160
DEVICE为网卡名,设置为eth0;
ONBOOT为开机加载,设置为yes;
BOOTPROTO字段为ip分配方式,dhcp为自动分配,将字段修改为static(静态分配),并添加以下字段:
IPADDR = 192.168.189.100
GATEWAY = 192.168.189.2
NETMAKE = 255.255.255.0
NDS1 = 8.8.8.8
HWADDR =00:0C:29:DF:E6:14
以上字段分别对应自己的IP地址、网关、子网掩码、NDS代理、物理地址。保存推出reboot重启加载。
重启后 ping www.baidu.com发现可以ping通,输入:[root@node01 ~]# ifconfig
查看网络配置,可以发现物理地址和IP地址已经修改成功。
其他虚拟机使用一样的步骤修改物理地址和IP地址。
修改主机名
[root@node01 ~]# vim /etc/sysconfig/network
将HOSTNAME字段改为node01。
这是centOS6之前的做法,OS7版本之后使用:
[root@node01 ~]# hostnamectl set-hostname 用户名
三台主机名分别设置主机名为node01、node02、node03。
配置每台服务器的IP和域名映射:
[root@node01 ~]# vim /etc/hosts
进入文件后编写:
192.168.189.100 node01 node01.hadoop.com
192.168.189.110 node02 node02.hadoop.com
192.168.189.120 node03 node03.hadoop.com
三台服务器均配置映射,之后关机重启。
关闭防火墙
三台服务器执行:
[root@node01 ~]#service iptables stop #关闭防火墙
[root@node01 ~]# chkconfig iptables off #禁止开机启动项
OS7版本之后linux系统没有安装iptables service,故没有service命令,iptables归入firewalld管理,所以使用:
[root@node01 ~]# systemctl stop firewalld
关闭SElinux
SELinux是Linux的一种安全子系统,Linux中的权限管理是针对文件的,而不是针对进程的,SELinux在Linux的文件权限之外,增加了对进程的限制,进程只能允许在一定范围之内操作资源,如果开启了SELinux,需要做非常复杂的配置,才能正常使用系统。SELinux分为enforcing强制模式、permissive宽容模式和disable关闭模式。三台服务器进入selinux配置文件关闭该子系统:
[root@node01 ~]# vim /etc/selinux/config
实现SSH免密登录
在三台服务器下执行以下命令,生成公钥和私钥:
[root@node01 ~]# ssh-keygen -t -rsa
执行之后敲下三个回车,在/root/.ssh目录下就会生成公钥和私钥文件。
服务器都有密钥对之后将公钥拷贝到同一个地方:
[root@node01 ~]# ssh-copy-id node01 #自己也要拷贝自己
[root@node02 ~]# ssh-copy-id node01 #用第二台拷贝到node01
[root@node03 ~]# ssh-copy-id node01 #用第三台拷贝到node01
每次拷贝都会输入root账户的密码,完成之后,第一台服务器node01就有了所有的公钥,将其分发到其他服务器上,在CRT窗口会话上(或者其他shell控制软件):
[root@node01 ~]# scp /root/.ssh/authorized_keys node02:/root/.ssh
[root@node01 ~]# scp /root/.ssh/authorized_keys node03:/root/.ssh
之后就可以在任意服务器登录任意其他服务器,实现免密登录
[root@node01 ~]# ssh node02 #输入后直接进入node02,没有询问密码环节
Activate the web console with: systemctl enable --now cockpit.socket
Last login: Fri Dec 31 19:14:25 2021 from 192.168.189.100
[root@node02 ~]#exit #观察到用户已经切换到node02了,退出
logout
Connection to node02 closed.
[root@node01 ~]# #回到node01
三台服务器安装JDK
以一台服务器为例,查看自带的openjdk并卸载:
[root@node01 ~]rpm -qa|grep java
java-1. 6. 0-openjdk-1.6.0. 41-1.13.13.1.e16 8. x86_ 64
tzdata-java 2016j-1. eA 6. noarch
java-1.7.0-openjdk-1.7.0. 131-2.6.9.0.e16_ 8. x86_ 64
You have new ma il in /var/spool/mail/root
[root@node01 ~]rpm -e java-1.6.0-openjdk-1.6.0.41-1.13.13.1.e16. .8.x86 .64 tzdata- java-
2016j-1.e16. noarch java-1 .7.0-openjdk-1.7.0.131-2.6.9.0.e16_ 8.x86 _64
nodeps
创建安装目录:
[root@node01 ~]mkdir -p /export softwares #软件安装包目录
[root@node01 ~]mkdir -p /export/servers #安装目录
上传并解压:
[root@node01 ~]rz -e
或者直接拖动文件到服务器安装包目录,解压存放到servers目录
[root@node01 ~]tar -zxvf jdk-8u301-linux-x64.tar.gz -C ../servers
配置环境变量
[root@node01 ~]vim /etc/profile
添加以下内容:
export JAVA_HOME=/export/servers/jdk1.8.0_301-amd64
export PATH=$JAVA_HOME/bin:$PATH
修改完成后重载文件:
[root@node01 ~]source /etc/profile
最后验证jdk安装是否成功
[root@node01 ~]java -version
时钟同步
可以用自己的一台主机为参照内部同步,而此搭建采用网络时间同步,OS7版本之后因为ntp自身缺陷的一些原因,摒弃了老式ntp软件的供应,但OS8中时间自带ntp服务,修改时区到上海
[root@node01 ~]#timedatectl list-timezones |grep Shanghai
Asia/Shanghai
[root@node01 ~]#timedatectl set-timezone “Asia/Shanghai”```
进入chrony配置文件
```bash
[root@node01 ~]#vim /etc/chroy.conf
添加阿里云时间服务
server ntp.aliyun.com iburst
保存退出重启时间
[root@node01 ~]systemctl restart chornyd
[root@node01 ~]systemctl status chrond #查看时间服务是否添加
[root@node01 ~]chronyc sources -v #查看各时间源时间误差
发现阿里云时间服务添加成功,误差也很小。
Hadoop的安装及配置
Hadoop的安装
去官网下载hadoop安装包:Index of /hadoop/common (apache.org),选择hadoop-3.3.0,
下载hadoop-3.3.0.tar.gz,后上传服务器解压到/export/servers目录。
Hadoop配置文件的修改
Hadoop需要修改四个配置文件:core-site.xml 、hdfs-site.xml、yarn-site.xml、mapred-site.xml,这些文件均在/export/servers/hadoop-3.3.0/etc/hadoop路径上,而它们对应的默认配置文件:core-default.xml 、hdfs-default.xml、yarn-default.xml、mapred-default.xml在对应的jar包中,编写时可用来对照参考。
集群规划示意表
IP | 主机名 | HDFS | YARN |
---|---|---|---|
192.168.189.100 | Node01 | Datanode、namenode | NodeManager、mapreduce.jobhistory |
192.168.189.110 | Node02 | datanode | ResourceManager、NodeManager |
192.168.189.120 | Node03 | Datanode、seconderynamenode | NodeManager |
搭建使用notepad++并配置NppFTP插件对文件进行修改。
core-site.xml配置文件
1.<configuration>
2.
3.<!--指定文件的系统类型:分布式文件系统-->
4.<property>
5. <name>fs.defaultFS</name>
6. <value>hdfs://node01:8020</value>
7. <description>The name of the default file system. A URI whose
8. scheme and authority determine the FileSystem implementation. The
9. uri's scheme determines the config property (fs.SCHEME.impl) naming
10. the FileSystem implementation class. The uri's authority is used to
11. determine the host, port, etc. for a filesystem.</description>
12.</property>
13.
14.<!--指定临时文件储存目录-->
15.<property>
16. <name>hadoop.tmp.dir</name>
17. <value>/export/servers/hadoop-3.3.0/hadoopDatas/tempDatas</value>
18. <description>A base for other temporary directories.</description>
19.</property>
20.
21.<!--缓冲区大小,根据服务器实际情况调整,默认为4096,建议调整为128K-->
22.<property>
23. <name>io.file.buffer.size</name>
24. <value>65536</value>
25. <description>The size of buffer for use in sequence files.
26. The size of this buffer should probably be a multiple of hardware
27. page size (4096 on Intel x86), and it determines how much data is
28. buffered during read and write operations.</description>
29.</property>
30.
31.<!--开启hdfs垃圾桶机制,删除的数据可以从垃圾桶回收,网盘实现需要,把时间调整为7天左右,单位为分钟 -->
32.<property>
33. <name>fs.trash.interval</name>
34. <value>10080</value>
35. <description>Number of minutes after which the checkpoint
36. gets deleted. If zero, the trash feature is disabled.
37. This option may be configured both on the server and the
38. client. If trash is disabled server side then the client
39. side configuration is checked. If trash is enabled on the
40. server side then the value configured on the server is
41. used and the client configuration value is ignored.
42. </description>
43.</property>
44.
45.</configuration>
hdfs-site.xml配置文件
1.<configuration>
2.
3.<!-- secondarynamenode服务器地址-->
4.<property>
5. <name>dfs.namenode.secondary.http-address</name>
6. <value>node02:50090</value>
7. <description>
8. The secondary namenode http server address and port.
9. </description>
10.</property>
11.
12.<!-- nn web端访问地址-->
13. <property>
14. <name>dfs.namenode.http-address</name>
15. <value>node01:50070</value>
16. </property>
17.
18.<!--存放namenode元数据的地址-->
19. <property>
20. <name>dfs.namenode.name.dir</name>
21. <value>file:///export/servers/hadoop-3.3.0/hadoopDatas/namenodeDatas,file:///export/servers/hadoop-3.3.0/hadoopDatas/namenodeDatas2</value>
22.</property>
23.
24.<!--存放datanode元数据的地址-->
25.<property>
26. <name>dfs.datanode.data.dir</name>
27. <value>file:///export/servers/hadoop-3.3.0/hadoopDatas/datanodeDatas,file:///export/servers/hadoop-3.3.0/hadoopDatas/datanodeDatas2</value>
28. <description>Determines where on the local filesystem an DFS data node
29. should store its blocks. If this is a comma-delimited
30. list of directories, then data will be stored in all named
31. directories, typically on different devices. The directories should be tagged
32. with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS
33. storage policies. The default storage type will be DISK if the directory does
34. not have a storage type tagged explicitly. Directories that do not exist will
35. be created if local filesystem permission allows.
36. </description>
37.</property>
38.
39.<!--存放日志文件的地址-->
40.<property>
41. <name>dfs.namenode.edits.dir</name>
42. <value>file:///export/servers/hadoop-3.3.0/hadoopDatas/nn/edits</value>
43. <description>Determines where on the local filesystem the DFS name node
44. should store the transaction (edits) file. If this is a comma-delimited list
45. of directories then the transaction file is replicated in all of the
46. directories, for redundancy. Default value is same as dfs.namenode.name.dir
47. </description>
48.</property>
49.
50.<!--文件备份个数-->
51. <property>
52. <name>dfs.replication</name>
53. <value>3</value>
54. <description>Default block replication.
55. The actual number of replications can be specified when the file is created.
56. The default is used if replication is not specified in create time.
57. </description>
58.</property>
59.
60.<!--访问hdfs权限-->
61. <property>
62. <name>dfs.permissions.enabled</name>
63. <value>false</value>
64. <description>
65. If "true", enable permission checking in HDFS.
66. If "false", permission checking is turned off,
67. but all other behavior is unchanged.
68. Switching from one parameter value to the other does not change the mode,
69. owner or group of files or directories.
70. </description>
71.</property>
72.
73. <!-- 储存块的大小 128M-->
74.<property>
75. <name>dfs.blocksize</name>
76. <value>134217728</value>
77. <description>
78. The default block size for new files, in bytes.
79. You can use the following suffix (case insensitive):
80. k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),
81. Or provide complete size in bytes (such as 134217728 for 128 MB).
82. </description>
83.</property>
84.
85. <!-- 2nn web端访问地址-->
86. <property>
87. <name>dfs.namenode.secondary.http-address</name>
88. <value>node03:50090</value>
89. </property>
90.
91.</configuration>
yarn-site.xml配置文件
1.<configuration>
2.<!-- Site specific YARN configuration properties -->
3.
4.<!-- 配置yarn主节点位置 -->
5. <property>
6. <description>The hostname of the RM.</description>
7. <name>yarn.resourcemanager.hostname</name>
8. <value>node02</value>
9. </property>
10.
11. <!-- 开启聚合日志时间 -->
12. <property>
13. <description>Whether to enable log aggregation. Log aggregation collects
14. each container's logs and moves these logs onto a file-system, for e.g.
15. HDFS, after the application completes. Users can configure the
16. "yarn.nodemanager.remote-app-log-dir" and
17. "yarn.nodemanager.remote-app-log-dir-suffix" properties to determine
18. where these logs are moved to. Users can access the logs via the
19. Application Timeline Server.
20. </description>
21. <name>yarn.log-aggregation-enable</name>
22. <value>true</value>
23. </property>
24.
25. <!--聚合日志保存时间 -->
26. <property>
27. <description>How long to keep aggregation logs before deleting them. -1 disables.
28. Be careful set this too small and you will spam the name node.</description>
29. <name>yarn.log-aggregation.retain-seconds</name>
30. <value>604800</value>
31. </property>
32.
33. <!--MR使用协议为 -->
34. <property>
35. <description>A comma separated list of services where service name should only
36. contain a-zA-Z0-9_ and can not start with numbers</description>
37. <name>yarn.nodemanager.aux-services</name>
38. <value>mapreduce_shuffle</value>
39. <!--<value>mapreduce_shuffle</value>-->
40. </property>
41.
42. <!--以下三个为设置yarn集群内存分配方案 -->
43. <property>
44. <description>Amount of physical memory, in MB, that can be allocated
45. for containers. If set to -1 and
46. yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
47. automatically calculated(in case of Windows and Linux).
48. In other cases, the default is 8192MB.
49. </description>
50. <name>yarn.nodemanager.resource.memory-mb</name>
51. <value>20480</value>
52. </property>
53.
54. <property>
55. <description>The minimum allocation for every container request at the RM
56. in MBs. Memory requests lower than this will be set to the value of this
57. property. Additionally, a node manager that is configured to have less memory
58. than this value will be shut down by the resource manager.</description>
59. <name>yarn.scheduler.minimum-allocation-mb</name>
60. <value>2048</value>
61. </property>
62.
63. <property>
64. <description>Ratio between virtual memory to physical memory when
65. setting memory limits for containers. Container allocations are
66. expressed in terms of physical memory, and virtual memory usage
67. is allowed to exceed this allocation by this ratio.
68. </description>
69. <name>yarn.nodemanager.vmem-pmem-ratio</name>
70. <value>2.1</value>
71. </property>
72.
73. <!--hdfs环境变量继承-->
74. <property>
75. <description>Environment variables that containers may override rather than use NodeManager's default.</description>
76. <name>yarn.nodemanager.env-whitelist</name>
77.<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
78. </property>
79.
80.</configuration>
mapred-site.xml配置文件
1.<configuration>
2.
3.<!--开启小任务模式-->
4.<property>
5. <name>mapreduce.job.ubertask.enable</name>
6. <value>true</value>
7. <description>Whether to enable the small-jobs "ubertask" optimization,
8. which runs "sufficiently small" jobs sequentially within a single JVM.
9. "Small" is defined by the following maxmaps, maxreduces, and maxbytes
10. settings. Note that configurations for application masters also affect
11. the "Small" definition - yarn.app.mapreduce.am.resource.mb must be
12. larger than both mapreduce.map.memory.mb and mapreduce.reduce.memory.mb,
13. and yarn.app.mapreduce.am.resource.cpu-vcores must be larger than
14. both mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores to enable
15. ubertask. Users may override this value.
16. </description>
17.</property>
18.
19.<!--历史任务主机端口-->
20.<property>
21. <name>mapreduce.jobhistory.address</name>
22. <value>node01:10020</value>
23. <description>MapReduce JobHistory Server IPC host:port</description>
24.</property>
25.
26.<!--网页访问历史任务主机端口-->
27.<property>
28. <name>mapreduce.jobhistory.webapp.address</name>
29. <value>node01:19888</value>
30. <description>MapReduce JobHistory Server Web UI host:port</description>
31.</property>
32.
33.<property>
34. <name>mapreduce.framework.name</name>
35. <value>yarn</value>
36. <description>The runtime framework for executing MapReduce jobs.
37. Can be one of local, classic or yarn.
38. </description>
39.</property>
40.
41.</configuration>
Hadoop环境变量设置
同样在/etc/hadoop目录中hadoop-env.sh和mapred-env.sh中添加JDK路径,可以不添加,这是为了防止无法识别JDK路径。
修改works文件(老版hadoop为slaves文件),添加三台服务器,不能有空行。
添加配置文件中设置的的文件夹:
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/tempDatas
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/namenodeDatas
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/namenodeDatas2
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/datanodeDatas
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/datanodeDatas2
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/nn/edits
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/snn/name
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/dfs/snn/edits
在集群上分发配置好的文件:
[root@node01 ~]# xsync /export/servers/hadoop-3 .0.0
三台服务器具有同样的配置之后,添加环境变量:
[root@node01 ~]# vim /etc/profile
在之前的JAVA环境变量后紧接着可以写下hadoop环境变量:
export HADOOP_HOME=/export/servers/hadoop-3.3.0
export PATH=:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
启动集群
因为使用root用户启动hadoop,所以需要修改启动脚本文件,添加启动环境变量,在start-dfs.sh脚本中添加:
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
在start-yarn.sh脚本中添加:
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
进入Hadoop目录,在此目录下操作初始化,命令只能使用一次,之后初始化会对文件进行完全清除。
[root@node01 ~]# cd /export/servers/hadoop-3.3.0/
[root@node01 ~]#bin/hdfs namenode -format #hdfs文件系统初始化
等待完成之后启动集群,在namenode(node01)服务器上启动hdfs文件系统:
[root@node01 ~]#sbin/start -dfs.sh #启动hdfs文件系统
在resourcemanager上启动yarn资源管理系统:
[root@node02 ~]#sbin/start-yarn.sh #启动yarn资源管理系统
之后启动历史任务主机:
[root@node01 ~]#sbin/mr-jobhistory-daemon.sh start historyserver
用jps命令检查hadoop启动是否成功
[root@node01 ~]#jps #利用java查看hadoop进程是否成功启动
之后可以登录node01:50070网址查看Hadoop工作状态:
另外,如果需要单独启动和关闭某一台服务器的命令需要使用其他命令:
#hadoop2.X版本命令
[root@node01 ~]#hadoop-daemon.sh start|stop namenode|datanode|secondarynamenode
[root@node01 ~]#yarn-daemon.sh start|stop resourcemanager|nodemanager
#hadoop3.X版本命令
[root@node01 ~]#hdfs --daemon start|stop namenode|datanode|secondarynamenode
[root@node01 ~]#yarn --daemon start|stop resourcemanager|nodemanager
Hadoop也有官方一键启动所有集群脚本:start-all.sh和stop-all.sh 。
Hadoop小案例的运行
圆周率Π的评估
在确保hadoop集群启动完成的情况下,运行官方实例包,参数pi为选择计算Π,参数3
为并行计算个数参数1000为MonteCarlo随机撒点个数。
[root@node01 ~]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar pi 3 1000
之后运行过程:
[root@node01 mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar pi 3 1000
Number of Maps = 3
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Starting Job
2022-01-02 16:07:58,675 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node02/192.168.189.110:8032
2022-01-02 16:07:59,475 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1641108593632_0001
2022-01-02 16:07:59,671 INFO input.FileInputFormat: Total input files to process : 3
2022-01-02 16:08:00,166 INFO mapreduce.JobSubmitter: number of splits:3
2022-01-02 16:08:00,497 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1641108593632_0001
2022-01-02 16:08:00,497 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-01-02 16:08:00,728 INFO conf.Configuration: resource-types.xml not found
2022-01-02 16:08:00,728 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-01-02 16:08:01,259 INFO impl.YarnClientImpl: Submitted application application_1641108593632_0001
2022-01-02 16:08:01,308 INFO mapreduce.Job: The url to track the job: http://node02:8088/proxy/application_1641108593632_0001/
2022-01-02 16:08:01,308 INFO mapreduce.Job: Running job: job_1641108593632_0001
2022-01-02 16:08:15,052 INFO mapreduce.Job: Job job_1641108593632_0001 running in uber mode : true
2022-01-02 16:08:15,065 INFO mapreduce.Job: map 0% reduce 0%
2022-01-02 16:08:17,230 INFO mapreduce.Job: map 67% reduce 0%
2022-01-02 16:08:18,251 INFO mapreduce.Job: map 100% reduce 0%
2022-01-02 16:08:19,266 INFO mapreduce.Job: map 100% reduce 100%
2022-01-02 16:08:19,303 INFO mapreduce.Job: Job job_1641108593632_0001 completed successfully
2022-01-02 16:08:19,428 INFO mapreduce.Job: Counters: 57
File System Counters
FILE: Number of bytes read=252
FILE: Number of bytes written=612
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2680
HDFS: Number of bytes written=1099179
HDFS: Number of read operations=97
HDFS: Number of large read operations=0
HDFS: Number of write operations=23
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=3
Launched reduce tasks=1
Other local map tasks=3
Total time spent by all maps in occupied slots (ms)=0
Total time spent by all reduces in occupied slots (ms)=0
TOTAL_LAUNCHED_UBERTASKS=4
NUM_UBER_SUBMAPS=3
NUM_UBER_SUBREDUCES=1
Total time spent by all map tasks (ms)=2717
Total time spent by all reduce tasks (ms)=1493
Total vcore-milliseconds taken by all map tasks=0
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=0
Total megabyte-milliseconds taken by all reduce tasks=0
Map-Reduce Framework
Map input records=3
Map output records=6
Map output bytes=54
Map output materialized bytes=84
Input split bytes=426
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=84
Reduce input records=6
Reduce output records=0
Spilled Records=12
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=849
CPU time spent (ms)=1440
Physical memory (bytes) snapshot=1276268544
Virtual memory (bytes) snapshot=11396325376
Total committed heap usage (bytes)=684646400
Peak Map Physical memory (bytes)=318140416
Peak Map Virtual memory (bytes)=2847207424
Peak Reduce Physical memory (bytes)=343498752
Peak Reduce Virtual memory (bytes)=2856808448
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=354
File Output Format Counters
Bytes Written=97
Job Finished in 20.926 seconds
Estimated value of Pi is 3.14133333333333333333
可以看出计算结果为3.1413333333333333333,进入resourcenamenode.address:8088查看工作记录:
WordCount单词频率统计
随机建立一个实验文件:
将该文件上传到HDFS文件系统中的/input 目录下,如果没有则创建该目录:
[root@node01 ~]#hadoop fs -mkdir /input
[root@node01 ~]#hadoop fs -put 计数.txt /input
当然,也可以使用网页添加该文件到HDFS文件系统中,登录namenode.addrss:50070,在菜单栏中点击Ueilitles→Browse the file system,创建文件夹放入文件:
回到服务器,运行计数任务。
[root@node01 ~]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar /input /output
其中输入路径参数/output是一定不能存在的,他会在运行后将结果自动生成在创建的目录中。运行完成后,重新回到HDFS浏览器刷新,进入output文件夹,点开part-r-xxxx,选择浏览查看统计结果。