Hadoop平台搭建手册

服务器基础配置

本次Hadoop平台搭建采用伪分布式文件储存方式,使用三台liunx系统虚拟机,版本为centOS8.0以上,Hadoop版本为3.3.0.x64位,VimWare版本为16pro。
集群环境搭建
在windows系统中确认VimWare的网络服务都完全启动
任务管理器→服务:
服务启动显示正在运行
确认VimWare生成的网关地址
在VW窗口菜单栏:编辑→虚拟网络编辑器→选择VMnet8→NAT设置查看网关,此处网关便为三台虚拟机网关。
显示子网网关为192.168.189.2
确认VimNet8的网卡配置好了IP地址和DNS。
Windows设置→状态→更改适配器选项→选择VimNet8→右键属性→选择ipv4协议打开属性。
设置ip地址和网关在同一网段,默认网关为之前虚拟机查看的网关,首选DNS服务器为谷歌的8.8.8.8.
设置好之后的ipv4协议属性

虚拟机IP环境设置

使用镜像文件快速安装虚拟机,内存设为2G,储存设为20G,三台名字分别设置为node01、node02、node03。
启动虚拟机,查看每台虚拟机的物理地址。
VM虚拟机界面→点击设置→网络适配器(NAT)→高级:
查看到物理地址为00:0C:29:DF:E6:14
老版VimWare生成虚拟机三台使用的为同一物理地址,则需要点击生成新的物理地址,可以确定在16pro版本后虚拟机的安装不会继承同一地址,所以此处只是查看物理地址,下一步将该物理地址编写到服务器网卡配置文件中:[root@node01 ~]# vim /etc/udev/rules.d/70-persistent-ipoib.rules

将ATTR{address}字段改写为之前的物理地址,并只留一行网卡配置,名字最好设为“eth0”

修改前道修改后
之后修改IP地址配置文件:[root@node01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens160
DEVICE为网卡名,设置为eth0;
ONBOOT为开机加载,设置为yes;
BOOTPROTO字段为ip分配方式,dhcp为自动分配,将字段修改为static(静态分配),并添加以下字段:

IPADDR = 192.168.189.100
GATEWAY = 192.168.189.2
NETMAKE = 255.255.255.0
NDS1 = 8.8.8.8
HWADDR =00:0C:29:DF:E6:14

以上字段分别对应自己的IP地址、网关、子网掩码、NDS代理、物理地址。保存推出reboot重启加载。
配置好的文件样式
重启后 ping www.baidu.com发现可以ping通,输入:[root@node01 ~]# ifconfig
查看网络配置,可以发现物理地址和IP地址已经修改成功。
修改成功的网络配置
其他虚拟机使用一样的步骤修改物理地址和IP地址。
修改主机名

[root@node01 ~]# vim  /etc/sysconfig/network

将HOSTNAME字段改为node01。
这是centOS6之前的做法,OS7版本之后使用:

[root@node01 ~]# hostnamectl set-hostname 用户名

三台主机名分别设置主机名为node01、node02、node03。
配置每台服务器的IP和域名映射:

[root@node01 ~]# vim  /etc/hosts

进入文件后编写:

192.168.189.100   node01   node01.hadoop.com
192.168.189.110   node02   node02.hadoop.com
192.168.189.120   node03   node03.hadoop.com

三台服务器均配置映射,之后关机重启。

关闭防火墙

三台服务器执行:

[root@node01 ~]#service iptables stop #关闭防火墙
[root@node01 ~]# chkconfig iptables off #禁止开机启动项

OS7版本之后linux系统没有安装iptables service,故没有service命令,iptables归入firewalld管理,所以使用:

[root@node01 ~]# systemctl stop firewalld

关闭SElinux

SELinux是Linux的一种安全子系统,Linux中的权限管理是针对文件的,而不是针对进程的,SELinux在Linux的文件权限之外,增加了对进程的限制,进程只能允许在一定范围之内操作资源,如果开启了SELinux,需要做非常复杂的配置,才能正常使用系统。SELinux分为enforcing强制模式、permissive宽容模式和disable关闭模式。三台服务器进入selinux配置文件关闭该子系统:

[root@node01 ~]# vim /etc/selinux/config

将SELINUX字段从enforcing改为disable

实现SSH免密登录

在三台服务器下执行以下命令,生成公钥和私钥:

[root@node01 ~]# ssh-keygen -t -rsa

执行之后敲下三个回车,在/root/.ssh目录下就会生成公钥和私钥文件。
服务器都有密钥对之后将公钥拷贝到同一个地方:

[root@node01 ~]# ssh-copy-id node01      #自己也要拷贝自己
[root@node02 ~]# ssh-copy-id node01      #用第二台拷贝到node01
[root@node03 ~]# ssh-copy-id node01      #用第三台拷贝到node01

每次拷贝都会输入root账户的密码,完成之后,第一台服务器node01就有了所有的公钥,将其分发到其他服务器上,在CRT窗口会话上(或者其他shell控制软件):

[root@node01 ~]# scp /root/.ssh/authorized_keys node02:/root/.ssh
[root@node01 ~]# scp /root/.ssh/authorized_keys node03:/root/.ssh

之后就可以在任意服务器登录任意其他服务器,实现免密登录

[root@node01 ~]# ssh node02     #输入后直接进入node02,没有询问密码环节
Activate the web console with: systemctl enable --now cockpit.socket

Last login: Fri Dec 31 19:14:25 2021 from 192.168.189.100
[root@node02 ~]#exit                    #观察到用户已经切换到node02了,退出
logout
Connection to node02 closed.
[root@node01 ~]#                       #回到node01

三台服务器安装JDK

以一台服务器为例,查看自带的openjdk并卸载:

[root@node01 ~]rpm -qa|grep java
java-1. 6. 0-openjdk-1.6.0. 41-1.13.13.1.e16 8. x86_ 64 
tzdata-java 2016j-1. eA 6. noarch
java-1.7.0-openjdk-1.7.0. 131-2.6.9.0.e16_ 8. x86_ 64
You have new ma il in /var/spool/mail/root
[root@node01 ~]rpm -e java-1.6.0-openjdk-1.6.0.41-1.13.13.1.e16. .8.x86 .64 tzdata- java-
2016j-1.e16. noarch java-1 .7.0-openjdk-1.7.0.131-2.6.9.0.e16_ 8.x86 _64
nodeps

创建安装目录:

[root@node01 ~]mkdir -p /export softwares           #软件安装包目录
[root@node01 ~]mkdir -p /export/servers             #安装目录

上传并解压:

[root@node01 ~]rz -e

或者直接拖动文件到服务器安装包目录,解压存放到servers目录

[root@node01 ~]tar -zxvf  jdk-8u301-linux-x64.tar.gz -C ../servers

配置环境变量

[root@node01 ~]vim /etc/profile

添加以下内容:

export JAVA_HOME=/export/servers/jdk1.8.0_301-amd64
export PATH=$JAVA_HOME/bin:$PATH

修改完成后重载文件:

[root@node01 ~]source /etc/profile

最后验证jdk安装是否成功

[root@node01 ~]java -version

jdk安装成功

时钟同步

可以用自己的一台主机为参照内部同步,而此搭建采用网络时间同步,OS7版本之后因为ntp自身缺陷的一些原因,摒弃了老式ntp软件的供应,但OS8中时间自带ntp服务,修改时区到上海

[root@node01 ~]#timedatectl list-timezones |grep Shanghai
Asia/Shanghai
[root@node01 ~]#timedatectl set-timezone  “Asia/Shanghai”```
进入chrony配置文件

```bash
[root@node01 ~]#vim /etc/chroy.conf

添加阿里云时间服务

server ntp.aliyun.com iburst

保存退出重启时间

[root@node01 ~]systemctl restart chornyd
[root@node01 ~]systemctl status chrond      #查看时间服务是否添加
[root@node01 ~]chronyc sources -v          #查看各时间源时间误差

检查时间配置
发现阿里云时间服务添加成功,误差也很小。

Hadoop的安装及配置

Hadoop的安装

去官网下载hadoop安装包:Index of /hadoop/common (apache.org),选择hadoop-3.3.0,
下载hadoop-3.3.0.tar.gz,后上传服务器解压到/export/servers目录。

Hadoop配置文件的修改

Hadoop需要修改四个配置文件:core-site.xml 、hdfs-site.xml、yarn-site.xml、mapred-site.xml,这些文件均在/export/servers/hadoop-3.3.0/etc/hadoop路径上,而它们对应的默认配置文件:core-default.xml 、hdfs-default.xml、yarn-default.xml、mapred-default.xml在对应的jar包中,编写时可用来对照参考。

                                       集群规划示意表
IP主机名HDFSYARN
192.168.189.100Node01Datanode、namenodeNodeManager、mapreduce.jobhistory
192.168.189.110Node02datanodeResourceManager、NodeManager
192.168.189.120Node03Datanode、seconderynamenodeNodeManager

搭建使用notepad++并配置NppFTP插件对文件进行修改。

core-site.xml配置文件

1.<configuration>
2.
3.<!--指定文件的系统类型:分布式文件系统-->
4.<property>
5.  <name>fs.defaultFS</name>
6.  <value>hdfs://node01:8020</value>
7.  <description>The name of the default file system.  A URI whose
8.  scheme and authority determine the FileSystem implementation.  The
9.  uri's scheme determines the config property (fs.SCHEME.impl) naming
10.  the FileSystem implementation class.  The uri's authority is used to
11.  determine the host, port, etc. for a filesystem.</description>
12.</property>
13.
14.<!--指定临时文件储存目录-->
15.<property>
16.  <name>hadoop.tmp.dir</name>
17.  <value>/export/servers/hadoop-3.3.0/hadoopDatas/tempDatas</value>
18.  <description>A base for other temporary directories.</description>
19.</property>
20.
21.<!--缓冲区大小,根据服务器实际情况调整,默认为4096,建议调整为128K-->
22.<property>
23.  <name>io.file.buffer.size</name>
24.  <value>65536</value>
25.  <description>The size of buffer for use in sequence files.
26.  The size of this buffer should probably be a multiple of hardware
27.  page size (4096 on Intel x86), and it determines how much data is
28.  buffered during read and write operations.</description>
29.</property>
30.
31.<!--开启hdfs垃圾桶机制,删除的数据可以从垃圾桶回收,网盘实现需要,把时间调整为7天左右,单位为分钟 -->
32.<property>
33.  <name>fs.trash.interval</name>
34.  <value>10080</value>
35.  <description>Number of minutes after which the checkpoint
36.  gets deleted.  If zero, the trash feature is disabled.
37.  This option may be configured both on the server and the
38.  client. If trash is disabled server side then the client
39.  side configuration is checked. If trash is enabled on the
40.  server side then the value configured on the server is
41.  used and the client configuration value is ignored.
42.  </description>
43.</property>
44.
45.</configuration>

hdfs-site.xml配置文件

1.<configuration>
2.
3.<!-- secondarynamenode服务器地址-->
4.<property>
5.  <name>dfs.namenode.secondary.http-address</name>
6.  <value>node02:50090</value>
7.  <description>
8.    The secondary namenode http server address and port.
9.  </description>
10.</property>
11.
12.<!-- nn web端访问地址-->
13. <property>
14.        <name>dfs.namenode.http-address</name>
15.        <value>node01:50070</value>
16.    </property>
17. 
18.<!--存放namenode元数据的地址-->
19. <property>
20.  <name>dfs.namenode.name.dir</name>
21.  <value>file:///export/servers/hadoop-3.3.0/hadoopDatas/namenodeDatas,file:///export/servers/hadoop-3.3.0/hadoopDatas/namenodeDatas2</value>
22.</property>
23.
24.<!--存放datanode元数据的地址-->
25.<property>
26.  <name>dfs.datanode.data.dir</name>
27.  <value>file:///export/servers/hadoop-3.3.0/hadoopDatas/datanodeDatas,file:///export/servers/hadoop-3.3.0/hadoopDatas/datanodeDatas2</value>
28.  <description>Determines where on the local filesystem an DFS data node
29.  should store its blocks.  If this is a comma-delimited
30.  list of directories, then data will be stored in all named
31.  directories, typically on different devices. The directories should be tagged
32.  with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS
33.  storage policies. The default storage type will be DISK if the directory does
34.  not have a storage type tagged explicitly. Directories that do not exist will
35.  be created if local filesystem permission allows.
36.  </description>
37.</property>
38.
39.<!--存放日志文件的地址-->
40.<property>
41.  <name>dfs.namenode.edits.dir</name>
42.  <value>file:///export/servers/hadoop-3.3.0/hadoopDatas/nn/edits</value>
43.  <description>Determines where on the local filesystem the DFS name node
44.      should store the transaction (edits) file. If this is a comma-delimited list
45.      of directories then the transaction file is replicated in all of the 
46.      directories, for redundancy. Default value is same as dfs.namenode.name.dir
47.  </description>
48.</property>
49.
50.<!--文件备份个数-->
51. <property>
52.  <name>dfs.replication</name>
53.  <value>3</value>
54.  <description>Default block replication. 
55.  The actual number of replications can be specified when the file is created.
56.  The default is used if replication is not specified in create time.
57.  </description>
58.</property>
59.
60.<!--访问hdfs权限-->
61. <property>
62.  <name>dfs.permissions.enabled</name>
63.  <value>false</value>
64.  <description>
65.    If "true", enable permission checking in HDFS.
66.    If "false", permission checking is turned off,
67.    but all other behavior is unchanged.
68.    Switching from one parameter value to the other does not change the mode,
69.    owner or group of files or directories.
70.  </description>
71.</property>
72.
73. <!-- 储存块的大小 128M-->
74.<property>
75.  <name>dfs.blocksize</name>
76.  <value>134217728</value>
77.  <description>
78.      The default block size for new files, in bytes.
79.      You can use the following suffix (case insensitive):
80.      k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),
81.      Or provide complete size in bytes (such as 134217728 for 128 MB).
82.  </description>
83.</property>
84. 
85. <!-- 2nn web端访问地址-->
86.    <property>
87.        <name>dfs.namenode.secondary.http-address</name>
88.        <value>node03:50090</value>
89.    </property>
90.
91.</configuration>

yarn-site.xml配置文件

1.<configuration>
2.<!-- Site specific YARN configuration properties -->
3.
4.<!-- 配置yarn主节点位置 -->
5.  <property>
6.    <description>The hostname of the RM.</description>
7.    <name>yarn.resourcemanager.hostname</name>
8.    <value>node02</value>
9.  </property>  
10.
11.  <!-- 开启聚合日志时间 -->
12.    <property>
13.    <description>Whether to enable log aggregation. Log aggregation collects
14.      each container's logs and moves these logs onto a file-system, for e.g.
15.      HDFS, after the application completes. Users can configure the
16.      "yarn.nodemanager.remote-app-log-dir" and
17.      "yarn.nodemanager.remote-app-log-dir-suffix" properties to determine
18.      where these logs are moved to. Users can access the logs via the
19.      Application Timeline Server.
20.    </description>
21.    <name>yarn.log-aggregation-enable</name>
22.    <value>true</value>
23.  </property>
24.  
25.  <!--聚合日志保存时间 -->
26.  <property>
27.    <description>How long to keep aggregation logs before deleting them.  -1 disables. 
28.    Be careful set this too small and you will spam the name node.</description>
29.    <name>yarn.log-aggregation.retain-seconds</name>
30.    <value>604800</value>
31.  </property> 
32.  
33.  <!--MR使用协议为 -->
34.    <property>
35.    <description>A comma separated list of services where service name should only
36.      contain a-zA-Z0-9_ and can not start with numbers</description>
37.    <name>yarn.nodemanager.aux-services</name>
38.    <value>mapreduce_shuffle</value>
39.    <!--<value>mapreduce_shuffle</value>-->
40.  </property>
41.  
42.   <!--以下三个为设置yarn集群内存分配方案 -->
43.     <property>
44.    <description>Amount of physical memory, in MB, that can be allocated 
45.    for containers. If set to -1 and
46.    yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
47.    automatically calculated(in case of Windows and Linux).
48.    In other cases, the default is 8192MB.
49.    </description>
50.    <name>yarn.nodemanager.resource.memory-mb</name>
51.    <value>20480</value>
52.  </property>
53.  
54.    <property>
55.    <description>The minimum allocation for every container request at the RM
56.    in MBs. Memory requests lower than this will be set to the value of this
57.    property. Additionally, a node manager that is configured to have less memory
58.    than this value will be shut down by the resource manager.</description>
59.    <name>yarn.scheduler.minimum-allocation-mb</name>
60.    <value>2048</value>
61.  </property>
62.  
63.  <property>
64.    <description>Ratio between virtual memory to physical memory when
65.    setting memory limits for containers. Container allocations are
66.    expressed in terms of physical memory, and virtual memory usage
67.    is allowed to exceed this allocation by this ratio.
68.    </description>
69.    <name>yarn.nodemanager.vmem-pmem-ratio</name>
70.    <value>2.1</value>
71.  </property>
72.  
73. <!--hdfs环境变量继承-->
74.    <property>
75.    <description>Environment variables that containers may override rather than use NodeManager's default.</description>
76.    <name>yarn.nodemanager.env-whitelist</name>
77.<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
78.  </property>
79.
80.</configuration>

mapred-site.xml配置文件

1.<configuration>
2.
3.<!--开启小任务模式-->
4.<property>
5.  <name>mapreduce.job.ubertask.enable</name>
6.  <value>true</value>
7.  <description>Whether to enable the small-jobs "ubertask" optimization,
8.  which runs "sufficiently small" jobs sequentially within a single JVM.
9.  "Small" is defined by the following maxmaps, maxreduces, and maxbytes
10.  settings. Note that configurations for application masters also affect
11.  the "Small" definition - yarn.app.mapreduce.am.resource.mb must be
12.  larger than both mapreduce.map.memory.mb and mapreduce.reduce.memory.mb,
13.  and yarn.app.mapreduce.am.resource.cpu-vcores must be larger than
14.  both mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores to enable
15.  ubertask. Users may override this value.
16.  </description>
17.</property>
18.
19.<!--历史任务主机端口-->
20.<property>
21.  <name>mapreduce.jobhistory.address</name>
22.  <value>node01:10020</value>
23.  <description>MapReduce JobHistory Server IPC host:port</description>
24.</property>
25.
26.<!--网页访问历史任务主机端口-->
27.<property>
28.  <name>mapreduce.jobhistory.webapp.address</name>
29.  <value>node01:19888</value>
30.  <description>MapReduce JobHistory Server Web UI host:port</description>
31.</property>
32.
33.<property>
34.  <name>mapreduce.framework.name</name>
35.  <value>yarn</value>
36.  <description>The runtime framework for executing MapReduce jobs.
37.  Can be one of local, classic or yarn.
38.  </description>
39.</property>
40.
41.</configuration>

Hadoop环境变量设置

同样在/etc/hadoop目录中hadoop-env.sh和mapred-env.sh中添加JDK路径,可以不添加,这是为了防止无法识别JDK路径。
添加的export环境变量
修改works文件(老版hadoop为slaves文件),添加三台服务器,不能有空行。
添加后的文件
添加配置文件中设置的的文件夹:

[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/tempDatas
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/namenodeDatas
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/namenodeDatas2
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/datanodeDatas
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/datanodeDatas2
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/nn/edits
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/snn/name
[root@node01 ~]# mkdir -p /export/servers/hadoop-3.3.0/hadoopDatas/dfs/snn/edits

在集群上分发配置好的文件:

[root@node01 ~]# xsync  /export/servers/hadoop-3 .0.0

三台服务器具有同样的配置之后,添加环境变量:

[root@node01 ~]# vim /etc/profile

在之前的JAVA环境变量后紧接着可以写下hadoop环境变量:

export HADOOP_HOME=/export/servers/hadoop-3.3.0
export PATH=:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

启动集群

因为使用root用户启动hadoop,所以需要修改启动脚本文件,添加启动环境变量,在start-dfs.sh脚本中添加:

HDFS_DATANODE_USER=root  
HDFS_DATANODE_SECURE_USER=hdfs  
HDFS_NAMENODE_USER=root  
HDFS_SECONDARYNAMENODE_USER=root

在start-yarn.sh脚本中添加:

YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

进入Hadoop目录,在此目录下操作初始化,命令只能使用一次,之后初始化会对文件进行完全清除。

[root@node01 ~]# cd  /export/servers/hadoop-3.3.0/
[root@node01 ~]#bin/hdfs namenode -format            #hdfs文件系统初始化

等待完成之后启动集群,在namenode(node01)服务器上启动hdfs文件系统:

[root@node01 ~]#sbin/start -dfs.sh                    #启动hdfs文件系统

在resourcemanager上启动yarn资源管理系统:

[root@node02 ~]#sbin/start-yarn.sh                   #启动yarn资源管理系统

之后启动历史任务主机:

[root@node01 ~]#sbin/mr-jobhistory-daemon.sh start historyserver

用jps命令检查hadoop启动是否成功

[root@node01 ~]#jps                  #利用java查看hadoop进程是否成功启动

Node01进程
Node02进程
Node03进程
之后可以登录node01:50070网址查看Hadoop工作状态:Hadoop网页
另外,如果需要单独启动和关闭某一台服务器的命令需要使用其他命令:

        #hadoop2.X版本命令
[root@node01 ~]#hadoop-daemon.sh start|stop namenode|datanode|secondarynamenode
[root@node01 ~]#yarn-daemon.sh start|stop resourcemanager|nodemanager
#hadoop3.X版本命令
[root@node01 ~]#hdfs --daemon start|stop namenode|datanode|secondarynamenode
[root@node01 ~]#yarn --daemon start|stop resourcemanager|nodemanager

Hadoop也有官方一键启动所有集群脚本:start-all.sh和stop-all.sh 。

Hadoop小案例的运行

圆周率Π的评估

在确保hadoop集群启动完成的情况下,运行官方实例包,参数pi为选择计算Π,参数3
为并行计算个数参数1000为MonteCarlo随机撒点个数。

[root@node01 ~]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar pi 3 1000

之后运行过程:

[root@node01 mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar pi 3 1000
Number of Maps  = 3
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Starting Job
2022-01-02 16:07:58,675 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node02/192.168.189.110:8032
2022-01-02 16:07:59,475 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1641108593632_0001
2022-01-02 16:07:59,671 INFO input.FileInputFormat: Total input files to process : 3
2022-01-02 16:08:00,166 INFO mapreduce.JobSubmitter: number of splits:3
2022-01-02 16:08:00,497 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1641108593632_0001
2022-01-02 16:08:00,497 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-01-02 16:08:00,728 INFO conf.Configuration: resource-types.xml not found
2022-01-02 16:08:00,728 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-01-02 16:08:01,259 INFO impl.YarnClientImpl: Submitted application application_1641108593632_0001
2022-01-02 16:08:01,308 INFO mapreduce.Job: The url to track the job: http://node02:8088/proxy/application_1641108593632_0001/
2022-01-02 16:08:01,308 INFO mapreduce.Job: Running job: job_1641108593632_0001
2022-01-02 16:08:15,052 INFO mapreduce.Job: Job job_1641108593632_0001 running in uber mode : true
2022-01-02 16:08:15,065 INFO mapreduce.Job:  map 0% reduce 0%
2022-01-02 16:08:17,230 INFO mapreduce.Job:  map 67% reduce 0%
2022-01-02 16:08:18,251 INFO mapreduce.Job:  map 100% reduce 0%
2022-01-02 16:08:19,266 INFO mapreduce.Job:  map 100% reduce 100%
2022-01-02 16:08:19,303 INFO mapreduce.Job: Job job_1641108593632_0001 completed successfully
2022-01-02 16:08:19,428 INFO mapreduce.Job: Counters: 57
        File System Counters
                FILE: Number of bytes read=252
                FILE: Number of bytes written=612
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2680
                HDFS: Number of bytes written=1099179
                HDFS: Number of read operations=97
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=23
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=3
                Launched reduce tasks=1
                Other local map tasks=3
                Total time spent by all maps in occupied slots (ms)=0
                Total time spent by all reduces in occupied slots (ms)=0
                TOTAL_LAUNCHED_UBERTASKS=4
                NUM_UBER_SUBMAPS=3
                NUM_UBER_SUBREDUCES=1
                Total time spent by all map tasks (ms)=2717
                Total time spent by all reduce tasks (ms)=1493
                Total vcore-milliseconds taken by all map tasks=0
                Total vcore-milliseconds taken by all reduce tasks=0
                Total megabyte-milliseconds taken by all map tasks=0
                Total megabyte-milliseconds taken by all reduce tasks=0
        Map-Reduce Framework
                Map input records=3
                Map output records=6
                Map output bytes=54
                Map output materialized bytes=84
                Input split bytes=426
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=84
                Reduce input records=6
                Reduce output records=0
                Spilled Records=12
                Shuffled Maps =3
                Failed Shuffles=0
                Merged Map outputs=3
                GC time elapsed (ms)=849
                CPU time spent (ms)=1440
                Physical memory (bytes) snapshot=1276268544
                Virtual memory (bytes) snapshot=11396325376
                Total committed heap usage (bytes)=684646400
                Peak Map Physical memory (bytes)=318140416
                Peak Map Virtual memory (bytes)=2847207424
                Peak Reduce Physical memory (bytes)=343498752
                Peak Reduce Virtual memory (bytes)=2856808448
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=354
        File Output Format Counters 
                Bytes Written=97
Job Finished in 20.926 seconds
Estimated value of Pi is 3.14133333333333333333

可以看出计算结果为3.1413333333333333333,进入resourcenamenode.address:8088查看工作记录:
网页显示有MR的进程提交

WordCount单词频率统计

随机建立一个实验文件:计数.txt
将该文件上传到HDFS文件系统中的/input 目录下,如果没有则创建该目录:

[root@node01 ~]#hadoop fs -mkdir  /input
[root@node01 ~]#hadoop fs -put 计数.txt  /input

当然,也可以使用网页添加该文件到HDFS文件系统中,登录namenode.addrss:50070,在菜单栏中点击Ueilitles→Browse the file system,创建文件夹放入文件:
上传文件
回到服务器,运行计数任务。

[root@node01 ~]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar  /input  /output

其中输入路径参数/output是一定不能存在的,他会在运行后将结果自动生成在创建的目录中。运行完成后,重新回到HDFS浏览器刷新,进入output文件夹,点开part-r-xxxx,选择浏览查看统计结果。
计数结果

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值