Hadoop3.1.3集群安装
一、准备工作
1.虚拟机准备
VMWare准备3台虚拟机,使用CentOS7简化版安装Linux系统
2.安装必要插件
sudo yum install -y epel-release
sudo yum install -y psmisc nc net-tools rsync vim lrzsz ntp libzstd openssl-static tree iotop git
3.修改主机名
主机名分别为:
- linux181
- linux182
- linux183
使用以下指令修改
# linux181主机执行
hostnamectl --static set-hostname linux181
# linux182主机执行
hostnamectl --static set-hostname linux182
# linux183主机执行
hostnamectl --static set-hostname linux183
4.关闭防火墙
3台主机均需要关闭
# 禁止开机启动
systemctl disable firewalld
# 关闭防火墙
systemctl stop firewalld
5.设置静态ip及映射
以linux181为例说明静态IP的配置
# 修改网络配置
[root@linux181 hadoop-3.1.3]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static # 修改为静态ip
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens33
UUID=6d4c656f-9a1d-4029-a8f3-3773afdcd31c
DEVICE=ens33
ONBOOT=yes # 默认为no,需要修改为yes
# 按照自己网络配置设置网络地址,我的虚拟机通过网桥设置
IPADDR=192.168.31.181
NETMASK=255.255.255.0
GATEWAY=192.168.31.1
DNS1=192.168.31.1
配置完成以后需要重启网络服务
systemctl restart network
通过vim /etc/hosts配置各个主机的映射
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.31.181 linux181
192.168.31.182 linux182
192.168.31.183 linux183
6.编写分发脚本
在/usr/local/bin下编写分发脚本
[root@linux181 software] vim /usr/local/bin/disync.sh
脚本如下
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Arguement!
exit;
fi
hosts=(linux181 linux182 linux183)
#2. 遍历所有目录,挨个发送
for file in $@
do
#3. 获取父目录
pdir=$(cd -P $(dirname $file); pwd)
#4. 获取当前文件的名称
fname=$(basename $file)
#5. 遍历集群所有机器,拷贝
for host in ${hosts[@]}
do
echo ==================== $host ====================
rsync -av $pdir/$fname $USER@$host:$pdir
done
done
7.配置ssh无密登录配置
以linux181为例
① 生成公匙和私匙
ssh-keygen -t rsa
然后敲(三个回车),查看/root/.ssh/目录,可以看到会生成两个文件id_rsa(私钥)、id_rsa.pub(公钥)
[root@linux181 software]# cd /root/.ssh/
② 将公钥拷贝到要免密登录的目标机器上
ssh-copy-id linux181
ssh-copy-id linux182
ssh-copy-id linux183
本质是将linux181的公匙拷贝到目标机器的/root/.ssh/authorized_keys文件中进行授权
8.安装JDK
三台主机均需要安装JDK
① linux181下解压
[root@linux181 software] tar -zxvf /usr/local/software/jdk-8u212-linux-x64.tar.gz -C /usr/local/
② 配置环境变量
# 查看配置文件
[root@linux181 software] vim /etc/profile
# /etc/profile
# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc
# 可以看到CentOS7建议在/etc/profile.d/目录下新建custom.sh配置环境变量
# 然后使用source /etc/profile使配置生效
# It's NOT a good idea to change this file unless you know what you
# are doing. It's much better to create a custom.sh shell script in
# /etc/profile.d/ to make custom changes to your environment, as this
# will prevent the need for merging in future updates.
在/etc/profile.d/custom.sh中配置环境变量
[root@linux181 software] vim /etc/profile.d/custom.sh
配置如下
# JAVA_HOME
JAVA_HOME=/usr/local/jdk1.8.0_212
PATH=$PATH:$JAVA_HOME/bin
export PATH JAVA_HOME
通过source /etc/profile刷新配置文件使之生效
③ 通过分发脚本分发jdk到linux182、linux183
[root@linux181 software] disync.sh /usr/local/jdk1.8.0_212/
④ linux182/linux183配置Java的环境变量
linux分发/etc/profile.d/custom.sh并在 linux182、linux183上分别source /etc/profile
9.重启
完成上述步骤后重启虚拟机
二、安装Hadoop
1.集群规划
| linux181 | linux182 | linux183 | |
|---|---|---|---|
| HDFS | NameNode DataNode | DataNode | Secondary NameNodeDataNode |
| YARN | NodeManager | ResourceManager NodeManager | NodeManager |
2.linux181上安装Hadoop
1)解压安装包
上传安装包到/usr/local/software目录下
tar -zxvf /usr/local/software/hadoop-3.1.3.tar.gz -C /usr/local/
2)配置文件
配置文件位于Hadoop安装目录的etc/hadoop/目录下
[root@linux181 hadoop-3.1.3] cd /usr/local/hadoop-3.1.3/etc/hadoop/
① 核心配置文件
配置core-site.xml
[root@linux181 hadoop] vim core-site.xml
配置如下
<configuration>
<!--指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://linux181:9820</value>
</property>
<!-- 指定Hadoop运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-3.1.3/data</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>root</value>
<description>指定在Web UI访问HDFS使用的用户名</description>
</property>
</configuration>
可选配置
<!-- 通过web界面操作hdfs的权限 -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>zyx</value>
</property>
<!-- 后面hive的兼容性配置 -->
<property>
<name>hadoop.proxyuser.zyx.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.zyx.groups</name>
<value>*</value>
</property>
② HDFS配置文件
配置hdfs-site.xml
[root@linux181 hadoop] vim hdfs-site.xml
配置如下
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>linux183:9868</value>
</property>
</configuration>
可选配置
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop-3.1.3/data/namenode</value>
<description>NameNode存储名称空间和事务日志的本地文件系统上的路径,默认为 ${hadoop.tmp.dir}/dfs/name</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop-3.1.3/data/datanode</value>
<description>DataNode存储名称空间和事务日志的本地文件系统上的路径,默认为 ${hadoop.tmp.dir}/dfs/data</description>
</property>
③ YARN配置文件
配置yarn-site.xml
[root@linux181 hadoop] vim yarn-site.xml
配置如下
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Reducer获取数据的方式</description>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>linux182</value>
<description>指定YARN的ResourceManager的地址</description>
</property>
<!-- 环境变量通过从NodeManagers的容器继承的环境属性,对于mapreduce应用程序,除了默认值 hadoop op_mapred_home应该被添>加外。属性值 还有如下-->
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
<description>环境变量通过从NodeManagers的容器继承的环境属性,对于mapreduce应用程序,除了默认值 hadoop op_mapred_home应该被添加外,上面的变量也需要被添加</description>
</property>
<!-- 解决Yarn在执行程序遇到超出虚拟内存限制,Container被kill -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!-- 后面hive的兼容性配置 -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
<description>为每个容器请求分配的最小内存限制资源管理器(512M)</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
<description>为每个容器请求分配的最大内存限制资源管理器(4G)</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
<description>每个工作节点可以请求分配的最大内存限制资源管理器(4G)</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>虚拟内存比例,默认为2.1,此处设置为4倍</description>
</property>
<!-- 设置日志聚集 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>开启日志聚集</description>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://linux181:19888/jobhistory/logs</value>
<description>日志聚集访问路径</description>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
<description>日志聚集保存的时间7天</description>
</property>
</configuration>
日志聚集
- 概念:应用运行完成以后,将程序运行日志信息上传到HDFS系统上。
- 作用:可以方便的查看到程序运行详情,方便开发调试。
④ MapReduce配置文件
配置mapred-site.xml
[root@linux181 hadoop] vim mapred-site.xml
配置如下
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>执行MapReduce的方式:yarn/local</description>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop-3.1.3</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop-3.1.3</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop-3.1.3</value>
</property>
<!-- 设置历史服务 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>linux181:10020</value>
<description>历史服务器端地址</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>linux181:19888</value>
<description>历史服务器web端地址</description>
</property>
</configuration>
⑤ workers文件
配置workers文件
[root@linux181 hadoop] vim workers
配置如下
linux181
linux182
linux183
⑥ 修改hadoop.env环境变量
修改hadoop.env文件
[root@linux181 hadoop] hadoop.env
配置如下
# 修改第54行(hadoop默认的hadoop.env文件)为:
export JAVA_HOME=/usr/local/jdk1.8.0_212 # 原文件中该行被注释掉
# 在配置文件末尾添加以下配置,避免因root用户带来的权限问题
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
3.修改Hadoop一键启动脚本
Hadoop 3中不允许使用root用户来一键启动集群,否则将会报错
Starting resourcemanager
ERROR: Attempting to operate on yarn resourcemanager as root
ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
Starting nodemanagers
ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.
需要修改一键启动脚本从而可以使用root用户一键启动集群
① 修改HDFS集群一键脚本
启动脚本位于Hadoop安装目录的sbin目录下
[root@linux181 hadoop] cd /usr/local/hadoop-3.1.3/sbin/
修改该目录下的start-dfs.sh、stop-dfs.sh脚本
在两个文件顶部添加如下内容
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
① 修改HDFS集群一键脚本
启动脚本位于Hadoop安装目录的sbin目录下
[root@linux181 hadoop] cd /usr/local/hadoop-3.1.3/sbin/
修改该目录下的start-yarn.sh、stop-yarn.sh脚本
在两个文件顶部添加如下内容
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
4.配置环境变量
方法同Java环境变量的配置,配置后的文件如下
# JAVA_HOME
JAVA_HOME=/usr/local/jdk1.8.0_212
PATH=$PATH:$JAVA_HOME/bin
# HADOOP_HOME
HADOOP_HOME=/usr/local/hadoop-3.1.3
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH JAVA_HOME HADOOP_HOME
配置完成后注意使用source /etc/profile刷新
5.分发文件
将配置好的hadoop分发到另外两台机器上
[root@linux181 hadoop] disync.sh /usr/local/hadoop-3.1.3/
分发后注意在linux182及linux183上配置hadoop环境变量
6.格式化HDFS
[root@linux181 hadoop-3.1.3] hdfs namenode -format
......
2021-02-19 11:57:55,841 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2021-02-19 11:57:55,841 INFO util.GSet: VM type = 64-bit
2021-02-19 11:57:55,841 INFO util.GSet: 0.029999999329447746% max memory 839.5 MB = 257.9 KB
2021-02-19 11:57:55,841 INFO util.GSet: capacity = 2^15 = 32768 entries
2021-02-19 11:57:55,874 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1150173176-192.168.31.181-1613707075866
# 可以看到目录data目录已经成功初始化
2021-02-19 11:57:55,889 INFO common.Storage: Storage directory /usr/local/hadoop-3.1.3/data/dfs/name has been successfully formatted.
2021-02-19 11:57:55,921 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop-3.1.3/data/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2021-02-19 11:57:56,005 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop-3.1.3/data/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds .
2021-02-19 11:57:56,023 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2021-02-19 11:57:56,029 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid = 0 when meet shutdown.
2021-02-19 11:57:56,030 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at linux181/192.168.31.181
************************************************************/
注意:格式化NameNode,会产生新的集群id,导致DataNode中记录的的集群id和刚生成的NameNode的集群id不 一致,DataNode找不到NameNode。所以,格式NameNode时,一定要先删除每个节点的data目录和logs日志,然后再格式化NameNode。
# 查看NameNode的集群ID
[root@linux181 current]# cd /usr/local/hadoop-3.1.3/data/dfs/name/current
[root@linux181 current]# cat VERSION
#Fri Feb 19 12:13:40 CST 2021
namespaceID=1274705931
clusterID=CID-33ddf8e2-70e5-4e82-a1dd-4b906d60071f
cTime=1613708020101
storageType=NAME_NODE
blockpoolID=BP-622750313-192.168.31.181-1613708020101
layoutVersion=-64
# 查看DataNode的集群ID
[root@linux181 current]# cd /usr/local/hadoop-3.1.3/data/dfs/data/current
[root@linux181 current]# cat VERSION
#Fri Feb 19 12:17:52 CST 2021
storageID=DS-330e42d1-8af5-4063-b2e7-20a3f9c0f423
clusterID=CID-33ddf8e2-70e5-4e82-a1dd-4b906d60071f
cTime=0
datanodeUuid=8c2ef087-efaf-454a-b5a9-d34812fb7af1
storageType=DATA_NODE
layoutVersion=-57
# NameNode以及DataNode的clusterID一致!!!
7.启动Hadoop集群
启动HDFS集群
# 选择linux181节点启动NameNode节点
hdfs --daemon start namenode
# 所有节点启动NameNode节点
hdfs --daemon start datanode
# linux183节点(与配置文件一致)启动NameNode节点
hdfs --daemon start secondarynamenode
启动YARN集群
# 选择linux182节点启动ResourceManager节点
yarn --daemon start resourcemanager
# 在所有节点上启动NodeManager
yarn --daemon start nodemanager
8.关闭Hadoop集群
关闭HDFS集群
# 选择linux181节点关闭NameNode节点
hdfs --daemon stop namenode
# 所有节点关闭NameNode节点
hdfs --daemon stop datanode
# linux183节点(与配置文件一致)关闭NameNode节点
hdfs --daemon stop secondarynamenode
启关闭ARN集群
# 选择linux182节点关闭ResourceManager节点
yarn --daemon stop resourcemanager
# 在所有节点上关闭NodeManager
yarn --daemon stop nodemanager
9.一键启动/关闭脚本
在/usr/local/bin目录下编写一键启动Hadoop的脚本
[root@linux181 current] vim /usr/local/bin/hdp.sh
脚本如下
#!/bin/bash
if [ $# -lt 1 ]
then
echo "wrong args number"
exit
fi
NN_HOST=linux181
RM_HOST=linux182
case $1 in
start)
echo ========== $NN_HOST 上启动hdfs ==========
ssh $NN_HOST "source /etc/profile ; start-dfs.sh"
echo ========== $RM_HOST 上启动yarn ==========
ssh $RM_HOST "source /etc/profile ; start-yarn.sh"
echo ========== $RM_HOST 上启动 timeline服务 ==========
ssh $RM_HOST "source /etc/profile ; yarn --daemon start timelineserver"
echo ========== $NN_HOST 上启动 历史服务器 ==========
ssh $NN_HOST "source /etc/profile ; mapred --daemon start historyserver"
;;
stop)
echo ========== $RM_HOST 上停止yarn ==========
ssh $RM_HOST "source /etc/profile ; stop-yarn.sh"
echo ========== $NN_HOST 上停止hdfs ==========
ssh $NN_HOST "source /etc/profile ; stop-dfs.sh"
echo ========== $NN_HOST 上停止历史服务器 ==========
ssh $NN_HOST "source /etc/profile ; mapred --daemon stop historyserver"
echo ========== $RM_HOST 上停止timeline服务 ==========
ssh $RM_HOST "source /etc/profile ; yarn --daemon stop timelineserver"
;;
*)
echo "你启动的姿势不对"
echo " start 启动hadoop集群"
echo " stop 停止hadoop集群"
;;
esac
从而可以一键启动/停止Hadoop集群
# 启动集群
hdp.sh start
# 停止集群
hdp.sh stop
三、测试Hadoop
1.上传文件
自己编写一个wc.txt文档置于本地的/usr/local/bin/data目录下
然后上传到HDFS上
hadoop fs -mkdir -p /user/root/input
hadoop fs -put /usr/local/bin/data/wc.txt /user/root/input
2.MR测试
使用Hadoop自带的脚本测试wordcount
[root@linux181 ~] cd $HADOOP_HOME
# 注意:HDFS上不能存在/user/root/output目录,否则会报错
[root@linux181 hadoop-3.1.3] hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /user/root/input /user/root/output
结果如下
......
2021-02-19 13:59:50,402 INFO mapreduce.Job: Running job: job_1613708282362_0002
2021-02-19 13:59:59,575 INFO mapreduce.Job: Job job_1613708282362_0002 running in uber mode : false
2021-02-19 13:59:59,577 INFO mapreduce.Job: map 0% reduce 0%
2021-02-19 14:00:05,682 INFO mapreduce.Job: map 100% reduce 0%
2021-02-19 14:00:09,718 INFO mapreduce.Job: map 100% reduce 100%
2021-02-19 14:00:10,743 INFO mapreduce.Job: Job job_1613708282362_0002 completed successfully
2021-02-19 14:00:10,834 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=89
FILE: Number of bytes written=435875
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=85770
HDFS: Number of bytes written=76
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=7214
Total time spent by all reduces in occupied slots (ms)=4640
Total time spent by all map tasks (ms)=3607
Total time spent by all reduce tasks (ms)=2320
Total vcore-milliseconds taken by all map tasks=3607
Total vcore-milliseconds taken by all reduce tasks=2320
Total megabyte-milliseconds taken by all map tasks=3693568
Total megabyte-milliseconds taken by all reduce tasks=2375680
Map-Reduce Framework
Map input records=6187
Map output records=14450
Map output bytes=143463
Map output materialized bytes=89
Input split bytes=108
Combine input records=14450
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=89
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=472
CPU time spent (ms)=2570
Physical memory (bytes) snapshot=490835968
Virtual memory (bytes) snapshot=5195370496
Total committed heap usage (bytes)=389021696
Peak Map Physical memory (bytes)=293011456
Peak Map Virtual memory (bytes)=2594807808
Peak Reduce Physical memory (bytes)=197824512
Peak Reduce Virtual memory (bytes)=2600562688
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=85662
File Output Format Counters
Bytes Written=76
3.查看结果
查看运行结果
[root@linux181 hadoop-3.1.3]# hadoop fs -cat /user/root/output/part-r-00000
2021-02-19 14:03:56,890 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
flink 1536
flume 2076
hadoop 2575
hello 2575
java 1536
json 2076
kafak 2076
4.查看日志
可以通过以下网址查看日志
http://linux181:19888/jobhistory
5.删除目录
hadoop fs -rm -r /user/root/output
本文详细介绍了如何在虚拟机中安装Hadoop 3.1.3集群,包括主机名修改、静态IP配置、SSH无密登录、JDK安装、环境变量设置、Hadoop配置、脚本编排,以及HDFS和YARN的部署和测试,提供了一站式的安装与测试指南。
1148

被折叠的 条评论
为什么被折叠?



