虚拟机搭建hadoop集群
创建三台虚拟机
- 使用vmware创建三台虚拟机centos1、centos2、centos3
- 配置虚拟机网络,/etc/sysconfig/network-scripts/ifcfg-ens32
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens32
UUID=98710bbd-6435-43b8-97ae-73532d6fe3b8
DEVICE=ens32
ONBOOT=yes
IPADDR=192.168.8.4
PREFIX=24
GATEWAY=192.168.8.2
DNS1=192.168.8.2l
- 修改hostname,/etc/hostname
- 修改hosts文件,/etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.8.1 rufeng
192.168.8.3 centos1
192.168.8.4 centos2
192.168.8.5 centos3
集群实用脚本
cluster_all.sh
#!/bin/bash
# /opt/cluster_bin/cluster_all.sh
if [ -z "$1" ]
then
echo "error arg!"
exit 1
fi
command=$(echo $* | cut -d " " -f 1-)
hosts=($(cut -d " " -f 1 /opt/cluster_nodes))
for host in ${hosts[@]}
do
echo "========== $host ============"
ssh $host $command
done
/opt/cluster_nodes文件保存了所有节点的host
同步文件xsync.sh
#!/bin/bash
if [ $# -lt 1 ]
then
echo "error arg"
exit 1
fi
hosts=($(cut -d " " -f 1 /opt/cluster_nodes))
for host in ${hosts[@]}
do
echo "=========== $host ============"
for file in $@
do
if [ ! -e "$file" ]
then
echo "file not exists: $file"
break
fi
if [ -d "$file" ]
then
cur_dir=$(cd -P ${file}; pwd)
pdir=$(cd -P $(dirname ${cur_dir}); pwd)
fname=$(basename ${cur_dir})
else
pdir=$(cd -P $(dirname ${file}); pwd)
fname=$(basename $file)
fi
if [ "$pdir" != "/" ]
then
ssh $host mkdir -p $pdir
fi
#echo "ssh $host mkdir -p $pdir"
#echo "rsync -av $pdir/$fname $host:$pdir"
rsync --delete -av $pdir/$fname $host:$pdir
done
done
配置虚拟机ssh互信
- 在root用户下配置集群互信
- 使用集群命令同步脚本创建用户bigdata,在bigdata用户下配置集群互信
安装JDK、Hadoop并配置环境变量
# /etc/profile.d/bigdata.sh
export JAVA_HOME=/opt/jdk8
export SPARK_HOME=/opt/spark3.0
export HADOOP_HOME=/opt/hadoop3.2
export PATH=$PATH:/opt/cluster_bin:$JAVA_HOME/bin
集群规划
centos1 | centos2 | centos3 | |
---|---|---|---|
HDFS | NameNode DataNode | DataNode | SecondaryNameNode DataNode |
YARN | NodeManager | ResourceManager NodeManager | NodeManager |
配置文件
$HADOOP_HOME/etc/hadoop
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 指定 NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos1:8020</value>
</property>
<!-- 指定 hadoop 数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop3.2/data</value>
</property>
<!-- 配置 HDFS 网页登录使用的静态用户为 atguigu -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>bigdata</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 指定 NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos1:8020</value>
</property>
<!-- 指定 hadoop 数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop3.2/data</value>
</property>
<!-- 配置 HDFS 网页登录使用的静态用户为 atguigu -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>bigdata</value>
</property>
</configuration>
[bigdata@centos2 hadoop3.2]$ cat ./etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- nn web 端访问地址-->
<property>
<name>dfs.namenode.http-address</name>
<value>centos1:9870</value>
</property>
<!-- 2nn web 端访问地址-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>centos3:9868</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 指定 MR 走 shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定 ResourceManager 的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>centos2</value>
</property>
<!-- 环境变量的继承 -->
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CO
NF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAP
RED_HOME</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/opt/hadoop3.2/etc/hadoop:/opt/hadoop3.2/share/hadoop/common/lib/*:/opt/hadoop3.2/share/hadoop/common/*:/opt/hadoop3.2/share/hadoop/hdfs:/opt/hadoop3.2/share/hadoop/hdfs/lib/*:/opt/hadoop3.2/share/hadoop/hdfs/*:/opt/hadoop3.2/share/hadoop/mapreduce/lib/*:/opt/hadoop3.2/share/hadoop/mapreduce/*:/opt/hadoop3.2/share/hadoop/yarn:/opt/hadoop3.2/share/hadoop/yarn/lib/*:/opt/hadoop3.2/share/hadoop/yarn/*</value>
</property>
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://centos1:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为 7 天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
</configuration>
其中yarn.application.classpath为hadoop classpath命令的值
mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 指定 MapReduce 程序运行在 Yarn 上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>centos1:10020</value>
</property>
<!-- 历史服务器 web 端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>centos1:19888</value>
</property>
</configuration>
workers
centos1
centos2
centos3
启动和测试集群
- 分发配置文件
- 在centos1格式化NameNode(第一次启动时需要),$HADOOP_HOME/bin hdfs namenode -format
- 在centos1启动hdfs,$HADOOP_HOME/sbin/start-dfs.sh
- 在centos2启动yarn,$HADOOP_HOME/sbin/start-yarn.sh
- 在centos1启动历史服务器,$HADOOP_HOME/bin/mapred --daemon start historyserver
注意:格式化NameNode,会产生新的集群id,导致NameNode和DataNode的集群id不一致,集群找不到已往数据。如果集群在运行过程中报错,需要重新格式化 NameNode的话,一定要先停止namenode和datanode进程,并且要删除所有机器的 data和logs目录,然后再进行格式化。
测试集群:$HADOOP_HOME/bin/hadoop jar jar文件 启动类 输入文件路径 输出路径
集群启动停止脚本
#!/bin/bash
# /opt/cluster_bin/myhadoop.sh
if [ ! $# -eq 1 ]
then
echo "arg error!"
exit 1
fi
if [ ! -e "$HADOOP_HOME" ]
then
echo "HADOOP_HOME not set!"
exit 1
fi
namenode="centos1"
resourcenode="centos2"
case $1 in
"start")
echo "start hadoop cluster"
echo "start hdfs on $namenode"
ssh $namenode "${HADOOP_HOME}/sbin/start-dfs.sh"
echo "start yarn on $resourcenode"
ssh $resourcenode "${HADOOP_HOME}/sbin/start-yarn.sh"
echo "start historyserver"
ssh $namenode "${HADOOP_HOME}/bin/mapred --daemon start historyserver"
;;
"stop")
echo "stop hadoop cluster"
echo "stop hdfs on $namenode"
ssh $namenode "${HADOOP_HOME}/sbin/stop-dfs.sh"
echo "stop yarn on $resourcenode"
ssh $resourcenode "${HADOOP_HOME}/sbin/stop-yarn.sh"
echo "stop historyserver"
ssh $namenode "${HADOOP_HOME}/bin/mapred --daemon stop historyserver"
;;
*)
echo "arg error start or stop"
;;
esac
常用端口号说明
名称 | Hadoop2.x | Hadoop3.x |
---|---|---|
NameNode内部通信端口 | 8020/9000 | 8020/9000/9820 |
NameNode HTTP UI | 50070 | 9870 |
MapReduce查看任务执行 | 8088 | 8088 |
历史服务器通信端口 | 19888 | 19888 |