基于docker的hadoop集群搭建
摘要:
- 本教程使用docker搭建Hadoop+Spark的基础镜像,获得镜像之后,可以直接从启动容器
- 本教程使用的是:
- Debian 12
- Hadoop 3.3.6
- Spark 3.3.0
- JDK 8u431
- 镜像下载链接:百度网盘 提取码: 1ii9,文件过期的话找下载的同学拷贝一下
- 本教程使用的是:
-
更新
apt-get update & apt-get upgrade
-
安装所需工具
apt install --reinstall ca-certificates
apt install net-tools iputils-ping vim apt-transport-https openssh-server sudo rsync
-
换源(使用中科大的源)
vim /etc/apt/source.list
deb https://mirrors.ustc.edu.cn/debian/ bookworm main non-free non-free-firmware contrib deb-src https://mirrors.ustc.edu.cn/debian/ bookworm main non-free non-free-firmware contrib deb https://mirrors.ustc.edu.cn/debian-security/ bookworm-security main deb-src https://mirrors.ustc.edu.cn/debian-security/ bookworm-security main deb https://mirrors.ustc.edu.cn/debian/ bookworm-updates main non-free non-free-firmware contrib deb-src https://mirrors.ustc.edu.cn/debian/ bookworm-updates main non-free non-free-firmware contrib deb https://mirrors.ustc.edu.cn/debian/ bookworm-backports main non-free non-free-firmware contrib deb-src https://mirrors.ustc.edu.cn/debian/ bookworm-backports main non-free non-free-firmware contrib
apt-get update & apt-get upgrade
-
创建Hadoop用户
adduser hadoop
#添加 hadoop 账户usermod -a -G hadoop hadoop
#添加到组cat /etc/group | grep hadoop
#查看结果vim /etc/sudoers
#赋予 root 用户权限- 在文件中添加下面内容:
hadoop ALL=(root) NOPASSWD:ALL
- 给root用户设置密码
passwd root
-
安装JDK、Hadoop、Spark
mkdir /opt/servers /opt/software
然后上传并进入到/opt/softwaretar -zxvf /jdk-8u431-linux-x64.tar.gz /opt/servers/
tar -zxvf hadoop-3.3.6.tar.gz -C /opt/servers/
tar -zxvf spark-3.3.0-bin-hadoop3.tgz -C /opt/servers/
- 进入到/opt/servers,重命名为jdk、hadoop、spark
-
配置环境变量,记得source一下
vim /etc/profile
export JAVA_HOME=/opt/servers/jdk export CLASSPATH=.:$JAVA_HOME/lib/tools.jar export PATH=$JAVA_HOME/bin:$PATH export HADOOP_HOME=/opt/servers/hadoop export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH export SPARK_HOME=/opt/servers/spark export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH export PATH=~/bin:$PATH
vim ~/.bashrc
service ssh start > /dev/null 2>&1 source /etc/profile
-
配置hadoop
- hadoop-env.sh 配置
export HDFS_NAMENODE_USER=root export HDFS_DATANODE_USER=root export HDFS_SECONDARYNAMENODE_USER=root export YARN_RESOURCEMANAGER_USER=root export YARN_NODEMANAGER_USER=root export JAVA_HOME=/opt/servers/jdk
- yarn-env.sh 配置
export JAVA_HOME=/opt/servers/jdk
- core-site.xml配置
<configuration> <!-- Hadoop文件系统的URI,master是主节点的主机名,9000是端口号 --> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <!-- Hadoop的临时目录 --> <property> <name>hadoop.tmp.dir</name> <value>/opt/servers/hadoop/tmp</value> </property> <!-- Hadoop集群的日志目录 --> <property> <name>hadoop.log.dir</name> <value>/opt/servers/hadoop/log</value> </property> <!-- 配置HDFS网页登录使用的静态用户为root --> <property> <name>hadoop.http.staticuser.user</name> <value>root</value> </property> </configuration>
- hdfs-site.xml文件配置
<configuration> <!-- 指定文件的默认副本数量 --> <property> <name>dfs.replication</name> <value>2</value> </property> <!-- 指定NameNode存储其元数据的本地文件系统目录 --> <property> <name>dfs.name.dir</name> <value>/opt/servers/hadoop/hdfs/name</value> </property> <!-- 指定DataNode存储数据块的本地文件系统目录 --> <property> <name>dfs.data.dir</name> <value>/opt/servers/hadoop/hdfs/data</value> </property> <!-- 指定HDFS文件的默认块大小 --> <property> <name>dfs.blocksize</name> <value>134217728</value> <!-- 128 MB --> </property> <!-- namenode web 端访问地址--> <property> <name>dfs.namenode.http-address</name> <value>master:9870</value> </property> <!-- secondarynamenode web 端访问地址--> <property> <name>dfs.namenode.secondary.http-address</name> <value>slave1:9868</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration>
- mapred-site.xml文件配置
<configuration> <!-- 指定MapReduce框架的名称,通常是YARN --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!-- 指定JobHistory Server的地址 --> <property> <name>mapreduce.jobhistory.address</name> <value>jobhistory:10020</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/opt/servers/hadoop</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/opt/servers/hadoop</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=/opt/servers/hadoop</value> </property> </configuration>
- yarn-site.xml配置
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- 指定YARN的ResourceManager的地址 --> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> </configuration>
- workers配置
master slave1 slave2
-
配置spark
- 修改spark-env.sh
cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
export JAVA_HOME=/opt/servers/jdk SPARK_MASTER_HOST=master SPARK_MASTER_PORT=7077
- 修改workers
cp workers.template workers
vim workers
master slave1 slave2
-
- 获得容器镜像之后先使用
docker load -i hadoopbase.tar
加载镜像 docker run -itd --name master -h master --network cluster -p 8080:8080 -p 9870:9870 -p 4040:4040 -p 7077:7077 -p 2230:22 hadoopbase
docker run -itd --name slave1 -h slave1 --network cluster -p 2231:22 hadoopbase
docker run -itd --name slave2 -h slave2 --network cluster -p 2232:22 hadoopbase
- 使用工具连接三台机器,端口分别是 2230、2231、2232, ip是127.0.0.1 用户是 root 密码是 000
- 获得容器镜像之后先使用
-
配置免密登录(没成功多试几次,一定要成功)
# master、slave1、slave2
ssh-keygen -t rsa
# 在 master 复制 master、slave1、slave2 的公钥
cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
ssh slave1 cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
ssh slave2 cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
# 在 slave1、slave2 复制 master 的 authorized_keys 文件
ssh master cat ~/.ssh/authorized_keys>> ~/.ssh/authorized_keys
- 编写脚本(cd一下,到用户目录,mkdir bin,下面的脚本写在bin里面)
vim mjps
#!/bin/bash
for host in master slave1 slave2
do
echo =============== $host ===============
ssh $host jps
done
vim mrsync
#!/bin/bash
#参数预处理
if [ $# -lt 1 ]
then
echo '参数不能为空!!!'
exit
fi
#遍历集群中的机器一次分发内容
for host in slave1 slave2
do
echo "===============$host==============="
#依次分发内容
for file in $@
do
#判断当前文件是否存在
if [ -e $file ]
then
#存在
#1.获取当前文件的目录结构
pdir=$(cd -P $(dirname $file); pwd)
#2.获取当前的文件名
fname=$(basename $file)
#3.登录目标机器,创建同一目录结构
ssh $host "mkdir -p $pdir"
#4.依次把要分发的文件或目录进行分发
rsync -avh $pdir/$fname $host:$pdir
else
#不存在
echo "$file 不存在"
exit
fi
done
done
vim mhadoop
#!/bin/bash
if [ $# -lt 1 ]
then
echo "没有参数输入!"
exit 1
fi
case $1 in
"start")
echo "=========== 启动 hadoop 集群 ==========="
$HADOOP_HOME/sbin/start-all.sh
echo "=========== 启动 spark 集群 ==========="
$SPARK_HOME/sbin/start-all.sh
;;
"stop")
echo "=========== 关闭 spark 集群 ==========="
$SPARK_HOME/sbin/stop-all.sh
echo "=========== 关闭 hadoop 集群 ==========="
$HADOOP_HOME/sbin/stop-all.sh
;;
*)
echo "输入的参数有误!"
;;
esac
chmod 777 ~/bin/*
- 启动集群
hdfs namenode -format
mhadoop start
mjps
- 浏览器访问"127.0.0.1:9870"查看hadoop集群状态
- 浏览器访问"127.0.0.1:8080"查看spark集群状态