搭建Spark on Yarn集群
此文以Hadoop 3.2.2、Spark 3.1.2版本为例!
如未指定,下述命令在所有节点执行!
一、系统资源及组件规划
节点名称 | 系统名称 | CPU/内存 | 网卡 | 磁盘 | IP地址 | OS |
---|---|---|---|---|---|---|
NameNode | namenode | 2C/4G | ens33 | 128G | 192.168.0.11 | CentOS7 |
Secondary NameNode | secondarynamenode | 2C/4G | ens33 | 128G | 192.168.0.12 | CentOS7 |
ResourceManager | resourcemanager | 2C/4G | ens33 | 128G | 192.168.0.13 | CentOS7 |
Spark | spark | 2C/4G | ens33 | 128G | 192.168.0.14 | CentOS7 |
Worker1 | worker1 | 2C/4G | ens33 | 128G | 192.168.0.21 | CentOS7 |
Worker2 | worker2 | 2C/4G | ens33 | 128G | 192.168.0.22 | CentOS7 |
Worker3 | worker3 | 2C/4G | ens33 | 128G | 192.168.0.23 | CentOS7 |
二、系统软件安装与设置
1、安装基本软件
yum -y install vim lrzsz bash-completion
2、设置名称解析
echo 192.168.0.11 namenode >> /etc/hosts
echo 192.168.0.12 secondarynamenode >> /etc/hosts
echo 192.168.0.13 resourcemanager >> /etc/hosts
echo 192.168.0.14 spark >> /etc/hosts
echo 192.168.0.21 worker1 >> /etc/hosts
echo 192.168.0.22 worker2 >> /etc/hosts
echo 192.168.0.23 worker3 >> /etc/hosts
3、设置NTP
yum -y install chrony
systemctl start chronyd
systemctl enable chronyd
systemctl status chronyd
chronyc sources
4、设置SELinux、防火墙
systemctl stop firewalld
systemctl disable firewalld
setenforce 0
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
三、搭建Spark on Yarn集群
1、设置SSH免密登录
在NameNode和ResourceManager节点上配置免密ssh NameNode、Secondary NameNode、ResourceManager、Worker节点:
ssh-keygen -t rsa
for host in namenode secondarynamenode resourcemanager worker1 worker2 worker3; do ssh-copy-id -i ~/.ssh/id_rsa.pub $host; done
在Spark节点上配置免密ssh Spark、Worker节点:
ssh-keygen -t rsa
for host in spark worker1 worker2 worker3; do ssh-copy-id -i ~/.ssh/id_rsa.pub $host; done
2、安装JDK
下载JDK文件:
参考地址:https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
解压JDK安装文件:
tar -xf /root/jdk-8u291-linux-x64.tar.gz -C /usr/local/
设置环境变量:
export JAVA_HOME=/usr/local/jdk1.8.0_291/
export PATH=$PATH:/usr/local/jdk1.8.0_291/bin/
添加环境变量至/etc/profile文件:
export JAVA_HOME=/usr/local/jdk1.8.0_291/
PATH=$PATH:/usr/local/jdk1.8.0_291/bin/
查看Java版本:
java -version
3、安装Scala
在Spark、Worker节点上安装Scala
下载Scala文件:
参考地址:https://www.scala-lang.org/download/all.html
解压Scala安装文件:
tar -xf /root/scala3-3.0.1.tar.gz -C /usr/local/
设置环境变量:
export PATH=$PATH:/usr/local/scala3-3.0.1/bin/
添加环境变量至/etc/profile文件:
PATH=$PATH:/usr/local/scala3-3.0.1/bin/
查看Scala版本:
scala -version
4、安装Hadoop集群
下载Hadoop文件:
参考地址:https://hadoop.apache.org/releases.html
解压Hadoop安装文件:
tar -xf /root/hadoop-3.2.2.tar.gz -C /usr/local/
设置环境变量:
export PATH=$PATH:/usr/local/hadoop-3.2.2/bin/:/usr/local/hadoop-3.2.2/sbin/
添加环境变量至/etc/profile文件:
PATH=$PATH:/usr/local/hadoop-3.2.2/bin/:/usr/local/hadoop-3.2.2/sbin/
查看Hadoop版本:
hadoop version
5、配置Hadoop集群
修改core-site.xml文件:
cat > /usr/local/hadoop-3.2.2/etc/hadoop/core-site.xml << EOF
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:9000</value>
</property>
</configuration>
EOF
修改hdfs-site.xml文件:
cat > /usr/local/hadoop-3.2.2/etc/hadoop/hdfs-site.xml << EOF
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>secondarynamenode:50090</value>
</property>
</configuration>
EOF
修改yarn-site.xml文件:
cat > /usr/local/hadoop-3.2.2/etc/hadoop/yarn-site.xml << EOF
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanager</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>resourcemanager:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resourcemanager:8031</value>
</property>
</configuration>
EOF
修改mapred-site.xml文件:
cat > /usr/local/hadoop-3.2.2/etc/hadoop/mapred-site.xml << EOF
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
EOF
修改hadoop-env.sh文件:
vim /usr/local/hadoop-3.2.2/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_291/
修改workers文件,指定Worker节点:
echo worker1 > /usr/local/hadoop-3.2.2/etc/hadoop/workers
echo worker2 >> /usr/local/hadoop-3.2.2/etc/hadoop/workers
echo worker3 >> /usr/local/hadoop-3.2.2/etc/hadoop/workers
修改start-dfs.sh文件,指定启动用户:
vim /usr/local/hadoop-3.2.2/sbin/start-dfs.sh
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
修改stop-dfs.sh文件,指定启动用户:
vim /usr/local/hadoop-3.2.2/sbin/stop-dfs.sh
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
修改start-yarn.sh文件,指定启动用户:
vim /usr/local/hadoop-3.2.2/sbin/start-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
修改stop-yarn.sh文件,指定启动用户:
vim /usr/local/hadoop-3.2.2/sbin/stop-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
6、启动Hadoop集群
在NameNode节点上格式化NameNode:
hdfs namenode -format
在NameNode节点上启动HDFS:
start-dfs.sh
在ResourceManager节点上启动YARN:
start-yarn.sh
7、登录Hadoop集群
登录NameNode:
http://192.168.0.11:9870
登录Secondary NameNode:
http://192.168.0.12:50090
登录ResourceManager:
http://192.168.0.13:8088
9、安装Spark on Yarn集群
在Spark、Worker节点上安装Spark on Yarn
下载Spark文件:
参考地址:http://spark.apache.org/downloads.html
解压Spark安装文件:
tar -zxf /root/spark-3.1.2-bin-hadoop3.2.tgz -C /usr/local/
设置环境变量:
export PATH=$PATH:/usr/local/spark-3.1.2-bin-hadoop3.2/bin/:/usr/local/spark-3.1.2-bin-hadoop3.2/sbin/
添加环境变量至/etc/profile文件:
PATH=$PATH:/usr/local/spark-3.1.2-bin-hadoop3.2/bin/:/usr/local/spark-3.1.2-bin-hadoop3.2/sbin/
10、配置Spark on Yarn集群
在Spark、Worker节点上配置Spark on Yarn
创建spark-env.sh文件:
cat > /usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh << EOF
JAVA_HOME=/usr/local/jdk1.8.0_291/
SCALA_HOME=/usr/local/scala3-3.0.1/
HADOOP_HOME=/usr/local/hadoop-3.2.2/
HADOOP_CONF_DIR=/usr/local/hadoop-3.2.2/etc/hadoop/
YARN_CONF_DIR=/usr/local/hadoop-3.2.2/etc/hadoop/etc/hadoop/
SPARK_MASTER_HOST=spark
SPARK_MASTER_PORT=7077
SPARK_HOME=/usr/local/spark-3.1.2-bin-hadoop3.2/
SPARK_LOCAL_DIRS=/usr/local/spark-3.1.2-bin-hadoop3.2/
SPARK_LIBARY_PATH=/usr/local/jdk1.8.0_291/lib/:/usr/local/hadoop-3.2.2/lib/native/
EOF
创建workers文件,指定Worker节点:
cat > /usr/local/spark-3.1.2-bin-hadoop3.2/conf/workers << EOF
worker1
worker2
worker3
EOF
11、启动Spark on Yarn集群
在Spark节点上启动Spark Master节点:
start-master.sh
在Worker节点上启动Spark Worker节点:
start-worker.sh spark://spark:7077
12、登录Spark on Yarn集群
登录Master:
http://192.168.0.14:8080
登录Worker:
http://192.168.0.21:8081
13、验证Spark on Yarn集群
在Spark节点上提交任务:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn /usr/local/spark-3.1.2-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.1.2.jar 10
登录ResourceManager,查看提交任务:
http://192.168.0.13:8088
14、停止Spark on Yarn集群
在Worker节点上停止Spark Worker节点:
stop-worker.sh
在Spark节点上停止Spark Master节点:
stop-master.sh
15、停止Hadoop集群
在ResourceManager节点上停止YARN:
stop-yarn.sh
在NameNode节点上停止HDFS:
stop-dfs.sh