1.前期准备
(1) 最好先把hadoop三件套做好,具体可以参考我之前的一篇文章(这篇文章没过审核,所以只有作者本人才能看到):
黑马程序员hadoop三件套(hdfs,Mapreduce,yarn)的安装配置以及hive的安装配置-优快云博客
(2)把spark-3.4.4-bin-hadoop3.tgz提前下载好(使用ndm下载)
下载链接:apache-spark-spark-3.4.4安装包下载_开源镜像站-阿里云
(3)ndm下载器下载方法:下载速度60M/s,直接跑满带宽!最新NDM中文绿色汉化版,内附详细安装使用教程,平替IDM下载器_哔哩哔哩_bilibili
里面讲的很详细。
2.配置spark
(1)打开Finalshell,上传spark-3.4.4-bin-hadoop3.tgz安装包到 node2节点 的/export/server目录下。(此时在node1上应该使用hadoop用户登录,而在整个配置过程中node2和node3一直保持在root账户即可)
su - hadoop
cd /export/server
rz
(2)解压缩spark-3.4.4-bin-hadoop3.tgz文件
tar -zxf spark-3.4.4-bin-hadoop3.tgz -C /export/server/
(1)(2)效果如下
[root@node1 ~]# su - hadoop
上一次登录:三 11月 20 09:13:46 CST 2024pts/0 上
[hadoop@node1 ~]$ cd /export/server
[hadoop@node1 server]$ rz
[hadoop@node1 server]$ tar -zxf spark-3.4.4-bin-hadoop3.tgz -C /export/server/
[hadoop@node1 server]$ ll
total 380148
drwxrwxr-x 11 hadoop hadoop 196 Nov 3 17:00 apache-hive-3.1.3-bin
-r-------- 1 hadoop hadoop 84 Jan 18 2018 dept.txt
drwxr-xr-x 5 root root 48 Nov 14 09:40 dockerkafka
-r-------- 1 hadoop hadoop 579 Jan 18 2018 emp.txt
-rw-r--r-- 1 root root 695 Nov 19 21:29 flink.yml
-rw-rw-r-- 1 hadoop hadoop 251 Nov 14 16:12 goods
lrwxrwxrwx 1 hadoop hadoop 27 Nov 3 15:32 hadoop -> /export/server/hadoop-3.3.4
drwxrwxr-x 11 hadoop hadoop 227 Nov 3 15:48 hadoop-3.3.4
drwxrwxr-x 8 hadoop hadoop 176 Nov 9 21:45 hbase-2.5.10
lrwxrwxrwx 1 hadoop hadoop 36 Nov 3 16:52 hive -> /export/server/apache-hive-3.1.3-bin
lrwxrwxrwx. 1 hadoop hadoop 27 Oct 21 17:46 jdk -> /export/server/jdk1.8.0_212
drwxr-xr-x. 7 hadoop hadoop 245 Apr 2 2019 jdk1.8.0_212
-rw-r--r-- 1 root root 2925 Nov 14 09:41 kafka.yml
drwxrwxr-x 2 hadoop hadoop 43 Nov 14 15:55 out
-rw-r--r-- 1 root root 51 Nov 14 15:19 p1.txt
drwxr-xr-x 8 root root 78 Nov 23 14:34 redis-cluster
-r-------- 1 hadoop hadoop 64 Jan 18 2018 salgrade.txt
drwxr-xr-x 13 hadoop hadoop 211 Oct 21 10:29 spark-3.4.4-bin-hadoop3
-r-------- 1 hadoop hadoop 388988563 Nov 24 21:55 spark-3.4.4-bin-hadoop3.tgz
-rw-r--r-- 1 root root 2140 Nov 23 14:48 start_redis.yml
-r-------- 1 hadoop hadoop 48 Nov 4 11:04 test.txt
-r-------- 1 hadoop hadoop 60222 Nov 3 2020 train.csv
-r-------- 1 hadoop hadoop 279 Nov 4 11:19 wordcount.hql
-r-------- 1 root root 88190 Jul 20 2017 XX.txt
-r-------- 1 root root 88380 Jul 20 2017 YY.txt
(3)进入/export/server/zookeeper-3.4.6/conf目录
cd /export/server/spark-3.4.4-bin-hadoop3/conf
(4)在conf目录,复制workers.template:cp workers.template workers
修改workers,先删除其中的localhost,然后添加:
node2
node3
cp workers.template workers
vi workers
node2
node3
(5)在conf目录,复制spark-defaults.conf.template:cp spark-defaults.conf.template spark-defaults.conf
修改spark-defaults.conf,往文件里添加:
spark.master spark://node1:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node1:8020/spark-logs
spark.history.fs.logDirectory hdfs://node1:8020/spark-logs
cp spark-defaults.conf.template spark-defaults.conf
vi spark-defaults.conf
spark.master spark://node1:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node1:8020/spark-logs
spark.history.fs.logDirectory hdfs://node1:8020/spark-logs
(6)在conf目录,复制spark-env.sh.template:cp spark-env.sh.template spark-env.sh
修改spark-env.sh,往文件里添加:
JAVA_HOME=/export/server/jdk1.8.0_212
HADOOP_CONF_DIR=/export/server/hadoop-3.3.4/etc/hadoop
SPARK_MASTER_IP=node1
SPARK_MASTER_PORT=7077
SPARK_WORKER_MEMORY=512m
SPARK_WORKER_CORES=1
SPARK_EXECUTOR_MEMORY=512m
SPARK_EXECUTOR_CORES=1
SPARK_WORKER_INSTANCES=1
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
JAVA_HOME=/export/server/jdk1.8.0_212
HADOOP_CONF_DIR=/export/server/hadoop-3.3.4/etc/hadoop
SPARK_MASTER_IP=node1
SPARK_MASTER_PORT=7077
SPARK_WORKER_MEMORY=512m
SPARK_WORKER_CORES=1
SPARK_EXECUTOR_MEMORY=512m
SPARK_EXECUTOR_CORES=1
SPARK_WORKER_INSTANCES=1
(3)~(6)效果如下
[hadoop@node1 server]$ cd /export/server/spark-3.4.4-bin-hadoop3/conf
[hadoop@node1 conf]$ cp workers.template workers
[hadoop@node1 conf]$ vi workers
[hadoop@node1 conf]$ cp spark-defaults.conf.template spark-defaults.conf
[hadoop@node1 conf]$ vi spark-defaults.conf
[hadoop@node1 conf]$ cp spark-env.sh.template spark-env.sh
[hadoop@node1 conf]$ vi spark-env.sh
(7)改spark-3.4.4-bin-hadoop3目录的拥有者和所属组
chown -R hadoop:hadoop /export/server/spark-3.4.4-bin-hadoop3
mkdir -p /export/server/spark-3.4.4-bin-hadoop3/logs
mkdir -p /export/server/spark-3.4.4-bin-hadoop3/logs
chown -R hadoop:hadoop /export/server/spark-3.4.4-bin-hadoop3
(7)效果如下
[hadoop@node1 conf]$ mkdir -p /export/server/spark-3.4.4-bin-hadoop3/logs
[hadoop@node1 conf]$ chown -R hadoop:hadoop /export/server/spark-3.4.4-bin-hadoop3
[hadoop@node1 conf]$ cd ..
[hadoop@node1 spark-3.4.4-bin-hadoop3]$ ll
total 124
drwxr-xr-x 2 hadoop hadoop 4096 Oct 21 10:29 bin
drwxr-xr-x 2 hadoop hadoop 260 Nov 25 17:00 conf
drwxr-xr-x 5 hadoop hadoop 50 Oct 21 10:29 data
drwxr-xr-x 4 hadoop hadoop 29 Oct 21 10:29 examples
drwxr-xr-x 2 hadoop hadoop 12288 Oct 21 10:29 jars
drwxr-xr-x 4 hadoop hadoop 38 Oct 21 10:29 kubernetes
-rw-r--r-- 1 hadoop hadoop 22982 Oct 21 10:29 LICENSE
drwxr-xr-x 2 hadoop hadoop 4096 Oct 21 10:29 licenses
drwxrwxr-x 2 hadoop hadoop 6 Nov 25 17:00 logs
-rw-r--r-- 1 hadoop hadoop 57842 Oct 21 10:29 NOTICE
drwxr-xr-x 9 hadoop hadoop 311 Oct 21 10:29 python
drwxr-xr-x 3 hadoop hadoop 17 Oct 21 10:29 R
-rw-r--r-- 1 hadoop hadoop 4605 Oct 21 10:29 README.md
-rw-r--r-- 1 hadoop hadoop 166 Oct 21 10:29 RELEASE
drwxr-xr-x 2 hadoop hadoop 4096 Oct 21 10:29 sbin
drwxr-xr-x 2 hadoop hadoop 42 Oct 21 10:29 yarn
(8)将Spark安装包分发到其他节点(确保此时node1是hadoop用户)
scp -r /export/server/spark-3.4.4-bin-hadoop3/ node2:/export/server/
scp -r /export/server/spark-3.4.4-bin-hadoop3/ node3:/export/server/
scp -r /export/server/spark-3.4.4-bin-hadoop3/ node2:/export/server/
scp -r /export/server/spark-3.4.4-bin-hadoop3/ node3:/export/server/
(9)在 所有节点 (node1,node2,node3)配置Spark环境变量(node1记得切换为root用户)
node1记得切换为root用户
vi /etc/profile
在文件尾加入:
export SPARK_HOME=/export/server/spark-3.4.4-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
执行source /etc/profile使命令生效
vi /etc/profile
export SPARK_HOME=/export/server/spark-3.4.4-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
(9)效果如下
node1上
node2上
node3上
(10)在node1进入hadoop用户,先启动hadoop三件套,再启动spark(这里node1切换回hadoop用户)
启动hadoop三件套,并且在集群上创建目录/spark-logs
su - hadoop
start-dfs.sh
start-yarn.sh
hdfs dfs -mkdir /spark-logs
启动spark
cd /export/server/spark-3.4.4-bin-hadoop3/sbin
./start-all.sh
(10)效果如下
启动hadoop三件套,并且在集群上创建目录/spark-logs
[root@node1 ~]# su - hadoop
Last login: Mon Nov 25 16:55:57 CST 2024 on pts/0
[hadoop@node1 ~]$ start-dfs.sh
Starting namenodes on [node1]
Starting datanodes
Starting secondary namenodes [node1]
[hadoop@node1 ~]$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
[hadoop@node1 ~]$ hdfs dfs -mkdir /spark-logs
[hadoop@node1 ~]$ hdfs dfs -ls /
Found 10 items
drwxrwx--- - hadoop supergroup 0 2024-11-03 16:09 /data
drwxr-xr-x - hadoop supergroup 0 2024-11-14 15:49 /export
drwxr-xr-x - hadoop supergroup 0 2024-11-17 19:50 /hbase
drwxr-xr-x - hadoop supergroup 0 2024-11-13 14:39 /hdfs_api2
drwxr-xr-x - hadoop supergroup 0 2024-11-14 15:56 /myhive2
drwxrwxrwx - hadoop supergroup 0 2024-11-20 09:11 /output
drwxr-xr-x - hadoop supergroup 0 2024-11-25 17:06 /spark-logs
-rw-r--r-- 3 hadoop supergroup 92 2024-11-18 23:27 /test.txt
drwx-wx-wx - hadoop supergroup 0 2024-11-03 17:07 /tmp
drwxr-xr-x - hadoop supergroup 0 2024-11-03 17:05 /user
启动spark后
node1上
[hadoop@node1 ~]$ cd /export/server/spark-3.4.4-bin-hadoop3/sbin
[hadoop@node1 sbin]$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node1.out
node3: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node3.out
node2: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node2.out
[hadoop@node1 sbin]$ jps
13184 DataNode
14624 NodeManager
12981 NameNode
13637 SecondaryNameNode
17029 Jps
14422 ResourceManager
15099 WebAppProxyServer
16717 Master
[hadoop@node1 sbin]$
node2上
[root@node2 ~]# jps
13142 DataNode
16856 Jps
16541 Worker
14287 NodeManager
[root@node2 ~]#
node3上
[root@node3 ~]# jps
17138 Jps
14435 NodeManager
13309 DataNode
16686 Worker
[root@node3 ~]#
(11)查看客户端(直接在浏览器输入 node1:8080 也行)
http://node1:8080
(12)关闭spark和hadoop
cd /export/server/spark-3.4.4-bin-hadoop3/sbin
./stop-all.sh
cd
stop-yarn.sh
stop-dfs.sh
jps
效果如下
[hadoop@node1 sbin]$ cd /export/server/spark-3.4.4-bin-hadoop3/sbin
[hadoop@node1 sbin]$ ./stop-all.sh
node2: stopping org.apache.spark.deploy.worker.Worker
node3: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
[hadoop@node1 sbin]$ cd
[hadoop@node1 ~]$ stop-yarn.sh
Stopping nodemanagers
Stopping resourcemanager
Stopping proxy server [node1]
[hadoop@node1 ~]$ stop-dfs.sh
Stopping namenodes on [node1]
Stopping datanodes
Stopping secondary namenodes [node1]
[hadoop@node1 ~]$ jps
45472 Jps
[hadoop@node1 ~]$
3.总结:
如果遇到了报错,一般都是/export/server/spark-3.4.4-bin-hadoop3/sbin目录权限的问题。
平时启动hadoop和spark,直接复制以下命令然后在node1上面执行即可。
su - hadoop
start-dfs.sh
start-yarn.sh
cd /export/server/spark-3.4.4-bin-hadoop3/sbin
./start-all.sh
如果要关闭hadoop和spark,直接复制以下命令然后在node1上面执行即可。(上面也有)
cd /export/server/spark-3.4.4-bin-hadoop3/sbin
./stop-all.sh
cd
stop-yarn.sh
stop-dfs.sh
jps