Hadoop大数据框架研究(3)——Spark的HA高可用性集群环境部署

本文详细介绍Hadoop与Spark环境的搭建过程,包括Scala与Spark的下载配置、Spark集群启动及测试、Spark History Server配置与交互式数据分析工具spark-shell的使用等。同时,通过具体案例演示了WordCount任务的实现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

       近期对hadoop生态的大数据框架进行了实际的部署测试,并结合ArcGIS平台的矢量大数据分析产品进行空间数据挖掘分析。本系列博客将进行详细的梳理、归纳和总结,以便相互交流学习。

A. Scala下载及配置

1.scala-2.11.11.tgz下载:
https://www.scala-lang.org/download/2.11.11.html
2.传递scala到/home/hadoop/hadoop/目录下并解压 

3.编辑环境变量

[root@node1 bin]# vim /etc/profile
增加scala的相关配置
export SCALA_HOME=/home/hadoop/hadoop/scala-2.11.11
export PATH=$PATH:$SCALA_HOME/bin
[root@node1 bin]# source /etc/profile
[root@node1 bin]# scala -version


传递到其他主机并作相应的配置
[hadoop@node1 hadoop]$ scp scala-2.11.11 hadoop@node2.gixy.com:/home/hadoop/hadoop/ 
[hadoop@node1 hadoop]$ scp scala-2.11.11 hadoop@node3.gixy.com:/home/hadoop/hadoop/ 
[hadoop@node1 hadoop]$ scp scala-2.11.11 hadoop@node4.gixy.com:/home/hadoop/hadoop/ 
[hadoop@node1 hadoop]$ scp scala-2.11.11 hadoop@node5.gixy.com:/home/hadoop/hadoop/ 

B. Spark下载及配置

1.spark-2.2.1-bin-hadoop2.7.tgz下载:
https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
2.传递spark到/home/hadoop/hadoop/目录下并解压
3.修改spark相关配置
[hadoop@node1 hadoop]# cd spark-2.2.1-bin-hadoop2.7/conf

[hadoop@node1 conf]# cp spark-env.sh.template spark-env.sh

export JAVA_HOME=/usr/local/jdk1.8.0_151
export SCALA_HOME=/home/hadoop/hadoop/scala-2.11.11
export HADOOP_HOME=/home/hadoop/hadoop/hadoop-2.7.5
export HADOOP_CONF_DIR=/home/hadoop/hadoop/hadoop-2.7.5/etc/hadoop
#export SPARK_MASTER_IP=node2.gisxy.com
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node3.gisxy.com,node4.gisxy.com:2181,node5.gisxy.com:2181 -Dspark.deploy.zookeeper.dir=/spark"
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=1
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://node1.gisxy.com:9000/history"

在HA环境需要配置export SPARK_DAEMON_JAVA_OPTS,设置恢复模式为zookeeper,注释#SPARK_MASTER_IP

4.编辑salves机器信息
[hadoop@node1 conf]$ vim slaves
node3.gisxy.com
node4.gisxy.com
node5.gisxy.com
5.传递spark到其他节点
[hadoop@node1 hadoop]scp -r spark-2.2.1-bin-hadoop2.7 node2.gisxy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]scp -r spark-2.2.1-bin-hadoop2.7 node3.gisxy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]scp -r spark-2.2.1-bin-hadoop2.7 node4.gisxy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]scp -r spark-2.2.1-bin-hadoop2.7 node5.gisxy.com:/home/hadoop/hadoop/

C. Spark使用测试

1.hadoop用户在node1上启动spark集群
[hadoop@node1 sbin]$ ./start-all.sh
[hadoop@node1 sbin]$ jps

4052 NameNode
24166 Master
4392 DFSZKFailoverController
24251 Jps
4494 ResourceManager
2.hadoop用户在node2上启动master
[hadoop@node2 sbin]$ ./start-master.sh
[hadoop@node2 sbin]$ jps
36048 DFSZKFailoverController
3332 NameNode
36101 Jps
3640 ResourceManager
35839 Master

3.node3、node4、node5上查看worker进程

[hadoop@node3 hadoop]$ jps
3638 DataNode
3851 NodeManager
3516 QuorumPeerMain
17900 Worker
3743 JournalNode
18159 Jps

4.浏览器查看node1的8080和node2的8080,kill掉node1的24166(Master)进程,稍有延迟后,node2由standy转到alive。



5.添加worker机器

[hadoop@node1 sbin]$ ./start-slave.sh 192.168.10.100:7077


6.Spark History Server配置
修改spark-default.conf
[hadoop@node1 conf]$ vim spark-default.conf

# spark.master                     spark://master:7077
park.eventLog.enabled              true
spark.eventLog.dir                 hdfs://node1.gisxy:9000/history
spark.eventLog.compress            true
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

创建日志的hdfs存放目录并启动start-history-server服务
[hadoop@node1 bin]$ hadoop dfs -mkdir /history
[hadoop@node1 sbin]$ ./start-historty-server.sh

7.启动交互式数据分析工具spark-shell

[hadoop@node1 bin]$ ./spark-shell
访问WebUI查看当前执行的任务


[hadoop@node1 bin]$ ./spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/23 13:35:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.10.100:4040
Spark context available as 'sc' (master = local[*], app id = local-1519364152828).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val file = sc.textFile("hdfs://HA/bigdatas/wordcount.txt")
file: org.apache.spark.rdd.RDD[String] = hdfs://HA/bigdatas/wordcount.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:26
scala> counts.saveAsTextFile("hdfs://HA/bigdatas/output")
scala> :quit
查看结果
[hadoop@node1 bin]$ hadoop fs -cat /bigdatas/outputs/part-00000
或者可以通过hdfs下载文件记事本打开查看
(gisxy,1)
(esrichina,1)
(world,4)
(hello,4)
(esri,1)

8.spark环境执行python脚本

#!/usr/bin/env python  
#-*-conding:utf-8-*-  
  
import logging  
from operator import add  
from pyspark import SparkContext  
  
logging.basicConfig(format='%(message)s', level=logging.INFO)    
test_file_name = "hdfs://HA/bigdatas/wordcount.txt"    
out_file_name = "hdfs://HA/bigdatas/outputs_py"  
sc = SparkContext("local","wordcount app")  
text_file = sc.textFile(test_file_name)  
  
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)  
counts.saveAsTextFile(out_file_name) 

[hadoop@node1 bin]$./spark-submit /home/hadoop/hadoop/hadoop275_tmp/wordcount.py
[hadoop@node1 bin]$ hadoop fs -ls /bigdatas
[hadoop@node1 bin]$ hadoop fs -cat /bigdatas/outputs_py/part-00000


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值