近期对hadoop生态的大数据框架进行了实际的部署测试,并结合ArcGIS平台的矢量大数据分析产品进行空间数据挖掘分析。本系列博客将进行详细的梳理、归纳和总结,以便相互交流学习。
A. Scala下载及配置
1.scala-2.11.11.tgz下载:
https://www.scala-lang.org/download/2.11.11.html
2.传递scala到/home/hadoop/hadoop/目录下并解压
3.编辑环境变量
[root@node1 bin]# vim /etc/profile
增加scala的相关配置
export SCALA_HOME=/home/hadoop/hadoop/scala-2.11.11
export PATH=$PATH:$SCALA_HOME/bin
[root@node1 bin]# source /etc/profile
[root@node1 bin]# scala -version
传递到其他主机并作相应的配置
[hadoop@node1 hadoop]$ scp scala-2.11.11 hadoop@node2.gixy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]$ scp scala-2.11.11 hadoop@node3.gixy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]$ scp scala-2.11.11 hadoop@node4.gixy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]$ scp scala-2.11.11 hadoop@node5.gixy.com:/home/hadoop/hadoop/
B. Spark下载及配置
1.spark-2.2.1-bin-hadoop2.7.tgz下载:
https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
2.传递spark到/home/hadoop/hadoop/目录下并解压
3.修改spark相关配置
[hadoop@node1 hadoop]# cd spark-2.2.1-bin-hadoop2.7/conf
[hadoop@node1 conf]# cp spark-env.sh.template spark-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_151
export SCALA_HOME=/home/hadoop/hadoop/scala-2.11.11
export HADOOP_HOME=/home/hadoop/hadoop/hadoop-2.7.5
export HADOOP_CONF_DIR=/home/hadoop/hadoop/hadoop-2.7.5/etc/hadoop
#export SPARK_MASTER_IP=node2.gisxy.com
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node3.gisxy.com,node4.gisxy.com:2181,node5.gisxy.com:2181 -Dspark.deploy.zookeeper.dir=/spark"
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=1
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://node1.gisxy.com:9000/history"
在HA环境需要配置export SPARK_DAEMON_JAVA_OPTS,设置恢复模式为zookeeper,注释#SPARK_MASTER_IP
4.编辑salves机器信息
[hadoop@node1 conf]$ vim slaves
node3.gisxy.com
node4.gisxy.com
node5.gisxy.com
5.传递spark到其他节点
[hadoop@node1 hadoop]scp -r spark-2.2.1-bin-hadoop2.7 node2.gisxy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]scp -r spark-2.2.1-bin-hadoop2.7 node3.gisxy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]scp -r spark-2.2.1-bin-hadoop2.7 node4.gisxy.com:/home/hadoop/hadoop/
[hadoop@node1 hadoop]scp -r spark-2.2.1-bin-hadoop2.7 node5.gisxy.com:/home/hadoop/hadoop/
C. Spark使用测试
1.hadoop用户在node1上启动spark集群
[hadoop@node1 sbin]$ ./start-all.sh
[hadoop@node1 sbin]$ jps
4052 NameNode
24166 Master
4392 DFSZKFailoverController
24251 Jps
4494 ResourceManager
2.hadoop用户在node2上启动master
[hadoop@node2 sbin]$ ./start-master.sh
[hadoop@node2 sbin]$ jps
36048 DFSZKFailoverController
3332 NameNode
36101 Jps
3640 ResourceManager
35839 Master
3.node3、node4、node5上查看worker进程
[hadoop@node3 hadoop]$ jps
3638 DataNode
3851 NodeManager
3516 QuorumPeerMain
17900 Worker
3743 JournalNode
18159 Jps
4.浏览器查看node1的8080和node2的8080,kill掉node1的24166(Master)进程,稍有延迟后,node2由standy转到alive。
5.添加worker机器
[hadoop@node1 sbin]$ ./start-slave.sh 192.168.10.100:7077
6.Spark History Server配置
修改spark-default.conf
[hadoop@node1 conf]$ vim spark-default.conf
# spark.master spark://master:7077
park.eventLog.enabled true
spark.eventLog.dir hdfs://node1.gisxy:9000/history
spark.eventLog.compress true
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
创建日志的hdfs存放目录并启动start-history-server服务
[hadoop@node1 bin]$ hadoop dfs -mkdir /history
[hadoop@node1 sbin]$ ./start-historty-server.sh
7.启动交互式数据分析工具spark-shell
[hadoop@node1 bin]$ ./spark-shell
访问WebUI查看当前执行的任务
[hadoop@node1 bin]$ ./spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/23 13:35:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.10.100:4040
Spark context available as 'sc' (master = local[*], app id = local-1519364152828).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val file = sc.textFile("hdfs://HA/bigdatas/wordcount.txt")
file: org.apache.spark.rdd.RDD[String] = hdfs://HA/bigdatas/wordcount.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:26
scala> counts.saveAsTextFile("hdfs://HA/bigdatas/output")
scala> :quit
查看结果
[hadoop@node1 bin]$ hadoop fs -cat /bigdatas/outputs/part-00000
或者可以通过hdfs下载文件记事本打开查看
(gisxy,1)
(esrichina,1)
(world,4)
(hello,4)
(esri,1)
8.spark环境执行python脚本
#!/usr/bin/env python
#-*-conding:utf-8-*-
import logging
from operator import add
from pyspark import SparkContext
logging.basicConfig(format='%(message)s', level=logging.INFO)
test_file_name = "hdfs://HA/bigdatas/wordcount.txt"
out_file_name = "hdfs://HA/bigdatas/outputs_py"
sc = SparkContext("local","wordcount app")
text_file = sc.textFile(test_file_name)
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(out_file_name)
[hadoop@node1 bin]$./spark-submit /home/hadoop/hadoop/hadoop275_tmp/wordcount.py
[hadoop@node1 bin]$ hadoop fs -ls /bigdatas
[hadoop@node1 bin]$ hadoop fs -cat /bigdatas/outputs_py/part-00000