一、 spark架构简介
Spark 是继 Hadoop 之后新一代的大数据分布式处理平台。它是一个基于内存、容错型的分布式计算引擎,与 Hadoop MapReduce 相比,计算速度要快100倍。 Spark 卓越的用户体验以及统一的技术堆栈基本上解决了大数据领域所有的核心问题,使得 Spark 迅速成为当前最为热门的大数据基础平台。
Spark 提供了多语言支持包括 Scala, Python, Java, R 等,特别适合迭代计算和交互式操作。它在 RDD(Resilient Distributed Dataset,一个容错的、并行的数据结构) 基础之上提供了 Spark streaming 流式计算,结构化数据处理组件 SparkSQL,机器学习库 MLlib 以及图计算 GraphX 等,详情请参阅 Spark 官方网站 。
与 Hadoop 一样, Spark 采用的是 master/slave 架构。既提供纯计算引擎的 Spark 集群,也提供与 Hadoop HDFS 集成的 Spark 集群,后者 Spark 的 worker 节点运行在 Hadoop 的存储 HDFS DataNode 节点上。 如下图所示,Spark 集群分三种节点类型:主节点 (Spark master node 和 Hadoop NameNode),从节点 (Spark worker nodes 和 Hadoop DataNodes) 和客户端 (Driver node)。Spark 目前 采用的是 Standalone cluster manager进行集群资源管理,用户通过客户端与 Spark、Hadoop 集群交互,用户的 driver program 将运行在 driver node上。
二、 spark编译
$ export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
$ ./make-distribution.sh --tgz -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -DskipTests -Dhadoop.version=2.6.0
三、 spark配置安装
1. 配置slaves
$ mv conf/slaves.template conf/slaves
$ vim conf/slaves
slave1
slave2
2. 配置spark-env.sh
export JAVA_HOME=/usr/ali/java
export HADOOP_HOME=/usr/local/lab/hadoop
export HBASE_HOME=/usr/local/lab/hbase
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
export SPARK_DIST_CLASSPATH=$HBASE_HOME/lib: $SPARK_DIST_CLASSPATH
SPARK_MASTER_IP=10.101.192.193
SPARK_LOCAL_DIRS=/apsarapangu/disk3/spark
SPARK_EXECUTOR_CORES=4
SPARK_EXECUTOR_INSTANCES=4
SPARK_DRIVER_MEMORY=2G
SPARK_EXECUTOR_MEMORY=2G
SPARK_YARN_APP_NAME="spark 2.2.0"
3. 配置spark-defaults.conf
spark.master spark://10.101.192.193:7077
spark.yarn.historyServer.address 10.101.192.193:18080
spark.history.ui.port 18080
spark.eventLog.enabled true
spark.eventLog.dir hdfs://10.101.192.193:16100/spark/events
spark.history.fs.logDirectory hdfs://10.101.192.193:16100/spark/events
spark.driver.memory 2g
spark.serializer org.apache.spark.serializer.KryoSerializer
4. 部署spark
$ pscp -r -h all_iplist spark /home/spark/
5. 启动spark服务
$ ./sbin/start-all.sh
四、 spark测试
1. 求1到1000000的和
$ ./bin/spark-shell --master spark://master:7077
scala> val data = new Array[Int](1000000)
data: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
scala> for (i <- 0 until data.length)
| data(i) = i + 1
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
scala> distData.reduce(_+_)
res1: Int = 1784293664
2. 统计spark目录下README.md中spark单词个数
$ ../hadoop/bin/hdfs dfs -put README.md /in/
$ ./bin/spark-shell --master spark://master:7077
scala> distData.reduce(_+_)
res1: Int = 1784293664
scala> val textFile = sc.textFile("hdfs://master:16100/in/README.md")
textFile: org.apache.spark.rdd.RDD[String] = hdfs://10.101.192.193:16100/in/README.md MapPartitionsRDD[2] at textFile at <console>:24
scala> textFile.filter(line => line.contains("Spark")).count()
res2: Long = 20
3. sparkSQL测试
$ ./bin/hdfs dfs -put examples/src/main/resources/people.json /in/
$ ./bin/spark-shell --master spark://master:7077
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5a4929f5
scala> val df = sqlContext.read.json("hdfs://master:16100/in/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> df.select("name").show()
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
4. 通过yarn提交task
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/jars/spark-examples_2.11-2.2.0.jar 100