初识 Spark

Spark集群部署与应用

最新推荐文章于 2025-02-05 14:23:22 发布

yuzx2008

最新推荐文章于 2025-02-05 14:23:22 发布

阅读量759

点赞数

CC 4.0 BY-SA版权

分类专栏： spark 文章标签： spark 集群 yarn

本文链接：https://blog.youkuaiyun.com/yuzx2008/article/details/50252559

spark 专栏收录该内容

5 篇文章

订阅专栏

本文详细介绍了Spark集群的搭建步骤，包括配置环境变量、启动集群等，并提供了在不同模式下运行Spark应用程序的具体命令示例。

初识 Spark

因为本人已经有搭建好的 hadoop 和 hbase 集群，所以，选择 spark 版本为 without-hadoop 1.5.2。

安装

tar -xf /home/yuzx/data/download/spark-1.5.2-bin-without-hadoop.tgz -C /home/yuzx/server
ln -sf -T /home/yuzx/server/spark-1.5.2-bin-without-hadoop /home/yuzx/server/spark

配置 spark-env.sh

spark 的安装目录下有各种配置

# 的安装目录下有各种配置的模板
cp ${SPARK_HOME}/spark/conf/spark-env.sh.template ${SPARK_HOME}/spark/conf/spark-env.sh

spark-env.sh

#!/usr/bin/env bash

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

export SCALA_HOME=/home/yuzx/server/scala
export JAVA_HOME=/home/yuzx/server/jdk7

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
export HADOOP_CONF_DIR=/home/yuzx/server/hadoop/etc/hadoop

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
export SPARK_WORKER_MEMORY=2g
export SPARK_MASTER_IP=10.0.3.242

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)

# 以下三种方式选择一种适合自己的即可
# http://spark.apache.org/docs/latest/hadoop-provided.html
# If 'hadoop' binary is on your PATH
#export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/home/yuzx/server/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
#export SPARK_DIST_CLASSPATH=$(hadoop --config /home/yuzx/server/hadoop/etc/hadoop classpath)

slaves

# A Spark Worker will be started on each of the machines listed below.
dn1
dn2

启动

在 Master 节点上执行下面的命令：

sbin/start-all.sh

启动之后在远程节点上通过 jps 命令查看 java 进程，Master 节点会出现 Master 进程，Worker 节点（slave）会出现 Worker 进程。

验证 spark 集群安装

在本地（非集群节点），配置客户端环境，仍然需要配置 spark-env.sh 至少需要 export HADOOP_CONF_DIR=XXX

在 spark 集群上运行

# client 模式启动，driver 程序在客户端本地，executors 在集群的 worker 节点，cluster manager 为 spark 自己的 standalone
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  --master spark://10.0.3.242:7077 \
  --deploy-mode client \
  --executor-memory 2g \
  --executor-cores 1 \
  --total-executor-cores 100 \
  lib/spark-examples-1.5.2-hadoop2.2.0.jar \
  500

在任务执行过程中可以访问
http://127.0.0.1:4040/
来监控任务执行

在 yarn 上运行（牛逼的 spark on yarn），即：在 Hadoop 集群上跑

参考资料（还没来得及看）：
http://spark.apache.org/docs/latest/running-on-yarn.html

先从集群中拷贝一份配置文件出来，注意本机的 hosts 要配置 hadoop 各节点的 host 映射
# 也可以设置在 spark env 中
export HADOOP_CONF_DIR=/home/yuzx/data/hadoop-etc
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \
  --num-executors 50 \
  --executor-memory 2g \
  lib/spark-examples-1.5.2-hadoop2.2.0.jar \
  500

# yarn-client 模式
export HADOOP_CONF_DIR=/home/yuzx/data/hadoop-etc
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  --master yarn-client \
  --num-executors 50 \
  --executor-memory 2g \
  lib/spark-examples-1.5.2-hadoop2.2.0.jar \
  500

yarn-cluster 如何查看日志呢?

我的环境是：http://dn2:8042/node/containerlogs/container_1449649503862_0004_01_000001/yuzx/stdout/?start=-4096

yarn-client 模式的输出会直接在本地 console 中打出

执行过程解读

参考文档：
http://spark.apache.org/docs/latest/cluster-overview.html

org.apache.spark.examples.SparkPi 是一个 Spark 应用，类似一个 Java 小程序，带有 main program
它的 main program 就是 driver program，包含一个 SparkContext，SparkContext 负责协调 Spark 应用执行
SparkContext 可以连接多种 cluster managers，具体有：
- Spark 自己的 standalone cluster manager
- Apache Mesos
- Hadoop YARN
一旦连接到 cluster manager，它会获取节点上的 executors，从图上看应该是 Worker 节点上的进程（通过 jps 命令观察 SparkPi 应用执行过程，应该是类似 CoarseGrainedExecutorBackend 这样的进程，在 Spark 应用启动后启动，在应用结束后终止）
- executor 负责执行计算，存储应用数据
然后，SparkContext 会向 executor 发送应用的代码（对于 SparkPi 应用来说，就是 spark-examples-1.5.2-hadoop2.2.0.jar）
最后，SparkContext 会向 executor 发送 task，由 executor 来执行它们（呵呵，这些邪恶的 tasks）

Spark 名词

Application：构建在 Spark 上执行的用户程序，我觉得可以也可以叫 Spark App，运行时表现为一个 driver 程序，一组 executor 程序
Application：Jar 打包后的用户程序，类似可执行 Jar，别说你不懂哦，如果有依赖 Jar 怎么办，打成一个 Jar（也叫 Assembly Jar），金山词霸一下，就是超级 Jar，Jar 本身包含各种依赖，另外，Uber Jar 中不能包含 Hadoop 和 Spark 的 Jar，它们会在运行时由框架提供，就是说得配置为 runtime 类型的依赖了，别说你不懂哦
Driver program：它是一个进程，它运行你的 Spark 应用中的 main，创建 SparkContext，如果是 deploy-mode=client，则这个 Driver 在客户端本地
Cluster manager：是一个外部服务，用于在集群上分配资源（Standalone, Mesos, YARN）
Deploy mode：用于区分 driver 程序在哪跑，如果 =cluster，由框架在集群中加载 driver 程序，如果 =client，一般在客户端本地
Worker Node：工作者节点，真正干活的节点，就像程序员，Driver 就是经理了，负责监控和协调
Executor：工作节点上干活的进程，就像程序员可以接很多来自不同的活，Executor 用于完成一个活，每个 Spark 应用有自己的 Executors，而且是多个，每个 WorkNode 上都有，进程级别
Task：一个 Spark 应用跑的时候可以分解为多个 Task，每个 Task 发给其中一个 Executor 执行
Job：多个 Task 的并行计算组成一个 Job
Stage：每个 Job 会分解为多个 tasks 集，它们相互依赖，类似 mapreduce 中的 map stage 和 reduce stage

翻译这东西也挺累 ~~