初识 Spark
因为本人已经有搭建好的 hadoop 和 hbase 集群,所以,选择 spark 版本为 without-hadoop 1.5.2。
安装
tar -xf /home/yuzx/data/download/spark-1.5.2-bin-without-hadoop.tgz -C /home/yuzx/server
ln -sf -T /home/yuzx/server/spark-1.5.2-bin-without-hadoop /home/yuzx/server/spark
配置 spark-env.sh
spark 的安装目录下有各种配置
# 的安装目录下有各种配置的模板
cp ${SPARK_HOME}/spark/conf/spark-env.sh.template ${SPARK_HOME}/spark/conf/spark-env.sh
spark-env.sh
#!/usr/bin/env bash
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
export SCALA_HOME=/home/yuzx/server/scala
export JAVA_HOME=/home/yuzx/server/jdk7
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
export HADOOP_CONF_DIR=/home/yuzx/server/hadoop/etc/hadoop
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
export SPARK_WORKER_MEMORY=2g
export SPARK_MASTER_IP=10.0.3.242
# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)
# 以下三种方式选择一种适合自己的即可
# http://spark.apache.org/docs/latest/hadoop-provided.html
# If 'hadoop' binary is on your PATH
#export SPARK_DIST_CLASSPATH=$(hadoop classpath)
# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/home/yuzx/server/hadoop/bin/hadoop classpath)
# Passing a Hadoop configuration directory
#export SPARK_DIST_CLASSPATH=$(hadoop --config /home/yuzx/server/hadoop/etc/hadoop classpath)
slaves
# A Spark Worker will be started on each of the machines listed below.
dn1
dn2
启动
在 Master 节点上执行下面的命令:
sbin/start-all.sh
启动之后在远程节点上通过 jps 命令查看 java 进程,Master 节点会出现 Master 进程,Worker 节点(slave)会出现 Worker 进程。
验证 spark 集群安装
在本地(非集群节点),配置客户端环境,仍然需要配置 spark-env.sh 至少需要 export HADOOP_CONF_DIR=XXX
在 spark 集群上运行
# client 模式启动,driver 程序在客户端本地,executors 在集群的 worker 节点,cluster manager 为 spark 自己的 standalone
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://10.0.3.242:7077 \
--deploy-mode client \
--executor-memory 2g \
--executor-cores 1 \
--total-executor-cores 100 \
lib/spark-examples-1.5.2-hadoop2.2.0.jar \
500
在任务执行过程中可以访问
http://127.0.0.1:4040/
来监控任务执行
在 yarn 上运行(牛逼的 spark on yarn),即:在 Hadoop 集群上跑
参考资料(还没来得及看):
http://spark.apache.org/docs/latest/running-on-yarn.html
先从集群中拷贝一份配置文件出来,注意本机的 hosts 要配置 hadoop 各节点的 host 映射
# 也可以设置在 spark env 中
export HADOOP_CONF_DIR=/home/yuzx/data/hadoop-etc
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 50 \
--executor-memory 2g \
lib/spark-examples-1.5.2-hadoop2.2.0.jar \
500
# yarn-client 模式
export HADOOP_CONF_DIR=/home/yuzx/data/hadoop-etc
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-client \
--num-executors 50 \
--executor-memory 2g \
lib/spark-examples-1.5.2-hadoop2.2.0.jar \
500
yarn-cluster 如何查看日志呢?
我的环境是:http://dn2:8042/node/containerlogs/container_1449649503862_0004_01_000001/yuzx/stdout/?start=-4096
登录 namenode 的 WebUI,找到 Finished 的任务,进入,appattempt 的列表,最有测有 logs,进入,找到 stdout
yarn-client 模式的输出会直接在本地 console 中打出
执行过程解读
参考文档:
http://spark.apache.org/docs/latest/cluster-overview.html
- org.apache.spark.examples.SparkPi 是一个 Spark 应用,类似一个 Java 小程序,带有 main program
- 它的 main program 就是 driver program,包含一个 SparkContext,SparkContext 负责协调 Spark 应用执行
- SparkContext 可以连接多种 cluster managers,具体有:
- Spark 自己的 standalone cluster manager
- Apache Mesos
- Hadoop YARN
- 一旦连接到 cluster manager,它会获取节点上的 executors,从图上看应该是 Worker 节点上的进程(通过 jps 命令观察 SparkPi 应用执行过程,应该是类似 CoarseGrainedExecutorBackend 这样的进程,在 Spark 应用启动后启动,在应用结束后终止)
- executor 负责执行计算,存储应用数据
- 然后,SparkContext 会向 executor 发送应用的代码(对于 SparkPi 应用来说,就是 spark-examples-1.5.2-hadoop2.2.0.jar)
- 最后,SparkContext 会向 executor 发送 task,由 executor 来执行它们(呵呵,这些邪恶的 tasks)
Spark 名词
- Application:构建在 Spark 上执行的用户程序,我觉得可以也可以叫 Spark App,运行时表现为一个 driver 程序,一组 executor 程序
- Application:Jar 打包后的用户程序,类似可执行 Jar,别说你不懂哦,如果有依赖 Jar 怎么办,打成一个 Jar(也叫 Assembly Jar),金山词霸一下,就是超级 Jar,Jar 本身包含各种依赖,另外,Uber Jar 中不能包含 Hadoop 和 Spark 的 Jar,它们会在运行时由框架提供,就是说得配置为 runtime 类型的依赖了,别说你不懂哦
- Driver program: 它是一个进程,它运行你的 Spark 应用中的 main,创建 SparkContext,如果是 deploy-mode=client,则这个 Driver 在客户端本地
- Cluster manager: 是一个外部服务,用于在集群上分配资源(Standalone, Mesos, YARN)
- Deploy mode: 用于区分 driver 程序在哪跑,如果 =cluster,由框架在集群中加载 driver 程序,如果 =client,一般在客户端本地
- Worker Node: 工作者节点,真正干活的节点,就像程序员,Driver 就是经理了,负责监控和协调
- Executor: 工作节点上干活的进程,就像程序员可以接很多来自不同的活,Executor 用于完成一个活,每个 Spark 应用有自己的 Executors,而且是多个,每个 WorkNode 上都有,进程级别
- Task: 一个 Spark 应用跑的时候可以分解为多个 Task,每个 Task 发给其中一个 Executor 执行
- Job: 多个 Task 的并行计算组成一个 Job
- Stage: 每个 Job 会分解为多个 tasks 集,它们相互依赖,类似 mapreduce 中的 map stage 和 reduce stage
翻译这东西也挺累 ~~