Spark学习笔记之（二）：Spark 提交任务流程与任务生成_提交spark任务--py-files: command not found 字符问题吗-优快云博客

本文链接：https://blog.youkuaiyun.com/u010737756/article/details/118414408

本文详细介绍了Spark提交任务时的参数解析，如`--conf`、`--driver`等，并深入讲解了Spark在Yarn上以client和cluster模式提交job的流程，包括ResourceManager、ApplicationMaster、Executor的角色与交互。同时，概述了Spark任务生成的过程，从SparkSubmit到Executor获取Task的整个生命周期。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark 提交 job 流程

1.提交脚本中的参数解析
2.Spark 提交 job 流程 (on yarn)
- 2.1 on yarn-client （spark的driver跑在任务提交的本机上）
- 2.2 on yarn-cluster(spark driver跑在yarn上)
3 Spark任务生成

由于提交流程相对很抽象，笔者决定先从提交脚本中的常用参数讲起，再讲解提交流程的过程。

1.提交脚本中的参数解析

笔者项目使用的Spark版本为Spark-2.0.2。下图是笔者项目中实际提交Spark Streaming任务时所使用的提交脚本。具体任务相关的信息抹去了，那么本文就从该脚本中的参数配置讲起。
在这里插入图片描述

spark-submit \
	--master yarn \              		运行的模式
    --deploy-mode cluster \
    --name spark-test \          		在yarn界面看到的名字，如果不设置，那就是下面--class的值
	--driver-memory 1g \         		driver的内存
	--executor-memory 1g \       		每一个executor的内存
	--executor-cores 1 \         		executor数量
    --jars xxx.jar, xxx.jar, xxx.jar    第三方jar，比如hbase，flume、apache的一些工具jar
	--conf                              设置spark内定的参数
	--files /conf.properties \       	发送到集群的配置文件，可以直接new fileInputstream("conf.properties")来获取
	--class WordCount    \              主类
	/WordCount-20210702-1.0.0.jar \    	自己工程的jar

1.1 常用参数含义

spark-submit --help查看解释

$ bin/spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.

  --help, -h                  Show this help message and exit
  --verbose, -v               Print additional debug output
  --version,                  Print the version of current Spark

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

–conf

用于配置Spark内置参数
（注意！Spark参数配置优先级：SparkConf>spark-submit/spark-shell>spark-defaults.conf）
例如:spark.app.name=WordCount-20210702

–driver

参数	含义
-cores	为driver配置多少核
-memory	为每个核配置多少内存

–executor

参数	含义
-cores	为每个executor配置多少核
-memory	为每个核配置多少内存

–num-executors

测试环境本项目日常挂起配置为（2核，2G=>1个driver1核1G+1个executor1核1G）
测试环境压测与生产环境本项目配置为（5核，5G=>1个driver1核1G+4个executor4核4G；生产数据输入量：10秒/批/均值12000，峰值15000-20000条）

–files

需要放入应用工作目录中的文件列表，即分发到各个executor节点上的文件 (项目中放置了配置文件的路径)

–principal

keberos认证，所使用的keytab的用户名

–keytab

keberos认证，所使用的keytab

–queue

spark on yarn专用配置，提交到对应的yarn队列

–conf

master地址，即任务提交到哪里。例如yarn，local等

–deploy-mode

cluster/client

–class

应用程序的主类，仅针对 java 或 scala 应用

–jar

用逗号分隔的本地 jar 包，设置后，这些 jar 将被包含在 driver 和 executor 的 classpath 下（本地绝对路径，逗号分隔）少量第三方包可以放置在此

1.2 Spark Submit脚本做了什么

spark-submit脚本是spark提供的一个用于提交任务的脚本，通过它的–master 参数可以很方便的将任务提交到对应的平台去执行，比如yarn、standalone、mesos等。本文后续主要讲解笔者项目使用的spark on yarn的cluster模式。
spark-submit会在提交任务时，把集群大部分的配置文件都打包在__spark_conf__.zip中，包括core-site.xml、hdfs-site.xml、yarn-site.xml、mapreduce-site.xml、hbase-site.xml、hive-site.xml等。然后将其和工程依赖的第三方jar一同发送到spark的资源存放目录下。

2.Spark 提交 job 流程 (on yarn)

Spark on yarn 模式有两种， yarn-client, yarn-cluster.

在详细说明Yarn模式之前，需要先了解几个名词
ResourceManager: 整个集群只有一个，负责集群资源的统一管理和调度. 因为整个集群只有一个，所以也有单点问题，
NodeManager：它可以理解为集群中的每一台slave
AM: application master, 对于每一个应用程序都有一个AM, AM主要是向RM申请资源（资源其实就是Container，目前这个Container就是cpu cores, memory), 然后在每个NodeManager上启动Executors（进一步分布资源给内部任务），监控跟踪应用程序的进程等。

这里就引入了YARN的调度框架问题： 双层调度框架
(1)RM统一管理集群资源，分配资源给AM
(2)AM将资源进一步分配给Tasks

在这里插入图片描述

2.1 on yarn-client （spark的driver跑在任务提交的本机上）

(1)ResourceManager接到请求后在集群中选择一个NodeManager分配Container，并在Container中启动ApplicationMaster进程；
(2)driver进程运行在client中，并初始化sparkContext；
(3)sparkContext初始化完后与ApplicationMaster通讯，通过ApplicationMaster向ResourceManager申请Container，ApplicationMaster通知NodeManager在获得的Container中启动excutor进程；
(4)sparkContext分配Task给excutor，excutor发送运行状态给driver。
在这里插入图片描述

2.2 on yarn-cluster(spark driver跑在yarn上)

(1)ResourceManager接到请求后在集群中选择一个NodeManager分配Container，并在Container中启动ApplicationMaster进程；
(2)在ApplicationMaster进程中初始化sparkContext；
(3)ApplicationMaster向ResourceManager申请到Container后，通知NodeManager在获得的Container中启动excutor进程；
(4)sparkContext分配Task给excutor，excutor发送运行状态给ApplicationMaster。

在这里插入图片描述

3 Spark任务生成

Executor 怎么获取 task?
在这里插入图片描述
1.调用SparkSubmit类，内部执行submit --> doRunMain -> 通过反射获取应用程序的主类对象 --> 执行主类的main方法。
2.构建SparkConf和SparkContext对象，在SparkContext入口做了三件事，创建了SparkEnv对象（创建了ActorSystem对象），TaskScheduler（用来生成并发送task给Executor），DAGScheduler（用来划分Stage）。
3.ClientActor将任务信息封装到ApplicationDescription对象里并且提交给Master。
4.Master收到ClientActor提交的任务信息后，把任务信息存在内存中，然后又将任务信息放到队列中。
5.当开始执行这个任务信息的时候，调用scheduler方法，进行资源的调度。
6.将调度好的资源封装到LaunchExecutor并发送给对应的Worker。
7.Worker接收到Master发送过来的调度信息（LaunchExecutor）后，将信息封装成一个ExecutorRunner对象。
8.封装成ExecutorRunner后，调用ExecutorRunner的start方法，开始启动 CoarseGrainedExecutorBackend对象。
9.Executor启动后向DriverActor进行反向注册。
10.与DriverActor注册成功后，创建一个线程池（ThreadPool），用来执行任务。
11.当所有的Executor注册完成后，意味着作业环境准备好了，Driver端会结束与SparkContext对象的初始化。
12.当Driver初始化完成后（创建了sc实例），会继续执行我们提交的App的代码，当触发了Action的RDD算子时，就触发了一个job，这时就会调用DAGScheduler对象进行Stage划分。
13.DAGScheduler开始进行Stage划分。
14.将划分好的Stage按照区域生成一个一个的task，并且封装到TaskSet对象，然后TaskSet提交到TaskScheduler。
15.TaskScheduler接收到提交过来的TaskSet，拿到一个序列化器，对TaskSet序列化，将序列化好的TaskSet封装到LaunchExecutor并提交到DriverActor。
16.把LaunchExecutor发送到Executor上。
17.Executor接收到DriverActor发送过来的任务（LaunchExecutor），会将其封装成TaskRunner，然后从线程池中获取线程来执行TaskRunner。
18.TaskRunner拿到反序列化器，反序列化TaskSet，然后执行App代码，也就是对RDD分区上执行的算子和自定义函数。