Spark 源码解析大全（1）任务的提交

最新推荐文章于 2024-01-13 20:11:07 发布

Println_handsome

最新推荐文章于 2024-01-13 20:11:07 发布

阅读量487

点赞数

本文链接：https://blog.youkuaiyun.com/Println_handsome/article/details/120423460

版权

本文详细解析了Spark任务提交的整个过程，从spark-class脚本开始，讲解了如何通过SparkSubmit类组装java命令，进入SparkSubmit的doSubmit方法，接着是如何通过prepareSubmitEnvironment解析参数，将代码封装成SparkApplication并转化为JavaMainApplication来运行，最终通过反射执行main方法完成任务提交。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark 源码解析任务提交

1：代码提交

	### 我们提交spark任务的步骤为：

	### 	1）：将IDE中所写的scala/java 代码 打包为一个jar包  

	### 	2）：将jar 包上传到 服务器  

	### 	3）: 通过spark-submit 运行 命令” ./spark-subimt --class org.apache.examples --master yarn ./exmples/jars/spark-examples_2.12-3.0.0.jar “  这样的shell 命令通过脚本 将我们的代码 提交到spark的集群运行。 

	### 那就从脚本入手 看看 脚本的内容为 spark-submit ：

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

可以看出这个shell 脚本其实是调用了 spark-class 脚本传入了参数 org.apache.deploy.SparkSubmit 这个参数

spark-class 脚本:

#!/usr/bin/env bash

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi
// 运行脚本 加载saprk的环境
. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
 //初始化 java的命令
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
build_command() {
   
  "$RUNNER" -Xmx128m $SPARK_LAUNCHER_OPTS -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
DELIM=$'\n'
CMD_START_FLAG="false"
while IFS= read -d "$DELIM" -r ARG; do
  if [ "$CMD_START_FLAG" == "true" ]; then
    CMD+=("$ARG")
  else
    if [ "$ARG" == $'\0' ]; then
      # After NULL character is consumed, change the delimiter and consume command string.
      DELIM=''
      CMD_START_FLAG="true"
    elif [ "$ARG" != "" ]; then
      echo "$ARG"
    fi
  fi
done < <(build_command "$@")

COUNT=${
   #CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}

# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi
if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

2 ：SparkSubmit 类

从spark-submit 脚本中我们可以看出来最后提交任务是运行了组装了java命令运行了 org.apache.deploy.SparkSubmit 这个类

查看main 方法入口

    override def main(args: Array[String]): Unit = {
   
        val submit = new SparkSubmit() {
   
            self =>
            
            override protected def parseArguments(args: Array[String]): SparkSubmitArguments = {
   
                new SparkSubmitArguments(args) {
   
                    override protected def logInfo(msg: => String): Unit = self.logInfo(msg)
                    
                    override protected def logWarning(msg: => String): Unit = self.logWarning(msg)
                }
            }
            
            override protected def logInfo(msg: => String): Unit = printMessage(msg)
            
            override protected def logWarning(msg: => String): Unit = printMessage(s"Warning: $msg")
            
          
            override def doSubmit(args: Array[String]): Unit = {
   
                try {
   
                    /**
                     *   依旧是调用了父类的doSubmit方法 
                     */
                    super.doSubmit(args)
                } catch {
   
                    case e: SparkUserAppException => exitFn(e.exitCode)
                }
            }
            
        }
        
        /**
         *   doSubmit  调用
         */
        submit.doSubmit(args)
    }

调用了 SparkSubmit的doSubmit方法

    def doSubmit(args: Array[String]): Unit = {
   
        // Initialize logging if it hasn't been done yet. Keep track of whether logging needs to
        // be reset before the application starts.
        val uninitLog = initializeLogIfNecessary(true, silent = true)
        
        val appArgs = parseArguments(args)
        if (appArgs.verbose) {
   
            logInfo(appArgs.toString)
        }
        
        /**
         *   根据你的 spark-submit 的命令来决定到底执行那个方法
         *   $CLASS 参数
         */
        appArgs.action match {
   
            //  提交 app 运行
            case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
            // 停止任务
            case SparkSubmitAction.KILL => kill(appArgs)
            // 获取状态
            case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
            // 输出答应版本号码
            case SparkSubmitAction.PRINT_VERSION => printVersion()
        }
    }

因为是提交任务流程我们调集 submit(appArgs,uninitLog)

  /**
     * Submit the application using the provided parameters, ensuring to first wrap
     * in a doAs when --proxy-user is specified.
     */
    @tailrec private def submit(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
   
        
        def doRunMain(): Unit = {
   
            if (args.proxyUser != null) {
   
                val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser, UserGroupInformation.getCurrentUser())
                try {
   
                    proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
   
                        override def run(): Unit = {
   
                            runMain(args, uninitLog)
                        }
                    })
                } catch {
   
                    case e: Exception => // Hadoop's AuthorizationException suppresses the exception's stack trace, which
                        // makes the message printed to the output by the JVM not very helpful. Instead,
                        // detect exceptions with empty stack traces here, and treat them differently.
                        if (e.getStackTrace().length == 0) {
   
                            error(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
                        } else {
   
                            throw e
                        }
                }
            } else {
   
               
                runMain(args, uninitLog)
            }
        }
        
        // In standalone cluster mode, there are two submission gateways:
        //   (1) The traditional RPC gateway using o.a.s.deploy.Client as a wrapper
        //   (2) The new REST-based gateway introduced in Spark 1.3
        // The latter is the default behavior as of Spark 1.3, but Spark submit will fail over
        // to use the legacy gateway if the master endpoint turns out to be not a REST server.
        if (args.isStandaloneCluster && args.useRest) {
   
            try {
   
             
                logInfo("Running Spark using the REST application submission protocol.")
                doRunMain()