how-to-configure-and-use-spark-history-server

本文介绍如何配置和使用Spark History Server,包括启用Spark事件日志、配置Spark环境变量、启动History Server的方法,以及如何通过Web UI查看已结束的应用程序历史记录。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

how-to-configure-and-use-spark-history-server


参考

spark configuration

spark monitoring Viewing After the Fact


基础知识

  • how to configure spark ?
    locations to configure spark

    • spark properties using val sparkConf=new SparkConf().set… in spark application code

      • Dynamically Loading Spark Properties using spark-submit options
      ./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
      • Viewing Spark Properties using spark app WebUI http://<driver>:4040 in the “Environment” tab
    • environment variable using conf/spark-env.sh
    • loging using conf/log4j.properites
  • Available Properties about history server

Property NameDefaultMeaning
spark.eventLog.compressfalseWhether to compress logged events, if spark.eventLog.enabled is true.
spark.eventLog.dirfile:///tmp/spark-eventsBase directory in which Spark events are logged, if spark.eventLog.enabled is true. Within this base directory, Spark creates a sub-directory for each application, and logs the events specific to the application in this directory. Users may want to set this to a unified location like an HDFS directory so history files can be read by the history server.
spark.eventLog.enabledfalseWhether to log Spark events, useful for reconstructing the Web UI after the application has finished.

* Environment Variables


enable spark eventlog

  • method 1:

    bin/spark-submit --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=file:/data01/data_tmp/spark-events ...
  • method 2:
    vi conf/spark-env.sh

    SPARK_DAEMON_JAVA_OPTS=
    SPARK_MASTER_OPTS=
    SPARK_WORKER_OPTS=
    SPARK_JAVA_OPTS="-Dspark.eventLog.enabled=true -Dspark.eventLog.dir=file:/data01/data_tmp/spark-events"

    NOTE:

    • a spark.eventLog.dir=file:/data01/data_tmp/spark-events

      • eventlog 是由 driver 写日志,对于 local/standalone/yarn-client 是没有大问题,但对于 spark-cluster 需要特别注意,最好设置为 hdfs 路径
      • spark.eventLog.dir 指定目录不会自动生成,需要手工创建,有相应权限
    • b these Environment variable will be used in bin/spark-class)

test spark app after enable eventlog

test case:

object WordCount extends App {

  val sparkConf = new SparkConf().setAppName("WordCount")
  val sc = new SparkContext(sparkConf)

  val lines = sc.textFile("file:/data01/data/datadir_github/spark/README.md")
  val words = lines.flatMap(_.split("\\s+"))

  val wordsCount = words.map(word=>(word, 1)).reduceByKey(_ + _)

  wordsCount.foreach(println)

  sc.stop()
}
  • 测试问题1:IDEA中 running spark in local 模式没有生成 eventlog ,继续测试自带 的 examples.SparkPi 一样

    • 处理方法1: 在 CLI 测试

      bin/run-example SparkPi

      结果报错:

      Exception in thread “main” java.lang.IllegalArgumentException: Log directory file:/data01/data_tmp/spark-events does not exist.

    • 处理方法2:

      mkdir -p /data01/data_tmp/spark-events
      bin/run-example SparkPi

      结果:测试确认 /data01/data_tmp/spark-events 下生成了eventlog

      继续在 IDEA中测试 running spark in local 发现依然没有生成 eventlog
      原因分析:在 IDEA 中测试,虽然在依赖添加了 SPARK_CONF_DIR 路径,但 IDEA中执行并不像 在 CLI 使用 bin/spark-submit 提交app 读取解析 conf/spark-env.sh 中的配置文件

    • 处理方法3:
      在 IDEA 的 run configuration 设置 vm options -Dspark.master="local[2]" -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=file:/data01/data_tmp/spark-events
      结果: local 模式正常

  • 问题2:IDEA 测试 spark-on-yarn报错(暂没有解决)
    在 IDEA 的 run configuration 设置 vm options -Dspark.master="yarn-client" -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=file:/data01/data_tmp/spark-events 报错


configure, start and use spark history server

  • configure
    vi conf/spark-env.sh

    SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=file:/data01/data_tmp/spark-events" #set when you use spark-history-server

    NOTE

    • spark.history.fs.logDirectory , spark.eventLog.dir 可以不同,意味着能够移动 eventlog 文件,便于协助诊断
    • spark.history.ui.port
  • start

    bin/start-history-server.sh

access spark history WebUI at http://<server-url>:18080

spark history server WebUI applications
spark history server WebUI applications

spark history server WebUI specific app
spark history server WebUI specific app

#!/usr/bin/env bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # This file is sourced when running various Spark programs. # Copy it as spark-env.sh and edit that to configure Spark for your site. # Options read when launching programs locally with # ./bin/run-example or ./bin/spark-submit # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files # - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node # - SPARK_PUBLIC_DNS, to set the public dns name of the driver program # Options read by executors and drivers running inside the cluster # - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node # - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program # - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data # - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos # Options read in any mode # - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf) # - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1). # - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G) # - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G) # Options read in any cluster manager using HDFS # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files # Options read in YARN client/cluster mode # - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN # Options for the daemons used in the standalone deploy mode # - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname # - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master # - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y") # - SPARK_WORKER_CORES, to set the number of cores to use on this machine # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g) # - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker # - SPARK_WORKER_DIR, to set the working directory of worker processes # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y") # - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g). # - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y") # - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y") # - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y") # - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons # - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers # Options for launcher # - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y") # Generic options for the daemons used in the standalone deploy mode # - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf) # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_LOG_MAX_FILES Max log files of Spark daemons can rotate to. Default is 5. # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) # - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER) # - SPARK_NICENESS The scheduling priority for daemons. (Default: 0) # - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file. # Options for native BLAS, like Intel MKL, OpenBLAS, and so on. # You might get better performance to enable these options if using native BLAS (see SPARK-21305). # - MKL_NUM_THREADS=1 Disable multi-threading of Intel MKL # - OPENBLAS_NUM_THREADS=1 Disable multi-threading of OpenBLAS # Options for beeline # - SPARK_BEELINE_OPTS, to set config properties only for the beeline cli (e.g. "-Dx=y") # - SPARK_BEELINE_MEMORY, Memory for beeline (e.g. 1000M, 2G) (Default: 1G) #设置JAVA安装目录 JAVA_HOME=/export/server/jdk1.8.0_401 # HADOOP软件配置文件目录,取HDFS上文件和运行YARN集群 HAD00P_CONF_DIR=/export/server/hadoop-3.3.6/etc/hadoop YARN_CONF_DIR=/export/server/hadoop-3.3.6/etc/hadoop #指定spark老大Master的IP和提交任务的通信端口 #告知Spark的master运行在哪个机器上 #export SPARK_MASTER_HOST=node1 #告知sparkmaster的通讯端囗 export SPARK_MASTER_PORT=7077 #告知spark master的 webui端端口 SPARK_MASTER_WEBUI_PORT=8080 # worker cpu可用核数 SPARK_WORKER_CORES=1 #worker可用内存 SPARK_WORKER_MEMORY=1g # worker的工作通讯地址 SPARK_WORKER_PORT=7078 # worker的 webui地址 SPARK_WORKER_WEBUI_PORT=8081 #设置历史服务器 #配置的意思是将spark程序运行的历史日志存到hdfs的/sparklog文件夹中 SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://node1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true" SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node1:2181,node2:2181,node3:2181 -Dspark.deploy.zookeeper.dir=/spark-ha" # spark.deploy.recoveryMode 指定HA模式基于zookeeper实现 #指定Zookeeper的连接地址#指定在zookeeper中注册临时节点的路径 在这个spark-env中如何把8080网址中的spark链接链接编程主机名node1:7077,node2:7077,node3:7077
最新发布
08-02
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值