Regarding Spark paramters(executors, memory)

本文探讨了如何合理配置Spark的--num-executors、--executor-memory和--executor-cores参数,通过三种不同策略(小执行器、大执行器和平衡策略)分析了集群资源分配的最佳实践。

Ever wondered how to configure --num-executors, --executor-memory and --execuor-cores spark config params for your cluster?

Let’s find out how…

  1. Lil bit theory: Let’s see some key recommendations that will help understand it better
  2. Hands on: Next, we’ll take an example cluster and come up with recommended numbers to these spark params

Lil bit theory:

Following list captures some recommendations to keep in mind while configuring them:

  • Hadoop/Yarn/OS Deamons: When we run spark application using a cluster manager like Yarn, there’ll be several daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker. So, while specifying num-executors, we need to make sure that we leave aside enough cores (~1 core per node) for these daemons to run smoothly.
  • Yarn ApplicationMaster (AM): ApplicationMaster is responsible for negotiating resources from the ResourceManager and working with the NodeManagers to execute and monitor the containers and their resource consumption. If we are running spark on yarn, then we need to budget in the resources that AM would need (~1024MB and 1 Executor).
  • HDFS Throughput: HDFS client has trouble with tons of concurrent threads. It was observed that HDFS achieves full write throughput with ~5 tasks per executor . So it’s good to keep the number of cores per executor below that number.
  • MemoryOverhead: Following picture depicts spark-yarn-memory-usage.
    Description of the memory
    Two things to make note of from this picture:
 Full memory requested to yarn per executor =  spark-executor-memory + spark.yarn.executor.memoryOverhead.
 spark.yarn.executor.memoryOverhead =  Max(384MB, 7% of spark.executor-memory)

So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.

  • Running executors with too much memory often results in excessive garbage collection delays.
  • Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.

Enough theory… Let’s go hands-on…

Now, let’s consider a 10 node cluster with following config and analyse different possibilities of executors-core-memory distribution:

Cluster Config:
10 Nodes
16 cores per Node
64GB RAM per Node

First Approach: Tiny executors [One Executor per core]:

Tiny executors essentially means one executor per core. Following table depicts the values of our spar-config params with this approach:

- `--num-executors` = `In this approach, we'll assign one executor per core`
                    = `total-cores-in-cluster`
                    = `num-cores-per-node * total-nodes-in-cluster` 
                    = 16 x 10 = 160
- `--executor-cores` = 1 (one executor per core)
- `--executor-memory` = `amount of memory per executor`
                      = `mem-per-node/num-executors-per-node`
                      = 64GB/16 = 4GB
                

Analysis: With only one executor per core, as we discussed above, we’ll not be able to take advantage of running multiple tasks in the same JVM. Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of the nodes which is 16 times. Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we are not counting in ApplicationManager. NOT GOOD!

Second Approach: Fat executors (One Executor per node):

Fat executors essentially means one executor per node. Following table depicts the values of our spark-config params with this approach:

- `--num-executors` = `In this approach, we'll assign one executor per node`
                    = `total-nodes-in-cluster`
                    = 10
- `--executor-cores` = `one executor per node means all the cores of the node are assigned to one executor`
                     = `total-cores-in-a-node`
                     = 16
- `--executor-memory` = `amount of memory per executor`
                      = `mem-per-node/num-executors-per-node`
                      = 64GB/1 = 64GB

Analysis: With all 16 cores per executor, apart from ApplicationManager and daemon processes are not counted for, HDFS throughput will hurt and it’ll result in excessive garbage results. Also,NOT GOOD!

Third Approach: Balance between Fat (vs) Tiny

According to the recommendations which we discussed above:

  • Based on the recommendations mentioned above, Let’s assign 5 core per executors => --executor-cores = 5 (for good HDFS throughput)
  • Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15
  • So, Total available of cores in cluster = 15 x 10 = 150
  • Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30
  • Leaving 1 executor for ApplicationManager => --num-executors = 29
  • Number of executors per node = 30/10 = 3
  • Memory per executor = 64GB/3 = 21GB
  • Counting off heap overhead = 7% of 21GB = 3GB. So, actual --executor-memory = 21 - 3 = 18GB

So, recommended config is: 29 executors, 18GB memory each and 5 cores each!!

Analysis: It is obvious as to how this third approach has found right balance between Fat vs Tiny approaches. Needless to say, it achieved parallelism of a fat executor and best throughputs of a tiny executor!!

Parameters’ Upper Limit

We all want to maxmize the program’s performance, but at the same time, the upper limit is also absolutely necessary for all. 1. --executor-memory have to under the node memory capability, or the program won’t running since no enough memory for per executor node.
2. --executor-cores have to under the node core count.
3. --executor-num no limit, but couldn’t allocate a huge amount. The reason is that all the executors don’t need to running at the same time, so the extra executors could wait until there is enough resource.

Conclusion:

We’ve seen:

  • Couple of recommendations to keep in mind which configuring these params for a spark-application like:

    Budget in the resources that Yarn’s Application Manager would need
    How we should spare some cores for Hadoop/Yarn/OS deamon processes
    Learnt about spark-yarn-memory-usage

  • Also, checked out and analysed three different approaches to configure these params:

    1. Tiny Executors - One Executor per Core
    2. Fat Executors - One executor per Node
    3. Recommended approach - Right balance between Tiny (Vs) Fat coupled with the recommendations.

–num-executors, --executor-cores and --executor-memory… these three params play a very important role in spark performance as they control the amount of CPU & memory your spark application gets. This makes it very crucial for users to understand the right way to configure them. Hope this blog helped you in getting that perspective…

### Apache Spark 启动全过程详解 Apache Spark 的启动过程涉及多个组件之间的交互,主要包括集群管理器的选择、Driver 和 Executor 的初始化以及资源分配等阶段。以下是 Spark 启动过程的技术细节: #### 1. 集群模式选择 Spark 支持多种部署方式,包括 Standalone 模式、YARN 模式、Mesos 模式和 Kubernetes 模式。用户可以在提交应用时通过 `--master` 参数指定集群管理模式。例如,在 YARN 上运行的应用程序可以通过设置 `yarn-client` 或 `yarn-cluster` 来决定 Driver 是否在客户端运行。 当用户提交一个 Spark 应用时,Spark 提交脚本(如 `spark-submit`)会解析配置文件并根据所选的 Master URL 初始化相应的环境[^1]。 #### 2. SparkContext 创建 在应用程序中,开发者通常通过创建 `SparkSession` 或者更底层的 `SparkContext` 对象来启动 Spark 应用。这一阶段的主要工作如下: - **加载配置**:从默认配置文件(如 `conf/spark-defaults.conf`)、命令行参数或者动态设定的属性中加载 Spark 配置。 - **初始化调度器**:创建 TaskScheduler 并注册到 Cluster Manager 中。TaskScheduler 负责任务分发,而 DAGScheduler 则负责将逻辑计划转化为物理执行计划。 - **绑定监听器**:为事件总线绑定各种 Listener,用于监控作业进度和其他元数据更新。 ```scala val conf = new SparkConf().setAppName("MyApp").setMaster("local[*]") val sc = new SparkContext(conf) ``` #### 3. 注册至集群管理器 一旦 SparkContext 成功构建完成之后,它便会尝试联系选定好的 Resource Manager (RM),比如对于 standalone cluster mode 下来说就是 master node;而在 yarn client/server modes 当中则是 ResourceManager daemon 运行所在位置。此时 driver program 将发送 register request 请求告知 RM 自己的存在状态及其所需资源规格信息(cores, memory etc.)以便后续安排 executors 实例化事宜[^4]。 #### 4. Executors 分配与启动 Cluster Manager 接收到 RegisterApplication 消息后,依据当前系统的负载情况以及其他约束条件(像 locality preferences),逐步批准申请并将实际可用 slot 数量反馈给 Client Side(Driver Process)。随后便进入 LaunchExecutor phase —— 即由 Worker Nodes 执行特定 shell script 去实例化 java process 形式的 worker threads(pool size determined by parameter settings like 'spark.executor.cores') ,从而形成完整的 distributed computing framework structure. 值得注意的是,在整个生命周期里,除非显式终止 session/stop context operation 发生之前,所有已分配出去 resources including both cpu time slots alongside corresponding physical memories will be reserved exclusively dedicated solely towards serving this single job only without interruption unless otherwise specified beforehand via advanced tuning parameters such as dynamic allocation mechanism enabled under certain circumstances.[^1] #### 5. 数据处理流程 随着 executor nodes 正常上线运作起来以后,接下来便是围绕 input datasets 展开的一系列 transformation & action operations 定义出来的 pipeline execution chain 。每当遇到某个 stage boundary point where shuffling becomes necessary due to key-value pair redistributions across partitions boundaries , system automatically triggers another round of resource negotiation processes similar described above but specifically targeting those newly generated intermediate results files stored temporarily within local filesystem directories managed separately per individual tasks involved during previous stages executions . 最后值得一提提一下 shuffle read/write mechanisms implemented inside block manager components which play crucial roles ensuring efficient data transfers between different machines located geographically far away from each other yet still maintaining high performance levels thanks largely contributed efforts made possible through sophisticated algorithms designed around network protocols selection strategies mentioned earlier regarding netty vs nio implementations choices affecting overall latencies experienced throughout entire end-to-end pipelines constructed based upon user supplied business logics encoded into their custom written map/reduce functions bodies passed along side regular api calls sequences invoked sequentially following standard library conventions established over years development iterations cycles continuously improving codebases maintained actively open source communities worldwide today ![^3] --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值