Spark Dynamic Allocation 分析

最新推荐文章于 2025-07-16 00:20:28 发布

javartisan

最新推荐文章于 2025-07-16 00:20:28 发布

阅读量1.1k

点赞数

分类专栏： Spark

Spark 专栏收录该内容

70 篇文章

订阅专栏

自Spark 1.5版本起，引入了Dynamic Allocation机制，适用于Mesos粗粒度模式及Standalone模式，通过智能管理Executor来提升集群资源利用率。本文详细介绍了动态资源调配的工作原理、配置方法及实际应用。

尊重原作出处：http://blog.youkuaiyun.com/lsshlsw/article/details/49888773

spark1.5开始为mesos粗粒度模式和standalone模式提供了Dynamic Allocation的机制。通过将闲置executor移除，达到提高资源利用率的目的。

一.动态资源调配

spark1.5为standalone模式和mesos的粗粒度模式提供了executor的动态管理，具体表现为：如果executor在一段时间内空闲就会移除这个executor。

动态申请executor

如果有新任务处于等待状态，并且等待时间超过Spark.dynamicAllocation.schedulerBacklogTimeout(默认1s)，则会依次启动executor,每次启动1,2,4,8…个executor（如果有的话）。启动的间隔由spark.dynamicAllocation.sustainedSchedulerBacklogTimeout控制(默认与schedulerBacklogTimeout相同)。

动态移除executor

executor空闲时间超过spark.dynamicAllocation.executorIdleTimeout设置的值(默认60s )，该executor会被移除，除非有缓存数据。

二.配置

conf/spark-default.conf中配置

spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
 1
2
 1
2

开启shuffle service(每个worker节点)

sbin/start-shuffle-service.sh

启动worker

sbin/start-slave.sh -h hostname sparkURL

如果有节点没开，运行任务时该节点就报错

ExecutorLostFailure

相关配置

参数名	默认值	描述
spark.dynamicAllocation.executorIdleTimeout	60s	executor空闲时间达到规定值，则将该executor移除。
spark.dynamicAllocation.cachedExecutorIdleTimeout	infinity	缓存了数据的executor默认不会被移除
spark.dynamicAllocation.maxExecutors	infinity	最多使用的executor数，默认为你申请的最大executor数
spark.dynamicAllocation.minExecutors	0	最少保留的executor数
spark.dynamicAllocation.schedulerBacklogTimeout	1s	有task等待运行时间超过该值后开始启动executor
spark.dynamicAllocation.executorIdleTimeout	schedulerBacklogTimeout	动态启动executor的间隔
spark.dynamicAllocation.initialExecutors	spark.dynamicAllocation.minExecutors	如果所有的executor都移除了，重新请求时启动的初始executor数

三.使用

启动一个spark-shell,有5个executor,每个executor使用2个core

bin/spark-shell --total-executor-cores 10 --executor-cores 2

如果在60s内无动作，在终端会看到如下提示

scala> 15/11/17 15:40:47 ERROR TaskSchedulerImpl: Lost executor 0 on spark047213: remote Rpc client disassociated
15/11/17 15:40:47 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark047213:50015] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
15/11/17 15:40:50 ERROR TaskSchedulerImpl: Lost executor 1 on spark047213: remote Rpc client disassociated
15/11/17 15:40:50 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark047213:49847] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
...
 1
2
3
4
5
 
  
 
 1
2
3
4
5