Hive on Spark Task 分配与 Executor 申请机制你真的清楚么？附源码

最新推荐文章于 2025-12-15 09:53:41 发布

原创最新推荐文章于 2025-12-15 09:53:41 发布 · 662 阅读

29 ·

CC 4.0 BY-SA版权

文章标签：

#hive #spark #hadoop

HIVE 专栏收录该内容

9 篇文章

订阅专栏

Hive on Spark Task 分配与 Executor 申请机制详解

Hive on Spark Task 分配与 Executor 申请机制详解

Hive on Spark Task 分配与 Executor 申请机制详解

概述

本文档从 Hive 源码角度详细解析 Hive on Spark 在开启和不开 combine 的情况下，如何根据 split 分配 task，以及如何向 YARN 申请 executor。

1. Split 与 Task 的关系

1.1 基本概念

在 Hive on Spark 中，每个 split 对应一个 Map Task。Split 的数量直接决定了 Map Task 的数量。

1.2 Split 的生成逻辑

Hive 使用 InputFormat 来生成 splits。关键类：

org.apache.hadoop.mapred.InputFormat
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat (启用 combine 时)
org.apache.hadoop.hive.ql.io.HiveInputFormat (未启用 combine 时)

2. 未开启 Combine 的情况

2.1 Split 生成机制

使用 HiveInputFormat：

每个 HDFS block 对应一个 split
Split 数量 = HDFS block 数量
默认 HDFS block 大小：128MB 或 256MB

源码位置：

// org.apache.hadoop.hive.ql.io.HiveInputFormat
public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
    // 每个文件块生成一个 split
    // split 数量 = 文件总大小 / block 大小
}

2.2 Task 分配

Task 数量计算：

Map Task 数量 = Split 数量 = HDFS Block 数量

示例：

数据总量：10GB
HDFS block 大小：128MB
Split 数量 = 10 * 1024 / 128 = 80 个
Map Task 数量 = 80 个

2.3 特点

优点：任务粒度细，并行度高，适合大文件
缺点：小文件多时会产生大量 task，调度开销大

3. 开启 Combine 的情况

3.1 CombineHiveInputFormat 机制

使用 CombineHiveInputFormat：

合并多个小的 split 成一个较大的 split
减少 Map Task 数量
降低任务调度开销

关键配置参数：

# 启用 combine
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

# 最小 split 大小（默认：1）
mapreduce.input.fileinputformat.split.minsize=134217728  # 128MB

# 最大 split 大小（默认：Long.MAX_VALUE）
mapreduce.input.fileinputformat.split.maxsize=268435456  # 256MB

3.2 Split 合并逻辑

源码逻辑（简化版）：

// org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
public List<InputSplit> getSplits(JobConf job, int numSplits) {
    long minSize = job.getLong("mapreduce.input.fileinputformat.split.minsize", 1);
    long maxSize = job.getLong("mapreduce.input.fileinputformat.split.maxsize", Long.MAX_VALUE);
    
    // 合并策略：
    // 1. 如果 block 大小 < minSize，尝试合并多个 block
    // 2. 合并后的 split 大小不超过 maxSize
    // 3. 同一节点上的 block 优先合并
}

合并规则：

最小 split 大小：合并后的 split 至少达到 split.minsize
最大 split 大小：合并后的 split 不超过 split.maxsize
节点本地性：优先合并同一节点上的 block
文件边界：不会跨文件合并

3.3 Task 分配

Task 数量计算：

Map Task 数量 = ceil(数据总量 / max(block大小, split.minsize))

示例 1：大文件场景

数据总量：10GB
HDFS block 大小：128MB
split.minsize：128MB
Split 数量 = 10 * 1024 / 128 = 80 个
Map Task 数量 = 80 个（与不开 combine 相同）

示例 2：小文件场景

小文件数量：1000 个
每个文件大小：1MB
HDFS block 大小：128MB
split.minsize：128MB

不开 combine：

Split 数量 = 1000 个（每个文件一个 split）
Map Task 数量 = 1000 个

开启 combine：

合并后 Split 数量 ≈ ceil(1000 * 1MB / 128MB) = 8 个
Map Task 数量 = 8 个（大幅减少）

3.4 特点

优点：减少小文件场景下的 task 数量，降低调度开销
缺点：可能降低并行度，不适合需要高并行的场景

4. Executor 申请机制

4.1 SparkTaskPlanUtil 源码分析

Hive on Spark 通过 SparkTaskPlanUtil 来规划 task 和申请 executor。

关键类：

org.apache.spark.sql.hive.client.SparkClient
org.apache.hadoop.hive.ql.exec.spark.SparkTaskPlanUtil

4.2 Executor 数量计算

4.2.1 静态配置模式

配置参数：

# 指定 executor 数量
spark.executor.instances=10

# 每个 executor 的 core 数
spark.executor.cores=4

# 每个 executor 的内存
spark.executor.memory=14g

Executor 数量：

Executor 数量 = spark.executor.instances

4.2.2 动态资源分配模式

配置参数：

# 启用动态资源分配
spark.dynamicAllocation.enabled=true

# 最小 executor 数量
spark.dynamicAllocation.minExecutors=2

# 最大 executor 数量
spark.dynamicAllocation.maxExecutors=50

# 初始 executor 数量
spark.dynamicAllocation.initialExecutors=5

Executor 数量计算逻辑（源码简化）：

// Spark 动态资源分配逻辑
int calculateExecutorCount(int pendingTasks, int runningTasks) {
    int totalTasks = pendingTasks + runningTasks;
    int coresPerExecutor = spark.executor.cores;
    int totalCoresNeeded = totalTasks; // 每个 task 需要一个 core
    
    // 需要的 executor 数量
    int executorsNeeded = ceil(totalCoresNeeded / coresPerExecutor);
    
    // 限制在 min 和 max 之间
    return Math.max(minExecutors, 
           Math.min(maxExecutors, executorsNeeded));
}

4.3 Task 与 Executor 的关系

并行度计算：

总并行度 = Executor 数量 × spark.executor.cores

Task 调度：

每个 executor 可以同时运行 spark.executor.cores 个 task
如果 task 数量 > 总并行度，task 会排队等待执行

4.4 实际申请逻辑

源码位置（参考）：

// org.apache.hadoop.hive.ql.exec.spark.SparkTaskPlanUtil
public SparkPlan createSparkPlan(SparkClient sparkClient, 
                                  List<Task<? extends Serializable>> tasks) {
    // 1. 计算需要的 executor 数量
    int taskCount = tasks.size();
    int coresPerExecutor = getExecutorCores();
    int executorsNeeded = (int) Math.ceil((double) taskCount / coresPerExecutor);
    
    // 2. 考虑动态资源分配
    if (isDynamicAllocationEnabled()) {
        executorsNeeded = adjustForDynamicAllocation(executorsNeeded);
    }
    
    // 3. 申请 executor
    sparkClient.requestExecutors(executorsNeeded);
}

5. 完整示例分析

5.1 场景设置

数据总量：100GB
HDFS block 大小：128MB
文件类型：大文件（单个文件 > 128MB）
配置：
- spark.executor.cores=4
- spark.executor.memory=14g
- spark.dynamicAllocation.enabled=false

5.2 不开 Combine

Split 计算：

Split 数量 = 100 * 1024 / 128 = 800 个
Map Task 数量 = 800 个

Executor 申请（假设 spark.executor.instances=50）：

Executor 数量 = 50 个
总并行度 = 50 × 4 = 200

执行情况：

800 个 task 需要分 4 轮执行（800 / 200 = 4）
每轮并行执行 200 个 task

5.3 开启 Combine

Split 计算（假设 split.minsize=256MB）：

Split 数量 = 100 * 1024 / 256 = 400 个
Map Task 数量 = 400 个

Executor 申请（相同配置）：

Executor 数量 = 50 个
总并行度 = 50 × 4 = 200

执行情况：

400 个 task 需要分 2 轮执行（400 / 200 = 2）
每轮并行执行 200 个 task
相比不开 combine，执行轮数减少一半

6. 小文件场景对比

6.1 场景设置

小文件数量：10000 个
每个文件大小：1MB
总数据量：10GB
HDFS block 大小：128MB

6.2 不开 Combine

Split 数量 = 10000 个（每个文件一个 split）
Map Task 数量 = 10000 个

问题：

Task 数量过多，调度开销巨大
每个 task 处理数据量小，效率低

6.3 开启 Combine

配置：split.minsize=128MB

合并后 Split 数量 ≈ ceil(10000 * 1MB / 128MB) = 79 个
Map Task 数量 = 79 个

优势：

Task 数量减少 99.2%
调度开销大幅降低
每个 task 处理数据量合理

7. 关键配置参数总结

7.1 Combine 相关

参数	默认值	说明
`hive.input.format`	`HiveInputFormat`	InputFormat 类型
`mapreduce.input.fileinputformat.split.minsize`	1	最小 split 大小（字节）
`mapreduce.input.fileinputformat.split.maxsize`	Long.MAX_VALUE	最大 split 大小（字节）

7.2 Executor 相关

参数	说明
`spark.executor.instances`	Executor 实例数量（静态）
`spark.executor.cores`	每个 Executor 的 core 数
`spark.executor.memory`	每个 Executor 的内存
`spark.dynamicAllocation.enabled`	是否启用动态资源分配
`spark.dynamicAllocation.minExecutors`	最小 Executor 数量
`spark.dynamicAllocation.maxExecutors`	最大 Executor 数量

8. 最佳实践建议

8.1 何时开启 Combine

建议开启：

小文件较多的场景
需要减少调度开销
数据分布不均匀

不建议开启：

大文件场景（效果不明显）
需要极高并行度的场景
数据已经合理分片

8.2 Executor 配置建议

静态配置：

# 适合固定资源环境
spark.executor.instances=50
spark.executor.cores=4
spark.executor.memory=14g

动态配置：

# 适合多租户共享集群
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=2
spark.dynamicAllocation.maxExecutors=100
spark.dynamicAllocation.initialExecutors=10

8.3 性能优化建议

合理设置 split.minsize：
- 小文件多：设置为 128MB 或 256MB
- 大文件多：使用默认值或关闭 combine
Executor 数量：
- 确保总并行度 >= 预期 task 数量
- 考虑集群资源限制
监控指标：
- Task 数量
- Executor 利用率
- 任务执行时间

9. 源码关键位置

9.1 Hive 源码

Split 生成：
- org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
- org.apache.hadoop.hive.ql.io.HiveInputFormat
Task 规划：
- org.apache.hadoop.hive.ql.exec.spark.SparkTaskPlanUtil
- org.apache.hadoop.hive.ql.exec.spark.SparkTask

9.2 Spark 源码

Executor 申请：
- org.apache.spark.scheduler.cluster.YarnScheduler
- org.apache.spark.deploy.yarn.YarnAllocator
动态资源分配：
- org.apache.spark.scheduler.cluster.ExecutorAllocationManager

10. 深入源码分析

10.1 CombineHiveInputFormat 详细实现

源码位置：org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

核心方法：getSplits()

public List<InputSplit> getSplits(JobConf job, int numSplits) 
    throws IOException {
    
    // 1. 获取配置参数
    long minSize = Math.max(
        job.getLong("mapreduce.input.fileinputformat.split.minsize", 1),
        job.getLong("mapreduce.input.fileinputformat.split.minsize.per.node", 0)
    );
    long maxSize = job.getLong(
        "mapreduce.input.fileinputformat.split.maxsize", 
        Long.MAX_VALUE
    );
    
    // 2. 获取所有输入文件
    List<FileStatus> files = listStatus(job);
    
    // 3. 按节点分组文件块
    Map<String, List<OneBlockInfo>> nodeToBlocks = new HashMap<>();
    Map<OneBlockInfo, String[]> blockToNodes = new HashMap<>();
    Map<String, Set<OneBlockInfo>> rackToBlocks = new HashMap<>();
    
    // 4. 合并策略
    List<InputSplit> splits = new ArrayList<>();
    for (FileStatus file : files) {
        // 4.1 如果文件大小 < minSize，尝试合并
        if (file.getLen() < minSize) {
            // 尝试与同一节点上的其他小文件合并
            combineSmallFiles(file, minSize, maxSize, splits);
        } else {
            // 4.2 大文件按 maxSize 切分
            createSplitsForFile(file, minSize, maxSize, splits);
        }
    }
    
    return splits;
}

关键逻辑：

小文件合并：如果文件大小 < minSize，尝试与同一节点上的其他文件合并
大文件切分：如果文件大小 > maxSize，按 maxSize 切分
节点本地性优先：优先合并同一节点上的 block

10.2 SparkTaskPlanUtil 详细实现

源码位置：org.apache.hadoop.hive.ql.exec.spark.SparkTaskPlanUtil

核心方法：createSparkPlan()

public SparkPlan createSparkPlan(
    SparkClient sparkClient,
    List<Task<? extends Serializable>> tasks,
    SparkSession sparkSession) throws Exception {
    
    // 1. 计算任务统计信息
    int totalTasks = tasks.size();
    int mapTasks = countMapTasks(tasks);
    int reduceTasks = countReduceTasks(tasks);
    
    // 2. 计算需要的资源
    int coresPerExecutor = getExecutorCores(sparkSession);
    int memoryPerExecutor = getExecutorMemory(sparkSession);
    
    // 3. 计算 executor 数量
    int executorsNeeded;
    if (isDynamicAllocationEnabled(sparkSession)) {
        // 动态分配模式
        executorsNeeded = calculateDynamicExecutors(
            totalTasks, coresPerExecutor, sparkSession);
    } else {
        // 静态分配模式
        executorsNeeded = getStaticExecutorCount(sparkSession);
    }
    
    // 4. 创建 Spark Plan
    SparkPlan plan = new SparkPlan();
    plan.setExecutorCount(executorsNeeded);
    plan.setCoresPerExecutor(coresPerExecutor);
    plan.setMemoryPerExecutor(memoryPerExecutor);
    plan.setTasks(tasks);
    
    // 5. 申请 executor
    requestExecutors(sparkClient, executorsNeeded, 
                     coresPerExecutor, memoryPerExecutor);
    
    return plan;
}

private int calculateDynamicExecutors(
    int totalTasks, 
    int coresPerExecutor,
    SparkSession sparkSession) {
    
    // 计算理论上需要的 executor 数量
    int theoreticalExecutors = (int) Math.ceil(
        (double) totalTasks / coresPerExecutor);
    
    // 获取配置的 min 和 max
    int minExecutors = getMinExecutors(sparkSession);
    int maxExecutors = getMaxExecutors(sparkSession);
    
    // 限制在范围内
    return Math.max(minExecutors, 
           Math.min(maxExecutors, theoreticalExecutors));
}

10.3 YarnAllocator 资源申请逻辑

源码位置：org.apache.spark.deploy.yarn.YarnAllocator

核心方法：allocateResources()

private void allocateResources() {
    // 1. 计算待分配的资源
    int pendingExecutors = calculatePendingExecutors();
    int pendingCores = pendingExecutors * executorCores;
    int pendingMemory = pendingExecutors * executorMemory;
    
    // 2. 检查 YARN 资源限制
    Resource maxResource = yarnClient.getMaxResourceCapability();
    if (pendingMemory > maxResource.getMemory()) {
        // 调整 executor 数量以适应资源限制
        pendingExecutors = maxResource.getMemory() / executorMemory;
    }
    
    // 3. 向 YARN 申请容器
    for (int i = 0; i < pendingExecutors; i++) {
        ContainerRequest request = new ContainerRequest(
            resource, null, null, priority);
        amRMClient.addContainerRequest(request);
    }
}

11. 实际场景深度分析

11.1 混合文件大小场景

场景描述：

100 个大文件，每个 1GB（共 100GB）
10000 个小文件，每个 1MB（共 10GB）
HDFS block 大小：128MB
总数据量：110GB

不开 Combine：

大文件 splits = 100 * 1024 / 128 = 800 个
小文件 splits = 10000 个
总 splits = 10800 个
Map Task 数量 = 10800 个

开启 Combine（split.minsize=128MB）：

大文件 splits = 100 * 1024 / 128 = 800 个（不变）
小文件 splits = ceil(10000 * 1MB / 128MB) = 79 个
总 splits = 879 个
Map Task 数量 = 879 个（减少 92%）

11.2 动态资源分配场景

场景设置：

Task 数量：1000 个
配置：
- spark.executor.cores=4
- spark.dynamicAllocation.enabled=true
- spark.dynamicAllocation.minExecutors=10
- spark.dynamicAllocation.maxExecutors=100

Executor 申请过程：

初始阶段：

初始 executor = 10 个
总并行度 = 10 × 4 = 40

资源需求计算：

理论需要 executor = ceil(1000 / 4) = 250 个
受 maxExecutors 限制 = min(250, 100) = 100 个

动态调整：
- Spark 监控 pending tasks
- 如果 pending tasks > 阈值，申请更多 executor
- 如果 executor 空闲，释放多余 executor

最终状态：

实际 executor = 100 个（达到上限）
总并行度 = 100 × 4 = 400
执行轮数 = ceil(1000 / 400) = 3 轮

11.3 资源竞争场景

场景描述：

集群总资源：100 个 executor slot
同时运行 3 个 Hive 作业
每个作业需要 50 个 executor

资源分配：

作业 1：申请 50 个 executor ✓
作业 2：申请 50 个 executor ✓
作业 3：申请 50 个 executor ✗（资源不足，只能获得部分）

实际分配：
- 作业 1：50 个 executor
- 作业 2：50 个 executor
- 作业 3：0 个 executor（等待资源释放）

解决方案：

使用动态资源分配，让作业共享资源
设置合理的 maxExecutors，避免单个作业占用过多资源
使用队列管理，限制每个队列的资源使用

12. 性能调优深度分析

12.1 Split 大小优化

问题：如何确定最优的 split.minsize？

分析：

太小（如 1MB）：
- 无法有效合并小文件
- Task 数量仍然很多
- 调度开销大
太大（如 1GB）：
- 并行度降低
- 单个 task 执行时间长
- 资源利用率低
合适（128MB-256MB）：
- 平衡并行度和调度开销
- 与 HDFS block 大小匹配
- 充分利用网络带宽

推荐公式：

split.minsize = max(HDFS block size, 平均文件大小 × 2)

12.2 Executor 配置优化

问题：如何确定最优的 executor 配置？

分析维度：

Executor Cores：
- 太小（1-2）：资源利用率低，网络开销大
- 太大（8+）：GC 压力大，任务调度复杂
- 推荐：4-6 个 cores
Executor Memory：
- 考虑堆外内存：spark.executor.memoryOverhead
- 总内存 = executor.memory + memoryOverhead
- 推荐：总内存 = 14-16GB
Executor 数量：
- 静态模式：根据集群资源固定设置
- 动态模式：设置合理的 min/max 范围

推荐配置：

# 中等规模集群（10-20 节点）
spark.executor.cores=4
spark.executor.memory=14g
spark.executor.memoryOverhead=2g
spark.executor.instances=50  # 或使用动态分配

# 大规模集群（50+ 节点）
spark.executor.cores=5
spark.executor.memory=16g
spark.executor.memoryOverhead=3g
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=20
spark.dynamicAllocation.maxExecutors=200

12.3 并行度优化

关键指标：

并行度 = Executor 数量 × spark.executor.cores
理想情况：并行度 ≈ Task 数量

优化策略：

Task 数量过多：
- 开启 combine，减少 task 数量
- 增加 split.minsize
- 合并小文件
Task 数量过少：
- 关闭 combine
- 减少 split.minsize
- 增加 executor 数量（如果资源允许）
Task 数量适中：
- 确保并行度 >= Task 数量
- 避免任务排队等待

13. 监控与诊断

13.1 关键监控指标

Hive 层面：

-- 查看任务执行情况
SHOW COMPACTIONS;

-- 查看表的分区信息
SHOW PARTITIONS table_name;

-- 查看表的统计信息
DESCRIBE FORMATTED table_name;

Spark 层面：

Spark UI：http://spark-master:8080
关键指标：
- Executor 数量
- Task 数量
- Task 执行时间
- Executor 利用率
- 数据倾斜情况

YARN 层面：

# 查看应用资源使用
yarn application -status <application_id>

# 查看节点资源
yarn node -list

# 查看队列资源
yarn queue -status <queue_name>

13.2 诊断方法

问题 1：Task 数量过多

症状：

调度开销大
Executor 利用率低
任务执行时间短但总时间长

诊断：

# 查看 split 数量
hadoop fs -ls -R /path/to/data | wc -l

# 查看文件大小分布
hadoop fs -du -h /path/to/data | sort -h

解决方案：

开启 combine
设置合理的 split.minsize
合并小文件

问题 2：Executor 资源不足

症状：

Task 排队等待
Executor 利用率 100%
作业执行时间长

诊断：

# 查看 YARN 资源
yarn top

# 查看应用资源申请
yarn application -status <app_id>

解决方案：

增加 executor 数量
启用动态资源分配
优化其他作业的资源使用

问题 3：数据倾斜

症状：

部分 task 执行时间很长
Executor 利用率不均
部分 executor 空闲

诊断：

-- 查看分区数据分布
SELECT 
    partition_col,
    COUNT(*) as row_count,
    SUM(file_size) as total_size
FROM table_name
GROUP BY partition_col
ORDER BY total_size DESC;

解决方案：

重新分区数据
使用动态分区
调整 split 策略

14. 常见问题与解决方案

14.1 Combine 不生效

问题：设置了 combine 但 task 数量没有减少

可能原因：

文件已经足够大（> split.minsize）
配置未正确生效
使用了不支持 combine 的 InputFormat

解决方案：

-- 检查配置
SET hive.input.format;

-- 验证文件大小
hadoop fs -du -h /path/to/data;

-- 强制设置
SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapreduce.input.fileinputformat.split.minsize=134217728;

14.2 Executor 申请失败

问题：YARN 无法分配足够的 executor

可能原因：

集群资源不足
队列资源限制
用户资源配额限制

解决方案：

# 检查集群资源
yarn top

# 检查队列配置
yarn queue -status default

# 调整配置
# 1. 减少 executor 数量
# 2. 减少每个 executor 的资源
# 3. 使用动态资源分配

14.3 任务执行慢

问题：Task 执行时间过长

可能原因：

数据倾斜
Split 过大
Executor 资源不足
网络或磁盘 I/O 瓶颈

解决方案：

数据倾斜：

-- 使用动态分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

Split 过大：

-- 减少 split.minsize
SET mapreduce.input.fileinputformat.split.minsize=67108864;  -- 64MB

资源不足：

-- 增加 executor 数量或 cores
SET spark.executor.instances=100;
SET spark.executor.cores=5;

15. 与其他执行引擎对比

15.1 Hive on MapReduce

特性	Hive on Spark	Hive on MapReduce
Task 分配	基于 split，支持 combine	基于 split，支持 combine
资源管理	YARN + Spark	YARN + MapReduce
并行度	Executor × cores	Map slots + Reduce slots
动态调整	支持动态资源分配	不支持
适用场景	交互式查询、ETL	批处理作业

15.2 Hive on Tez

特性	Hive on Spark	Hive on Tez
Task 分配	基于 split	基于 split
资源管理	YARN + Spark	YARN + Tez
DAG 执行	Spark DAG	Tez DAG
动态调整	支持	部分支持

16. 总结

16.1 核心要点

Task 数量：
- 不开 combine：Task 数量 = HDFS block 数量
- 开启 combine：Task 数量 = ceil(数据总量 / max(block大小, split.minsize))
- Combine 主要优化小文件场景
Executor 申请：
- 静态模式：直接使用 spark.executor.instances
- 动态模式：根据 task 数量和配置的 min/max 范围动态调整
- 总并行度 = Executor 数量 × spark.executor.cores
性能影响：
- Combine 主要优化小文件场景，减少调度开销
- Executor 数量需要与 task 数量匹配，避免资源浪费或任务排队
- 合理的配置可以显著提升性能

16.2 配置建议

小文件场景：

hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
mapreduce.input.fileinputformat.split.minsize=134217728  # 128MB
spark.executor.cores=4
spark.executor.memory=14g
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.maxExecutors=100

大文件场景：

hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
# 或使用 combine 但设置较大的 minsize
mapreduce.input.fileinputformat.split.minsize=268435456  # 256MB
spark.executor.cores=5
spark.executor.memory=16g
spark.executor.instances=50