spark on yarn_spark on yarn .snappy-优快云博客

本文详细介绍了Apache Spark集群的各项配置参数，包括资源分配、内存管理、网络通信等关键设置，旨在帮助读者深入了解Spark的工作原理并进行高效调优。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1、spark-default
export SPARK_LOCAL_DIRS=/home/hadoop/spark/tmp
export SPARK_HOME=/usr/install/spark

2、spark-env

//This requires spark.shuffle.service.enabled to be set. The following //configurations are also relevant: //spark.dynamicAllocation.minExecutors, //spark.dynamicAllocation.maxExecutors, and //spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.maxExecutors 20
spark.dynamicAllocation.executorIdleTimeout 120s
spark.dynamicAllocation.cachedExecutorIdleTimeout 1800s
Spark.shuffle.service.port 7338
spark.shuffle.io.connectionTimeout 600s

spark.yarn.jars hdfs://master:9000/user/yarn_jars/spark2.0/*

spark.yarn.executor.memoryOverhead 3g
spark.driver.memory 3g
spark.yarn.am.memory 3g
spark.executor.memory 8g
spark.executor.cores 3
spark.yarn.queue test
spark.ui.enabled true
spark.port.maxRetries 50
spark.locality.wait 0s
spark.master yarn

应用程序上载到HDFS的复制份数
spark.yarn.submit.file.replication 3

spark.yarn.am.waitTime 100s
设置为true，在job结束后，将stage相关的文件保留而不是删除。（一般无需保留，设置成false)
spark.preserve.staging.files false

Spark application master给YARN ResourceManager 发送心跳的时间间隔（ms）
spark.yarn.scheduler.heartbeat.interal-ms 5000

仅适用于HashShuffleMananger的实现，同样是为了解决生成过多文件的问题，采用的方式是在不同批次运行的Map任务之间重用Shuffle输出文件，也就是说合并的是不同批次的Map任务的输出数据，但是每个Map任务所需要的文件还是取决于Reduce分区的数量，因此，它并不减少同时打开的输出文件的数量，因此对内存使用量的减少并没有帮助。只是HashShuffleManager里的一个折中的解决方案。
spark.shuffle.consolidateFiles true

spark.serializer org.apache.spark.serializer.KryoSerializer

spark.executor.extraJavaOptions -XX:+PrintGCDetails-XX:+PrintGCTimeStamps

spark.driver.cores 1
spark.driver.maxResultSize 1g
spark.driver.memory 1g
spark.executor.memory 1g
//including map output files and RDDs that get stored on disk
spark.local.dir /tmp
spark.submit.deployMode client/cluster
spark.reducer.maxSizeInFlight 48m
spark.shuffle.compress true
spark.shuffle.file.buffer 32k
spark.shuffle.io.maxRetries 3
spark.shuffle.io.preferDirectBufs true
spark.shuffle.io.retryWait 5s
//This must be enabled if spark.dynamicAllocation.enabled is “true”.
spark.shuffle.service.enabled false

spark.shuffle.service.port 7337
//在sort-shuffle里面如果没有map-side 聚合，避免合并排序数据，最多允许有这么多分区
spark.shuffle.sort.bypassMergeThreshold 200

spark.shuffle.spill.compress true
spark.io.compression.codec lz4
org.apache.spark.io.LZ4CompressionCodec, org.apache.spark.io.LZFCompressionCodec, and org.apache.spark.io.SnappyCompressionCodec.

spark.broadcast.compress true
spark.io.compression.snappy.blockSize 32k
spark.io.compression.lz4.blockSize 32k
spark.kryoserializer.buffer.max 64m
spark.kryoserializer.buffer 64k
spark.rdd.compress false
spark.memory.fraction 0.6
spark.memory.storageFraction 0.5

spark.memory.offHeap.enabled false
spark.memory.offHeap.size 0

Spark.executor.cores 1
spark.default.parallelism 2
spark.executor.heartbeatInterval 10s
spark.files.useFetchCache true
spark.storage.memoryMapThreshold 2m

//This config will be used in place of //spark.core.connection.ack.wait.timeout, //spark.storage.blockManagerSlaveTimeoutMs, spark.shuffle.io.connectionTimeout, spark.rpc.askTimeout or spark.rpc.lookupTimeout
spark.network.timeout 120s

spark.cores.max (not set)
spark.locality.wait 3s

//Useful for multi-user services.
spark.scheduler.mode FIFO
//任务推测机制
spark.speculation false
//检查任务推测的频率
spark.speculation.interval 100ms
//任务慢多少倍开始推测
//完成任务的百分比开始启用
spark.speculation.quantile 0.75

spark.speculation.multiplier 1.5

spark.sql.autoBroadcastJoinThreshold -1

spark.sql.shuffle.partitions 800
spark.shuffle.manager tungsten-sort
//Spark SQL在每次执行次，先把SQL查询编译JAVA字节码。针对执行时间长//的SQL查询或频繁执行的SQL查询，此配置能加快查询速度，因为它产生特殊//的字节码去执行。但是针对很短的查询，可能会增加开销，因为它必须先编译//每一个查询
spark.sql.codegen true

//shuffle默认情况下的文件数据为map tasks * reduce tasks,通过设置其为//true,可以使spark合并shuffle的中间文件为reduce的tasks数目
spark.shuffle.consolidateFiles true