spark 常用配置参数调优

最新推荐文章于 2025-10-28 15:21:52 发布

原创

最新推荐文章于 2025-10-28 15:21:52 发布 · 1.6k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #spark #hive

本文介绍了Spark参数调优的方法及实例，包括解决内存溢出、连接重置等问题，并提供了资源调整、自适应框架、动态资源分配等方面的配置建议。

spark 参数调优

(spark.sql.hive.metastore.version,1.2.1)

三.ERROR

问题1：

ERROR YarnScheduler: Lost executor 53 on node100p32: Container killed by YARN for exceeding memory limits.
10.0 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

解决:

暂时换用 hive：控制 reduce数750，过程中 Allocated memory max约3.3T，20个Job 正好8小时。

-- set mapreduce.map.memory.mb=3000;
-- set mapreduce.reduce.memory.mb=6000;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.max.split.size=512000000;
set mapred.min.split.size.per.node=128000000;
set mapred.min.split.size.per.rack=128000000;
set hive.merge.mapfiles=true;
set hive.map.aggr=true;
set hive.merge.smallfiles.avgsize=128000000;
set hive.exec.reducers.max=750;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1500;
set hive.exec.max.dynamic.partitions.pernode=1500;

原因分析（待写）：

set mapreduce.map.memory.mb=2048
set mapreduce.reduce.memory.mb=6000;
set spark.yarn.executor.memoryOverhead
set yarn.nodemanager.vmem-check-enabled
set hive.groupby.skewindata=true;
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=5000000;

一.spark常用配置

1.spark-sql：

spark-sql --name “$0”
–master yarn --deploy-mode client --queue deve
–driver-memory 4g --executor-memory 6g --num-executors 50 --executor-cores 3
–conf spark.dynamicAllocation.enabled=true
–conf spark.shuffle.service.enabled=true
–conf spark.dynamicAllocation.minExecutors=20
–conf spark.dynamicAllocation.maxExecutors=56
–conf spark.sql.adaptive.enabled=true
–conf spark.sql.adaptive.maxNumPostShufflePartitions=500
–conf spark.sql.adaptive.shuffle.targetPostShuffleInputSize=256000000
–conf spark.yarn.executor.memoryOverhead=1200m
-i /opt/data/dev/util/spark_com.sql
–hiveconf hive.cli.print.header=true
–hiveconf hive.resultset.use.unique.column.names=false
–conf ‘spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/data/dev/spark/log4j.properties’
-v -e
" ${sql_query_insert} "

2.spark-submit：

spark-submit --master yarn --queue deve
–driver-memory 6G --executor-memory 7G --num-executors 32 --executor-cores 3
–conf spark.yarn.executor.memoryoverhead=8096M
–conf spark.sql.shuffle.partitions=1000
–conf spark.default.parallelism=150
–conf spark.shuffle.service.enabled=true
–conf spark.shuffle.service.port=
–class com.ecnomic.test
/package/package.jar 2 2020-10-01 2020-10-01 > /log.log 2>&1

3.udf加载方式：（hive的udf不用考虑线程安全，而spark的udf需考虑线程安全）

方式1.初始化文件: spark-sql -i /opt/data/dev/util/spark_com.sql
方式2.source： source /opt/data/dev/util/spark_com.sql;

such:
add jar /opt/data/lib/udf.jar;
create temporary function udf_date_format as ‘com.hive.udf.DateFormat’; spark/hive -e " source /opt/data/dev/util/spark_com.sql; select * from table_test limit 5;"

二.资源调整

mapreduce.map.memory.mb=3000  指定这个mapreduce任务运行时内存的大小
mapreduce.reduce.memory.mb=6000  
spark.yarn.executor.memoryoverhead=6000     解决OOM，调节对外内存大小，以满足JVM自身的开销
spark.shuffle.service.enabled=true          NodeManager中一个长期运行的辅助服务，用于提升Shuffle计算性能。默认为false，表示不启用该功能。
    (1).Spark系统在运行含shuffle过程的应用时，Executor进程除了运行task，还要负责写shuffle数据，给其他Executor提供shuffle数据。
        当Executor进程任务过重，导致GC而不能为其他Executor提供shuffle数据时，会影响任务运行。
    (2).External shuffle Service是长期存在于NodeManager进程中的一个辅助服务。通过该服务来抓取shuffle数据，减少了Executor的压力，
        在Executor GC的时候也不会影响其他Executor的任务运行。
        
参考: https://blog.youkuaiyun.com/zuodaoyong/article/details/107172810 Spark之Shuffle参数调优解析

1.自适应框架

spark.sql.adaptive.enabled 自适应执行框架的开关,默认 false,启用 Adaptive Execution ，从而启用自动设置 Shuffle Reducer 特性
spark.sql.adaptive.minNumPostShufflePartitions 默认 1,reduce个数区间最小值
spark.sql.adaptive.maxNumPostShufflePartitions 默认 500，reduce个数区间最大值
spark.sql.adaptive.shuffle.targetPostShuffleInputSize 默认为67108864(64MB),动态调整reduce个数的partition大小依据,为每个Reducer读取的目标数据量,如设置64MB则reduce阶段每个task最少处理64MB的数据,一般改成集群块大小
spark.sql.adaptive.shuffle.targetPostShuffleRowCount 默认为20000000 动态调整reduce个数的partition条数依据，如设置20000000则reduce阶段每个task最少处理20000000条的数据
参考：https://blog.youkuaiyun.com/qq_14950717/article/details/105302842 Spark-SQL adaptive 自适应框架

2.动态资源：

spark.dynamicAllocation.enabled 是否开启动态资源配置，根据工作负载来衡量是否应该增加或减少executor，默认false
spark.shuffle.service.enabled=true **
spark.dynamicAllocation.minExecutors 动态分配最小executor个数，在启动时就申请好的，默认0,初始executor数量
spark.dynamicAllocation.maxExecutors 动态分配最大executor个数，(默认infinity,默认是无限制的。## 待验证)
spark.dynamicAllocation.initialExecutors 动态分配初始executor个数默认值=spark.dynamicAllocation.minExecutors，如果–num-executors设置的值比这个值大，那么将使用–num-executors设置的值作为初始executor数量。
spark.dynamicAllocation.executorIdleTimeout 当某个executor空闲超过这个设定值，就会被kill，默认60s
spark.dynamicAllocation.cachedExecutorIdleTimeout 如果executor内有缓存数据(cache data)，并且空闲了N秒。则remove该executor。默认值无限制。
spark.dynamicAllocation.schedulerBacklogTimeout 任务队列非空，资源不够，申请 executor的时间间隔，默认1s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 同schedulerBacklogTimeout，是申请了新executor之后继续申请的间隔，默认=schedulerBacklogTimeout
参考： https://blog.youkuaiyun.com/zyzzxycj/article/details/82256893

3.数据倾斜

spark.sql.adaptive.enabled 默认
false，自适应执行框架的开关 spark.sql.adaptive.skewedJoin.enabled
默认 false 倾斜处理开关 spark.sql.adaptive.skewedPartitionFactor
默认 10 当一个partition的size大小大于该值乘以所有parititon大小的中位数且
大于spark.sql.adaptive.skewedPartitionSizeThreshold，或者parition的条数大于该值乘以所有parititon条数的中位数且
大于 spark.sql.adaptive.skewedPartitionRowCountThreshold，
才会被当做倾斜的partition进行相应的处理
spark.sql.adaptive.skewedPartitionSizeThreshold 默认 67108864
倾斜的partition大小不能小于该值，该值还需要参照HDFS使用的压缩算法以及存储文件类型（如ORC、Parquet等）
spark.sql.adaptive.skewedPartitionRowCountThreshold 默认 10000000
倾斜的partition条数不能小于该值 spark.shuffle.statistics.verbose
默认 false 打开后MapStatus会采集每个partition条数的信息，用于倾斜处理

参考：https://blog.youkuaiyun.com/qq_14950717/article/details/105302842 Spark-SQL adaptive 自适应框架

4. 内存管理

参见：https://www.iteblog.com/archives/2342.html
https://blog.youkuaiyun.com/zyzzxycj/article/details/81011540
https://my.oschina.net/freelili/blog/1853714
https://blog.yoodb.com/sugarliny/article/detail/1307

三.ERROR

问题2：

WARN TaskSetManager: Lost task 90.0 in stage 17.0 (TID 8770, n20p191,
executor 136): FetchFailed(BlockManagerId(65, n20p193, 7337, None),
shuffleId=3, mapId=247, reduceId=90, message=
org.apache.spark.shuffle.FetchFailedException: Connection reset by
peer
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
at scala.collection.Iterator $KaTeX parse error: Can't use function '$' in math mode at position 5: anon$̲12.nextCur(Iter\dots$ anon $12 . h a s N e x t (I t e r a t o r . s c a l a : 441) a t s c a l a . c o l l e c t i o n . I t e r a t o r$ $anon $11 . h a s N e x t (I t e r a t o r . s c a l a : 409) a t o r g . a p a c h e . s p a r k . u t i l . C o m p l e t i o n I t e r a t o r . h a s N e x t (C o m p l e t i o n I t e r a t o r . s c a l a : 31) a t o r g . a p a c h e . s p a r k . I n t e r r u p t i b l e I t e r a t o r . h a s N e x t (I n t e r r u p t i b l e I t e r a t o r . s c a l a : 37) a t s c a l a . c o l l e c t i o n . I t e r a t o r$ $anon $11 . h a s N e x t (I t e r a t o r . s c a l a : 409) a t o r g . a p a c h e . s p a r k . s q l . c a t a l y s t . e x p r e s s i o n s . G e n e r a t e d C l a s s$ GeneratedIteratorForCodegenStage2.sort_addToSorter_0