Spark中的spark.sql.shuffle.partitions 和spark.default.parallelism参数设置默认partition数目

Code_LT

已于 2023-04-19 20:22:23 修改

阅读量1w

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Spark 文章标签： default partition rdd spark dataframe

于 2019-10-26 19:12:56 首次发布

本文链接：https://blog.youkuaiyun.com/Code_LT/article/details/102759932

Spark 专栏收录该内容

40 篇文章

订阅专栏

本文深入解析了Spark中shuffle参数spark.sql.shuffle.partitions和spark.default.parallelism的作用与区别，阐述了它们如何影响DataFrame和RDD的分区数量，以及如何通过代码或提交任务时设置这些参数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

当不跟随父对象partition数目的shuffle过程发生后，结果的partition会发生改变，这两个参数就是控制这类shuffle过程后，返回对象的partition的

经过实测，得到结论：

spark.sql.shuffle.partitions 作用于dataframe（val df2=df1.shuffle算子（如df1.orderBy()），的df2的partition就是这个参数的值）

spark.default.parallelism 作用于rdd（val rdd2=rdd1.shuffle算子（如rdd1.reduceByKey()），的rdd2的partition就是这个参数的值）

如何查看操作是否有shuffle？善用rdd的toDebugString函数，详见Spark中的shuffle算子

df也可以先df.rdd.toDebugString查看是否有shuffle发生

另外，也可以说：

spark.default.parallelism只有在处理RDD时有效。
spark.sql.shuffle.partitions则是只对SparkSQL（产生的是dataframe）有效。

修改方法：

代码中设定:

sqlContext.setConf("spark.sql.shuffle.partitions", "500")
sqlContext.setConf("spark.default.parallelism", "500")

提交任务时设定:

./bin/spark-submit --conf spark.sql.shuffle.partitions=500 --conf spark.default.parallelism=500

官方说明和默认值：

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

spark.sql.shuffle.partitions

200（default）

Configures the number of partitions to use when shuffling data for joins or aggregations.

该参数也决定着sc.sql(...)取数据时的并行度

跟随父对象partition数目的shuffle？比如df的join，df1.join(df2) 返回partition数目根据df1定

参考资料：

Configuration - Spark 2.1.0 Documentation

Performance Tuning - Spark 3.4.0 Documentation

https://www.jianshu.com/p/7442deb21ae0

performance - What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? - Stack Overflow