spark3.2.0新特性_orc.force.positional.evolution-优快云博客

本文链接：https://blog.youkuaiyun.com/IreneByron/article/details/125258093

Apache Spark 3.2.0引入了众多新特性，如默认启用自适应查询执行（AQE）、Pandas API支持、RocksDB StateStore、ANSI SQL兼容性和性能提升等。此外，它还增强了连接器功能，如Parquet和ORC的升级，支持更多压缩格式。这一版本着重于SQL兼容性、查询优化和 Shuffle 效率，进一步提升了大数据处理的效率和灵活性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

官方文档: Spark Release 3.2.0 | Apache Spark

参考资料：https://www.modb.pro/db/183007

spark3.2.0是3.x的第三个发布版本。此版本解决了超过 1700 个Jira问题。

在这个版本，支持Pandas API，RocksDB StateStore，会话窗口，基于推送的 shuffle 支持、ANSI SQL INTERVAL 类型、默认启用自适应查询执行 (AQE) 和 ANSI SQL 模式 GA。

详细改动可见JIRA列表：Release Notes - ASF JIRA

按模块区分为:

Hightlights 特别重要的特性

在 PySpark 上支持 Pandas API 层 Support Pandas API layer on PySpark (SPARK-34849)
默认启用自适应查询执行(AQE) Enable adaptive query execution by default (SPARK-33679)
支持push-based shuffle，提高shuffle效率 Support push-based shuffle to improve shuffle efficiency (SPARK-30602)
添加 RocksDB StateStore 实现 Add RocksDB StateStore implementation (SPARK-34198)
基于 EventTime 的会话化（会话窗口） EventTime based sessionization (session window) (SPARK-10816)
ANSI SQL 模式 GA ANSI SQL mode GA (SPARK-35030)
支持 ANSI SQL INTERVAL 类型 Support for ANSI SQL INTERVAL types (SPARK-27790)
查询编译延迟减少 Query compilation latency reduction (SPARK-35042, SPARK-35103, SPARK-34989)
支持 Scala 2.13 Support Scala 2.13 (SPARK-34218)

Core and Spark SQL

ANSI SQL兼容性增强 ANSI SQL Compatibility Enhancements

支持 ANSI SQL INTERVAL 类型 Support for ANSI SQL INTERVAL types (SPARK-27790)
ANSI 模式下的新类型强制语法规则 New type coercion syntax rules in ANSI mode (SPARK-34246)
支持横向子查询 Support LATERAL subqueries (SPARK-34382)
ANSI 模式：IntegralDivide 在溢出时抛出异常 ANSI mode: IntegralDivide throws an exception on overflow (SPARK-35152)
ANSI 模式：检查平均溢出 ANSI mode: Check for overflow in Average (SPARK-35955)
Block count(table.*) 遵循 ANSI 标准和其他 SQL 引擎 Block count(table.*) to follow ANSI standard and other SQL engines (SPARK-34199)
LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE 支持 (IGNORERESPECT) NULL(SPARK-30789) (SPARK-30789)

性能

查询编译延迟 Query compilation latency
- 支持转换/解析函数及其调用点中的遍历修剪 Support traversal pruning in transform/resolve functions and their call sites (SPARK-35042)
- 提高 mapChildren 和 withNewChildren 方法的性能 Improve the performance of mapChildren and withNewChildren methods (SPARK-34989)
- 提高类型强制规则的性能 Improve the performance of type coercion rules (SPARK-35103)

查询优化 Query optimization
- 删除优化器中的冗余聚合 Remove redundant aggregates in the Optimizer (SPARK-33122)
- 通过join的项目下限 Push down limit through Project with Join (SPARK-34622)
- LEFT SEMI 和 LEFT ANTI 连接的下推限制 Push down limit for LEFT SEMI and LEFT ANTI join (SPARK-36404, SPARK-34514)
- 当分区规格为空时通过 WINDOW 下限 Push down limit through WINDOW when partition spec is empty (SPARK-34575)
- 在 CBO 中使用相对成本比较函数 Use a relative cost comparison function in the CBO (SPARK-34922)
- 联合、排序和范围运算符的基数估计 Cardinality estimation of union, sort, and range operator (SPARK-33411)
- 仅在聚合上下推 LeftSemi/LeftAnti ，如果join 可以被规划为broadcast join Only push down LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join (SPARK-34081)
- UnwrapCastInBinaryComparison 支持 In/InSet 谓词 UnwrapCastInBinaryComparison support In/InSet predicate (SPARK-35316)
- 子表达式消除增强 Subexpression elimination enhancements (SPARK-35448)
- 分区修剪后保留必要的统计信息 Keep necessary stats after partition pruning (SPARK-34119)
- 分离桶过滤器修剪和桶表扫描 Decouple bucket filter pruning and bucket table scan (SPARK-32985)

查询执行 Query execution
- 自适应查询 Adaptive query execution
  - 默认启用自适应查询执行 Enable adaptive query execution by default (SPARK-33679)
  - 当join在开始时是广播哈希join或没有重用的广播交换时，支持AQE中的动态分区修剪（DPP） Support Dynamic Partition Pruning (DPP) in AQE when the join is broadcast hash join at the beginning or there is no reused broadcast exchange (SPARK-34168, SPARK-35710)
  - 在合并 shuffle 分区之前优化 skew join Optimize skew join before coalescing shuffle partitions (SPARK-35447)
  - 支持使用规则的 AQE 侧 shuffled hash join 公式 Support AQE side shuffled hash join formula using rule (SPARK-35282)
  - 支持 AQE 侧广播哈希加入阈值 Support AQE side broadcast hash join threshold (SPARK-35264)
  - 允许 AQE 成本评估器的自定义插件 Allow custom plugin for AQE cost evaluator (SPARK-35794)
- 默认启用 Zstandard 缓冲池 Enable Zstandard buffer pool by default (SPARK-34340, SPARK-34390)
- 为排序合并连接的所有连接类型添加代码生成 Add code-gen for all join types of sort-merge join (SPARK-34705)
- 整个计划交换和子查询重用 Whole plan exchange and subquery reuse (SPARK-29375)
- 广播嵌套循环连接改进 Broadcast nested loop join improvement (SPARK-34706)
- 允许并发写入器写入动态分区和桶表 Allow concurrent writers for writing dynamic partitions and bucket table (SPARK-26164)
- 提高 Spark Thrift 服务器中处理 FETCH_PRIOR 的性能Improve performance of processing FETCH_PRIOR in Spark Thrift server (SPARK-33655)

连接器增强 Connector Enhancements

Parquet
- 将 Apache Parquet 升级到 1.12.1 版本 Upgrade Apache Parquet used to version 1.12.1 (SPARK-36726)
- 支持 Parquet 矢量化阅读器中的列索引 Support column index in Parquet vectorized reader (SPARK-34289)
- 添加新的parquet数据源选项以控制读取中的日期时间变基 Add new parquet data source options to control datetime rebasing in read (SPARK-34377)
- 读取在 parquet 中存储为 int32 物理类型的 parquet 无符号类型 Read parquet unsigned types that are stored as int32 physical type in parquet (SPARK-34817)
- 将存储为有符号 int64 物理类型的 Parquet unsigned int64 逻辑类型读取为 decimal(20, 0) Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0) (SPARK-34786)
- 使用矢量化 Parquet 阅读器时处理列索引 Handle column index when using vectorized Parquet reader (SPARK-34859)
- 改进 Parquet 过滤器下推 Improve Parquet In filter pushdown (SPARK-32792)

ORC
- 升级 Apache ORC 到 1.6.11 版本 Upgrade Apache ORC used to version 1.6.11 (SPARK-36482)
- 支持Apache ORC强制位置进化 Support Apache ORC forced positional evolution (SPARK-32864)
- 支持 ORC 矢量化阅读器中的嵌套列 Support nested column in ORC vectorized reader (SPARK-34862)
- ORC数据源支持ZSTD、LZ4压缩 Support ZSTD, LZ4 compression in ORC data source (SPARK-33978, SPARK-35612)
- 在任务配置中设置读取列的列表以减少ORC数据的读取 Set the list of read columns in the task configuration to reduce reading of ORC data (SPARK-35783)
Avro
- 将 Apache Avro 升级到 1.10.2 版本 Upgrade Apache Avro used to version 1.10.2 (SPARK-34778)
- 使用“avro.schema.literal”支持分区 Hive 表的 Avro 模式演变 Supporting Avro schema evolution for partitioned Hive tables with “avro.schema.literal” (SPARK-26836)
- 添加新的 Avro 数据源选项以控制读取中的日期时间变基 Add new Avro datasource options to control datetime rebasing in read (SPARK-34404)
- 在 Avro 中添加对用户提供的架构 URL 的支持 Adding support for user provided schema url in Avro (SPARK-34416)
- 添加对位置 Catalyst-to-Avro 模式匹配的支持 Add support for positional Catalyst-to-Avro schema matching (SPARK-34365)
JSON
- 升级 Jackson 到 2.12.3 版本 Upgrade Jackson used to version 2.12.3 (SPARK-35550)
- 允许 JSON 数据源将非 ASCII 字符写入代码点 Allow JSON data sources to write non-ASCII characters as codepoints (SPARK-35047)
CSV
- 将 univocity-parsers 升级到 2.9.1 Upgrade univocity-parsers to 2.9.1 (SPARK-33940)
JDBC
- 将 JDBC SQL TIME 类型映射到 TimestampType，无论时区如何，时间部分都是固定的 Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone (SPARK-34357)
- 在 JDBCRelation 中计算更精确的分区步幅 Calculate more precise partition stride in JDBCRelation (SPARK-34843)
- 支持 JDBC 数据源中的 refreshKrb5Config 选项 Support refreshKrb5Config option in JDBC data sources (SPARK-35226)
Hive Metastore 支持过滤器 NOT IN Hive Metastore support filter by NOT IN (SPARK-34538)