SPARK中InMemoryFileIndex文件缓存导致的REFRESH TABLE tableName问题

最新推荐文章于 2023-09-20 11:48:15 发布

原创

最新推荐文章于 2023-09-20 11:48:15 发布 · 1.7k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#spark #缓存 #大数据

文章讨论了Spark在处理数据时可能出现的REFRESHTABLEtableName命令错误，这通常由于InMemoryFileIndex缓存文件导致。在同一个JVM中，读写同一表不会引发此错误，因为有刷新机制使文件缓存失效。关键参数包括spark.sql.hive.filesourcePartitionFileCacheSize,spark.sql.hive.manageFilesourcePartitions和spark.sql.metadataCacheTTLSeconds，它们影响文件缓存的行为。在某些情况下，如插入后读取，会触发缓存刷新，避免错误发生。

背景

在spark中，有时候会报出running ‘REFRESH TABLE tableName’ command in SQL or by recreating the Dataset/DataFrame involved.的错误，这种错误的原因有一种隐形的原因，那就是InMemoryFileIndex会缓存需要scan的文件在内存中，

分析

在scan file的过程中，最主要涉及的是CatalogFileIndex类，该类中的方法filterPartitions会创建InMemoryFileIndex：

def filterPartitions(filters: Seq[Expression]): InMemoryFileIndex = {
    if (table.partitionColumnNames.nonEmpty) {
      val startTime = System.nanoTime()
      val selectedPartitions = sparkSession.sessionState.catalog.listPartitionsByFilter(
        table.identifier, filters)
      val partitions = selectedPartitions.map { p =>
        val path = new Path(p.location)
        val fs = path.getFileSystem(hadoopConf)
        PartitionPath(
          p.toRow(partitionSchema, sparkSession.sessionState.conf.sessionLocalTimeZone),
          path.makeQualified(fs.getUri, fs.getWorkingDirectory))
      }
      val