初探Spark SQL catalog缓存机制

文章讨论了SparkSQL在使用Catalog缓存表结构时,当表结构发生改变(如列添加或删除)时,无法自动更新的问题,以及如何导致的SQL查询错误。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

先说结论:Spark SQL catalog中对表结构的缓存一般不会自动更新。

实验如下:

  1. 在pg中新建一张表t1,其中只有一列 c1 int
  2. 在Spark SQL中注册这张表,并从中查询数据
    1. ./bin/spark-sql --driver-class-path postgresql-42.7.1.jar --jars postgresql-42.7.1.jar
    2. spark-sql (default)> create table c1v using org.apache.spark.sql.jdbc options (url "jdbc:postgresql://localhost:5432/postgres", dbtable "public.t1", user 'postgres', password 'postgres');
    3. spark-sql (default)> select * from c1v;
    4. 结果:一行数据
  3. 在pg中为t1新增一列 c2,并插入一行数据2,2
  4. 在Spark SQL中继续查询数据
    1. spark-sql (default)> select * from c1v;
    2. 结果:两行数据,但是没有c2列
  5. 在pg中删掉c1列
  6. 在Spark SQL中继续查询数据
    1. spark-sql (default)> select * from c1v;
    2. 结果:报错 c1列不存在
  7. 从pg的query log中也可以看到,Spark一直发送的都是 SELECT "c1" FROM public.wkt,也即Spark对上述表结构的变化一无所知。

Spark报错如下

spark-sql (default)> select * from c1v;
24/01/09 20:18:04 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
org.postgresql.util.PSQLException: ERROR: column "c1" does not exist
  Hint: Perhaps you meant to reference the column "wkt.c2".
  Position: 8
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2712)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2400)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:367)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
        at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:320)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
24/01/09 20:18:04 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5) (172.18.203.110 executor driver): org.postgresql.util.PSQLException: ERROR: column "c1" does not exist
  Hint: Perhaps you meant to reference the column "wkt.c2".
  Position: 8
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值