hive表嵌套查询 ERROR operation.ExecuteStatement: Error operating EXECUTE_STATEMENT: org.apache.spark.sql

文章讲述了在ApacheSparkSQL中,由于列名`datediff(checkout_date,checkin_date)`在嵌套查询时未正确解析导致错误。解决方法是给这个列取别名,确保在嵌套查询时引用正确的列名。

原因可能是没有使用别名,养成良好的SQl习惯就可以避免犯错误。

报错类型ERROR operation.ExecuteStatement: Error operating EXECUTE_STATEMENT: org.apache.spark.sql.AnalysisException: cannot resolve '.......' given input columns

with zibiao as (SELECT  dt
        ,wrapper_name
        ,uid
        ,datediff(checkout_date,checkin_date)  
FROM    表1
)

SELECT dt
        ,wrapper_name
        ,uid,
         datediff(checkout_date,checkin_date)  
from   zibiao

在上述代码当中子表从表1当中查询出来 并去计算出两天的一个实际差值,datediff(checkout_date,checkin_date),但是这个并不是这一列数值的名称,

在对表1进行嵌套查询的时候如果把         datediff(checkout_date,checkin_date)  当作列的名称是会出现下面这种报错的
ERROR operation.ExecuteStatement: Error operating EXECUTE_STATEMENT: org.apache.spark.sql.AnalysisException: cannot resolve 'checkout_date' given input columns

 其实我们要明白一点就能解决这个问题,对于查询出来的结果如上图,虽然列名是         datediff(checkout_date,checkin_date)  ,但是如果在嵌套查询时候使用datediff(checkout_date,checkin_date)就不行,比较简单的解决方案是取个别名,在查询的时候使用别名就不会出问题了

with zibiao as (SELECT  dt
        ,wrapper_name
        ,uid
        ,datediff(checkout_date,checkin_date)   `取的别名`
FROM    表1
)

SELECT dt
        ,wrapper_name
        ,uid,
          `取的别名`
from   zibiao

org.apache.kyuubi.KyuubiSQLException: Error operating ExecuteStatement: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:346) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectBase.run(CreateHiveTableAsSelectCommand.scala:64) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectBase.run$(CreateHiveTableAsSelectCommand.scala:40) at org.apache.spark.sql.hive.execution.OptimizedCreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:141) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3743) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3741) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:86) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.kyuubi.engine.spark.operation.SparkOperation.$anonfun$withLocalProperties$1(SparkOperation.scala:147) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:131) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:81) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:863) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:868) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:765) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:293) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:763) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:287) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) ... 36 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.hadoop.hive.ql.metadata.Table.validateColumns(Table.java:979) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:221) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:836) ... 47 more at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:70) at org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.$anonfun$applyOrElse$1(SparkOperation.scala:181) at org.apache.kyuubi.Utils$.withLockRequired(Utils.scala:425) at org.apache.kyuubi.operation.AbstractOperation.withLockRequired(AbstractOperation.scala:52) at org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:169) at org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:164) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:92) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:346) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectBase.run(CreateHiveTableAsSelectCommand.scala:64) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectBase.run$(CreateHiveTableAsSelectCommand.scala:40) at org.apache.spark.sql.hive.execution.OptimizedCreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:141) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3743) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3741) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:86) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.kyuubi.engine.spark.operation.SparkOperation.$anonfun$withLocalProperties$1(SparkOperation.scala:147) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:131) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:81) ... 6 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:863) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:868) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:765) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:293) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:763) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:287) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) ... 36 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.hadoop.hive.ql.metadata.Table.validateColumns(Table.java:979) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:221) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:836) ... 47 more
最新发布
12-18
你遇到的错误是: ``` org.apache.kyuubi.KyuubiSQLException: Error operating ExecuteStatement: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. ``` ### 🔍 错误分析 这个异常的核心信息是: > **"Duplicate column name oppty_spark_id in the table definition."** 这意味着:你在创建 Hive 时,结构中**出现了重复的列名 `oppty_spark_id`**。Hive 不允许在同一个中存在两个同名列。 这类问题通常出现在以下场景: - 使用 `CREATE TABLE AS SELECT ...`(CTAS)语句时,SELECT 查询的结果集中有多个字段具有相同的别名。 - 在源数据查询中,多个列被显式或隐式地命名为 `oppty_spark_id`。 - 从多个 JOIN 后未重命名冲突列,导致两个来源的 `oppty_spark_id` 被同时选中。 --- ### ✅ 解决方案 #### ✅ 方法一:检查并重命名 SELECT 中的重复列(推荐) 如果你正在执行类似下面的 SQL: ```sql CREATE TABLE target_table AS SELECT t1.oppty_spark_id, t2.oppty_spark_id, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id; ``` 那么这里 `t1.oppty_spark_id` 和 `t2.oppty_spark_id` 都会生成名为 `oppty_spark_id` 的列,从而引发冲突。 ✅ 正确做法是使用别名避免重复: ```sql CREATE TABLE target_table AS SELECT t1.oppty_spark_id AS oppty_spark_id_srcA, t2.oppty_spark_id AS oppty_spark_id_srcB, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id; ``` 这样每列都有唯一名称,不会报错。 --- #### ✅ 方法二:只选择必要的列,避免冗余 如果实际上你不需要两个 `oppty_spark_id`,那就只保留一个: ```sql CREATE TABLE target_table AS SELECT t1.oppty_spark_id, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id; ``` --- #### ✅ 方法三:使用 Spark DataFrame 写入时指定 schema 或列名(代码方式) 如果你是用 Spark 程序写入 Hive ,比如: ```python df.write.mode("overwrite").saveAsTable("target_table") ``` 那你需要确保 `df` 中没有重复列名。可以在写入前检查和重命名: ##### 示例(PySpark): ```python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("FixDuplicateColumn").enableHiveSupport().getOrCreate() # 假设 df 是你的原始 DataFrame df = spark.sql(""" SELECT t1.oppty_spark_id, t2.oppty_spark_id, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id """) # 查看列名(调试用) print("Original columns:", df.columns) # 重命名重复列 df_fixed = df.withColumnRenamed("oppty_spark_id", "oppty_spark_id_1") \ .withColumnRenamed("oppty_spark_id", "oppty_spark_id_2") # ⚠️ 注意:withColumnRenamed 只作用于当前名称,需按顺序处理 # 更安全的方式:通过 select 显式命名 df_correct = df.select( col("oppty_spark_id").alias("oppty_spark_id_srcA"), col("oppty_spark_id").alias("oppty_spark_id_srcB"), # ❌ 这样也不行!因为无法区分两个同名字段 ) ``` ⚠️ 上面这种方式仍然有问题 —— 当两列同名时,Spark 无法区分它们! ✅ 所以你应该在 SQL 中就解决歧义: ```python df = spark.sql(""" SELECT t1.oppty_spark_id AS oppty_spark_id_a, t2.oppty_spark_id AS oppty_spark_id_b, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id """) # 写入 Hive df.write.mode("overwrite").saveAsTable("target_table") ``` --- ### 🛠 如何排查此类问题? 1. **打印出 DataFrame 的列名**: ```python print(df.columns) ``` 检查是否有重复项。 2. **在 Hive/Spark SQL 中预览结果前先 LIMIT**: ```sql SELECT ... FROM join_tables LIMIT 5; ``` 观察输出列是否重名。 3. **启用 Spark 的日志级别为 DEBUG**,查看实际生成的结构。 4. **使用 DESCRIBE 查看已有结构**,确认是否存在命名冲突。 --- ### 💡 总结 | 问题 | 原因 | |------|------| | `Duplicate column name oppty_spark_id` | SELECT 子句中有多个字段最终命名为 `oppty_spark_id` | | 解决方法 | 说明 | |--------|------| | 使用 `AS` 别名重命名列 | 最直接有效的方法 | | 避免 SELECT * 多 JOIN 后直接建 | 容易引入重复列 | | 在 Spark 中提前重命名列 | 保证 DataFrame 列名唯一 | --- ###
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值