2.Spark SQL、Dataset、DataFrame 初步

SparkSQL是Spark用于结构化数据处理的核心模块,提供SQL接口和Dataset API。用户可以通过SQL查询执行数据操作,或者利用强类型化的Dataset进行计算。DataFrame作为Dataset的一种,类似于关系数据库中的表。SparkSQL支持从多种源加载数据,如JSON文件,并能与其他编程语言交互。用户还可以创建和注册临时视图,执行SQL查询并转换RDD为DataFrame。此外,SparkSQL支持标量和聚合函数,允许用户进行复杂的数据分析。

1.概述

Spark SQL 是用于结构化数据处理的 Spark 模块。

  1. Spark SQL API 提供了有关数据结构和正在执行的计算的更多信息。
  2. 在内部,Spark SQL 使用这些额外的信息来执行额外的优化。
  3. 与 Spark SQL 交互的方式有多种,包括 SQL 和 Dataset API。
  4. 计算结果时,使用相同的执行引擎,与您用于表达计算的 API/语言无关。

1.1.SQL

Spark SQL 的一种用途是执行 SQL 查询。
从另一种编程语言中运行 SQL 时,结果将作为Dataset/DataFrame返回。

1.2.Datasets 和 DataFrames

Dataset 是数据的分布式集合。
Dataset 是 Spark 1.6 中添加的一个新接口,它提供了 RDD 的优势(强类型化、使用强大 lambda 函数的能力)以及 Spark SQL 优化执行引擎的优势。可以从 JVM 对象构造数据集,然后使用功能转换(map、flatMap、filter等)进行操作。
DataFrame 是以命名列组织的数据集(Dataset<Row>)。
它在概念上等同于关系数据库中的表或 R/Python 中的数据框,但在底层进行了更丰富的优化。DataFrames 可以从多种来源构建,例如:结构化数据文件、Hive 中的表、外部数据库或现有 RDD。
我们将 Dataset<Row> 称为 DataFrames。

2.入门

2.1.创建 DataFrame

Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");

2.2.无类型数据集操作(又名 DataFrame 操作)

// Print the schema in a tree format
df.printSchema();

// Select only the "name" column
df.select("name").show();

// Select everybody, but increment the age by 1
df.select(col("name"), col("age").plus(1)).show();

// Select people older than 21
df.filter(col("age").gt(21)).show();

// Count people by age
df.groupBy("age").count().show();

有关可以对数据集执行的操作类型的完整列表,请参阅API 文档

2.3.以编程方式运行 SQL 查询

df.createOrReplaceTempView("people");

Dataset<Row> sqlDF = spark.sql("SELECT * FROM people");

2.4.创建 Dataset

使用专门的编码器来序列化对象,允许 Spark 执行许多操作(如过滤、排序和散列),而无需将字节反序列化回对象。

// Create an instance of a Bean class
Person person = new Person();
person.setName("Andy");
person.setAge(32);

// Encoders are created for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> javaBeanDS = spark.createDataset(
  Collections.singletonList(person),
  personEncoder
);

// Encoders for most common types are provided in class Encoders
Encoder<Long> longEncoder = Encoders.LONG();
Dataset<Long> primitiveDS = spark.createDataset(Arrays.asList(1L, 2L, 3L), longEncoder);
Dataset<Long> transformedDS = primitiveDS.map(
    (MapFunction<Long, Long>) value -> value + 1L,
    longEncoder);
transformedDS.collect(); 

// DataFrames can be converted to a Dataset by providing a class. Mapping based on name
String path = "examples/src/main/resources/people.json";
Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder);

2.5.与 RDD 互操作

Spark SQL 支持两种不同的方法将现有 RDD 转换为数据集:

  • 第一种方法使用反射来推断包含特定类型对象的 RDD 的模式。
  • 第二种创建数据集的方法是通过一个编程接口,它允许您构建一个模式,然后将其应用到现有的 RDD。

2.5.1.使用反射推断模式

Spark SQL 支持自动将 JavaBean的 RDD 转换为 DataFrame。使用BeanInfo反射获得的 定义表的架构。

// Create an RDD of Person objects from a text file
JavaRDD<Person> peopleRDD = spark.read()
  .textFile("examples/src/main/resources/people.txt")
  .javaRDD()
  .map(line -> {
    String[] parts = line.split(",");
    Person person = new Person();
    person.setName(parts[0]);
    person.setAge(Integer.parseInt(parts[1].trim()));
    return person;
  });

// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people");

// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");

// The columns of a row in the result can be accessed by field index
Encoder<String> stringEncoder = Encoders.STRING();
Dataset<String> teenagerNamesByIndexDF = teenagersDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    stringEncoder);

// or by field name
Dataset<String> teenagerNamesByFieldDF = teenagersDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.<String>getAs("name"),
    stringEncoder);

2.5.2.以编程方式指定架构

Dataset<Row>可以通过三个步骤以编程方式创建。

  1. 从原始 RDD创建一个RDD<Row>;
  2. 创建和步骤 1 中创建的 RDD<Row> 相匹配的由 StructTypes 表示的 schema。
  3. 将该 schema 应用于 createDataFrame 方法 。
// Create an RDD
JavaRDD<String> peopleRDD = spark.sparkContext()
  .textFile("examples/src/main/resources/people.txt", 1)
  .toJavaRDD();

// The schema is encoded in a string
String schemaString = "name age";

// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
  StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
  fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);

// Convert records of the RDD (people) to Rows
JavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> {
  String[] attributes = record.split(",");
  return RowFactory.create(attributes[0], attributes[1].trim());
});

// Apply the schema to the RDD
Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);

// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView("people");

// SQL can be run over a temporary view created using DataFrames
Dataset<Row> results = spark.sql("SELECT name FROM people");

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset<String> namesDS = results.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    Encoders.STRING());

3.标量函数

标量函数是每行返回单个值的函数,与聚合函数相反,聚合函数返回一组行的值。Spark SQL 支持多种内置标量函数。它还支持用户定义的标量函数。

4.聚合函数

聚合函数是在一组行上返回单个值的函数。内置聚合函数提供了常用的聚合,如count()、count_distinct()、avg()、max()、min()等。用户不限于预定义的聚合函数,可以创建自己的聚合函数。有关用户定义的聚合函数的更多详细信息,请参阅 用户定义的聚合函数的文档。

org.apache.kyuubi.KyuubiSQLException: Error operating ExecuteStatement: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:346) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectBase.run(CreateHiveTableAsSelectCommand.scala:64) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectBase.run$(CreateHiveTableAsSelectCommand.scala:40) at org.apache.spark.sql.hive.execution.OptimizedCreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:141) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3743) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3741) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:86) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.kyuubi.engine.spark.operation.SparkOperation.$anonfun$withLocalProperties$1(SparkOperation.scala:147) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:131) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:81) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:863) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:868) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:765) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:293) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:763) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:287) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) ... 36 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.hadoop.hive.ql.metadata.Table.validateColumns(Table.java:979) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:221) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:836) ... 47 more at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:70) at org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.$anonfun$applyOrElse$1(SparkOperation.scala:181) at org.apache.kyuubi.Utils$.withLockRequired(Utils.scala:425) at org.apache.kyuubi.operation.AbstractOperation.withLockRequired(AbstractOperation.scala:52) at org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:169) at org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:164) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:92) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:346) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectBase.run(CreateHiveTableAsSelectCommand.scala:64) at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectBase.run$(CreateHiveTableAsSelectCommand.scala:40) at org.apache.spark.sql.hive.execution.OptimizedCreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:141) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3743) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3741) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:86) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.kyuubi.engine.spark.operation.SparkOperation.$anonfun$withLocalProperties$1(SparkOperation.scala:147) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:131) at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:81) ... 6 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:863) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:868) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:765) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:293) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:763) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:287) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) ... 36 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. at org.apache.hadoop.hive.ql.metadata.Table.validateColumns(Table.java:979) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:221) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:836) ... 47 more
最新发布
12-18
你遇到的错误是: ``` org.apache.kyuubi.KyuubiSQLException: Error operating ExecuteStatement: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name oppty_spark_id in the table definition. ``` ### 🔍 错误分析 这个异常的核心信息是: > **"Duplicate column name oppty_spark_id in the table definition."** 这意味着:你在创建 Hive 表时,表结构中**出现了重复的列名 `oppty_spark_id`**。Hive 不允许在同一个表中存在两个同名列。 这类问题通常出现在以下场景: - 使用 `CREATE TABLE AS SELECT ...`(CTAS)语句时,SELECT 查询的结果集中有多个字段具有相同的别名。 - 在源数据查询中,多个列被显式或隐式地命名为 `oppty_spark_id`。 - 从多个表 JOIN 后未重命名冲突列,导致两个来源的 `oppty_spark_id` 被同时选中。 --- ### ✅ 解决方案 #### ✅ 方法一:检查并重命名 SELECT 中的重复列(推荐) 如果你正在执行类似下面的 SQL: ```sql CREATE TABLE target_table AS SELECT t1.oppty_spark_id, t2.oppty_spark_id, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id; ``` 那么这里 `t1.oppty_spark_id` 和 `t2.oppty_spark_id` 都会生成名为 `oppty_spark_id` 的列,从而引发冲突。 ✅ 正确做法是使用别名避免重复: ```sql CREATE TABLE target_table AS SELECT t1.oppty_spark_id AS oppty_spark_id_srcA, t2.oppty_spark_id AS oppty_spark_id_srcB, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id; ``` 这样每列都有唯一名称,不会报错。 --- #### ✅ 方法二:只选择必要的列,避免冗余 如果实际上你不需要两个 `oppty_spark_id`,那就只保留一个: ```sql CREATE TABLE target_table AS SELECT t1.oppty_spark_id, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id; ``` --- #### ✅ 方法三:使用 Spark DataFrame 写入时指定 schema 或列名(代码方式) 如果你是用 Spark 程序写入 Hive 表,比如: ```python df.write.mode("overwrite").saveAsTable("target_table") ``` 那你需要确保 `df` 中没有重复列名。可以在写入前检查和重命名: ##### 示例(PySpark): ```python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("FixDuplicateColumn").enableHiveSupport().getOrCreate() # 假设 df 是你的原始 DataFrame df = spark.sql(""" SELECT t1.oppty_spark_id, t2.oppty_spark_id, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id """) # 查看列名(调试用) print("Original columns:", df.columns) # 重命名重复列 df_fixed = df.withColumnRenamed("oppty_spark_id", "oppty_spark_id_1") \ .withColumnRenamed("oppty_spark_id", "oppty_spark_id_2") # ⚠️ 注意:withColumnRenamed 只作用于当前名称,需按顺序处理 # 更安全的方式:通过 select 显式命名 df_correct = df.select( col("oppty_spark_id").alias("oppty_spark_id_srcA"), col("oppty_spark_id").alias("oppty_spark_id_srcB"), # ❌ 这样也不行!因为无法区分两个同名字段 ) ``` ⚠️ 上面这种方式仍然有问题 —— 当两列同名时,Spark 无法区分它们! ✅ 所以你应该在 SQL 中就解决歧义: ```python df = spark.sql(""" SELECT t1.oppty_spark_id AS oppty_spark_id_a, t2.oppty_spark_id AS oppty_spark_id_b, t1.name, t2.status FROM table_a t1 JOIN table_b t2 ON t1.id = t2.id """) # 写入 Hive 表 df.write.mode("overwrite").saveAsTable("target_table") ``` --- ### 🛠 如何排查此类问题? 1. **打印出 DataFrame 的列名**: ```python print(df.columns) ``` 检查是否有重复项。 2. **在 Hive/Spark SQL 中预览结果前先 LIMIT**: ```sql SELECT ... FROM join_tables LIMIT 5; ``` 观察输出列是否重名。 3. **启用 Spark 的日志级别为 DEBUG**,查看实际生成的表结构。 4. **使用 DESCRIBE 查看已有表结构**,确认是否存在命名冲突。 --- ### 💡 总结 | 问题 | 原因 | |------|------| | `Duplicate column name oppty_spark_id` | SELECT 子句中有多个字段最终命名为 `oppty_spark_id` | | 解决方法 | 说明 | |--------|------| | 使用 `AS` 别名重命名列 | 最直接有效的方法 | | 避免 SELECT * 多表 JOIN 后直接建表 | 容易引入重复列 | | 在 Spark 中提前重命名列 | 保证 DataFrame 列名唯一 | --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值