构造一个dataframe
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val data =
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", 150, "This is category C"))
// Create a schema for the dataframe
val schema =
StructType(
StructField("Category", StringType, true) ::
StructField("Count", IntegerType, true) ::
StructField("Description", StringType, true) :: Nil)
// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList
// Create RDD
val rdd = spark.sparkContext.parallelize(rows)
// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()
+----------+-----+------------------+
| Category|Count| Description|
+----------+-----+------------------+
|Category A| 100|This is category A|
|Category B| 120|This is category B|
|Category C| 150|This is category C|
+----------+-----+------------------+
重命名一列:
val df2 = df.withColumnRenamed("Category", "category_new")
df2.show()
scala> df2.show()
+------------+-----+------------------+
|category_new|Count| Description|
+------------+-----+------------------+
| Category A| 100|This is category A|
| Category B| 120|This is category B|
| Category C| 150|This is category C|
+------------+-----+------------------+
重命名所有列:(每个列名转为小写并在末尾增加_new)
# Rename columns
val new_column_names=df.columns.map(c=>c.toLowerCase() + "_new")
val df3 = df.toDF(new_column_names:_*)
df3.show()
scala> df3.show()
+------------+---------+------------------+
|category_new|count_new| description_new|
+------------+---------+------------------+
| Category A| 100|This is category A|
| Category B| 120|This is category B|
| Category C| 150|This is category C|
+------------+---------+------------------+
还可以替换列名中的.或者其它字符
c.toLowerCase().replaceAll("\\.", "_") + "_new"
转自:https://kontext.tech/column/spark/527/scala-change-data-frame-column-names-in-spark
该博客展示了如何在Scala和Spark中创建DataFrame,然后重命名列。首先,通过定义一个数据列表和对应的模式创建DataFrame。接着,通过`withColumnRenamed`方法重命名单个列,将'Category'列改为'category_new'。最后,通过遍历原始列名并添加'_new'后缀,重命名所有列并将新的列名应用于DataFrame。
1587

被折叠的 条评论
为什么被折叠?



