PySpark DataFrame 的列操作

旺仔的算法coding笔记

已于 2023-01-31 19:29:52 修改

阅读量1.8k

点赞数 1

CC 4.0 BY-SA版权

分类专栏：大数据处理hadoop/spark 文章标签：大数据

于 2023-01-31 17:34:48 首次发布

本文链接：https://blog.youkuaiyun.com/wangwangstone/article/details/128818645

大数据处理hadoop/spark 专栏收录该内容

30 篇文章

订阅专栏

本文介绍了 PySpark 中 DataFrame 的常用操作方法，包括新增列、删除列、重命名列、缺失值填充、值替换、类型转换及排序等。通过具体实例展示了如何高效地处理数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

.withColumn 新增一列

.drop丢弃指定列

.alias重命名列名的别名

.withColumnRenamed 重命名列名，得到新的df

DataFrame.fillna 指定列的缺失值填充

DataFrame.replace 替换各列中对应的值

Column.astype(dataType) 和.cast (dataType)# 类型转换

column.asc 整个df按照某列的升序排序，常结合df.orderBy使用

Column.between(lowerBound, upperBound) 某字段是否在指定区间范围内

Column.contains(other) # 是否包含某个字符，直接返回的是列，用filter或者[]筛选出得到对应的df行即可。

Column.endswith(other) # 字段以什么结束的值，直接返回的是列，用filter或者[]筛选出得到对应的df行即可。

Column.isNotNull() # 筛选列中非空的，得到对应的内容

Column.isin(*cols) # 返回包含某些值的行

Column.like(other) # 返回含有关键词的行, rlike 会进行正则匹配，

如果你觉得下面的资料有用，还请点赞、收藏，支持下，你的鼓励是我继续整理的一个动力，非常感谢

样例数据，简单建一个DataFrame

rdd = sc.parallelize([("Sam", 28, 88.52, "M"),
                      ("Flora", 28, 90.55, "F"),
                      ("Mey", 1, None, "M"),
                      ("Chery", 7, 80.23, "F")])
test = rdd.toDF(["name", "age", "score", "sex"])
test.show()

+-----+---+-----+---+
| name|age|score|sex|
+-----+---+-----+---+
|  Sam| 28|88.52|  M|
|Flora| 28|90.55|  F|
|  Mey|  1| null|  M|
|Chery|  7|80.23|  F|

.withColumn 新增一列

新增列,注意原dataframe并未改变

test2 = test.withColumn('big_score',test['score']+1)
test2.show()

结果显示：

+-----+---+-----+---+---------+
| name|age|score|sex|big_score|
+-----+---+-----+---+---------+
|  Sam| 28|88.52|  M|    89.52|
|Flora| 28|90.55|  F|    91.55|
|  Mey|  1| null|  M|     null|
|Chery|  7|80.23|  F|    81.23|
+-----+---+-----+---+---------+

1）新增一列常数项 withColumn 结合 functions.lit(value)

from pyspark.sql import functions
test2 = test.withColumn("add_constant",functions.lit(10))
test2.show()

结果如：

+-----+---+-----+---+------------+
| name|age|score|sex|add_constant|
+-----+---+-----+---+------------+
|  Sam| 28|88.52|  M|          10|
|Flora| 28|90.55|  F|          10|
|  Mey|  1| null|  M|          10|
|Chery|  7|80.23|  F|          10|
+-----+---+-----+---+------------+

2）简单根据某列进行计算得到新的列

from pyspark.sql import functions
test_3 = test.withColumn("name_length",functions.length(test.name))
test_3.show()

结果如：

+-----+---+-----+---+-----------+
| name|age|score|sex|name_length|
+-----+---+-----+---+-----------+
|  Sam| 28|88.52|  M|          3|
|Flora| 28|90.55|  F|          5|
|  Mey|  1| null|  M|          3|
|Chery|  7|80.23|  F|          5|

3）使用selectExpr 里面的sql计算

test_4 = test.selectExpr(["name","age","score","sex","length(name) as name_length"])
test_4.show()

结果如：

+-----+---+-----+---+-----------+
| name|age|score|sex|name_length|
+-----+---+-----+---+-----------+
|  Sam| 28|88.52|  M|          3|
|Flora| 28|90.55|  F|          5|
|  Mey|  1| null|  M|          3|
|Chery|  7|80.23|  F|          5|
此处参考： pyspark 给dataframe增加新的一列

.drop丢弃指定列

删除列，返回一个新的dataframe，注意原dataframe的列并未改变。如果要改变原df，需要赋值覆盖。

test.drop('age').show() # 得到的新的df删除了列
print(test.show()) # 原df中age仍然存在。

test= test.drop('age')

test.show() # age被删除

.alias重命名列名的别名

输入列的操作，在select等里面嵌套着用

test.select('name',test.age.alias('gender')).show()
## test本身的列age并没有改变

输出：

+-----+------+
| name|gender|
+-----+------+
|  Sam|    28|
|Flora|    28|
|  Mey|     1|
|Chery|     7|
+-----+------+

.withColumnRenamed 重命名列名，得到新的df

重命名列名，原注意原dataframe并未改变,得到的包含了没有提及的其他列和新命名列的新的dataframe，

rdd = sc.parallelize([("Sam", 28, 88.52, "M"),
                      ("Flora", 28, 90.55, "F"),
                      ("Mey", 1, None, "M"),
                      ("Chery", 7, 80.23, "F")])
test = rdd.toDF(["name", "age", "score", "sex"])
test.withColumnRenamed("sex","gender").show() #包含新列名和其他的列的新的df
# test.show() #里面仍然是原来的列名

结果：

+-----+---+-----+------+
| name|age|score|gender|
+-----+---+-----+------+
|  Sam| 28|88.52|     M|
|Flora| 28|90.55|     F|
|  Mey|  1| null|     M|
|Chery|  7|80.23|     F|
+-----+---+-----+------+

DataFrame.fillna 指定列的缺失值填充

空值填充操作，以字典形式传值，可以指定各列中空值的填充值。返回新的df
test2 = test.fillna({"score":61})
test2.show()

结果：

+-----+---+-----+---+
| name|age|score|sex|
+-----+---+-----+---+
|  Sam| 28|88.52|  M|
|Flora| 28|90.55|  F|
|  Mey|  1| 61.0|  M|
|Chery|  7|80.23|  F|
+-----+---+-----+---+

等价于： test2 = test.na.fill({"score":61})

DataFrame.replace 替换各列中对应的值

注意所有的列中符合条件的值都会被替换，一次replace里面传参的数据类型要一样。原来的df并未改变。可以通过subset指定替换关系应用在哪些列。

test2 = test.replace({"M":"Male"})
test2.replace({68:58}).show() #所有列中值大小为为68(包括int和float)的都被替换为了58(58.0)

结果：

+-----+---+-----+----+
| name|age|score| sex|
+-----+---+-----+----+
|  Sam| 28|88.23|Male|
|Flora| 58| 58.0|   F|
|  Mey|  1| null|Male|
|Chery|  7|80.23|   F|
+-----+---+-----+----+

通过subset指定替换关系应用在哪些列

rdd = sc.parallelize([("Sam", 28, 88.23, "Sam"),
("Flora", 68, 68.0, "F"),
("Mey", 1, None, "M"),
("Chery", 7, 80.23, "F")])
test = rdd.toDF(["name", "age", "score", "sex"])
test3 = test.replace({"Sam":"aareplace"},subset=['name','sex']) ## 指定要应用的列
test3.show()

结果：

+---------+---+-----+---------+
|     name|age|score|      sex|
+---------+---+-----+---------+
|aareplace| 28|88.23|aareplace|
|    Flora| 68| 68.0|        F|
|      Mey|  1| null|        M|
|    Chery|  7|80.23|        F|
+---------+---+-----+---------+

Column.astype(dataType) 和.cast (dataType)# 类型转换

# 类型转换 ,常结合withColumn，得到新的列。.astype和.cast在pysparkDataframe中功能一样。示例如下

rdd= sc.parallelize([("Sam", 28, 88.23, "M"),
("Flora", 68, 68.0, "F"),
("Mey", 1, None, "M"),
("Chery", 7, 80.23, "F")])
test = rdd.toDF(["name", "age", "score", "sex"])

test = test.withColumn('score_int',test.score.astype('int'))
test.drop('score').show()# 如果不想要原来的数据类型列，加个drop即可

结果：

+-----+---+---+---------+
| name|age|sex|score_int|
+-----+---+---+---------+
|  Sam| 28|  M|       88|
|Flora| 68|  F|       68|
|  Mey|  1|  M|     null|
|Chery|  7|  F|       80|
+-----+---+---+---------+

test = test.withColumn('score_int',test.score.cast('int')) # # 类型转换 ,常结合withColumn，得到新的列。
test.drop('score').show()# 如果不想要原来的数据类型列，加个drop即可

column.asc 整个df按照某列的升序排序，常结合df.orderBy使用

test.orderBy(test.age.asc()).show()

+-----+---+-----+---+---------+
| name|age|score|sex|score_int|
+-----+---+-----+---+---------+
|  Mey|  1| null|  M|     null|
|Chery|  7|80.23|  F|       80|
|  Sam| 28|88.23|  M|       88|
|Flora| 68| 68.0|  F|       68|
+-----+---+-----+---+---------+

Column.between(lowerBound, upperBound) 某字段是否在指定区间范围内

test[test.age.between(5,10)].show() # 返回符合条件的行
print(type(test.age.between(5,10))) # 列

输出：

+-----+---+-----+---+---------+
| name|age|score|sex|score_int|
+-----+---+-----+---+---------+
|Chery|  7|80.23|  F|       80|
+-----+---+-----+---+---------+

<class 'pyspark.sql.column.Column'>

Column.contains(other) # 是否包含某个字符，直接返回的是列，用filter或者[]筛选出得到对应的df行即可。

test.filter(test.name.contains('e')).show()
#等价于
# test[test.name.contains('e')].show()

结果：

+-----+---+-----+---+---------+
| name|age|score|sex|score_int|
+-----+---+-----+---+---------+
|  Mey|  1| null|  M|     null|
|Chery|  7|80.23|  F|       80|
+-----+---+-----+---+---------+

Column.endswith(other) # 字段以什么结束的值，直接返回的是列，用filter或者[]筛选出得到对应的df行即可。

test[test.name.endswith('am')].show()

结果：

+----+---+-----+---+---------+
|name|age|score|sex|score_int|
+----+---+-----+---+---------+
| Sam| 28|88.23|  M|       88|
+----+---+-----+---+---------+

Column.isNotNull() # 筛选列中非空的，得到对应的内容

test[test.score.isNotNull()].show() # score这列非空的行

+-----+---+-----+---+---------+
| name|age|score|sex|score_int|
+-----+---+-----+---+---------+
|  Sam| 28|88.23|  M|       88|
|Flora| 68| 68.0|  F|       68|
|Chery|  7|80.23|  F|       80|

Column.isin(*cols) # 返回包含某些值的行

print(test[test.name.isin(['Flora','Chery'])].show())
print(test[test.age.isin([1,7])].show())

结果：

+-----+---+-----+---+---------+
| name|age|score|sex|score_int|
+-----+---+-----+---+---------+
|Flora| 68| 68.0|  F|       68|
|Chery|  7|80.23|  F|       80|
+-----+---+-----+---+---------+

+-----+---+-----+---+---------+
| name|age|score|sex|score_int|
+-----+---+-----+---+---------+
|  Mey|  1| null|  M|     null|
|Chery|  7|80.23|  F|       80|
+-----+---+-----+---+---------+