目录
.withColumnRenamed 重命名列名,得到新的df
Column.astype(dataType) 和.cast (dataType)# 类型转换
column.asc 整个df按照某列的升序排序,常结合df.orderBy使用
Column.between(lowerBound, upperBound) 某字段是否在指定区间范围内
Column.contains(other) # 是否包含某个字符,直接返回的是列,用filter或者[]筛选出得到对应的df行即可。
Column.endswith(other) # 字段以什么结束的值,直接返回的是列,用filter或者[]筛选出得到对应的df行即可。
Column.isNotNull() # 筛选列中非空的,得到对应的内容
Column.isin(*cols) # 返回包含某些值的行
Column.like(other) # 返回含有关键词的行, rlike 会进行正则匹配,
如果你觉得下面的资料有用,还请点赞、收藏,支持下,你的鼓励是我继续整理的一个动力,非常感谢
样例数据,简单建一个DataFrame
rdd = sc.parallelize([("Sam", 28, 88.52, "M"),
("Flora", 28, 90.55, "F"),
("Mey", 1, None, "M"),
("Chery", 7, 80.23, "F")])
test = rdd.toDF(["name", "age", "score", "sex"])
test.show()
+-----+---+-----+---+ | name|age|score|sex| +-----+---+-----+---+ | Sam| 28|88.52| M| |Flora| 28|90.55| F| | Mey| 1| null| M| |Chery| 7|80.23| F|
.withColumn 新增一列
新增列,注意原dataframe并未改变
test2 = test.withColumn('big_score',test['score']+1)
test2.show()
结果显示:
+-----+---+-----+---+---------+ | name|age|score|sex|big_score| +-----+---+-----+---+---------+ | Sam| 28|88.52| M| 89.52| |Flora| 28|90.55| F| 91.55| | Mey| 1| null| M| null| |Chery| 7|80.23| F| 81.23| +-----+---+-----+---+---------+
1)新增一列常数项 withColumn 结合 functions.lit(value)
from pyspark.sql import functions
test2 = test.withColumn("add_constant",functions.lit(10))
test2.show()
结果如:
+-----+---+-----+---+------------+ | name|age|score|sex|add_constant| +-----+---+-----+---+------------+ | Sam| 28|88.52| M| 10| |Flora| 28|90.55| F| 10| | Mey| 1| null| M| 10| |Chery| 7|80.23| F| 10| +-----+---+-----+---+------------+
2) 简单根据某列进行计算得到新的列
from pyspark.sql import functions
test_3 = test.withColumn("name_length",functions.length(test.name))
test_3.show()
结果如:
+-----+---+-----+---+-----------+ | name|age|score|sex|name_length| +-----+---+-----+---+-----------+ | Sam| 28|88.52| M| 3| |Flora| 28|90.55| F| 5| | Mey| 1| null| M| 3| |Chery| 7|80.23| F| 5|
3)使用selectExpr 里面的sql计算
test_4 = test.selectExpr(["name","age","score","sex","length(name) as name_length"])
test_4.show()
结果如:
+-----+---+-----+---+-----------+ | name|age|score|sex|name_length| +-----+---+-----+---+-----------+ | Sam| 28|88.52| M| 3| |Flora| 28|90.55| F| 5| | Mey| 1| null| M| 3| |Chery| 7|80.23| F| 5| 此处参考: pyspark 给dataframe增加新的一列
.drop丢弃指定列
删除列,返回一个新的dataframe,注意原dataframe的列并未改变。如果要改变原df,需要赋值覆盖。
test.drop('age').show() # 得到的新的df删除了列
print(test.show()) # 原df中age仍然存在。
test= test.drop('age')
test.show() # age被删除
.alias重命名列名的别名
输入列的操作,在select等里面嵌套着用
test.select('name',test.age.alias('gender')).show()
## test本身的列age并没有改变
输出:
+-----+------+ | name|gender| +-----+------+ | Sam| 28| |Flora| 28| | Mey| 1| |Chery| 7| +-----+------+
.withColumnRenamed 重命名列名,得到新的df
重命名列名,原注意原dataframe并未改变,得到的包含了没有提及的其他列和新命名列的新的dataframe,
rdd = sc.parallelize([("Sam", 28, 88.52, "M"),
("Flora", 28, 90.55, "F"),
("Mey", 1, None, "M"),
("Chery", 7, 80.23, "F")])
test = rdd.toDF(["name", "age", "score", "sex"])
test.withColumnRenamed("sex","gender").show() #包含新列名和其他的列的新的df
# test.show() #里面仍然是原来的列名
结果:
+-----+---+-----+------+ | name|age|score|gender| +-----+---+-----+------+ | Sam| 28|88.52| M| |Flora| 28|90.55| F| | Mey| 1| null| M| |Chery| 7|80.23| F| +-----+---+-----+------+
DataFrame.fillna 指定列的缺失值填充
空值填充操作,以字典形式传值,可以指定各列中空值的填充值。返回新的df
test2 = test.fillna({"score":61})
test2.show()
结果:
+-----+---+-----+---+ | name|age|score|sex| +-----+---+-----+---+ | Sam| 28|88.52| M| |Flora| 28|90.55| F| | Mey| 1| 61.0| M| |Chery| 7|80.23| F| +-----+---+-----+---+
等价于: test2 = test.na.fill({"score":61})
DataFrame.replace 替换各列中对应的值
注意所有的列中符合条件的值都会被替换,一次replace里面传参的数据类型要一样。原来的df并未改变。可以通过subset指定替换关系应用在哪些列。
test2 = test.replace({"M":"Male"})
test2.replace({68:58}).show() #所有列中值大小为为68(包括int和float)的都被替换为了58(58.0)
结果:
+-----+---+-----+----+ | name|age|score| sex| +-----+---+-----+----+ | Sam| 28|88.23|Male| |Flora| 58| 58.0| F| | Mey| 1| null|Male| |Chery| 7|80.23| F| +-----+---+-----+----+
通过subset指定替换关系应用在哪些列
rdd = sc.parallelize([("Sam", 28, 88.23, "Sam"),
("Flora", 68, 68.0, "F"),
("Mey", 1, None, "M"),
("Chery", 7, 80.23, "F")])
test = rdd.toDF(["name", "age", "score", "sex"])
test3 = test.replace({"Sam":"aareplace"},subset=['name','sex']) ## 指定要应用的列
test3.show()
结果:
+---------+---+-----+---------+ | name|age|score| sex| +---------+---+-----+---------+ |aareplace| 28|88.23|aareplace| | Flora| 68| 68.0| F| | Mey| 1| null| M| | Chery| 7|80.23| F| +---------+---+-----+---------+
Column.astype(dataType) 和.cast (dataType)# 类型转换
# 类型转换 ,常结合withColumn,得到新的列。.astype和.cast在pysparkDataframe中功能一样。示例如下
rdd= sc.parallelize([("Sam", 28, 88.23, "M"),
("Flora", 68, 68.0, "F"),
("Mey", 1, None, "M"),
("Chery", 7, 80.23, "F")])
test = rdd.toDF(["name", "age", "score", "sex"])
test = test.withColumn('score_int',test.score.astype('int'))
test.drop('score').show()# 如果不想要原来的数据类型列,加个drop即可
结果:
+-----+---+---+---------+ | name|age|sex|score_int| +-----+---+---+---------+ | Sam| 28| M| 88| |Flora| 68| F| 68| | Mey| 1| M| null| |Chery| 7| F| 80| +-----+---+---+---------+
test = test.withColumn('score_int',test.score.cast('int')) # # 类型转换 ,常结合withColumn,得到新的列。
test.drop('score').show()# 如果不想要原来的数据类型列,加个drop即可
column.asc 整个df按照某列的升序排序,常结合df.orderBy使用
test.orderBy(test.age.asc()).show()
+-----+---+-----+---+---------+ | name|age|score|sex|score_int| +-----+---+-----+---+---------+ | Mey| 1| null| M| null| |Chery| 7|80.23| F| 80| | Sam| 28|88.23| M| 88| |Flora| 68| 68.0| F| 68| +-----+---+-----+---+---------+
Column.between(lowerBound, upperBound) 某字段是否在指定区间范围内
test[test.age.between(5,10)].show() # 返回符合条件的行
print(type(test.age.between(5,10))) # 列
输出:
+-----+---+-----+---+---------+ | name|age|score|sex|score_int| +-----+---+-----+---+---------+ |Chery| 7|80.23| F| 80| +-----+---+-----+---+---------+ <class 'pyspark.sql.column.Column'>
Column.contains(other) # 是否包含某个字符,直接返回的是列,用filter或者[]筛选出得到对应的df行即可。
test.filter(test.name.contains('e')).show()
#等价于
# test[test.name.contains('e')].show()
结果:
+-----+---+-----+---+---------+ | name|age|score|sex|score_int| +-----+---+-----+---+---------+ | Mey| 1| null| M| null| |Chery| 7|80.23| F| 80| +-----+---+-----+---+---------+
Column.endswith(other) # 字段以什么结束的值,直接返回的是列,用filter或者[]筛选出得到对应的df行即可。
test[test.name.endswith('am')].show()
结果:
+----+---+-----+---+---------+ |name|age|score|sex|score_int| +----+---+-----+---+---------+ | Sam| 28|88.23| M| 88| +----+---+-----+---+---------+
Column.isNotNull() # 筛选列中非空的,得到对应的内容
test[test.score.isNotNull()].show() # score这列非空的行
+-----+---+-----+---+---------+ | name|age|score|sex|score_int| +-----+---+-----+---+---------+ | Sam| 28|88.23| M| 88| |Flora| 68| 68.0| F| 68| |Chery| 7|80.23| F| 80|
Column.isin(*cols) # 返回包含某些值的行
print(test[test.name.isin(['Flora','Chery'])].show())
print(test[test.age.isin([1,7])].show())
结果:
+-----+---+-----+---+---------+ | name|age|score|sex|score_int| +-----+---+-----+---+---------+ |Flora| 68| 68.0| F| 68| |Chery| 7|80.23| F| 80| +-----+---+-----+---+---------+ +-----+---+-----+---+---------+ | name|age|score|sex|score_int| +-----+---+-----+---+---------+ | Mey| 1| null| M| null| |Chery| 7|80.23| F| 80| +-----+---+-----+---+---------+
Column.like(other) # 返回含有关键词的行, rlike 会进行正则匹配,
test[test.name.like('Chery')].show()
# rlike 会进行正则匹配,
print(test[test.name.rlike('am')].show()) # 为空