一、python
更多用法参考:
pyspark中的数据转换_Fesgrome的博客-优快云博客
常用库:
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number, rank, col,explode,percentile_approx
PS:row_number() 相同score不并列排名
rank() 相同score并列排名
1、新增列&修改列:
data.withColumn("new_col_name",运算) = data.select('*',运算.alias('new_col_name'))
data.withColumn("new_col_name",F.lit("默认值"))
data.withColumnRenamed("old_col_name","new_col_name")
data.selectExpr("old_col_name as new_col_name")
2、排序
data.orderBy(cols,ascending=False)
data.orderBy(F.desc('dt')) #逆序
data.orderBy(('dt')) # 正序 或者 data.orderBy((F.asc('dt')))
data.orderBy(F.col('dt').desc()) # 逆序
data.orderBy(F.col('dt')) 或者 data.orderBy(F.col('dt').asc()) # 正序
多个字段倒序正序
data.orderBy(['count','column4'],ascending=[0,1])
3、开窗函数【groupBy/groupby一样】
1)根据某个字段排序
data.withColumn("rank_score", F.row_number().over(Window.orderBy(F.desc("score")))) \
.where("rank_score<={0}".format(target_quantity))
2)count()后排序
# data.groupBy("sku_id").count()
#=data.groupBy("sku_id").agg(F.count('sku_id'))
#= select sku_id,count(1) as count from xx group by sku_id
data.groupB