好用到不行的 SparkSQL 开窗函数

最新推荐文章于 2024-09-03 17:10:14 发布

原创最新推荐文章于 2024-09-03 17:10:14 发布 · 517 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#sql #spark sql #oracle

spark 专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍了Spark SQL中的开窗函数应用，包括rank、dense_rank、row_number等排名函数，以及ntile、percent_rank等分组函数的使用方法。同时，文章还讲解了如何利用开窗函数进行数据聚合和分析，如计算部门销售额和累积销售额。此外，还提到了cume_dist、first_value、last_value等分析函数的应用场景。

好用到不行的 [Spark] SQL开窗函数

To use window functions, users need to mark that a function is used as a window function by either
Adding an OVER clause after a supported function in SQL, e.g. avg(revenue) OVER (…); or
(SparkSQL) Calling the over method on a supported function in the DataFrame API, e.g. rank().over(…).

如: rank() over( partition by ... order by ... ) ranks

Ranking：

	rank -- 跳跃排序
	dense_rank -- 连续排序
	row_number
	percent_rank
	ntile 
	
	ntile(expr) over([partition_clause] order_by_clause) 
		可以看成是：它把有序的数据集合平均分配到指定的数量（expr）个桶中,将桶号分配给每一行。
		如果不能平均分配，则较小桶号的桶分配额外的行，并且各个桶中能放的行数最多相差1。
	
	使用rank() over()时，用nulls last将null值(null最大,避免在前面有null值)排在最后面。		  
		rank over(partition by empno order by sales desc nulls last)

aggregate：

	count
    max
    min
    sum
    avg
    
	少数据量时，如直接计算各部门当前月及累计销售额：
	  select distinct 
	  		empno,month
	  		,sum(sales) over(partition by empno,month) sum_sales
	  		,sum(sales) over(partition by empno order by month) acc_sum_sales 
	  from Table

analytic：

	cume_dist
    first_value
    last_value
    lag
    lead

参考资料：

Introducing Window Functions in Spark SQL 可见链接: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
[Oracle查询优化改写 技巧与案例 有教无类落落著]