Spark Window Functions for DataFrames and SQL

转载自http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html

Introduced in Spark 1.4, Spark window functions improved the expressiveness of Spark DataFrames and Spark SQL. With window functions, you can easily calculate a moving average or cumulative sum, or reference a value in a previous row of a table. Window functions allow you to do many common calculations with DataFrames, without having to resort to RDD manipulation.

 

Aggregates, UDFs vs. Window functions

Window functions are complementary to existing DataFrame operations: aggregates, such as sumand avg, and UDFs. To review, aggregates calculate one result, a sum or average, for each group of rows, whereas UDFs calculate one result for each row based on only data in that row. In contrast, window functions calculate one result for each row based on a window of rows. For example, in a moving average, you calculate for each row the average of the rows surrounding the current row; this can be done with window functions.

 

Moving Average Example

Let us dive right into the moving average example. In this example dataset, there are two customers who have spent different amounts of money each day.

 

// Building the customer DataFrame. All examples are written in Scala with Spark 1.6.1, but the same can be done in Python or SQL.

val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00),

                                    ("Alice", "2016-05-03", 45.00),

                                    ("Alice", "2016-05-04", 55.00),

                                    ("Bob", "2016-05-01", 25.00),

                                    ("Bob", "2016-05-04", 29.00),

                                    ("Bob", "2016-05-06", 27.00))).

                               toDF("name", "date", "amountSpent")

 

// Import the window functions.

import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.functions._

 

// Create a window spec.

val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)

 

In this window spec, the data is partitioned by customer. Each customer’s data is ordered by date. And, the window frame is defined as starting from -1 (one row before the current row) and ending at 1 (one row after the current row), for a total of 3 rows in the sliding window.

 

// Calculate the moving average

customers.withColumn( "movingAvg",

                                             avg(customers("amountSpent")).over(wSpec1)  ).show()

 

This code adds a new column, “movingAvg”, by applying the avg function on the sliding window defined in the window spec:

 

name

date

amountSpent

movingAvg

Alice

5/1/2016

50

47.5

Alice

5/3/2016

45

50

Alice

5/4/2016

55

50

Bob

5/1/2016

25

27

Bob

5/4/2016

29

27

Bob

5/6/2016

27

28


 

Window function and Window Spec definition

As shown in the above example, there are two parts to applying a window function: (1) specifying the window function, such as avg in the example, and (2) specifying the window spec, or wSpec1 in the example. For (1), you can find a full list of the window functions here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
 You can use functions listed under “Aggregate Functions” and “Window Functions”.

 

For (2) specifying a window spec, there are three components: partition by, order by, and frame.

  1.     “Partition by” defines how the data is grouped; in the above example, it was by customer. You have to specify a reasonable grouping because all data within a group will be collected to the same machine. Ideally, the DataFrame has already been partitioned by the desired grouping.
  2.       “Order by” defines how rows are ordered within a group; in the above example, it was by date.
  3.       “Frame” defines the boundaries of the window with respect to the current row; in the above example, the window ranged between the previous row and the next row.



 

Cumulative Sum

Next, let us calculate the cumulative sum of the amount spent per customer.

 

// Window spec: the frame ranges from the beginning (Long.MinValue) to the current row (0).

val wSpec2 = Window.partitionBy("name").orderBy("date").rowsBetween(Long.MinValue, 0)

 

// Create a new column which calculates the sum over the defined window frame.

customers.withColumn( "cumSum",

  sum(customers("amountSpent")).over(wSpec2)  ).show()

 

name

date

amountSpent

cumSum

Alice

5/1/2016

50

50

Alice

5/3/2016

45

95

Alice

5/4/2016

55

150

Bob

5/1/2016

25

25

Bob

5/4/2016

29

54

Bob

5/6/2016

27

81


 

Data from previous row

In the next example, we want to see the amount spent by the customer in their previous visit.

 

// Window spec. No need to specify a frame in this case.

val wSpec3 = Window.partitionBy("name").orderBy("date")

 

// Use the lag function to look backwards by one row.

customers.withColumn("prevAmountSpent",

 lag(customers("amountSpent"), 1).over(wSpec3) ).show()

 

name

date

amountSpent

prevAmountSpent

Alice

5/1/2016

50

null

Alice

5/3/2016

45

50

Alice

5/4/2016

55

45

Bob

5/1/2016

25

null

Bob

5/4/2016

29

25

Bob

5/6/2016

27

29


 

Rank

In this example, we want to know the order of a customer’s visit (whether this is their first, second, or third visit).

 

// The rank function returns what we want.

customers.withColumn( "rank", rank().over(wSpec3) ).show()

 

name

date

amountSpent

rank

Alice

5/1/2016

50

1

Alice

5/3/2016

45

2

Alice

5/4/2016

55

3

Bob

5/1/2016

25

1

Bob

5/4/2016

29

2

Bob

5/6/2016

27

3


 

Conclusion

 

I hope these examples have helped you understand Spark’s window functions. There is more functionality that was not covered here. To learn more, please see the Databricks article on this topic: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

如有任何问题可以加群R语言&大数据分析456726635或者Python & Spark大数636866908与我联系。

<think> 嗯,用户遇到了数据库报错"distinct is not implemented for window functions",这在Spark SQL和PostgreSQL中都是常见问题。用户提供的几个引用很有价值,特别是引用2给出了具体的解决方案。 用户可能正在处理数据分析任务,需要计算分区内的唯一值数量。从报错信息看,ta尝试在窗口函数里使用count(distinct column),但数据库明确不支持这种写法。ta现在应该比较着急,因为这种需求在实际分析中很常见。 让我整理下解决方案的核心要点:Spark SQL可以用collect_set配合size代替,而PostgreSQL需要用子查询或CTE。注意到引用3提到listagg函数也有类似限制,说明窗口函数对distinct的限制是普遍存在的。 考虑到用户可能不熟悉窗口函数的高级用法,解释时要特别注意把替代方案的每一步拆解清楚。比如collect_set的原理是先聚合唯一值数组,size再计算数组长度,这和count(distinct)在数学上是等价的。 引用4提到的order by影响窗口范围也很重要,这个细节容易被人忽略。如果用户按时间分区计算唯一值,order by会导致结果与预期不符,需要在解决方案里特别提醒。 最后那个性能对比的引用5虽然相关度不高,但暗示了集群环境下替代方案的性能影响,可以简要提及可能需要测试执行效率。 </think> 针对数据库报错 `distinct is not implemented for window functions`,以下是综合解决方案: --- ### **问题根源分析** 在 **Spark SQL** 和 **PostgreSQL** 等数据库中,**窗口函数(`OVER()`子句)不支持直接使用 `DISTINCT` 聚合操作**(如 `COUNT(DISTINCT column)`)。这是数据库引擎的设计限制[^1][^2][^3]。 --- ### **解决方案** #### 1. **Spark SQL:`COLLECT_SET()` + `SIZE()`** ```sql -- 错误写法(报错) COUNT(DISTINCT totalprice) OVER (PARTITION BY name, month) -- 正确替代方案 SIZE(COLLECT_SET(totalprice) OVER (PARTITION BY name, month)) ``` - **原理**: - `COLLECT_SET()` 在窗口内收集唯一值生成数组(自动去重)[^2]。 - `SIZE()` 计算数组长度,等效于 `COUNT(DISTINCT)`[^2]。 #### 2. **PostgreSQL:子查询或CTE** ```sql -- 方法1:子查询 SELECT *, (SELECT COUNT(DISTINCT totalprice) FROM orders o2 WHERE o2.name = o1.name AND DATE_TRUNC('month', o2.orderdate) = DATE_TRUNC('month', o1.orderdate) ) AS totalorder FROM orders o1; -- 方法2:CTE + 窗口函数 WITH unique_vals AS ( SELECT name, DATE_TRUNC('month', orderdate) AS month, COUNT(DISTINCT totalprice) AS totalorder FROM orders GROUP BY name, month ) SELECT o.*, u.totalorder FROM orders o JOIN unique_vals u ON o.name = u.name AND DATE_TRUNC('month', o.orderdate) = u.month; ``` - **关键点**: - 先通过 `GROUP BY` 计算唯一值,再关联回原表[^3][^4]。 - 避免在窗口函数内直接使用 `DISTINCT`[^4]。 --- ### **注意事项** 1. **性能影响**: - `COLLECT_SET()` 在数据量大时可能内存开销较高[^2]。 - 子查询方案需确保关联字段有索引[^4]。 2. **窗口范围陷阱**: - 若窗口包含 `ORDER BY`(如 `ROWS BETWEEN ...`),结果可能与预期不符(需显式指定范围)[^4]。 --- ### **示例验证(Spark SQL)** ```sql -- 原始报错语句 SELECT name, COUNT(DISTINCT color) OVER (PARTITION BY category) -- 触发报错 FROM products; -- 修正后 SELECT name, SIZE(COLLECT_SET(color) OVER (PARTITION BY category)) AS unique_colors FROM products; ``` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值