疑问:max+struct的作用?
test.map(x => tester(x._1, x._2, x._3, x._4, x._5))
.toDS
.select($"ac", $"sk", struct($"num1", struct($"time", $"num1")).as("grp"))
.groupBy($"ac", $"sk")
.agg(max($"grp")).show(false)
解答
Struct结构用于按字典顺序排序,从左往右地按字段比较。在上述例子中:
- 首先会比较第一顺位的字段
num1
-
- 如果
num1
不相等,那么会返回更大的num1
- 如果
-
- 如果
num1
相等,那么会跳到第二顺位struct($"time", $"num1")
继续比较,先看第二顺位的第一个字段time
- 如果
- 第二顺位的第二个字段
num1
,和第一顺位都是num1
,而num1
已经比较过了,所以第二顺位的第二个字段num1
不会影响到最终的排序结果
原文链接🔗——
The StructTypes are compared lexicographically - field by field, from left to right and all fields have to recursively orderable. So in your case:
- It will compare the first element of the struct.
-
- If the elements are not equal it will return the struct with higher value.
-
- Otherwise it will proceed to the point 2.
- Since the second field is complex as well, it will repeat procedure
from point 1 this time comparing time fields first.
Note that nested num1 can be evaluated on if top level num1 fields are equal, therefore it doesn’t affect the ordering in practice.
实战
首局/首条/最后一局/最后一条
一般来说,对于这类查询的方式有两种
- 先通过MAX/MIN聚合函数找到最大/最小的那条数据,再通过该结果和原表做联合查询得到结果
- 通过窗口函数row_number来查询结果
在Spark中对于struct的MAX/MIN函数会按照struct的第一个字段进行判断,因此提供了一个tricky的做法,MAX(STRUCT(xxxx)),这样的好处是,这是无需排序或关联,有更好的运行效率.
但这个方式只有Spark SQL有效
为了更直观的了解三个方法的差别,下面举一个例子
ts | account_id | kills |
---|---|---|
1 | 1 | 2 |
2 | 1 | 2 |
对于表match,希望得到account_id第一场比赛的记录,也就是输出
ts | account_id | kills |
---|---|---|
1 | 1 | 2 |
- 方案1:
SELECT *
FROM
(
SELECT account_id,
MIN(ts) ts
FROM match
GROUP BY account_id
) t1
JOIN
match
USING(account_id,ts)
- 方案2:
SELECT *
FROM
(
SELECT *,
row_number() over(partition by account_id order by ts) rnum
FROM match
) t1
WHERE rnum = 1
- 使用Spark tricky:
SELECT account_id,
first.*
FROM
(
SELECT account_id,
MIN(STRUCT(ts, kills)) first
FROM match
GROUP BY account_id
) t1