【Python笔记】Spark Sql max+structの作用

最新推荐文章于 2025-02-08 18:45:50 发布

阳光快乐普信男

最新推荐文章于 2025-02-08 18:45:50 发布

阅读量784

点赞数

分类专栏： Python笔记工作札记文章标签： python

本文链接：https://blog.youkuaiyun.com/qq_36056219/article/details/117421392

版权

Python笔记同时被 2 个专栏收录

52 篇文章

订阅专栏

工作札记

5 篇文章

订阅专栏

疑问：max+struct的作用？

test.map(x => tester(x._1, x._2, x._3, x._4, x._5))
    .toDS
    .select($"ac", $"sk", struct($"num1", struct($"time", $"num1")).as("grp"))
    .groupBy($"ac", $"sk")
    .agg(max($"grp")).show(false)

解答

Struct结构用于按字典顺序排序，从左往右地按字段比较。在上述例子中：

首先会比较第一顺位的字段num1

- 如果num1不相等，那么会返回更大的num1
- 如果num1相等，那么会跳到第二顺位struct($"time", $"num1")继续比较，先看第二顺位的第一个字段time

第二顺位的第二个字段num1，和第一顺位都是num1，而num1已经比较过了，所以第二顺位的第二个字段num1不会影响到最终的排序结果

原文链接🔗——

The StructTypes are compared lexicographically - field by field, from left to right and all fields have to recursively orderable. So in your case:

It will compare the first element of the struct.

- If the elements are not equal it will return the struct with higher value.
- Otherwise it will proceed to the point 2.

Since the second field is complex as well, it will repeat procedure
from point 1 this time comparing time fields first.

Note that nested num1 can be evaluated on if top level num1 fields are equal, therefore it doesn’t affect the ordering in practice.

实战

首局/首条/最后一局/最后一条

一般来说，对于这类查询的方式有两种

先通过MAX/MIN聚合函数找到最大/最小的那条数据，再通过该结果和原表做联合查询得到结果
通过窗口函数row_number来查询结果

在Spark中对于struct的MAX/MIN函数会按照struct的第一个字段进行判断，因此提供了一个tricky的做法，MAX(STRUCT(xxxx))，这样的好处是，这是无需排序或关联，有更好的运行效率.

但这个方式只有Spark SQL有效

为了更直观的了解三个方法的差别，下面举一个例子

ts	account_id	kills
1	1	2
2	1	2

对于表match，希望得到account_id第一场比赛的记录，也就是输出

ts	account_id	kills
1	1	2

方案1：

SELECT * 
FROM 
	(
		SELECT account_id, 
		 	   MIN(ts) ts 
		FROM match 
		GROUP BY account_id
	) t1 
JOIN 
	match 
USING(account_id,ts)

方案2：

SELECT * 
FROM 
	(
		SELECT *, 
			   row_number() over(partition by account_id order by ts) rnum 
	    FROM match
	 ) t1 
WHERE rnum = 1

使用Spark tricky：

SELECT account_id, 
	   first.* 
FROM 
	(
		SELECT account_id,
   			   MIN(STRUCT(ts, kills)) first 
   		FROM match 
   		GROUP BY account_id
   	) t1