SQL获取分组数据中max/first/least列的方法(转载备份)

本文介绍了如何使用SQL从每个分组中选取最小、最大或特定数量的记录。提供了多种方法,包括自连接、相关子查询及使用用户变量等技巧,并针对MySQL进行了详细探讨。

这几天在做一个和oracle相关的项目的时候,遇到一个问题,我需要一次性查出一个表里每一个分组的最新的一条记录,于是想到了group by,处理后,可以获得每个分组最新的记录的时间,但是下一步卡住了,后来在网上查到了一个相关的解决方案,作为技术备份,先记录在这里。

原始链接:http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

 

PS:因为我用的oracle数据库,这个是mysql数据库,所以在第一个解决方法

select f.type, f.variety, f.price
from (
   select type, min(price) as minprice
   from fruits group by type
) as x inner join fruits as f on f.type = x.type and f.price = x.minprice;

在oracle下的代码要去掉里面的as和inner便可以执行通过了。

How to select the first/least/max row per group in SQL


Here are some common SQL problems, all of which have related solutions: how do I find the most recent log entry for each program? How do I find the most popular item from each category? How do I find the top score for each player? In general, these types of “select the extreme from each group” queries can be solved with the same techniques. I’ll explain how to do that in this article, including the harder problem of selecting the top N entries, not just the top 1.

This topic is related to numbering rows, which I just wrote about (see my articles about MySQL-specific and generic techniques to assign a number to each row in a group). Therefore I’ll use nearly the same table and data as I used in those articles, with the addition of a price column:


[code="java"]+--------+------------+-------+
| type   | variety    | price |
+--------+------------+-------+
| apple  | gala       |  2.79 |
| apple  | fuji       |  0.24 |
| apple  | limbertwig |  2.87 |
| orange | valencia   |  3.59 |
| orange | navel      |  9.36 |
| pear   | bradford   |  6.05 |
| pear   | bartlett   |  2.14 |
| cherry | bing       |  2.55 |
| cherry | chelan     |  6.33 |
+--------+------------+-------+

Selecting the one maximum row from each group

Let’s say I want to select the most recent log entry for each program, or the most recent changes in an audit table, or something of the sort. This question comes up over and over on IRC channels and mailing lists. I’ll re-phrase the question in terms of fruits. I want to select the cheapest fruit from each type. Here’s the desired result:

+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
+--------+----------+-------+


There are a few common solutions to this problem. All involve two steps: finding the desired value of price, and then selecting the rest of the row based on that.

One common solution is a so-called self-join. Step one is to group the fruits by type (apple, cherry etc) and choose the minimum price:

select type, min(price) as minprice
from fruits
group by type;
+--------+----------+
| type   | minprice |
+--------+----------+
| apple  |     0.24 | 
| cherry |     2.55 | 
| orange |     3.59 | 
| pear   |     2.14 | 
+--------+----------+


Step two is to select the rest of the row by joining these results back to the same table. Since the first query is grouped, it needs to be put into a subquery so it can be joined against the non-grouped table:

select f.type, f.variety, f.price
from (
   select type, min(price) as minprice
   from fruits group by type
) as x inner join fruits as f on f.type = x.type and f.price = x.minprice;
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| cherry | bing     |  2.55 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
+--------+----------+-------+


Another common way to do this is with a correlated subquery. This can be much less efficient, depending on how good your system’s query optimizer is. You might find it clearer, though.

select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type);
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
+--------+----------+-------+


Both queries are logically equivalent, though they may not perform the same.

Select the top N rows from each group

This is a slightly harder problem to solve. Finding a single row from each group is easy with SQL’s aggregate functions (MIN(), MAX(), and so on). Finding the first several from each group is not possible with that method because aggregate functions only return a single value. Still, it’s possible to do.

Let’s say I want to select the two cheapest fruits from each type. Here’s a first try:

select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type)
   or price = (select min(price) from fruits as f where f.type = fruits.type
      and price > (select min(price) from fruits as f2 where f2.type = fruits.type));
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | gala     |  2.79 | 
| apple  | fuji     |  0.24 | 
| orange | valencia |  3.59 | 
| orange | navel    |  9.36 | 
| pear   | bradford |  6.05 | 
| pear   | bartlett |  2.14 | 
| cherry | bing     |  2.55 | 
| cherry | chelan   |  6.33 | 
+--------+----------+-------+


Yuck! That can be written as a self-join, but it’s just as bad (I leave it as an exercise for the reader). This gets worse as you go to higher numbers (top 3, top 4…). There are other ways to phrase the statement, but they all boil down to the same thing, and they’re all pretty unwieldy and inefficient.
Yuck! That can be written as a self-join, but it’s just as bad (I leave it as an exercise for the reader). This gets worse as you go to higher numbers (top 3, top 4…). There are other ways to phrase the statement, but they all boil down to the same thing, and they’re all pretty unwieldy and inefficient.

There’s a better way: select the variety from each type where the variety is no more than the second-cheapest of that type.

select type, variety, price
from fruits
where (
   select count(*) from fruits as f
   where f.type = fruits.type and f.price < fruits.price
) <= 2;


This is elegant, and lets you vary N without rewriting your query (a very good thing!), but it’s functionally the same as the previous query. Both are essentially a quadratic algorithm relative to the number of varieties in each type. And again, some query optimizers may not do well with this and make it quadratic with respect to the number of rows in the table overall (especially if no useful index is defined), and the server might get clobbered. Are there better ways? Can it be done with one pass through the data, instead of the many passes required by a correlated subquery? You know it can, or I wouldn’t be writing this, now would I?

Use UNION

If there’s an index on (type, price), and there are many more records to eliminate than to include in each group, a more efficient single-pass method (especially for MySQL, but also for some other RDBMSs) is to break the queries out separately and put a limit on each, then UNION them all back together. Here’s the syntax you need for MySQL:

(select * from fruits where type = 'apple' order by price limit 2)
union all
(select * from fruits where type = 'orange' order by price limit 2)
union all
(select * from fruits where type = 'pear' order by price limit 2)
union all
(select * from fruits where type = 'cherry' order by price limit 2)


Peter Zaitsev has written in detail about this technique, so I won’t go into it too much more here. If it suits your purposes, it can be a very good solution.

One note: use UNION ALL, not just UNION. It prevents the server sorting the results to eliminate duplicates before returning them. In this case there will be no duplicates, so I’m telling the server to skip that (useless, expensive) step.

Do it with user variables on MySQL

The UNION trick is an especially good idea when the results are a small fraction of the rows in the table and there is an index that can be used for sorting the rows. Another linear-time technique, which might be a good option in cases where you are selecting most of the rows from the table anyway, is user variables. This is MySQL-specific. Please refer to my previous post on how to number rows in MySQL for the gory details of why this works:

set @num := 0, @type := '';

select type, variety, price
from (
   select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
  from fruits
  order by type, price
) as x where x.row_number <= 2;


This isn’t one pass through the table, by the way. The subquery is implemented as a temporary table behind the scenes, so filling it with data is one pass; then selecting every row from it and applying the WHERE clause is another. However, twice through is still O(n) with respect to the table size. That’s a lot better than correlated subqueries, which are O(n2) with respect to the group size — even moderate group sizes cause bad performance (say there are five varieties of each fruit. That’s on the order of 25 passes through the table, all told).

One-pass technique on MySQL… maybe?

If you want to leave your queries up the the query optimizer’s whims, you can try this technique, which builds no temporary tables and makes just one pass through:

set @num := 0, @type := '';

select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
from fruits
group by type, price, variety
having row_number <= 2;


This theoretically ought to work if MySQL orders by the GROUP BY criteria, which it sometimes does for efficiency and to produce the expected results. Does it work? Here’s what it returns on MySQL 5.0.27 on Windows:

+--------+----------+-------+------------+--------+
| type   | variety  | price | row_number | dummy  |
+--------+----------+-------+------------+--------+
| apple  | gala     |  2.79 |          1 | apple  |
| apple  | fuji     |  0.24 |          3 | apple  |
| orange | valencia |  3.59 |          1 | orange |
| orange | navel    |  9.36 |          3 | orange |
| pear   | bradford |  6.05 |          1 | pear   |
| pear   | bartlett |  2.14 |          3 | pear   |
| cherry | bing     |  2.55 |          1 | cherry |
| cherry | chelan   |  6.33 |          3 | cherry |
+--------+----------+-------+------------+--------+
Look closely… it’s returning rows one and three from each group, and they’re not numbered in order of increasing price? Huh? But the HAVING clause says the row_number should be no greater than 2! Here’s what it returns on version 5.0.24a on Ubuntu:

+--------+------------+-------+------------+--------+
| type   | variety    | price | row_number | dummy  |
+--------+------------+-------+------------+--------+
| apple  | fuji       |  0.24 |          1 | apple  |
| apple  | gala       |  2.79 |          1 | apple  |
| apple  | limbertwig |  2.87 |          1 | apple  |
| cherry | bing       |  2.55 |          1 | cherry |
| cherry | chelan     |  6.33 |          1 | cherry |
| orange | valencia   |  3.59 |          1 | orange |
| orange | navel      |  9.36 |          1 | orange |
| pear   | bartlett   |  2.14 |          1 | pear   |
| pear   | bradford   |  6.05 |          1 | pear   |
+--------+------------+-------+------------+--------+
Look, this time everything is numbered 1 and every row is returned. Wonky. This is exactly what the MySQL manual page on user variables warns about.

This technique is pretty much non-deterministic, because it relies on things that you and I don’t get to control directly, such as which indexes MySQL decides to use for grouping. However, if you need to use it — and I know there are some folks out there who do, because I’ve consulted for them — you can still tweak it. We’re getting into the realm of really bastardizing SQL, but the results above came from a table without indexes other than the primary key on (type, variety). What happens if I add an index MySQL can use for grouping?

alter table fruits add key(type, price);
Nothing changes, and EXPLAIN shows why: the query doesn’t use the index I just added. Why? Because the grouping is on three columns, and the index is only on two. In fact, the query is using a temp table and filesort anyway, so this is still not achieving the once-through goal. I can force it to use the index:

set @num := 0, @type := '';

select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
from fruits force index(type)
group by type, price, variety
having row_number <= 2;


Let’s see if that works:

+--------+----------+-------+------------+--------+
| type   | variety  | price | row_number | dummy  |
+--------+----------+-------+------------+--------+
| apple  | fuji     |  0.24 |          1 | apple  |
| apple  | gala     |  2.79 |          2 | apple  |
| cherry | bing     |  2.55 |          1 | cherry |
| cherry | chelan   |  6.33 |          2 | cherry |
| orange | valencia |  3.59 |          1 | orange |
| orange | navel    |  9.36 |          2 | orange |
| pear   | bartlett |  2.14 |          1 | pear   |
| pear   | bradford |  6.05 |          2 | pear   |
+--------+----------+-------+------------+--------+
Ah, now we’re cooking! It did what I wanted, without a filesort or temporary table. Another way to do this, by the way, is to take variety out of the GROUP BY so it uses the index on its own. Because this selects a non-grouped column from a grouped query, this only works if you are running with ONLY_FULL_GROUP_BY mode turned off, which I hope you are not doing without good reason.

Other methods

Be sure to check the comments for user-contributed methods. There are some really novel approaches. I always learn so much from your comments… thank you!

Conclusion

Well, that’s it. I’ve shown you several ways of solving the common “get the extreme row from each group” query, and then moved on to how you can get the top N rows from each group in various ways. Then I dove into MySQL-specific techniques which some (including myself, depending on my mood) would regard as mildly foolish to utterly stupid. But if you need the last bit of speed out of your server, you sometimes have to know when to break the rules. And for those who think this is just MySQL foolishness, it’s not; I’ve seen people desperately do these types of things on other platforms too, such as SQL Server. There are hacks and tweaks on every platform, and people who need to use them.

 

<think>我们有一个需求:根据三个字段(restapi, resourcesid, type)进行分组,然后从每个分组中选择id最小的那一行记录,并且不实际删除重复数据,只展示。 在SQL中,我们可以使用窗口函数(如ROW_NUMBER)来为每个分组内的行编号,然后选择编号为1的行(即每个分组中id最小的行)。 但是,如果使用PL/SQL,我们通常指的是Oracle数据库的SQL。在Oracle中,我们可以使用ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...)来实现。 步骤: 1. 使用PARTITION BY restapi, resourcesid, type 进行分组。 2. 在每个分组内,按照id升序排序(这样最小的id就会排在最前面)。 3. 使用ROW_NUMBER()函数为每个分组内的行分配一个序号(最小id的行序号为1)。 4. 然后在外层查询中,我们选择序号为1的行。 因此,查询语句可以这样写: ```sql SELECT * FROM ( SELECT t.*, ROW_NUMBER() OVER ( PARTITION BY restapi, resourcesid, type ORDER BY id ASC ) AS rn FROM your_table t ) WHERE rn = 1; ``` 注意: - 将`your_table`替换为实际的表名。 - 这个查询会返回每个分组中id最小的那一行。 另外,我们也可以使用其他方法,比如使用子查询和MIN()函数,但使用窗口函数通常更高效且清晰。 例如,使用子查询的方法: ```sql SELECT t1.* FROM your_table t1 INNER JOIN ( SELECT restapi, resourcesid, type, MIN(id) AS min_id FROM your_table GROUP BY restapi, resourcesid, type ) t2 ON t1.restapi = t2.restapi AND t1.resourcesid = t2.resourcesid AND t1.type = t2.type AND t1.id = t2.min_id; ``` 但是,如果有多个记录具有相同的最小id(虽然id通常是唯一的,但这里假设唯一),那么使用MIN()函数的方法可能会返回多行,而使用ROW_NUMBER()方法只会返回一行(因为按id排序,然后取第一个)。如果id是唯一的,那么两种方法结果一致。 根据需求,我们只需要展示每组中id最小的记录,所以两种方法都可以。但考虑到id可能是主键(唯一),所以两种方法都适用。 然而,如果存在多个相同的最小id(比如在非唯一的情况下),那么使用ROW_NUMBER()方法会任意取一行(因为相同id排序相同,但ROW_NUMBER()会赋予不同序号,所以实际上还是会按顺序取第一个最小id的那一行,因为id相同,所以任意一行都满足条件)。而使用MIN()函数然后连接的方法会返回所有id等于min_id的行,这样每个分组可能会有多行。 但我们的需求是“只展示每组中id最小的记录”,如果id最小且唯一,那么每个分组只有一行;如果id不唯一,那么可能会有多行。但通常id是主键,不会重复。所以这里我们采用第一种方法(ROW_NUMBER)更常见。 因此,推荐使用窗口函数的方法。 如果用户使用的Oracle版本较老(比如8i之前)不支持窗口函数,那么可以使用第二种方法。但现在普遍使用的Oracle版本都支持窗口函数。 所以,我们给出第一种方法。 另外,如果用户希望不改变原表顺序,可以在外层查询中再排序,但这里没有要求。 注意:在Oracle中,我们也可以使用FIRST_VALUE或RANK等,但ROW_NUMBER()是最直接的。 最后,如果用户需要去重展示,但实际不删除重复数据,那么上述查询只是查询出结果,并不会影响表中数据。 因此,我们给出如下PL/SQL查询(实际上是标准的SQL,在PL/SQL中可以作为静态SQL使用):</think>根据您的需求,需要在PL/SQL中编写查询,根据`restapi`、`resourcesid`和`type`字段分组,并展示每组中`id`最小的记录(不实际删除数据)。以下是两种实现方案: ### 方案1:使用ROW_NUMBER()窗口函数(推荐) ```sql SELECT * FROM ( SELECT t.*, ROW_NUMBER() OVER ( PARTITION BY restapi, resourcesid, type ORDER BY id ASC ) AS rn FROM your_table t ) WHERE rn = 1; ``` **说明:** 1. `PARTITION BY restapi, resourcesid, type`:定义分组依据 2. `ORDER BY id ASC`:每组内按id升序排 3. `rn = 1`:筛选每组的第一条记录(id最小) 4. **优点**:高效,只需单次表扫描 ### 方案2:使用子查询与MIN() ```sql SELECT t1.* FROM your_table t1 WHERE t1.id = ( SELECT MIN(t2.id) FROM your_table t2 WHERE t1.restapi = t2.restapi AND t1.resourcesid = t2.resourcesid AND t1.type = t2.type ); ``` **说明:** 1. 子查询定位每组的最小id 2. 外层查询匹配对应记录 3. **适用场景**:旧版Oracle(<8i)不支持窗口函数时 ### 关键点说明 1. **去重逻辑**:两种方案都按三个字段分组,取最小id记录 2. **数据安全**:仅查询展示,不修改实际数据 3. **性能建议**: - 对`(restapi, resourcesid, type, id)`建复合索引 - 大数据集优先用方案1(窗口函数) 4. **空值处理**:如果字段可能含NULL,需考虑`NULLS FIRST/LAST`排序规则 > **应用场景**:这类查询常用于日志分析、数据报表展示等需要"每组最新/最早记录"的业务场景[^1]。 --- ### 相关问题 1. 如何获取每组中`id`最大的记录而非最小? 2. 如果还需要显示每组的记录数量,如何修改查询? 3. 当分组字段包含NULL值时,查询结果会如何变化? 4. 如何优化这个查询在百万级数据表的性能? 5. 在PL/SQL中如何将此类查询结果存入游标进行逐行处理? [^1]: 引用自SQL分组查询技术文档,How to select the first/least/max row per group in SQL [^2]: 引用自MySQL官方文档关于GROUP BY的约束说明 [^3]: 引用自SQL子查询最佳实践案例
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值