文章目录
学无止境,从细节开始
聚合函数:
聚合函数(Aggregate Functions)是关系型数据库(如 MySQL、PostgreSQL、Oracle)中用于对一组值执行计算并返回单一结果的函数。它们常用于数据统计、汇总和分析。聚合函数一般与 GROUP BY
子句一起使用,但在不分组的情况下也可以直接作用于整个表。
我们先看下错误例子:
业务场景:查询业务库中不同客户对应产品总数,我们来提起前大于100的数量。
select appid,count(1) as total from gkb_sample_info where isright=0 and total>100 GROUP BY appid ORDER BY total DESC
以下代码看似人畜无害,其实查询出来结果是错误的
看正确例子:
SELECT
appid,
COUNT(1) AS total
FROM
gkb_sample_info
WHERE
isright = 0
GROUP BY
appid
HAVING
total > 100
ORDER BY
total DESC;
在 SQL 中,WHERE
子句用于筛选原始表数据,而 HAVING
子句用于筛选聚合函数的结果。因此,total > 100
不能写在 WHERE
子句中,因为 count(1)
是聚合计算的结果,在执行 GROUP BY
之后才会生成。
这里延伸到聚合函数的结果(如 COUNT
, SUM
等)不能在 WHERE
中直接过滤,只能通过 HAVING
过滤。
1. 常用的聚合函数
SUM(列)
-
场景:计算某列的总和,例如销售额、库存总量。
-
示例:
SELECT department, SUM(salary) AS total_salary FROM employees GROUP BY department;
AVG(列)
-
场景:计算某列的平均值,例如平均薪资、平均评分。
-
示例:
SELECT AVG(price) AS average_price FROM products;
COUNT(列)
-
场景:统计记录的数量,例如计算订单数、用户数。
-
示例:
SELECT customer_id, COUNT(order_id) AS order_count FROM orders GROUP BY customer_id;
MAX(列) 和 MIN(列)
-
场景:查找某列的最大值或最小值,例如最高分、最低库存量。
-
示例:
-- 找到销售额最高的订单 SELECT MAX(sales_amount) AS max_sales FROM sales;
涉及到聚合函数查询,也会存在查询优化的情况,本节延伸一些优化方面内容:
2.聚合函数常见优化技巧
(1)聚合函数的索引优化
-
原理:聚合查询会遍历大量数据,为了提高性能,可以基于索引优化,尤其是范围查询或分组计算时。
-
优化方式:
- 对用于
GROUP BY
的列建立索引。 - 对
WHERE
条件中涉及的列建立索引。
- 对用于
-
示例:针对用户登录统计的查询:
CREATE INDEX idx_user_login ON login_records(user_id); SELECT user_id, COUNT(*) AS login_count FROM login_records WHERE login_date >= '2024-01-01' GROUP BY user_id;
(2)避免全表扫描
在大表中,使用条件 (WHERE
子句) 减少参与聚合计算的数据。
-
示例:筛选某一时间段内的订单:
SELECT customer_id, SUM(order_amount) AS total_spent FROM orders WHERE order_date >= '2024-01-01' AND order_date <= '2024-12-31' GROUP BY customer_id;
(3)部分聚合
针对大规模数据量,可以使用分片聚合或分步聚合的方式,提高性能。
-
示例:将大数据集划分为多个小集群分别计算,然后汇总:
-- 在节点 A 聚合 SELECT customer_id, SUM(order_amount) AS total_orders FROM orders WHERE region = 'East' GROUP BY customer_id; -- 在节点 B 聚合 SELECT customer_id, SUM(order_amount) AS total_orders FROM orders WHERE region = 'West' GROUP BY customer_id; -- 最后汇总 SELECT customer_id, SUM(total_orders) AS final_orders FROM regional_totals GROUP BY customer_id;
3. 聚合函数的组合使用
(1)聚合 + CASE
通过 CASE
实现条件聚合,能够按分类统计不同的结果。
-
示例:统计男女用户的订单数量:
SELECT SUM(CASE WHEN gender = 'male' THEN 1 ELSE 0 END) AS male_orders, SUM(CASE WHEN gender = 'female' THEN 1 ELSE 0 END) AS female_orders FROM orders;
(2)带条件的聚合函数
有些数据库支持直接在聚合函数中添加过滤条件,如 PostgreSQL 和 MySQL 的 FILTER
子句或 COUNT(IF)
语法。
-
PostgreSQL 示例:
SELECT COUNT(*) AS total_users, COUNT(*) FILTER (WHERE gender = 'male') AS male_users, COUNT(*) FILTER (WHERE gender = 'female') AS female_users FROM users;
-
MySQL 示例:
SELECT COUNT(*) AS total_users, COUNT(IF(gender = 'male', 1, NULL)) AS male_users, COUNT(IF(gender = 'female', 1, NULL)) AS female_users FROM users;
3. 窗口函数扩展
窗口函数将聚合和行级别数据结合,使得每行都可以保留原始值,同时计算某些聚合结果。
(1)累计和/滚动总和
-
场景:按照时间顺序,计算逐日的累计销量。
-
示例:
SELECT order_date, SUM(order_amount) OVER (ORDER BY order_date) AS cumulative_sales FROM sales;
(2)分区聚合
-
场景:对数据进行分组,但不减少结果行数。
-
示例:计算每个部门员工工资的最高值,同时保留所有行。
SELECT employee_id, department_id, salary, MAX(salary) OVER (PARTITION BY department_id) AS max_salary_in_dept FROM employees;
(3)行间比较
-
场景:使用窗口函数找到每行数据相对前一行的差值。
-
示例:计算每个订单金额与上一订单金额的差距。
SELECT order_id, order_amount, order_amount - LAG(order_amount) OVER (ORDER BY order_id) AS difference FROM orders;
4. 聚合结果的数据处理
(1)TOP-N 查询
通过聚合和排序,筛选出每组的前 N 名。
-
示例:统计每个部门工资最高的 3 位员工。
SELECT employee_id, department_id, salary FROM ( SELECT employee_id, department_id, salary, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank FROM employees ) ranked WHERE rank <= 3;
(2)按百分比分组
将数据按总数划分为百分比分组,可以帮助分析数据的分布情况。
-
示例:将员工工资划分为 25%、50%、75%、100% 四个分位区间。
SELECT employee_id, salary, NTILE(4) OVER (ORDER BY salary) AS salary_quartile FROM employees;
5. 特殊聚合函数扩展
(1)GROUP_CONCAT
将分组后的多行值合并为一行字符串(MySQL 支持)。
-
示例:获取每个部门的员工姓名列表。
SELECT department_id, GROUP_CONCAT(employee_name) AS employee_names FROM employees GROUP BY department_id;
(2)LISTAGG
将多个值合并为一个分隔字符串(Oracle 和 PostgreSQL 支持)。
-
PostgreSQL 示例:
SELECT department_id, STRING_AGG(employee_name, ', ') AS employee_names FROM employees GROUP BY department_id;
(3)MEDIAN
部分数据库(如 PostgreSQL 和 Oracle)支持计算中位数,MySQL 可通过窗口函数模拟。
-
模拟中位数(MySQL 示例):
SELECT AVG(salary) AS median_salary FROM ( SELECT salary FROM employees ORDER BY salary LIMIT 2 OFFSET (SELECT FLOOR(COUNT(*)/2) FROM employees) - 1 ) median;
6. 聚合函数在实际业务场景中的应用
用户行为分析
按日期统计活跃用户数、总数和留存率:
SELECT
login_date,
COUNT(DISTINCT user_id) AS daily_active_users,
SUM(new_user_flag) AS new_users
FROM user_login
GROUP BY login_date;
财务分析
按月统计收入、退款、净利润:
SELECT
DATE_FORMAT(transaction_date, '%Y-%m') AS month,
SUM(CASE WHEN transaction_type = 'income' THEN amount ELSE 0 END) AS total_income,
SUM(CASE WHEN transaction_type = 'refund' THEN amount ELSE 0 END) AS total_refund,
SUM(CASE WHEN transaction_type = 'income' THEN amount ELSE 0 END) -
SUM(CASE WHEN transaction_type = 'refund' THEN amount ELSE 0 END) AS net_profit
FROM transactions
GROUP BY DATE_FORMAT(transaction_date, '%Y-%m');