【第4章】4.4分组聚合的组内计算

最新推荐文章于 2025-11-08 11:18:11 发布

原创最新推荐文章于 2025-11-08 11:18:11 发布 · 821 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#学习笔记

python数据分析学习专栏收录该内容

8 篇文章

订阅专栏

本文详细介绍了Pandas库中用于数据分组聚合的方法，包括groupby、agg、apply和transform。通过这些方法，可以对数据进行拆分、聚合统计、自定义函数计算和数据转换，例如计算组内均值、标准差、中位数，以及进行组内离差标准化等操作。

分组聚合

1.groupby 方法拆分数据

实现数据拆分
DataFrame.groupby( by=None, axis=0,level=None…)

常用参数by传入值	说明
字典或Series	以它的 “值” 作为分组依据
Numpy 数组	以数据的 “元素” 最为分组依据
字符串或字符串列表	以字符串代表的 “字段” 最为分组依据

对详情表的订单编号分组

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://root:981221@localhost/testdb?charset=utf8mb4')
detail = pd.read_sql_table('meal_order_detail1',con=engine)
detailGroup = detail[['order_id','counts','amounts']].groupby(by = 'order_id')
print('分组后详情表：',detailGroup) # 只显示内存地址

分组后详情表： <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0C2B7D90>

分组后的对象为：GroupBy

GroupBy 的常用描述性统计方法

方法名称	说明
count	计算“ 分组的数目 ”，含缺失值
head	返回每组前 n 个值，默认 5，n 可以指定
max	返回每组最大值
mean	返回每组均值
median	返回每组中位数
cumcount	最分组的组员标记，0~n-1
size	返回每组的大小
min	返回每组最小值
std	返回每组标准差
sum	返回每组的和

GroupBy 类均值，标准差，中位数

print('前 5 组均值: \n',detailGroup.mean().head())

前 5 组均值: 
           counts  amounts
order_id                 
1002      1.0000   32.000
1003      1.2500   30.125
1004      1.0625   43.875
1008      1.0000   63.000
1011      1.0000   57.700

print('前 5 组标准差：\n',detailGroup.std().head())

前 5 组标准差：
            counts    amounts
order_id                    
1002      0.00000  16.000000
1003      0.46291  21.383822
1004      0.25000  31.195886
1008      0.00000  64.880660
1011      0.00000  50.077828

print('前 5 组中位数：\n',detailGroup.median().head())

前 5 组中位数：
           counts  amounts
order_id                 
1002         1.0     30.0
1003         1.0     28.0
1004         1.0     35.0
1008         1.0     37.0
1011         1.0     33.5

2.agg 方法聚合数据

agg 和 aggregate 方法几乎相同
agg 方法从 pandas 0.20 版本开始，之前的版本， agg 函数无此功能
DataFrame.agg( func, axis=0, *args,**kwargs)
DataFrame.aggregate( func, axis=0, *args,**kwargs)

agg 求出 “当前数据” 对应的 “统计量”

print('销量的“和”&“均值” 和 售价的“和”&“均值”：\n',
      detail[['counts','amounts']].agg([np.sum,np.mean])) # 传入函数列表

销量的“和”&“均值” 和 售价的“和”&“均值”：
            counts        amounts
sum   3088.000000  125992.000000
mean     1.111191      45.337172

agg 求“不同字段” 的 “统计量”

传入字典，字段名作 key, 函数作 value

print('销量的“和” 和 售价的“均值”:\n"',
     detail.agg({'counts':np.sum,'amounts':np.mean}))

销量的“和” 和 售价的“均值”:
" counts     3088.000000
amounts      45.337172
dtype: float64

agg 求 “不同字段” 的 “不同数目的统计量”

value 转换为列表传入，列表元素转化为多个目标的统计量

print('销量的“和” 和 售价的“和”&“均值”:\n',
     detail.agg({'counts':np.sum,'amounts':[np.sum,np.mean]}))

销量的“和” 和 售价的“和”&“均值”:
       counts        amounts
mean     NaN      45.337172
sum   3088.0  125992.000000

agg 中使用自定义函数

# 自定义函数求 两倍的和
def DoubleSum(data):
    s = data.sum()*2
    return s
print('销量的两倍和:\n',detail.agg({'counts':DoubleSum},axis=0))

销量的两倍和:
 counts    6176.0
dtype: float64

agg 中自定义函数含有 Numpy 里的函数

计算单列------不太可以
计算多列------可以

def DoubleSum1(data):
    s = np.sum(data)*2
    return s
print('销量的两倍和:\n',detail.agg({'counts':DoubleSum1},axis=0).head())

销量的两倍和:
    counts
0     2.0
1     2.0
2     2.0
3     2.0
4     2.0

print('销量 和 售价的“两倍的和”:\n',detail[['counts','amounts']].agg(DoubleSum1))

销量 和 售价的“两倍的和”:
 counts       6176.0
amounts    251984.0
dtype: float64

agg 对分组 detailGroup 做简单的聚合

print('分组后前3 组的均值:\n',detailGroup.agg(np.mean).head(3))

分组后前3 组的均值:
           counts  amounts
order_id                 
1002      1.0000   32.000
1003      1.2500   30.125
1004      1.0625   43.875

print('分组后前3 组的标准差:\n',detailGroup.agg(np.std).head(3))

分组后前3 组的标准差:
            counts    amounts
order_id                    
1002      0.00000  16.000000
1003      0.46291  21.383822
1004      0.25000  31.195886

agg 对分组 detailGroup 使用 “不同聚合函数”

print('分组的前3 组销量“总和”和售价“均值:”\n',
     detailGroup.agg({'counts':np.sum,'amounts':np.mean}).head(3))

分组的前3 组销量“总和”和售价“均值:”
           counts  amounts
order_id                 
1002         7.0   32.000
1003        10.0   30.125
1004        17.0   43.875

3.apply 方法聚合数据

apply 方法类似于 agg
不能想 agg 方法一样可以对“不太字段”应用“不同函数”
DateFrame.apply(func,axis=0,…)

apply 基本用法

print('销量和售价的“均值”：\n',detail[['counts','amounts']].apply(np.mean))

销量和售价的“均值”：
 counts      1.111191
amounts    45.337172
dtype: float64

apply 聚合操作（分组后，聚合）

以均值为例，其余函数类似

print('分组后前3 组的均值:\n',detailGroup.apply(np.mean).head())

分组后前3 组的均值:
               order_id  counts  amounts
order_id                               
1002      1.431572e+26  1.0000   32.000
1003      1.253875e+30  1.2500   30.125
1004      6.275628e+61  1.0625   43.875
1008      2.016202e+18  1.0000   63.000
1011      1.011101e+38  1.0000   57.700

4.transform 方法聚合数据

transform 只有一个参数 func*
DataFrame.transform(lambda x: 关于 x 的运算)

transform 方法 “翻倍数据”

print('销量和售价的两倍为（前 5）:\n',detail[['counts','amounts']].transform(lambda x: x*2).head())

销量和售价的两倍为:
    counts  amounts
0     2.0     98.0
1     2.0     96.0
2     2.0     60.0
3     2.0     50.0
4     2.0     26.0

transform 对 detailGroup 实现组内离差标准化

print('分组后组内离差标准化（前 5）：\n',detailGroup.transform(lambda x: (x.mean()-x.min())/(x.max()-x.min())).head())

c:\python37\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in double_scalars
  """Entry point for launching an IPython kernel.


分组后组内离差标准化（前 5）：
    counts   amounts
0     NaN  0.555556
1     NaN  0.555556
2     NaN  0.555556
3     NaN  0.555556
4     NaN  0.555556