第四章 分组
import numpy as np
import pandas as pd
一、分组模式及其对象
df.groupby(分组依据)[数据来源].使用操作
例如第一个例子中的代码就应该如下:
df.groupby('Gender')['Longevity'].mean()
df = pd.read_csv('../data/learn_pandas.csv')
df.groupby('Gender')['Height'].median()
Gender
Female 159.6
Male 173.4
Name: Height, dtype: float64
2. 分组依据的本质
根据学校和性别进行分组,统计身高的均值就可以如下写出:
df.groupby(['School', 'Gender'])['Height'].mean()
School Gender
Fudan University Female 158.776923
Male 174.212500
Peking University Female 158.666667
Male 172.030000
Shanghai Jiao Tong University Female 159.122500
Male 176.760000
Tsinghua University Female 159.753333
Male 171.638889
Name: Height, dtype: float64
首先写出分组条件:
condition = df.Weight > df.Weight.mean()
然后传入groupby
中:
df.groupby(condition)['Height'].mean()
Weight
False 159.034646
True 172.705357
Name: Height, dtype: float64
3. Groupby对象
gb = df.groupby(['School', 'Grade'])
gb
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024E9F837288>
ngroups
可以访问分为了多少组:
gb.ngroups
16
gb.size()
School Grade
Fudan University Freshman 9
Junior 12
Senior 11
Sophomore 8
Peking University Freshman 13
Junior 8
Senior 8
Sophomore 5
Shanghai Jiao Tong University Freshman 13
Junior 17
Senior 22
Sophomore 5
Tsinghua University Freshman 17
Junior 22
Senior 14
Sophomore 16
dtype: int64
通过get_group
方法可以直接获取所在组对应的行,此时必须知道组的具体名字:
gb.get_group(('Fudan University', 'Freshman'))
School | Grade | Name | Gender | Height | Weight | Transfer | Test_Number | Test_Date | Time_Record | |
---|---|---|---|---|---|---|---|---|---|---|
15 | Fudan University | Freshman | Changqiang Yang | Female | 156.0 | 49.0 | N | 3 | 2020/1/1 | 0:05:25 |
28 | Fudan University | Freshman | Gaoqiang Qin | Female | 170.2 | 63.0 | N | 2 | 2020/1/7 | 0:05:24 |
63 | Fudan University | Freshman | Gaofeng Zhao | Female | 152.2 | 43.0 | N | 2 | 2019/10/31 | 0:04:00 |
70 | Fudan University | Freshman | Yanquan Wang | Female | 163.5 | 55.0 | N | 1 | 2019/11/19 | 0:04:07 |
73 | Fudan University | Freshman | Feng Wang | Male | 176.3 | 74.0 | N | 1 | 2019/9/26 | 0:03:31 |
105 | Fudan University | Freshman | Qiang Shi | Female | 164.5 | 52.0 | N | 1 | 2019/12/11 | 0:04:23 |
108 | Fudan University | Freshman | Yanqiang Xu | Female | 152.4 | 38.0 | N | 1 | 2019/12/8 | 0:05:03 |
157 | Fudan University | Freshman | Xiaoli Lv | Female | 152.5 | 45.0 | N | 2 | 2019/9/11 | 0:04:17 |
186 | Fudan University | Freshman | Yanjuan Zhao | Female | NaN | 53.0 | N | 2 | 2019/10/9 | 0:04:21 |
4. 分组的三大操作
- 第一个例子中,每一个组返回一个标量值,可以是平均值、中位数、组容量
size
等 - 第二个例子中,做了原序列的标准化处理,也就是说每组返回的是一个
Series
类型 - 第三个例子中,既不是标量也不是序列,返回的整个组所在行的本身,即返回了
DataFrame
类型
二、聚合函数
1. 内置聚合函数
包括如下函数:max/min/mean/median/count/all/any/idxmax/idxmin/mad/nunique/skew/quantile/sum/std/var/sem/size/prod
。
gb = df.groupby('Gender')['Height']
gb.idxmin()
Gender
Female 143
Male 199
Name: Height, dtype: int64
gb.quantile(0.95)
Gender
Female 166.8
Male 185.9
Name: Height, dtype: float64
gb = df.groupby('Gender')[['Height', 'Weight']]
gb.max()
Height | Weight | |
---|---|---|
Gender | ||
Female | 170.2 | 63.0 |
Male | 193.9 | 89.0 |
2. agg方法
【a】使用多个函数
gb.agg(['sum', 'idxmax', 'skew'])
Height | Weight | |||||
---|---|---|---|---|---|---|
sum | idxmax | skew | sum | idxmax | skew | |
Gender | ||||||
Female | 21014.0 | 28 | -0.219253 | 6469.0 | 28 | -0.268482 |
Male | 8854.9 | 193 | 0.437535 | 3929.0 | 2 | -0.332393 |
【b】对特定的列使用特定的聚合函数
gb.agg({'Height':['mean','max'], 'Weight':'count'})
Height | Weight | ||
---|---|---|---|
mean | max | count | |
Gender | |||
Female | 159.19697 | 170.2 | 135 |
Male | 173.62549 | 193.9 | 54 |
【c】使用自定义函数
gb.agg(lambda x: x.mean()-x.min())
Height | Weight | |
---|---|---|
Gender | ||
Female | 13.79697 | 13.918519 |
Male | 17.92549 | 21.759259 |
def my_func(s):
res = 'High'
if s.mean() <= df[s.name].mean():
res = 'Low'
return res
gb.agg(my_func)
Height | Weight | |
---|---|---|
Gender | ||
Female | Low | Low |
Male | High | High |
【d】聚合结果重命名
gb.agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')])
Height | Weight | |||
---|---|---|---|---|
range | my_sum | range | my_sum | |
Gender | ||||
Female | 24.8 | 21014.0 | 29.0 | 6469.0 |
Male | 38.2 | 8854.9 | 38.0 | 3929.0 |
gb.agg({'Height': [('my_func', my_func), 'sum'], 'Weight': lambda x:x.max()})
Height | Weight | ||
---|---|---|---|
my_func | sum | <lambda> | |
Gender | |||
Female | Low | 21014.0 | 63.0 |
Male | High | 8854.9 | 89.0 |
gb.agg([('my_sum', 'sum')])
Height | Weight | |
---|---|---|
my_sum | my_sum | |
Gender | ||
Female | 21014.0 | 6469.0 |
Male | 8854.9 | 3929.0 |
gb.agg({'Height': [('my_func', my_func), 'sum'], 'Weight': [('range', lambda x:x.max())]})
Height | Weight | ||
---|---|---|---|
my_func | sum | range | |
Gender | |||
Female | Low | 21014.0 | 63.0 |
Male | High | 8854.9 | 89.0 |
三、变换和过滤
1. 变换函数与transform方法
gb.cummax().head()
Height | Weight | |
---|---|---|
0 | 158.9 | 46.0 |
1 | 166.5 | 70.0 |
2 | 188.9 | 89.0 |
3 | NaN | 46.0 |
4 | 188.9 | 89.0 |
现对身高和体重进行分组标准化,即减去组均值后除以组的标准差:
gb.transform(lambda x: (x-x.mean())/x.std()).head()
Height | Weight | |
---|---|---|
0 | -0.058760 | -0.354888 |
1 | -1.010925 | -0.355000 |
2 | 2.167063 | 2.089498 |
3 | NaN | -1.279789 |
4 | 0.053133 | 0.159631 |
gb.transform('mean').head() # 传入返回标量的函数也是可以的
Height | Weight | |
---|---|---|
0 | 159.19697 | 47.918519 |
1 | 173.62549 | 72.759259 |
2 | 173.62549 | 72.759259 |
3 | 159.19697 | 47.918519 |
4 | 173.62549 | 72.759259 |
2. 组索引与过滤
gb.filter(lambda x: x.shape[0] > 100).head()
Height | Weight | |
---|---|---|
0 | 158.9 | 46.0 |
3 | NaN | 41.0 |
5 | 158.0 | 51.0 |
6 | 162.5 | 52.0 |
7 | 161.9 | 50.0 |
四、跨列分组
1. apply的引入
之前几节介绍了三大分组操作,但事实上还有一种常见的分组场景,无法用前面介绍的任何一种方法处理,例如现在如下定义身体质量指数BMI:
B
M
I
=
W
e
i
g
h
t
H
e
i
g
h
t
2
{\rm BMI} = {\rm\frac{Weight}{Height^2}}
BMI=Height2Weight
其中体重和身高的单位分别为千克和米,需要分组计算组BMI的均值。
首先,这显然不是过滤操作,因此filter
不符合要求;其次,返回的均值是标量而不是序列,因此transform
不符合要求;最后,似乎使用agg
函数能够处理,但是之前强调过聚合函数是逐列处理的,而不能够
多
列
数
据
同
时
处
理
\color{#FF0000}{多列数据同时处理}
多列数据同时处理。由此,引出了apply
函数来解决这一问题。
2. apply的使用
在设计上,apply
的自定义函数传入参数与filter
完全一致,只不过后者只允许返回布尔值。现如下解决上述计算问题:
def BMI(x):
Height = x['Height']/100
Weight = x['Weight']
BMI_value = Weight/Height**2
return BMI_value.mean()
gb.apply(BMI)
Gender
Female 18.860930
Male 24.318654
dtype: float64
apply
方法还可以返回一维Series
和二维DataFrame
:
【a】标量情况:结果得到的是 Series
,索引与 agg
的结果一致
gb = df.groupby(['Gender','Test_Number'])[['Height','Weight']]
gb.apply(lambda x: 0)
Gender Test_Number
Female 1 0
2 0
3 0
Male 1 0
2 0
3 0
dtype: int64
gb.apply(lambda x: [0, 0]) # 虽然是列表,但是作为返回值仍然看作标量
Gender Test_Number
Female 1 [0, 0]
2 [0, 0]
3 [0, 0]
Male 1 [0, 0]
2 [0, 0]
3 [0, 0]
dtype: object
【b】Series
情况:得到的是DataFrame
,行索引与标量情况一致,列索引为Series
的索引
gb.apply(lambda x: pd.Series([0,0],index=['a','b']))
a | b | ||
---|---|---|---|
Gender | Test_Number | ||
Female | 1 | 0 | 0 |
2 | 0 | 0 | |
3 | 0 | 0 | |
Male | 1 | 0 | 0 |
2 | 0 | 0 | |
3 | 0 | 0 |
gb.apply(lambda x: pd.DataFrame(np.ones((2,2)), index = ['a','b'], columns=pd.Index([('w','x'),('y','z')])))
w | y | |||
---|---|---|---|---|
x | z | |||
Gender | Test_Number | |||
Female | 1 | a | 1.0 | 1.0 |
b | 1.0 | 1.0 | ||
2 | a | 1.0 | 1.0 | |
b | 1.0 | 1.0 | ||
3 | a | 1.0 | 1.0 | |
b | 1.0 | 1.0 | ||
Male | 1 | a | 1.0 | 1.0 |
b | 1.0 | 1.0 | ||
2 | a | 1.0 | 1.0 | |
b | 1.0 | 1.0 | ||
3 | a | 1.0 | 1.0 | |
b | 1.0 | 1.0 |