一、groupby的基础操作
In [2]: df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
...: 'B': [2, 8, 1, 4, 3, 2, 5, 9],
...: 'C': [102, 98, 107, 104, 115, 87, 92, 123]})
...: df
...:
Out[2]:
A B C
0 a 2 102
1 b 8 98
2 a 1 107
3 c 4 104
4 a 3 115
5 c 2 87
6 b 5 92
7 c 9 123
1、按A列分组(groupby),获取其他列的均值
df.groupby('A').mean()
Out[3]:
B C
A
a 2.0 108.000000
b 6.5 95.000000
c 5.0 104.666667
2、按多列进行分组(groupby)
df.groupby(['A','B']).mean()
Out[4]:
C
A B
a 1 107
2 102
3 115
b 5 92
8 98
c 2 87
4 104
9 123
合并相同的列, 对其他列进行操作运算
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
...: 'B': [2, 8, 1, 4, 3, 2, 5, 9],
...: 'C': [102, 98, 107, 104, 115, 87, 92, 123]})
...:
In [3]: df.groupby(["A","B" ]).agg({'C':'sum'})
Out[3]:
C
A B
a 1 107
2 102
3 115
b 5 92
8 98
c 2 87
4 104
9 123
In [4]: df.groupby(["A","B" ]).agg({'C':'sum'}).reset_index()
Out[4]:
A B C
0 a 1 107
1 a 2 102
2 a 3 115
3 b 5 92
4 b 8 98
5 c 2 87
6 c 4 104
7 c 9 123
3、 可以针对不同的列选用不同的聚合方法
df = df.groupby("A").agg({'B':'mean', 'C':'sum'})
print(df)
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name": ["Alice", "Bob", "Mallory", "Mallory", "Bob", "Mallory"],
"City": ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"],
"Val": [4, 3, 3, np.nan, np.nan, 4]})
print(df)
"""
Name City Val
0 Alice Seattle 4.0
1 Bob Seattle 3.0
2 Mallory Portland 3.0
3 Mallory Seattle NaN
4 Bob Seattle NaN
5 Mallory Portland 4.0
"""
print("*" * 20)
print(df.groupby(["Name", "City"], as_index=False)['Val'].count()) # 这个加上有 index
"""
输出结果:
Name City Val
0 Alice Seattle 1
1 Bob Seattle 1
2 Mallory Portland 2
3 Mallory Seattle 0
"""
print("*" * 20)
print(df.groupby(["Name", "City"], as_index=True)['Val'].count())
"""
输出结果:
Name City
Alice Seattle 1
Bob Seattle 1
Mallory Portland 2
Seattle 0
Name: Val, dtype: int64
"""
print("*" * 20)
print(df.groupby(["Name", "City"])['Val'].size().reset_index(name='Size'))
"""
输出结果:
Name City Size
0 Alice Seattle 1
1 Bob Seattle 2
2 Mallory Portland 2
3 Mallory Seattle 1
"""
4 # 分组运算方法 agg()
df = pd.DataFrame({'A': list('XYZXYZXYZX'), 'B': [1, 2, 1, 3, 1, 2, 3, 3, 1, 2], 'C': [12, 14, 11, 12, 13, 14, 16, 12, 10, 19]})
print(df)
"""
A B C
0 X 1 12
1 Y 2 14
2 Z 1 11
3 X 3 12
4 Y 1 13
5 Z 2 14
6 X 3 16
7 Y 3 12
8 Z 1 10
9 X 2 19
"""
5 、 agg 针对这一列求不同的值
print(df.groupby('A')['B'].agg({'mean':np.mean, 'standard deviation': np.std}))
"""
mean standard deviation
A
X 2.250000 0.957427
Y 2.000000 1.000000
Z 1.333333 0.577350
"""
6 、针对不同的列应用多种不同的统计方法
print(df.groupby('A').agg({'B':[np.mean, 'sum'], 'C':['count',np.std]}))
"""
B C
mean sum count std
A
X 2.250000 9 4 3.403430
Y 2.000000 6 3 1.000000
Z 1.333333 4 3 2.081666
"""
7、分组运算方法 apply()
print(df.groupby('A').apply(np.mean))
# 跟下面的方法的运行结果是一致的
# df.groupby('A').mean()
"""
B C
A
X 2.250000 14.750000
Y 2.000000 13.000000
Z 1.333333 11.666667
"""
8、 apply()方法可以应用lambda函数,举例如下:
print(df.groupby('A').apply(lambda x: x['C']-x['B']))
"""
A
X 0 11
3 9
6 13
9 17
Y 1 12
4 12
7 9
Z 2 10
5 12
8 9
dtype: int64
"""
print(df.groupby('A').apply(lambda x: (x['C']-x['B']).mean()))
"""
A
X 12.500000
Y 11.000000
Z 10.333333
dtype: float64
"""
9、 分组运算方法 transform()
df = pd.DataFrame({'group1' : ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'group2' : ['C', 'C', 'C', 'D','E', 'E', 'F', 'F'],
'B': ['one', np.NaN, np.NaN, np.NaN,np.NaN, 'two', np.NaN, np.NaN],
'C': [np.NaN, 1, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, 4]})
print(df)
"""
B C group1 group2
0 one NaN A C
1 NaN 1.0 A C
2 NaN NaN A C
3 NaN NaN A D
4 NaN NaN B E
5 two NaN B E
6 NaN NaN B F
7 NaN 4.0 B F
"""
print(df.groupby(['group1', 'group2'])['B'].transform('count'))
"""
0 1
1 1
2 1
3 0
4 1
5 1
6 0
7 0
Name: B, dtype: int64
"""
df['count_B']=df.groupby(['group1', 'group2'])['B'].transform('count')
print(df)
"""
B C group1 group2 count_B
0 one NaN A C 1
1 NaN 1.0 A C 1
2 NaN NaN A C 1
3 NaN NaN A D 0
4 NaN NaN B E 1
5 two NaN B E 1
6 NaN NaN B F 0
7 NaN 4.0 B F 0
"""