Pandas实现groupby分组统计
类似SQL:
select city.max(temperature) from city_weather groupby by city
groupby: 先对数据分组,然后在每个分组上应用聚合函数、转换函数
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
| A | B | C | D |
---|
0 | foo | one | -0.190741 | 1.612948 |
---|
1 | bar | one | -0.781209 | 1.002117 |
---|
2 | foo | two | 0.858491 | -0.178429 |
---|
3 | bar | three | -0.280559 | 0.083816 |
---|
4 | foo | two | 0.214302 | -0.107717 |
---|
5 | bar | two | -0.134638 | -1.604365 |
---|
6 | foo | one | -2.108731 | -0.653819 |
---|
7 | foo | three | 0.414685 | -1.717435 |
---|
1、分组使用聚合函数做数据统计
1、单个列groupby,查询所有数据列的统计
df.groupby('A').sum()
| C | D |
---|
A | | |
---|
bar | -1.196406 | -0.518431 |
---|
foo | -0.811994 | -1.044452 |
---|
1、groupby中的A变成了数据的索引列
2、B列不是数字,所以被自动忽略
2、多个列groupby,查询所有数据列的统计
df.groupby(['A','B']).mean()
| | C | D |
---|
A | B | | |
---|
bar | one | -0.781209 | 1.002117 |
---|
three | -0.280559 | 0.083816 |
---|
two | -0.134638 | -1.604365 |
---|
foo | one | -1.149736 | 0.479564 |
---|
three | 0.414685 | -1.717435 |
---|
two | 0.536397 | -0.143073 |
---|
(A,B)成对变成了二级索引,不想改变原来索引。加一个as_index=False
df.groupby(['A','B'],as_index=False).mean()
| A | B | C | D |
---|
0 | bar | one | -0.781209 | 1.002117 |
---|
1 | bar | three | -0.280559 | 0.083816 |
---|
2 | bar | two | -0.134638 | -1.604365 |
---|
3 | foo | one | -1.149736 | 0.479564 |
---|
4 | foo | three | 0.414685 | -1.717435 |
---|
5 | foo | two | 0.536397 | -0.143073 |
---|
3、同时查看多种数据统计
df.groupby('A').agg([np.sum,np.mean,np.std])
| C | D |
---|
| sum | mean | std | sum | mean | std |
---|
A | | | | | | |
---|
bar | -1.196406 | -0.398802 | 0.339116 | -0.518431 | -0.17281 | 1.322055 |
---|
foo | -0.811994 | -0.162399 | 1.151755 | -1.044452 | -0.20889 | 1.204567 |
---|
列变成了多级索引
4、查看单列的结果数据统计
df.groupby('A')['C'].agg([np.sum,np.mean,np.std])
| sum | mean | std |
---|
A | | | |
---|
bar | -1.196406 | -0.398802 | 0.339116 |
---|
foo | -0.811994 | -0.162399 | 1.151755 |
---|
df.groupby('A').agg([np.sum,np.mean,np.std])['C']
| sum | mean | std |
---|
A | | | |
---|
bar | -1.196406 | -0.398802 | 0.339116 |
---|
foo | -0.811994 | -0.162399 | 1.151755 |
---|
5、不同列使用不同的聚合函数
df.groupby('A').agg({'C':np.sum,'D':np.std})
| C | D |
---|
A | | |
---|
bar | -1.196406 | 1.322055 |
---|
foo | -0.811994 | 1.204567 |
---|