pd.groupby 作用
pd.groupby 能将feature按不同类型分开
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_train = pd.read_csv("train.csv") # titanic数据
查看统计
df_train.groupby(['Sex','Survived'])['Survived'].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
画出性别对应的生存率
df_train[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1f8d3c93198>

df = pd.DataFrame(data={'books':['bk1','bk1','bk1','bk2','bk2','bk3'], 'price': [12,12,12,15,15,17]})
df
| books | price | |
|---|---|---|
| 0 | bk1 | 12 |
| 1 | bk1 | 12 |
| 2 | bk1 | 12 |
| 3 | bk2 | 15 |
| 4 | bk2 | 15 |
| 5 | bk3 | 17 |
df0 = df.groupby('books',as_index=True).sum()
print (df0.loc['bk1'])
price 36
Name: bk1, dtype: int64
df0.loc[0] #报错
df1 = df.groupby('books',as_index=False).sum()
print (df1.loc[df1.books == 'bk1'])
books price
0 bk1 36
df1.loc['bk1'] # 报错
当as_index = True 时,df.loc[]只能用label来,比如’bk1’
当as_index =False 时,df.loc[]只能用索引来 ,比如 0,1,2
但是都能用 df.iloc[]来索引,结果一致
agg vs filter vs transform
链接里有详细的教程
简单用法
df.groupby('day')['total_bill'].mean()
df.groupby('day').filter(lambda x : x['total_bill'].mean() > 20)
df.groupby('day')['total_bill'].transform(lambda x : x/x.mean())
适用条件
-
if we want to get a single value for each group -> use
aggregate() -
if we want to get a subset of the input rows -> use
filter() -
if we want to get a new value for each input row -> use
transform()
本文详细介绍了如何使用Pandas库进行数据分组(groupby)及后续的聚合操作,包括计数、平均值计算、筛选子集和转换数据等技巧。通过实际案例,如Titanic数据集的生存率分析,展示了groupby函数的强大功能及其参数as_index的影响。此外,还对比了agg、filter和transform函数在不同场景下的应用。
255

被折叠的 条评论
为什么被折叠?



