Pandas 提供groupby()、 agg() 方法可以实现分组数据分析,如计数、平均值、最值等。本文通过pandas.DataFrame.groupby()对列进行分组,然后获得每组的大小和数量,groupby()函数用于把相同数据分为组,然后基于组运用聚集函数。
准备示例数据
首先创建示例数据,接着执行示例代码验证结果。示例数据包括列课程、费用、时长、折扣:
# Create a DataFrame.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Hadoop","Hadoop","Spark","Python","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
'Duration':['30days','50days','55days', '40days','55days','35days','30days','40days','40days'],
'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200]
})
df = pd.DataFrame(technologies)
print(df)
输出结果如下:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 55days 1000
3 Python 24000 40days 1200
4 Hadoop 26000 55days 2500
5 Hadoop 25000 35days 1200
6 Spark 25000 30days 1400
7 Python 22000 40days 1000
8 Spark 25000 40days 1200
分组调用size()
使用DataFrame.groupby()进行分组,使用size函数对每组进行计数。size用于获得对象中元素数量(int类型),Series对象返回行数,DataFrame对象返回行和列乘积 (rows * columns)。
# Use DataFrame.size()
df2=df.groupby(['Courses','Duration'])
df2=df.groupby(['Courses', 'Duration']).size().reset_index(name='counts')
print(df2)
输出结果如下,reset_index(name=‘counts’)用户设置列计数标题:
Courses Duration counts
0 Hadoop 35days 1
1 Hadoop 55days 2
2 PySpark 50days 1
3 Python 40days 2
4 Spark 30days 2
5 Spark 40days 1s
分组调用agg()
另外,groupby方法还可以和agg()函数一起使用。agg函数可以在DataFrame的一个维度上应用一个或多个聚集函数,axis缺省为0,即索引(按行计算)。下面对每组执行计数和均值函数:
# Pandas .groupby() and using agg() Method.
df2 = df.groupby(['Courses', 'Duration']).agg(['mean', 'count'])
print(df2)
输出结果:
Fee Discount
mean count mean count
Courses Duration
Hadoop 35days 25000.0 1 1200.0 1
55days 24500.0 2 1750.0 2
PySpark 50days 25000.0 1 2300.0 1
Python 40days 23000.0 2 1100.0 2
Spark 30days 23500.0 2 1200.0 2
40days 25000.0 1 1200.0 1
使用groupby().describe()函数获得统计信息
本节使用groupby().describe()获得每组的统计信息。describe()作为统计工具可以应用于任何变量后组,它的输出依赖于你是否应用于数值列和字符列:
# Pandas Get Statistics Using groupby().describe()
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()
print(df2)
输出结果如下:
count mean std ... 50% 75% max
Courses Duration ...
Hadoop 35days 1.0 1200.0 NaN ... 1200.0 1200.0 1200.0
55days 2.0 1750.0 1060.660172 ... 1750.0 2125.0 2500.0
PySpark 50days 1.0 2300.0 NaN ... 2300.0 2300.0 2300.0
Python 40days 2.0 1100.0 141.421356 ... 1100.0 1150.0 1200.0
Spark 30days 2.0 1200.0 282.842712 ... 1200.0 1300.0 1400.0
40days 1.0 1200.0 NaN ... 1200.0 1200.0 1200.0
我们也可以在describe()函数中指定聚集函数:
# Pandas DataFrame.groupby() and describe() function.
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()[['count', 'mean']]
print(df2)
输出结果:
count mean
Courses Duration
Hadoop 35days 1.0 1200.0
55days 2.0 1750.0
PySpark 50days 1.0 2300.0
Python 40days 2.0 1100.0
Spark 30days 2.0 1200.0
40days 1.0 1200.0
通过DataFrame.value_counts统计
通过DataFrame.value_counts获取每组的统计信息。value_counts函数用于获得包含唯一值计数:
# Get statistics by DataFrame.value_counts.
df2=df.value_counts(subset=['Courses', 'Duration'])
print(df2)
输出结果:
Courses Duration
Hadoop 55days 2
Python 40days 2
Spark 30days 2
Hadoop 35days 1
PySpark 50days 1
Spark 40days 1
dtype: int64
其他示例
下面示例,为了获得多个统计信息,但不要索引列,保留列名称:
df2 = df.groupby(['Courses','Duration']).agg(['mean', 'count'])
df2.columns = [ ' '.join(str(i) for i in col) for col in df2.columns]
df2.reset_index(inplace=True)
print(df2)
输出结果:
Courses Duration Fee mean Fee count Discount mean Discount count
0 Hadoop 35days 25000 1 1200 1
1 Hadoop 55days 24500 2 1750 2
2 PySpark 50days 25000 1 2300 1
3 Python 40days 23000 2 1100 2
4 Spark 30days 23500 2 1200 2
5 Spark 40days 25000 1 1200 1
总结
本文我们学习了如何使用groupby对单个和多个列进行分组,通过对每组执行counts,size,max,min,mean,describe等聚集函数。