本文翻译自:Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
I have a data frame df
and I use several columns from it to groupby
: 我有一个数据框df
,我从中使用了几列到groupby
:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. 通过以上方法,我几乎得到了所需的表(数据框)。 What is missing is an additional column that contains number of rows in each group. 缺少的是另外一列,其中包含每个组中的行数。 In other words, I have mean but I also would like to know how many number were used to get these means. 换句话说,我有意思,但我也想知道有多少个数字被用来获得这些价值。 For example in the first group there are 8 values and in the second one 10 and so on. 例如,在第一组中有8个值,在第二组中有10个,依此类推。
In short: How do I get group-wise statistics for a dataframe? 简而言之:如何获取数据框的分组统计信息?
#1楼
参考:https://stackoom.com/question/1JKnk/使用pandas-GroupBy获取每个组的统计信息-例如计数-均值等
#2楼
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. 在groupby
对象上, agg
函数可以列出一个列表,以一次应用多种聚合方法 。 This should give you the result you need: 这应该给您您需要的结果:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
#3楼
Quick Answer: 快速回答:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
: 获取每个组的行数的最简单方法是调用.size()
,它返回一个Series
:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do: 通常,您希望将此结果作为DataFrame
(而不是Series
),因此您可以执行以下操作:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below. 如果您想了解如何计算每组的行数和其他统计信息,请继续阅读下面的内容。
Detailed example: 详细的例子:
Consider the following example dataframe: 考虑以下示例数据框:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts: 首先让我们使用.size()
来获取行数:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts: 然后让我们使用.size().reset_index(name='counts')
来获取行数:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics 包括结果以获取更多统计信息
When you want to calculate statistics on grouped data, it usually looks like this: 当您要计算分组数据的统计信息时,通常如下所示:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis. 由于嵌套的列标签,并且行计数是基于每列的,因此上面的结果有点令人讨厌。
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. 为了获得对输出的更多控制,我通常将统计信息拆分为单个聚合,然后使用join
进行合并。 It looks like this: 看起来像这样:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes 脚注
The code used to generate the test data is shown below: 生成测试数据的代码如下所示:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer: 免责声明:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. 如果您要聚合的某些列具有空值,那么您真的希望将组行计数视为每列的独立聚合。 Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it. 否则,您可能会误认为实际上有多少记录用于计算均值,因为熊猫会在均值计算中丢弃NaN
条目而不会告诉您。
#4楼
We can easily do it by using groupby and count. 我们可以使用groupby和count轻松地做到这一点。 But, we should remember to use reset_index(). 但是,我们应该记住使用reset_index()。
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
#5楼
One Function to Rule Them All: GroupBy.describe
一种将其全部统治的功能: GroupBy.describe
Returns count
, mean
, std
, and other useful statistics per-group. 按组返回count
, mean
, std
和其他有用的统计信息。
df.groupby(['col1', 'col2'])['col3', 'col4'].describe()
# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
from IPython.display import display
with pd.option_context('precision', 2):
display(df.groupby(['A', 'B'])['C'].describe())
count mean std min 25% 50% 75% max
A B
bar one 1.0 0.40 NaN 0.40 0.40 0.40 0.40 0.40
three 1.0 2.24 NaN 2.24 2.24 2.24 2.24 2.24
two 1.0 -0.98 NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one 2.0 1.36 0.58 0.95 1.15 1.36 1.56 1.76
three 1.0 -0.15 NaN -0.15 -0.15 -0.15 -0.15 -0.15
two 2.0 1.42 0.63 0.98 1.20 1.42 1.65 1.87
To get specific statistics, just select them, 要获取特定的统计信息,只需选择它们,
df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]
count mean
A B
bar one 1.0 0.400157
three 1.0 2.240893
two 1.0 -0.977278
foo one 2.0 1.357070
three 1.0 -0.151357
two 2.0 1.423148
describe
works for multiple columns (change ['C']
to ['C', 'D']
—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe). describe
多列的工作(将['C']
更改为['C', 'D']
-或完全删除-看看会发生什么,结果是一个MultiIndexed列数据框)。
You also get different statistics for string data. 您还将获得不同的字符串数据统计信息。 Here's an example, 这是一个例子
df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)
with pd.option_context('precision', 2):
display(df2.groupby(['A', 'B'])
.describe(include='all')
.dropna(how='all', axis=1))
C D
count mean std min 25% 50% 75% max count unique top freq
A B
bar one 14.0 0.40 5.76e-17 0.40 0.40 0.40 0.40 0.40 14 1 a 14
three 14.0 2.24 4.61e-16 2.24 2.24 2.24 2.24 2.24 14 1 b 14
two 9.0 -0.98 0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98 9 1 c 9
foo one 22.0 1.43 4.10e-01 0.95 0.95 1.76 1.76 1.76 22 2 a 13
three 15.0 -0.15 0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15 15 1 c 15
two 26.0 1.49 4.48e-01 0.98 0.98 1.87 1.87 1.87 26 2 b 15
For more information, see the documentation . 有关更多信息,请参见文档 。
#6楼
Create a group object and call methods like below example: 创建一个组对象并调用如下示例所示的方法:
grp = df.groupby(['col1', 'col2', 'col3'])
grp.max()
grp.mean()
grp.describe()