使用pandas GroupBy获取每个组的统计信息（例如计数，均值等）？

最新推荐文章于 2025-09-26 10:02:59 发布

翻译最新推荐文章于 2025-09-26 10:02:59 发布 · 8.1k 阅读

16 ·

CC 4.0 BY-SA版权

原文链接：https://oldbug.net/q/1JKnk/Get-statistics-for-each-group-such-as-count-mean-etc-using-pandas-GroupBy

文章标签：

#python #pandas #dataframe #group-by #pandas-groupby

本文介绍了如何使用pandas的GroupBy功能获取数据框的分组统计信息，包括计数、均值等。通过GroupBy结合agg()或count()函数，可以方便地得到每个组的行数和其他统计指标。示例代码展示了如何实现这一操作。

本文翻译自：Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

I have a data frame df and I use several columns from it to groupby : 我有一个数据框df ，我从中使用了几列到groupby ：

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. 通过以上方法，我几乎得到了所需的表（数据框）。 What is missing is an additional column that contains number of rows in each group. 缺少的是另外一列，其中包含每个组中的行数。 In other words, I have mean but I also would like to know how many number were used to get these means. 换句话说，我有意思，但我也想知道有多少个数字被用来获得这些价值。 For example in the first group there are 8 values and in the second one 10 and so on. 例如，在第一组中有8个值，在第二组中有10个，依此类推。

In short: How do I get group-wise statistics for a dataframe? 简而言之：如何获取数据框的分组统计信息？

#1楼

参考：https://stackoom.com/question/1JKnk/使用pandas-GroupBy获取每个组的统计信息-例如计数-均值等

#2楼

On groupby object, the agg function can take a list to apply several aggregation methods at once. 在groupby对象上， agg函数可以列出一个列表，以一次应用多种聚合方法。 This should give you the result you need: 这应该给您您需要的结果：

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

#3楼

Quick Answer: 快速回答：

The simplest way to get row counts per group is by calling .size() , which returns a Series : 获取每个组的行数的最简单方法是调用.size() ，它返回一个Series ：

df.groupby(['col1','col2']).size()

Usually you want this result as a DataFrame (instead of a Series ) so you can do: 通常，您希望将此结果作为DataFrame （而不是Series ），因此您可以执行以下操作：

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

If you want to find out how to calculate the row counts and other statistics for each group continue reading below. 如果您想了解如何计算每组的行数和其他统计信息，请继续阅读下面的内容。

Detailed example: 详细的例子：

Consider the following example dataframe: 考虑以下示例数据框：

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts: 首先让我们使用.size()来获取行数：

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts: 然后让我们使用.size().reset_index(name='counts')来获取行数：

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1

Including results for more statistics 包括结果以获取更多统计信息

When you want to calculate statistics on grouped data, it usually looks like this: 当您要计算分组数据的统计信息时，通常如下所示：

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis. 由于嵌套的列标签，并且行计数是基于每列的，因此上面的结果有点令人讨厌。

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join . 为了获得对输出的更多控制，我通常将统计信息拆分为单个聚合，然后使用join进行合并。 It looks like this: 看起来像这样：

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63

Footnotes 脚注

The code used to generate the test data is shown below: 生成测试数据的代码如下所示：

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...:

Disclaimer: 免责声明：

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. 如果您要聚合的某些列具有空值，那么您真的希望将组行计数视为每列的独立聚合。 Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it. 否则，您可能会误认为实际上有多少记录用于计算均值，因为熊猫会在均值计算中丢弃NaN条目而不会告诉您。

#4楼

We can easily do it by using groupby and count. 我们可以使用groupby和count轻松地做到这一点。 But, we should remember to use reset_index(). 但是，我们应该记住使用reset_index（）。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

#5楼

One Function to Rule Them All: `GroupBy.describe` 一种将其全部统治的功能： `GroupBy.describe`

Returns count , mean , std , and other useful statistics per-group. 按组返回count ， mean ， std和其他有用的统计信息。

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them, 要获取特定的统计信息，只需选择它们，

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe works for multiple columns (change ['C'] to ['C', 'D'] —or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe). describe多列的工作（将['C']更改为['C', 'D'] -或完全删除-看看会发生什么，结果是一个MultiIndexed列数据框）。

You also get different statistics for string data. 您还将获得不同的字符串数据统计信息。 Here's an example, 这是一个例子

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation . 有关更多信息，请参见文档。

#6楼

Create a group object and call methods like below example: 创建一个组对象并调用如下示例所示的方法：

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe()