df.gourpby() Applications | Python Learning

本文链接：https://blog.youkuaiyun.com/chenqi0320/article/details/130923712

文章介绍了如何使用pandas的df.groupby().agg()函数来对数据分组并计算统计量，例如按品牌分组计算总营收。同时讨论了df.groupby().size()在了解数据分布中的作用，以及结合自定义函数（如聚类平滑填充缺失值）进行数据预处理的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Intro

df.groupby() is often used with other functions for data analysis or preprocessing. This blog is a documentation of the functions that I think work really well with groupby(). I will keep adding on it as I learn more different applications of df.groupby() later.

Applications

df.groupby().agg ()
- Example df:

# example df
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5],
    'rev':[100,330,220,111,567]
})
df

     brand style  rating  rev
0  Yum Yum   cup     4.0  100
1  Yum Yum   cup     4.0  330
2  Indomie   cup     3.5  220
3  Indomie  pack    15.0  111
4  Indomie  pack     5.0  567

Using df.groupby().agg() can easily calculate statistics of the variable we are studying baout in the data groups we assgined. For example, if we are trying to calculate the total revenue for each brand, we can write:

df.groupby('brand').agg({'rev':sum})

rev

brand

Indomie 898

Yum YUm 430

Even better, we can incorporate more than one methods of calculating statistics into the agg functions on more than one column, so that we can obtain more than one piece of info at one time:

# for different brand, caulate statistics
df.groupby('brand').agg({'rating':['mean','max',np.min], \
                                    'rev':['sum','mean']})

            rating   rev
mean   max   amin sum   mean
brand
Indomie   7.833333   15.0 3.5   898   299.333333
Yum Yum   4.000000   4.0   4.0   430   215.000000

df.groupby().size()

❗️Notice that the size() function here is the pandas.DataFrame.size function not the one in Numpy.

Using groupby().size() can be extremly helpful in help us learn the distribution in different groups of a data set. And it becomes extremely powerful when grouping our data with mutiple variables. It can output a highly organized summary of distribution.

Let's see an example of a df i am recently studying:

enrollee_id city city_development_index gender relevent_experience enrolled_university education_level major_discipline experience company_size company_type last_new_job training_hours target
8949 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM >20 NaN NaN 1 36 1.0
29725 city_40 0.776 Male No relevent experience no_enrollment Graduate STEM 15 50-99 Pvt Ltd >4 47 0.0
11561 city_21 0.624 NaN No relevent experience Full time course Graduate STEM 5 NaN NaN never 83 0.0
33241 city_115 0.789 NaN No relevent experience NaN Graduate Business Degree <1 NaN Pvt Ltd never 52 1.0
666 city_162 0.767 Male Has relevent experience no_enrollment Masters STEM >20 50-99 Funded Startup 4 8 0.0

enrollee_id	city	city_development_index	gender	relevent_experience	enrolled_university	education_level	major_discipline	experience	company_size	company_type	last_new_job	training_hours	target
8949	city_103	0.920	Male	Has relevent experience	no_enrollment	Graduate	STEM	>20	NaN	NaN	1	36	1.0
29725	city_40	0.776	Male	No relevent experience	no_enrollment	Graduate	STEM	15	50-99	Pvt Ltd	>4	47	0.0
11561	city_21	0.624	NaN	No relevent experience	Full time course	Graduate	STEM	5	NaN	NaN	never	83	0.0
33241	city_115	0.789	NaN	No relevent experience	NaN	Graduate	Business Degree	<1	NaN	Pvt Ltd	never	52	1.0
666	city_162	0.767	Male	Has relevent experience	no_enrollment	Masters	STEM	>20	50-99	Funded Startup	4	8	0.0

When I was studying, I realize that there is many example report their gender as 'others'. I wonder if such phenomenon is related to the education level, so I decided to group examples with variables related to educations ('enrolled_university','education_level','major_discipline'), and wrote the following codes:

# check 'other' distribution on enrolled_university, education_level, major_discipline
group = df[df['gender']=='Other'].groupby(['enrolled_university','education_level','major_discipline']).size()
group

In the output, we can clearly see the distribution of gender=='others' in different education groups. And the group objects can be used in data visualization later very easily.

enrolled_university  education_level  major_discipline
Full time course     Graduate         Arts                 2
                                      Humanities           2
                                      No Major             1
                                      Other                2
                                      STEM                23
                     Masters          STEM                 7
                     Phd              Other                1
Part time course     Graduate         Arts                 1
                                      Humanities           1
                                      STEM                 9
                     Masters          STEM                 1
no_enrollment        Graduate         Arts                 4
                                      Business Degree      3
                                      Humanities           4
                                      No Major             2
                                      Other                4
                                      STEM                66
                     Masters          Humanities           2
                                      No Major             1
                                      STEM                17
                     Phd              STEM                 1
dtype: int64

df.groupby() + Clustering Imputation

Basically, df.groupby() can follow by any user-defined functions. Clustering Imputation is one of the most useful application of combining df.groupby with self-defined functions.

Here is the example df:

rank ranking-institution-title ranking-institution-title href location overall resources engagement outcomes enviroment
1 Harvard University https://www.timeshighereducation.com/world-uni... Massachusetts 91.9 29.8 15.6 39.5 7.0
2 Columbia University https://www.timeshighereducation.com/world-uni... New York 90.6 27.0 16.7 39.0 7.8
3 Massachusetts Institute of Technology https://www.timeshighereducation.com/world-uni... Massachusetts 90.4 29.2 15.8 38.2 7.2
3 Stanford University https://www.timeshighereducation.com/world-uni... California 90.4 26.2 17.4 38.9 7.9

rank	ranking-institution-title	ranking-institution-title href	location	overall	resources	engagement	outcomes	enviroment
1	Harvard University	https://www.timeshighereducation.com/world-uni...	Massachusetts	91.9	29.8	15.6	39.5	7.0
2	Columbia University	https://www.timeshighereducation.com/world-uni...	New York	90.6	27.0	16.7	39.0	7.8
3	Massachusetts Institute of Technology	https://www.timeshighereducation.com/world-uni...	Massachusetts	90.4	29.2	15.8	38.2	7.2
3	Stanford University	https://www.timeshighereducation.com/world-uni...	California	90.4	26.2	17.4	38.9	7.9

There are some numeric values and they all have certain empty values in them:

df.isna().sum()

rank                                0
ranking-institution-title           0
ranking-institution-title href      0
location                            0
overall                             0
resources                         470
engagement                        341
outcomes                          482
enviroment                        296
dtype: int64

To fill out these nana values, I decided to apply clustering imputation. In this case, I clustered the data by 'overall', and filled the empty value with group means. That is, I grouped the data by 'overall' and calculate the mean of corresponding variables, and filled that means into nans in different groups

# replacing multiple columns in the orginal dataset with imputed data
# genrealized Syntax:
# df[['col1','col2',...]] = df.groupby('col_group').transform(lambda x:x.fillna(x.mean()))
df[['resources','engagement','outcomes','enviroment']] = \
df.groupby('overall').transform(lambda x:x.fillna(x.mean()))

A general syntax can be:

df[['col1','col2',...]] = df.groupby('col_group').transform(lambda x:x.fillna(x.mean()))

And we can check that the empty values are all gone:

rank                              0
ranking-institution-title         0
ranking-institution-title href    0
location                          0
overall                           0
resources                         0
engagement                        0
outcomes                          0
enviroment                        0
dtype: int64