df.gourpby() Applications | Python Learning | Updating

文章介绍了如何使用pandas的df.groupby().agg()函数来对数据分组并计算统计量,例如按品牌分组计算总营收。同时讨论了df.groupby().size()在了解数据分布中的作用,以及结合自定义函数(如聚类平滑填充缺失值)进行数据预处理的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Intro

df.groupby() is often used with other functions for data analysis or preprocessing. This blog is a documentation of the functions that I think work really well with groupby(). I will keep adding on it as I learn more different applications of df.groupby() later.

Applications

  • df.groupby().agg ()

    • Example df:

# example df
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5],
    'rev':[100,330,220,111,567]
})
df
     brand style  rating  rev
0  Yum Yum   cup     4.0  100
1  Yum Yum   cup     4.0  330
2  Indomie   cup     3.5  220
3  Indomie  pack    15.0  111
4  Indomie  pack     5.0  567

Using df.groupby().agg() can easily calculate statistics of the variable we are studying baout in the data groups we assgined. For example, if we are trying to calculate the total revenue for each brand, we can write:

df.groupby('brand').agg({'rev':sum})

                      rev

brand

Indomie         898

Yum YUm        430

Even better, we can incorporate more than one methods of calculating statistics into the agg functions on more than one column, so that we can obtain more than one piece of info at one time:

# for different brand, caulate statistics
df.groupby('brand').agg({'rating':['mean','max',np.min], \
                                    'rev':['sum','mean']})

                            rating                   rev
            mean    max    amin                     sum    mean
brand                    
Indomie    7.833333    15.0    3.5            898    299.333333
Yum Yum    4.000000    4.0        4.0            430    215.000000 

  • df.groupby().size()

❗️Notice that the size() function here is the pandas.DataFrame.size function not the one in Numpy.

Using groupby().size() can be extremly helpful in help us learn the distribution in different groups of a data set.  And it becomes extremely powerful when grouping our data with mutiple variables. It can output a highly organized summary of distribution.

Let's see an example of a df i am recently studying:

enrollee_idcitycity_development_indexgenderrelevent_experienceenrolled_universityeducation_levelmajor_disciplineexperiencecompany_sizecompany_typelast_new_jobtraining_hourstarget
8949city_1030.920MaleHas relevent experienceno_enrollmentGraduateSTEM>20NaNNaN1361.0
29725city_400.776MaleNo relevent experienceno_enrollmentGraduateSTEM1550-99Pvt Ltd>4470.0
11561city_210.624NaNNo relevent experienceFull time courseGraduateSTEM5NaNNaNnever830.0
33241city_1150.789NaNNo relevent experienceNaNGraduateBusiness Degree<1NaNPvt Ltdnever521.0
666city_1620.767MaleHas relevent experienceno_enrollmentMastersSTEM>2050-99Funded Startup480.0

When I was studying, I realize that there is many example report their gender as 'others'. I wonder if such phenomenon is related to the education level, so I decided to group examples with variables related to educations ('enrolled_university','education_level','major_discipline'), and wrote the following codes:

# check 'other' distribution on enrolled_university, education_level, major_discipline
group = df[df['gender']=='Other'].groupby(['enrolled_university','education_level','major_discipline']).size()
group

In the output, we can clearly see the distribution of gender=='others' in different education groups. And the group objects can be used in data visualization later very easily. 

enrolled_university  education_level  major_discipline
Full time course     Graduate         Arts                 2
                                      Humanities           2
                                      No Major             1
                                      Other                2
                                      STEM                23
                     Masters          STEM                 7
                     Phd              Other                1
Part time course     Graduate         Arts                 1
                                      Humanities           1
                                      STEM                 9
                     Masters          STEM                 1
no_enrollment        Graduate         Arts                 4
                                      Business Degree      3
                                      Humanities           4
                                      No Major             2
                                      Other                4
                                      STEM                66
                     Masters          Humanities           2
                                      No Major             1
                                      STEM                17
                     Phd              STEM                 1
dtype: int64
  • df.groupby() + Clustering Imputation

Basically, df.groupby() can follow by any user-defined functions. Clustering Imputation is one of the most useful application of combining df.groupby with self-defined functions.

Here is the example df:        

rankranking-institution-titleranking-institution-title hreflocationoverallresourcesengagementoutcomesenviroment
1Harvard Universityhttps://www.timeshighereducation.com/world-uni...Massachusetts91.929.815.639.57.0
2Columbia Universityhttps://www.timeshighereducation.com/world-uni...New York90.627.016.739.07.8
3Massachusetts Institute of Technologyhttps://www.timeshighereducation.com/world-uni...Massachusetts90.429.215.838.27.2
3Stanford Universityhttps://www.timeshighereducation.com/world-uni...California90.426.217.438.97.9

 There are some numeric values and they all have certain empty values in them:

df.isna().sum()
rank                                0
ranking-institution-title           0
ranking-institution-title href      0
location                            0
overall                             0
resources                         470
engagement                        341
outcomes                          482
enviroment                        296
dtype: int64

To fill out these nana values, I decided to apply clustering imputation. In this case, I clustered the data by 'overall', and filled the empty value with group means. That is, I grouped the data by 'overall' and calculate the mean of corresponding variables, and filled that means into nans in different groups

# replacing multiple columns in the orginal dataset with imputed data
# genrealized Syntax:
# df[['col1','col2',...]] = df.groupby('col_group').transform(lambda x:x.fillna(x.mean()))
df[['resources','engagement','outcomes','enviroment']] = \
df.groupby('overall').transform(lambda x:x.fillna(x.mean()))

A general syntax can be:

df[['col1','col2',...]] = df.groupby('col_group').transform(lambda x:x.fillna(x.mean()))

And we can check that the empty values are all gone:

rank                              0
ranking-institution-title         0
ranking-institution-title href    0
location                          0
overall                           0
resources                         0
engagement                        0
outcomes                          0
enviroment                        0
dtype: int64

References:

My own github page 

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.size.html

Pandas Groupby Count Using Size() and Count() Method

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值