Intro
df.groupby() is often used with other functions for data analysis or preprocessing. This blog is a documentation of the functions that I think work really well with groupby(). I will keep adding on it as I learn more different applications of df.groupby() later.
Applications
-
df.groupby().agg ()
-
Example df:
-
# example df
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5],
'rev':[100,330,220,111,567]
})
df
brand style rating rev 0 Yum Yum cup 4.0 100 1 Yum Yum cup 4.0 330 2 Indomie cup 3.5 220 3 Indomie pack 15.0 111 4 Indomie pack 5.0 567
Using df.groupby().agg() can easily calculate statistics of the variable we are studying baout in the data groups we assgined. For example, if we are trying to calculate the total revenue for each brand, we can write:
df.groupby('brand').agg({'rev':sum})
rev
brand
Indomie 898
Yum YUm 430
Even better, we can incorporate more than one methods of calculating statistics into the agg functions on more than one column, so that we can obtain more than one piece of info at one time:
# for different brand, caulate statistics
df.groupby('brand').agg({'rating':['mean','max',np.min], \
'rev':['sum','mean']})
rating rev
mean max amin sum mean
brand
Indomie 7.833333 15.0 3.5 898 299.333333
Yum Yum 4.000000 4.0 4.0 430 215.000000
-
df.groupby().size()
❗️Notice that the size() function here is the pandas.DataFrame.size function not the one in Numpy.
Using groupby().size() can be extremly helpful in help us learn the distribution in different groups of a data set. And it becomes extremely powerful when grouping our data with mutiple variables. It can output a highly organized summary of distribution.
Let's see an example of a df i am recently studying:
enrollee_id city city_development_index gender relevent_experience enrolled_university education_level major_discipline experience company_size company_type last_new_job training_hours target 8949 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM >20 NaN NaN 1 36 1.0 29725 city_40 0.776 Male No relevent experience no_enrollment Graduate STEM 15 50-99 Pvt Ltd >4 47 0.0 11561 city_21 0.624 NaN No relevent experience Full time course Graduate STEM 5 NaN NaN never 83 0.0 33241 city_115 0.789 NaN No relevent experience NaN Graduate Business Degree <1 NaN Pvt Ltd never 52 1.0 666 city_162 0.767 Male Has relevent experience no_enrollment Masters STEM >20 50-99 Funded Startup 4 8 0.0
When I was studying, I realize that there is many example report their gender as 'others'. I wonder if such phenomenon is related to the education level, so I decided to group examples with variables related to educations ('enrolled_university','education_level','major_discipline'), and wrote the following codes:
# check 'other' distribution on enrolled_university, education_level, major_discipline
group = df[df['gender']=='Other'].groupby(['enrolled_university','education_level','major_discipline']).size()
group
In the output, we can clearly see the distribution of gender=='others' in different education groups. And the group objects can be used in data visualization later very easily.
enrolled_university education_level major_discipline Full time course Graduate Arts 2 Humanities 2 No Major 1 Other 2 STEM 23 Masters STEM 7 Phd Other 1 Part time course Graduate Arts 1 Humanities 1 STEM 9 Masters STEM 1 no_enrollment Graduate Arts 4 Business Degree 3 Humanities 4 No Major 2 Other 4 STEM 66 Masters Humanities 2 No Major 1 STEM 17 Phd STEM 1 dtype: int64
-
df.groupby() + Clustering Imputation
Basically, df.groupby() can follow by any user-defined functions. Clustering Imputation is one of the most useful application of combining df.groupby with self-defined functions.
Here is the example df:
rank ranking-institution-title ranking-institution-title href location overall resources engagement outcomes enviroment 1 Harvard University https://www.timeshighereducation.com/world-uni... Massachusetts 91.9 29.8 15.6 39.5 7.0 2 Columbia University https://www.timeshighereducation.com/world-uni... New York 90.6 27.0 16.7 39.0 7.8 3 Massachusetts Institute of Technology https://www.timeshighereducation.com/world-uni... Massachusetts 90.4 29.2 15.8 38.2 7.2 3 Stanford University https://www.timeshighereducation.com/world-uni... California 90.4 26.2 17.4 38.9 7.9
There are some numeric values and they all have certain empty values in them:
df.isna().sum()
rank 0 ranking-institution-title 0 ranking-institution-title href 0 location 0 overall 0 resources 470 engagement 341 outcomes 482 enviroment 296 dtype: int64
To fill out these nana values, I decided to apply clustering imputation. In this case, I clustered the data by 'overall', and filled the empty value with group means. That is, I grouped the data by 'overall' and calculate the mean of corresponding variables, and filled that means into nans in different groups
# replacing multiple columns in the orginal dataset with imputed data
# genrealized Syntax:
# df[['col1','col2',...]] = df.groupby('col_group').transform(lambda x:x.fillna(x.mean()))
df[['resources','engagement','outcomes','enviroment']] = \
df.groupby('overall').transform(lambda x:x.fillna(x.mean()))
A general syntax can be:
df[['col1','col2',...]] = df.groupby('col_group').transform(lambda x:x.fillna(x.mean()))
And we can check that the empty values are all gone:
rank 0 ranking-institution-title 0 ranking-institution-title href 0 location 0 overall 0 resources 0 engagement 0 outcomes 0 enviroment 0 dtype: int64
References:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.size.html