grouping and sorting
group
本文中的数据案例来自Kaggle,数据格式如下
是关于酒的表
groupby的用法
reviews.groupby('points').points.count() #返回series,值为每个分数有多少个
>>>
points
80 397
81 692
...
99 33
100 19
Name: points, Length: 21, dtype: int64
代码:
reviews.groupby('points').price.min() #返回series,值为每个分数中,价格最低的
>>>
points
80 5.0
81 5.0
...
99 44.0
100 80.0
Name: price, Length: 21, dtype: float64
reviews.groupby('winery').apply(lambda df: df.title.iloc[0]) # apply的操作对象是df
>>>
winery
1+1=3 1+1=3 NV Rosé Sparkling (Cava)
10 Knots 10 Knots 2010 Viognier (Paso Robles)
...
àMaurice àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object
# 筛选每个国家,每个地区评价最高的酒
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
结果如下图
groupby与agg函数合用
reviews.groupby(['country']).price.agg([len, min, max])
>>>
len min max
country
Argentina 3800 4.0 230.0
Armenia 2 14.0 15.0
... ... ... ...
Ukraine 14 6.0 13.0
Uruguay 109 10.0 130.0
##多层索引
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
>>>
len
country province
Argentina Mendoza Province 3264
Other 536
... ... ...
Uruguay San Jose 3
Uruguay 24