文档版本:0.20.3
这些例子是用python3.4写出来的。对于较早的python版本需要对代码做些相应的调整。
Pandas(pd)和Numpy(np)是唯一两个默认导入的包。其余的包会显示导入给新用户看。
若有翻译不当的地方,请多多指教。
这份文档中的例子都是从Stack-Overflow和Github中别人提问的比较经典的问题,作者从中进行提炼与总结。
缺失值
反转时间序列把缺失值替换成前一个值。df.index[::-1]表示把df的index进行倒序处理,::表示选择全列,-1表示依次递减逆序。
In [82]: df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A'))
In [83]: df.loc[df.index[3], 'A'] = np.nan
In [84]: df
Out[84]:
A
2013-08-01 -1.054874
2013-08-02 -0.179642
2013-08-05 0.639589
2013-08-06 NaN
2013-08-07 1.906684
2013-08-08 0.104050
In [85]: df.reindex(df.index[::-1]).ffill()
Out[85]:
A
2013-08-08 0.104050
2013-08-07 1.906684
2013-08-06 1.906684
2013-08-05 0.639589
2013-08-02 -0.179642
2013-08-01 -1.054874
分组
不同于agg函数,apply函数是可以在一个数据框的子集中获取所有列进行处理的函数。
In [86]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
....: 'size': list('SSMMMLL'),
....: 'weight': [8, 10, 11, 1, 20, 12, 12],
....: 'adult' : [False] * 5 + [True] * 2}); df
....:
Out[86]:
adult animal size weight
0 False cat S 8
1 False dog S 10
2 False cat M 11
3 False fish M 1
4 False dog M 20
5 True cat L 12
6 True cat L 12
#List the size of the animals with the highest weight.
In [87]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[87]:
animal
cat L
dog M
fish M
dtype: object
get_group的使用方法
In [88]: gb = df.groupby(['animal'])
In [89]: gb.get_group('cat')
Out[89]:
adult animal size weight
0 False cat S 8
2 False cat M 11
5 True cat L 12
6 True cat L 12
在分组后对一个列中不同的分类变量进行不同的操作
In [90]: def GrowUp(x):
....: avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
....: avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
....: avg_weight += sum(x[x['size'] == 'L'].weight)
....: avg_weight /= len(x)
....: return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])
....:
In [91]: expected_df = gb.apply(GrowUp)
In [92]: expected_df
Out[92]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
expanding与apply的组合使用
In [93]: S = pd.Series([i / 100.0 for i in range(1,11)])
In [94]: def CumRet(x,y):
....: return x * (1 + y)
....:
In [95]: def Red(x):
....: return functools.reduce(CumRet,x,1.0)
....:
In [96]: S.expanding().apply(Red)
Out[96]:
0 1.010000
1 1.030200
2 1.061106
3 1.103550
4 1.158728
5 1.228251
6 1.314229
7 1.419367
8 1.547110
9 1.701821
dtype: float64
在分组后替换一些值为每个分组的平均值(这里注意apply与transform的差别)
In [97]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})
In [98]: gb = df.groupby('A')
In [99]: def replace(g):
....: mask = g < 0
....: g.loc[mask] = g[~mask].mean()
....: return g
....:
In [100]: gb.transform(replace)
Out[100]:
B
0 1.0
1 1.0
2 1.0
3 2.0
对聚合的数据进行分组排序
In [101]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
.....: 'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
.....: 'flag': [False, True] * 3})
.....:
In [102]: code_groups = df.groupby('code')
In [103]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
In [104]: sorted_df = df.loc[agg_n_sort_order.index]
In [105]: sorted_df
Out[105]:
code data flag
1 bar -0.21 True
4 bar -0.59 False
0 foo 0.16 False
3 foo 0.45 True
2 baz 0.33 False
5 baz 0.62 True
创建多重聚合列
In [106]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')
In [107]: ts = pd.Series(data = list(range(10)), index = rng)
In [108]: def MyCust(x):
.....: if len(x) > 2:
.....: return x[1] * 1.234
.....: return pd.NaT
.....:
In [109]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}
In [110]: ts.resample("5min").apply(mhc)
Out[110]:
Custom 2014-10-07 00:00:00 1.234
2014-10-07 00:05:00 NaT
2014-10-07 00:10:00 7.404
2014-10-07 00:15:00 NaT
Max 2014-10-07 00:00:00 2
2014-10-07 00:05:00 4
2014-10-07 00:10:00 7
2014-10-07 00:15:00 9
Mean 2014-10-07 00:00:00 1
2014-10-07 00:05:00 3.5
2014-10-07 00:10:00 6
2014-10-07 00:15:00 8.5
dtype: object
In [111]: ts
Out[111]:
2014-10-07 00:00:00 0
2014-10-07 00:02:00 1
2014-10-07 00:04:00 2
2014-10-07 00:06:00 3
2014-10-07 00:08:00 4
2014-10-07 00:10:00 5
2014-10-07 00:12:00 6
2014-10-07 00:14:00 7
2014-10-07 00:16:00 8
2014-10-07 00:18:00 9
Freq: 2T, dtype: int64
创建一个计数的列并且放置在数据框中
In [112]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
.....: 'Value': [100, 150, 50, 50]}); df
.....:
Out[112]:
Color Value
0 Red 100
1 Red 150
2 Red 50
3 Blue 50
In [113]: df['Counts'] = df.groupby(['Color']).transform(len)
In [114]: df
Out[114]:
Color Value Counts
0 Red 100 3
1 Red 150 3
2 Red 50 3
3 Blue 50 1
在index的基础上进行分组后对某一个组的值位移
In [115]: df = pd.DataFrame(
.....: {u'line_race': [10, 10, 8, 10, 10, 8],
.....: u'beyer': [99, 102, 103, 103, 88, 100]},
.....: index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',
.....: u'Paynter', u'Paynter', u'Paynter']); df
.....:
Out[115]:
beyer line_race
Last Gunfighter 99 10
Last Gunfighter 102 10
Last Gunfighter 103 8
Paynter 103 10
Paynter 88 10
Paynter 100 8
In [116]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)
In [117]: df
Out[117]:
beyer line_race beyer_shifted
Last Gunfighter 99 10 NaN
Last Gunfighter 102 10 99.0
Last Gunfighter 103 8 102.0
Paynter 103 10 NaN
Paynter 88 10 103.0
Paynter 100 8 88.0
从每个分组中选取最大值的行
In [118]: df = pd.DataFrame({'host':['other','other','that','this','this'],
.....: 'service':['mail','web','mail','mail','web'],
.....: 'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])
.....:
In [119]: mask = df.groupby(level=0).agg('idxmax')
In [120]: df_count = df.loc[mask['no']].reset_index()
In [121]: df_count
Out[121]:
host service no
0 other web 2
1 that mail 1
2 this mail 2
像Python的itertools.groupby一样分组
In [122]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
In [123]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
Out[123]:
{1: Int64Index([0], dtype='int64'),
2: Int64Index([1], dtype='int64'),
3: Int64Index([2], dtype='int64'),
4: Int64Index([3, 4, 5], dtype='int64'),
5: Int64Index([6], dtype='int64'),
6: Int64Index([7, 8], dtype='int64')}
In [124]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
Out[124]:
0 0
1 1
2 0
3 1
4 2
5 3
6 0
7 1
8 2
Name: A, dtype: int64