pandas官方文档cookbook（5）中MissValue&groupby翻译

最新推荐文章于 2022-12-02 19:26:41 发布

三分之一给你

最新推荐文章于 2022-12-02 19:26:41 发布

阅读量516

点赞数

分类专栏： pandas 0.20.3 documentation 翻译文章标签： python 文档 pd

pandas 0.20.3 documentation 翻译专栏收录该内容

9 篇文章

订阅专栏

本文介绍了使用 Pandas 进行数据处理的一些实用技巧，包括处理缺失值、分组操作、应用自定义函数、时间序列处理等。通过具体示例展示了如何高效地进行数据预处理和分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文档版本：0.20.3
这些例子是用python3.4写出来的。对于较早的python版本需要对代码做些相应的调整。
Pandas（pd）和Numpy（np）是唯一两个默认导入的包。其余的包会显示导入给新用户看。
若有翻译不当的地方，请多多指教。

这份文档中的例子都是从Stack-Overflow和Github中别人提问的比较经典的问题，作者从中进行提炼与总结。

缺失值

反转时间序列把缺失值替换成前一个值。df.index[::-1]表示把df的index进行倒序处理，::表示选择全列，-1表示依次递减逆序。

In [82]: df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A'))

In [83]: df.loc[df.index[3], 'A'] = np.nan

In [84]: df
Out[84]: 
                   A
2013-08-01 -1.054874
2013-08-02 -0.179642
2013-08-05  0.639589
2013-08-06       NaN
2013-08-07  1.906684
2013-08-08  0.104050

In [85]: df.reindex(df.index[::-1]).ffill()
Out[85]: 
                   A
2013-08-08  0.104050
2013-08-07  1.906684
2013-08-06  1.906684
2013-08-05  0.639589
2013-08-02 -0.179642
2013-08-01 -1.054874

分组

不同于agg函数，apply函数是可以在一个数据框的子集中获取所有列进行处理的函数。

In [86]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
   ....:                    'size': list('SSMMMLL'),
   ....:                    'weight': [8, 10, 11, 1, 20, 12, 12],
   ....:                    'adult' : [False] * 5 + [True] * 2}); df
   ....: 
Out[86]: 
   adult animal size  weight
0  False    cat    S       8
1  False    dog    S      10
2  False    cat    M      11
3  False   fish    M       1
4  False    dog    M      20
5   True    cat    L      12
6   True    cat    L      12

#List the size of the animals with the highest weight.
In [87]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[87]: 
animal
cat     L
dog     M
fish    M
dtype: object

get_group的使用方法

In [88]: gb = df.groupby(['animal'])

In [89]: gb.get_group('cat')
Out[89]: 
   adult animal size  weight
0  False    cat    S       8
2  False    cat    M      11
5   True    cat    L      12
6   True    cat    L      12

在分组后对一个列中不同的分类变量进行不同的操作

In [90]: def GrowUp(x):
   ....:    avg_weight =  sum(x[x['size'] == 'S'].weight * 1.5)
   ....:    avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
   ....:    avg_weight += sum(x[x['size'] == 'L'].weight)
   ....:    avg_weight /= len(x)
   ....:    return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])
   ....: 

In [91]: expected_df = gb.apply(GrowUp)

In [92]: expected_df
Out[92]: 
       size   weight  adult
animal                     
cat       L  12.4375   True
dog       L  20.0000   True
fish      L   1.2500   True

expanding与apply的组合使用

In [93]: S = pd.Series([i / 100.0 for i in range(1,11)])

In [94]: def CumRet(x,y):
....:    return x * (1 + y)
....: 

In [95]: def Red(x):
....:    return functools.reduce(CumRet,x,1.0)
....: 

In [96]: S.expanding().apply(Red)
Out[96]: 
0    1.010000
1    1.030200
2    1.061106
3    1.103550
4    1.158728
5    1.228251
6    1.314229
7    1.419367
8    1.547110
9    1.701821
dtype: float64

在分组后替换一些值为每个分组的平均值（这里注意apply与transform的差别）

In [97]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})

In [98]: gb = df.groupby('A')

In [99]: def replace(g):
   ....:    mask = g < 0
   ....:    g.loc[mask] = g[~mask].mean()
   ....:    return g
   ....: 

In [100]: gb.transform(replace)
Out[100]: 
     B
0  1.0
1  1.0
2  1.0
3  2.0

对聚合的数据进行分组排序

In [101]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
   .....:                    'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
   .....:                    'flag': [False, True] * 3})
   .....: 

In [102]: code_groups = df.groupby('code')

In [103]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')

In [104]: sorted_df = df.loc[agg_n_sort_order.index]

In [105]: sorted_df
Out[105]: 
  code  data   flag
1  bar -0.21   True
4  bar -0.59  False
0  foo  0.16  False
3  foo  0.45   True
2  baz  0.33  False
5  baz  0.62   True

创建多重聚合列

In [106]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')

In [107]: ts = pd.Series(data = list(range(10)), index = rng)

In [108]: def MyCust(x):
   .....:    if len(x) > 2:
   .....:       return x[1] * 1.234
   .....:    return pd.NaT
   .....: 

In [109]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}

In [110]: ts.resample("5min").apply(mhc)
Out[110]: 
Custom  2014-10-07 00:00:00    1.234
        2014-10-07 00:05:00      NaT
        2014-10-07 00:10:00    7.404
        2014-10-07 00:15:00      NaT
Max     2014-10-07 00:00:00        2
        2014-10-07 00:05:00        4
        2014-10-07 00:10:00        7
        2014-10-07 00:15:00        9
Mean    2014-10-07 00:00:00        1
        2014-10-07 00:05:00      3.5
        2014-10-07 00:10:00        6
        2014-10-07 00:15:00      8.5
dtype: object

In [111]: ts
Out[111]: 
2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
2014-10-07 00:08:00    4
2014-10-07 00:10:00    5
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2T, dtype: int64

创建一个计数的列并且放置在数据框中

In [112]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
   .....:                    'Value': [100, 150, 50, 50]}); df
   .....: 
Out[112]: 
  Color  Value
0   Red    100
1   Red    150
2   Red     50
3  Blue     50

In [113]: df['Counts'] = df.groupby(['Color']).transform(len)

In [114]: df
Out[114]: 
  Color  Value  Counts
0   Red    100       3
1   Red    150       3
2   Red     50       3
3  Blue     50       1

在index的基础上进行分组后对某一个组的值位移

In [115]: df = pd.DataFrame(
   .....:    {u'line_race': [10, 10, 8, 10, 10, 8],
   .....:     u'beyer': [99, 102, 103, 103, 88, 100]},
   .....:     index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',
   .....:            u'Paynter', u'Paynter', u'Paynter']); df
   .....: 
Out[115]: 
                 beyer  line_race
Last Gunfighter     99         10
Last Gunfighter    102         10
Last Gunfighter    103          8
Paynter            103         10
Paynter             88         10
Paynter            100          8

In [116]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)

In [117]: df
Out[117]: 
                 beyer  line_race  beyer_shifted
Last Gunfighter     99         10            NaN
Last Gunfighter    102         10           99.0
Last Gunfighter    103          8          102.0
Paynter            103         10            NaN
Paynter             88         10          103.0
Paynter            100          8           88.0

从每个分组中选取最大值的行

In [118]: df = pd.DataFrame({'host':['other','other','that','this','this'],
   .....:                    'service':['mail','web','mail','mail','web'],
   .....:                    'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])
   .....: 

In [119]: mask = df.groupby(level=0).agg('idxmax')

In [120]: df_count = df.loc[mask['no']].reset_index()

In [121]: df_count
Out[121]: 
    host service  no
0  other     web   2
1   that    mail   1
2   this    mail   2

像Python的itertools.groupby一样分组

In [122]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])

In [123]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
Out[123]: 
{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5], dtype='int64'),
 5: Int64Index([6], dtype='int64'),
 6: Int64Index([7, 8], dtype='int64')}

In [124]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
Out[124]: 
0    0
1    1
2    0
3    1
4    2
5    3
6    0
7    1
8    2
Name: A, dtype: int64