pandas时间序列频率处理

最新推荐文章于 2025-11-04 00:10:51 发布

原创最新推荐文章于 2025-11-04 00:10:51 发布 · 5.2k 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#python #pandas #数据处理

数据分析同时被 2 个专栏收录

21 篇文章

订阅专栏

pandas

13 篇文章

订阅专栏

本文介绍了pandas库在时间序列分析中的应用，包括生成日期范围、频率和日期偏移量的理解、移动数据的概念。重点讨论了重采样，分为降采样和升采样，解释了降采样时的注意事项和升采样的操作，并给出了使用groupby进行降采样的示例。同时提到了时期算术运算和Timestamp与Period之间的转换，以及如何对时期进行重采样。

部署运行你感兴趣的模型镜像

《Python for Data Analysis》

生成日期范围

pd.data_range()

In [15]: rng = pd.date_range('2000-01-01', '2000-06-30', freq='BM')

In [16]: rng
Out[16]:
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30'],
              dtype='datetime64[ns]', freq='BM')

In [17]: Series(np.random.randn(6),index=rng)
Out[17]:
2000-01-31    0.586341
2000-02-29   -0.439679
2000-03-31    0.853946
2000-04-28   -0.740858
2000-05-31   -0.114699
2000-06-30   -0.529631
Freq: BM, dtype: float64

频率和日期偏移量

from pandas.tseries.offsets import Hour, Minute

移动（shifting）数据

ts.shift()

时期及其算术运算

Period类 、 PeriodIndex类

pd.period_range()：创建规则的时期范围。

In [20]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')
    ...: rng
    ...:
Out[20]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='int64', freq='M')

构造函数：
pd.PeriodIndex()

时期的频率转换

ts.asfred()

Timestamp（时间戳）和 Period（时期）的转换

In [21]: rng = pd.date_range('2000-01-01', '2000-06-30', freq='M')

In [22]: rng
Out[22]:
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
               '2000-05-31', '2000-06-30'],
              dtype='datetime64[ns]', freq='M')

In [23]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')

In [24]: rng
Out[24]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='int64', freq='M')

to_period() to_timestamp()

In [25]: rng = pd.date_range('2000-01-01', periods=3, freq='M')
    ...: ts = pd.Series(np.random.randn(3), index=rng)
    ...: ts
    ...:
Out[25]:
2000-01-31    0.455968
2000-02-29    1.720553
2000-03-31    1.695834
Freq: M, dtype: float64

In [26]: pts = ts.to_period()
    ...: pts
    ...:
Out[26]:
2000-01    0.455968
2000-02    1.720553
2000-03    1.695834
Freq: M, dtype: float64

重采样及频率转换！！

重采样（resampling）指的是将时间序列从一个频率转换到另一个频率的处理过程。高频率数据聚合到低频率称为降采样（downsamling），而将低频率数据转换到高频率数据则称为升采样（upsampling，通常伴随着插值）。

resample() : 频率转换工作的主力函数

参数	说明
freq	表示重采样频率的字符串或DataOffset，例如‘M’、‘5min’、Second(15)
how=’mean’	用于产生聚合值的函数名或数组函数。默认为‘mean’ –> FutureWarning: how in .resample() is deprecated the new syntax is .resample(…).mean()
axis=0	重采样的轴
fill_method=None	升采样时如何插值，如‘ffill’或‘bfill’。默认不插值。
closed=’right’	降采样时哪一段是闭合的。
label=’right’	降采样时如何设置聚合值的标签
loffset=None	面元标签的时间校正值，比如‘-1s’或者Second(-1)用于将聚合标签调早1秒
limit = None	在前向或后向填充时，允许填充的最大时期数
kind = None	聚合到时期（Period）或者时间戳（Timestamp），默认聚合到时间序列的索引类型
convention=None	重采样时期时，低频转高频的约定，默认‘end’。

降采样

使用`resample`

看下面的例子，使用resample对数据进行降采样时，需要考虑两样东西：

各区间哪边是闭合的。
如何标记各个聚合面元，用区间的开头还是末尾。

In [27]: rng = pd.date_range('2000-01-01', periods=100, freq='D')
    ...: ts = pd.Series(np.random.randn(len(rng)), index=rng)
    ...: ts
    ...:
Out[27]:
2000-01-01   -0.189731
                ...
2000-04-09    0.283110
Freq: D, dtype: float64

In [28]: ts.resample('M').mean()
Out[28]:
2000-01-31   -0.019276
2000-02-29   -0.041192
2000-03-31   -0.214551
2000-04-30    0.411190
Freq: M, dtype: float64

In [29]: ts.resample('M', kind='period').mean()
Out[29]:
2000-01   -0.019276
2000-02   -0.041192
2000-03   -0.214551
2000-04    0.411190
Freq: M, dtype: float64

In [31]: rng = pd.date_range('2000-01-01', periods=12, freq='T')
    ...: ts = pd.Series(np.arange(12), index=rng)
    ...: ts
    ...:
Out[31]:
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

In [32]: ts.resample('5min', closed='right', label='right').sum()
Out[32]:
2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

In [33]: ts.resample('5min', closed='right',
    ...:             label='right', loffset='-1s').sum()
Out[33]:
1999-12-31 23:59:59     0
2000-01-01 00:04:59    15
2000-01-01 00:09:59    40
2000-01-01 00:14:59    11
Freq: 5T, dtype: int32

通过groupby进行降采样

打算根据月份或者星期进行分组，传入能够访问时间序列的索引上的这些字段的函数。

In [35]: rng = pd.date_range('2000-01-01', periods=100, freq='D')
    ...: ts = pd.Series(np.random.randn(len(rng)), index=rng)
    ...: ts

In [36]: ts.groupby(lambda x : x.month).mean()
Out[36]:
1   -0.126008
2    0.079132
3    0.026093
4    0.321457
dtype: float64

In [37]: ts.groupby(lambda x : x.weekday).mean()
Out[37]:
0    0.280289
1    0.174452
2    0.166102
3   -0.779489
4   -0.036195
5    0.086394
6    0.234831
dtype: float64

升采样

In [38]: import pandas as pd
    ...: import numpy as np
    ...: frame = pd.DataFrame(np.random.randn(2, 4),
    ...:                      index=pd.date_range('1/1/2000', periods=2,
    ...:                                          freq='W-WED'),
    ...:                      columns=['Colorado', 'Texas', 'New York', 'Ohio'])
    ...: frame
    ...:
Out[38]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.925525 -0.434350  1.037349 -1.532790
2000-01-12  1.075744  0.237922 -0.907699  0.592211

In [39]: df_daily = frame.resample('D').asfreq()
    ...: df_daily
    ...:
Out[39]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.925525 -0.434350  1.037349 -1.532790
2000-01-06       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN
2000-01-08       NaN       NaN       NaN       NaN
2000-01-09       NaN       NaN       NaN       NaN
2000-01-10       NaN       NaN       NaN       NaN
2000-01-11       NaN       NaN       NaN       NaN
2000-01-12  1.075744  0.237922 -0.907699  0.592211

In [40]: frame.resample('D').ffill()
Out[40]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.925525 -0.434350  1.037349 -1.532790
2000-01-06 -0.925525 -0.434350  1.037349 -1.532790
2000-01-07 -0.925525 -0.434350  1.037349 -1.532790
2000-01-08 -0.925525 -0.434350  1.037349 -1.532790
2000-01-09 -0.925525 -0.434350  1.037349 -1.532790
2000-01-10 -0.925525 -0.434350  1.037349 -1.532790
2000-01-11 -0.925525 -0.434350  1.037349 -1.532790
2000-01-12  1.075744  0.237922 -0.907699  0.592211

# 之前的frame.resample('D', how='mean')

In [41]: df_daily = frame.resample('D').mean()
    ...: df_daily
    ...:
Out[41]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.925525 -0.434350  1.037349 -1.532790
2000-01-06       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN
2000-01-08       NaN       NaN       NaN       NaN
2000-01-09       NaN       NaN       NaN       NaN
2000-01-10       NaN       NaN       NaN       NaN
2000-01-11       NaN       NaN       NaN       NaN
2000-01-12  1.075744  0.237922 -0.907699  0.592211

对于时期进行重采样。

In [42]: frame = pd.DataFrame(np.random.randn(24, 4),
    ...:                      index=pd.period_range('1-2000', '12-2001',
    ...:                                            freq='M'),
    ...:                      columns=['Colorado', 'Texas', 'New York', 'Ohio'])
    ...: frame[:5]
    ...: annual_frame = frame.resample('A-DEC').mean()
    ...: annual_frame
    ...:
Out[42]:
      Colorado     Texas  New York      Ohio
2000  0.442672  0.104870 -0.067043 -0.128942
2001 -0.263757 -0.399865 -0.423485  0.026256

In [43]: annual_frame.resample('Q-DEC', convention='end').ffill()
Out[43]:
        Colorado     Texas  New York      Ohio
2000Q4  0.442672  0.104870 -0.067043 -0.128942
2001Q1  0.442672  0.104870 -0.067043 -0.128942
2001Q2  0.442672  0.104870 -0.067043 -0.128942
2001Q3  0.442672  0.104870 -0.067043 -0.128942
2001Q4 -0.263757 -0.399865 -0.423485  0.026256