pandas时间序列频率处理

本文介绍了pandas库在时间序列分析中的应用,包括生成日期范围、频率和日期偏移量的理解、移动数据的概念。重点讨论了重采样,分为降采样和升采样,解释了降采样时的注意事项和升采样的操作,并给出了使用groupby进行降采样的示例。同时提到了时期算术运算和Timestamp与Period之间的转换,以及如何对时期进行重采样。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

《Python for Data Analysis》

生成日期范围

pd.data_range()

In [15]: rng = pd.date_range('2000-01-01', '2000-06-30', freq='BM')

In [16]: rng
Out[16]:
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30'],
              dtype='datetime64[ns]', freq='BM')

In [17]: Series(np.random.randn(6),index=rng)
Out[17]:
2000-01-31    0.586341
2000-02-29   -0.439679
2000-03-31    0.853946
2000-04-28   -0.740858
2000-05-31   -0.114699
2000-06-30   -0.529631
Freq: BM, dtype: float64

频率和日期偏移量

from pandas.tseries.offsets import Hour, Minute

移动(shifting)数据

ts.shift()

时期及其算术运算

Period类PeriodIndex类

pd.period_range():创建规则的时期范围。

In [20]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')
    ...: rng
    ...:
Out[20]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='int64', freq='M')

构造函数:
pd.PeriodIndex()

时期的频率转换

ts.asfred()

Timestamp(时间戳) 和 Period(时期) 的 转换

In [21]: rng = pd.date_range('2000-01-01', '2000-06-30', freq='M')

In [22]: rng
Out[22]:
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
               '2000-05-31', '2000-06-30'],
              dtype='datetime64[ns]', freq='M')

In [23]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')

In [24]: rng
Out[24]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='int64', freq='M')

to_period() to_timestamp()

In [25]: rng = pd.date_range('2000-01-01', periods=3, freq='M')
    ...: ts = pd.Series(np.random.randn(3), index=rng)
    ...: ts
    ...:
Out[25]:
2000-01-31    0.455968
2000-02-29    1.720553
2000-03-31    1.695834
Freq: M, dtype: float64

In [26]: pts = ts.to_period()
    ...: pts
    ...:
Out[26]:
2000-01    0.455968
2000-02    1.720553
2000-03    1.695834
Freq: M, dtype: float64

重采样及频率转换!!

重采样(resampling)指的是将时间序列从一个频率转换到另一个频率的处理过程。高频率数据聚合到低频率称为降采样(downsamling),而将低频率数据转换到高频率数据则称为升采样upsampling,通常伴随着插值)。

resample() : 频率转换工作的主力函数

参数说明
freq表示重采样频率的字符串或DataOffset,例如‘M’、‘5min’、Second(15)
how=’mean’用于产生聚合值的函数名或数组函数。默认为‘mean’ –> FutureWarning: how in .resample() is deprecated the new syntax is .resample(…).mean()
axis=0重采样的轴
fill_method=None升采样时如何插值,如‘ffill’或‘bfill’。默认不插值。
closed=’right’降采样时哪一段是闭合的。
label=’right’降采样时如何设置聚合值的标签
loffset=None面元标签的时间校正值,比如‘-1s’或者Second(-1)用于将聚合标签调早1秒
limit = None在前向或后向填充时,允许填充的最大时期数
kind = None聚合到时期(Period)或者时间戳(Timestamp),默认聚合到时间序列的索引类型
convention=None重采样时期时,低频转高频的约定,默认‘end’。

降采样

使用resample

看下面的例子,使用resample对数据进行降采样时,需要考虑两样东西:

  • 各区间哪边是闭合的。
  • 如何标记各个聚合面元,用区间的开头还是末尾。
In [27]: rng = pd.date_range('2000-01-01', periods=100, freq='D')
    ...: ts = pd.Series(np.random.randn(len(rng)), index=rng)
    ...: ts
    ...:
Out[27]:
2000-01-01   -0.189731
                ...
2000-04-09    0.283110
Freq: D, dtype: float64

In [28]: ts.resample('M').mean()
Out[28]:
2000-01-31   -0.019276
2000-02-29   -0.041192
2000-03-31   -0.214551
2000-04-30    0.411190
Freq: M, dtype: float64

In [29]: ts.resample('M', kind='period').mean()
Out[29]:
2000-01   -0.019276
2000-02   -0.041192
2000-03   -0.214551
2000-04    0.411190
Freq: M, dtype: float64
In [31]: rng = pd.date_range('2000-01-01', periods=12, freq='T')
    ...: ts = pd.Series(np.arange(12), index=rng)
    ...: ts
    ...:
Out[31]:
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

In [32]: ts.resample('5min', closed='right', label='right').sum()
Out[32]:
2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

In [33]: ts.resample('5min', closed='right',
    ...:             label='right', loffset='-1s').sum()
Out[33]:
1999-12-31 23:59:59     0
2000-01-01 00:04:59    15
2000-01-01 00:09:59    40
2000-01-01 00:14:59    11
Freq: 5T, dtype: int32
通过groupby进行降采样

打算根据月份或者星期进行分组,传入能够访问时间序列的索引上的这些字段的函数。

In [35]: rng = pd.date_range('2000-01-01', periods=100, freq='D')
    ...: ts = pd.Series(np.random.randn(len(rng)), index=rng)
    ...: ts

In [36]: ts.groupby(lambda x : x.month).mean()
Out[36]:
1   -0.126008
2    0.079132
3    0.026093
4    0.321457
dtype: float64

In [37]: ts.groupby(lambda x : x.weekday).mean()
Out[37]:
0    0.280289
1    0.174452
2    0.166102
3   -0.779489
4   -0.036195
5    0.086394
6    0.234831
dtype: float64

升采样

In [38]: import pandas as pd
    ...: import numpy as np
    ...: frame = pd.DataFrame(np.random.randn(2, 4),
    ...:                      index=pd.date_range('1/1/2000', periods=2,
    ...:                                          freq='W-WED'),
    ...:                      columns=['Colorado', 'Texas', 'New York', 'Ohio'])
    ...: frame
    ...:
Out[38]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.925525 -0.434350  1.037349 -1.532790
2000-01-12  1.075744  0.237922 -0.907699  0.592211

In [39]: df_daily = frame.resample('D').asfreq()
    ...: df_daily
    ...:
Out[39]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.925525 -0.434350  1.037349 -1.532790
2000-01-06       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN
2000-01-08       NaN       NaN       NaN       NaN
2000-01-09       NaN       NaN       NaN       NaN
2000-01-10       NaN       NaN       NaN       NaN
2000-01-11       NaN       NaN       NaN       NaN
2000-01-12  1.075744  0.237922 -0.907699  0.592211

In [40]: frame.resample('D').ffill()
Out[40]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.925525 -0.434350  1.037349 -1.532790
2000-01-06 -0.925525 -0.434350  1.037349 -1.532790
2000-01-07 -0.925525 -0.434350  1.037349 -1.532790
2000-01-08 -0.925525 -0.434350  1.037349 -1.532790
2000-01-09 -0.925525 -0.434350  1.037349 -1.532790
2000-01-10 -0.925525 -0.434350  1.037349 -1.532790
2000-01-11 -0.925525 -0.434350  1.037349 -1.532790
2000-01-12  1.075744  0.237922 -0.907699  0.592211

# 之前的frame.resample('D', how='mean')

In [41]: df_daily = frame.resample('D').mean()
    ...: df_daily
    ...:
Out[41]:
            Colorado     Texas  New York      Ohio
2000-01-05 -0.925525 -0.434350  1.037349 -1.532790
2000-01-06       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN
2000-01-08       NaN       NaN       NaN       NaN
2000-01-09       NaN       NaN       NaN       NaN
2000-01-10       NaN       NaN       NaN       NaN
2000-01-11       NaN       NaN       NaN       NaN
2000-01-12  1.075744  0.237922 -0.907699  0.592211

对于时期进行重采样。

In [42]: frame = pd.DataFrame(np.random.randn(24, 4),
    ...:                      index=pd.period_range('1-2000', '12-2001',
    ...:                                            freq='M'),
    ...:                      columns=['Colorado', 'Texas', 'New York', 'Ohio'])
    ...: frame[:5]
    ...: annual_frame = frame.resample('A-DEC').mean()
    ...: annual_frame
    ...:
Out[42]:
      Colorado     Texas  New York      Ohio
2000  0.442672  0.104870 -0.067043 -0.128942
2001 -0.263757 -0.399865 -0.423485  0.026256

In [43]: annual_frame.resample('Q-DEC', convention='end').ffill()
Out[43]:
        Colorado     Texas  New York      Ohio
2000Q4  0.442672  0.104870 -0.067043 -0.128942
2001Q1  0.442672  0.104870 -0.067043 -0.128942
2001Q2  0.442672  0.104870 -0.067043 -0.128942
2001Q3  0.442672  0.104870 -0.067043 -0.128942
2001Q4 -0.263757 -0.399865 -0.423485  0.026256
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值