时间序列数据处理方法

  • 本文章代码全部来源于:Electricity_Demand_and_Price_forecasting/Electricity_Demand_and_Price_forecasting.ipynb at main · ritikdhame/Electricity_Demand_and_Price_forecasting · GitHub

  • 数据可视化图片加载不了再git上都有

# 基础数据处理
import pandas as pd
import numpy as np

# 数据预处理工具
from sklearn.preprocessing import StandardScaler

# 时间处理
from datetime import datetime, timedelta

# 数据可视化(可选,用于探索性分析)
import matplotlib.pyplot as plt
import seaborn as sns

对发电数据进行处理

读取数据,显示数据的基本信息

# 读取数据  
df_energy = pd.read_csv('E:\\python_project\\informer\\Informer2020-main\\data\\energy_dataset.csv')

# 查看数据
df_energy.head()
timegeneration biomassgeneration fossil brown coal/lignitegeneration fossil coal-derived gasgeneration fossil gasgeneration fossil hard coalgeneration fossil oilgeneration fossil oil shalegeneration fossil peatgeneration geothermal...generation wastegeneration wind offshoregeneration wind onshoreforecast solar day aheadforecast wind offshore eday aheadforecast wind onshore day aheadtotal load forecasttotal load actualprice day aheadprice actual
02015-01-01 00:00:00+01:00447.0329.00.04844.04821.0162.00.00.00.0...196.00.06378.017.0NaN6436.026118.025385.050.1065.41
12015-01-01 01:00:00+01:00449.0328.00.05196.04755.0158.00.00.00.0...195.00.05890.016.0NaN5856.024934.024382.048.1064.92
22015-01-01 02:00:00+01:00448.0323.00.04857.04581.0157.00.00.00.0...196.00.05461.08.0NaN5454.023515.022734.047.3364.48
32015-01-01 03:00:00+01:00438.0254.00.04314.04131.0160.00.00.00.0...191.00.05238.02.0NaN5151.022642.021286.042.2759.32
42015-01-01 04:00:00+01:00428.0187.00.04130.03840.0156.00.00.00.0...189.00.04935.09.0NaN4861.021785.020264.038.4156.04

5 rows × 29 columns

# 查看数据的统计信息,显示每一特征(列)的统计信息
df_energy.describe().T
countmeanstdmin25%50%75%max
generation biomass35045.0383.51354085.3539430.00333.0000367.00433.00592.00
generation fossil brown coal/lignite35046.0448.059208354.5685900.000.0000509.00757.00999.00
generation fossil coal-derived gas35046.00.0000000.0000000.000.00000.000.000.00
generation fossil gas35046.05622.7374882201.8304780.004126.00004969.006429.0020034.00
generation fossil hard coal35046.04256.0657421961.6010130.002527.00004474.005838.758359.00
generation fossil oil35045.0298.31978952.5206730.00263.0000300.00330.00449.00
generation fossil oil shale35046.00.0000000.0000000.000.00000.000.000.00
generation fossil peat35046.00.0000000.0000000.000.00000.000.000.00
generation geothermal35046.00.0000000.0000000.000.00000.000.000.00
generation hydro pumped storage aggregated0.0NaNNaNNaNNaNNaNNaNNaN
generation hydro pumped storage consumption35045.0475.577343792.4066140.000.000068.00616.004523.00
generation hydro run-of-river and poundage35045.0972.116108400.7775360.00637.0000906.001250.002000.00
generation hydro water reservoir35046.02605.1147351835.1997450.001077.25002164.003757.009728.00
generation marine35045.00.0000000.0000000.000.00000.000.000.00
generation nuclear35047.06263.907039839.6679580.005760.00006566.007025.007117.00
generation other35046.060.22858520.2383810.0053.000057.0080.00106.00
generation other renewable35046.085.63970214.0775540.0073.000088.0097.00119.00
generation solar35046.01432.6659251680.1198870.0071.0000616.002578.005792.00
generation waste35045.0269.45213350.1955360.00240.0000279.00310.00357.00
generation wind offshore35046.00.0000000.0000000.000.00000.000.000.00
generation wind onshore35046.05464.4797693213.6915870.002933.00004849.007398.0017436.00
forecast solar day ahead35064.01439.0667351677.7033550.0069.0000576.002636.005836.00
forecast wind offshore eday ahead0.0NaNNaNNaNNaNNaNNaNNaN
forecast wind onshore day ahead35064.05471.2166893176.312853237.002979.00004855.007353.0017430.00
total load forecast35064.028712.1299624594.10085418105.0024793.750028906.0032263.2541390.00
total load actual35028.028696.9399054574.98795018041.0024807.750028901.0032192.0041015.00
price day ahead35064.049.87434114.6189002.0641.490050.5260.53101.99
price actual35064.057.88402314.2040839.3349.347558.0268.01116.80
# 查看数据的基本信息,包括数据类型、非空值数量和内存占用
df_energy.info()
# 查看数据缺失值
df_energy.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35064 entries, 0 to 35063
Data columns (total 29 columns):
time                                           35064 non-null object
generation biomass                             35045 non-null float64
generation fossil brown coal/lignite           35046 non-null float64
generation fossil coal-derived gas             35046 non-null float64
generation fossil gas                          35046 non-null float64
generation fossil hard coal                    35046 non-null float64
generation fossil oil                          35045 non-null float64
generation fossil oil shale                    35046 non-null float64
generation fossil peat                         35046 non-null float64
generation geothermal                          35046 non-null float64
generation hydro pumped storage aggregated     0 non-null float64
generation hydro pumped storage consumption    35045 non-null float64
generation hydro run-of-river and poundage     35045 non-null float64
generation hydro water reservoir               35046 non-null float64
generation marine                              35045 non-null float64
generation nuclear                             35047 non-null float64
generation other                               35046 non-null float64
generation other renewable                     35046 non-null float64
generation solar                               35046 non-null float64
generation waste                               35045 non-null float64
generation wind offshore                       35046 non-null float64
generation wind onshore                        35046 non-null float64
forecast solar day ahead                       35064 non-null float64
forecast wind offshore eday ahead              0 non-null float64
forecast wind onshore day ahead                35064 non-null float64
total load forecast                            35064 non-null float64
total load actual                              35028 non-null float64
price day ahead                                35064 non-null float64
price actual                                   35064 non-null float64
dtypes: float64(28), object(1)
memory usage: 7.8+ MB





time                                               0
generation biomass                                19
generation fossil brown coal/lignite              18
generation fossil coal-derived gas                18
generation fossil gas                             18
generation fossil hard coal                       18
generation fossil oil                             19
generation fossil oil shale                       18
generation fossil peat                            18
generation geothermal                             18
generation hydro pumped storage aggregated     35064
generation hydro pumped storage consumption       19
generation hydro run-of-river and poundage        19
generation hydro water reservoir                  18
generation marine                                 19
generation nuclear                                17
generation other                                  18
generation other renewable                        18
generation solar                                  18
generation waste                                  19
generation wind offshore                          18
generation wind onshore                           18
forecast solar day ahead                           0
forecast wind offshore eday ahead              35064
forecast wind onshore day ahead                    0
total load forecast                                0
total load actual                                 36
price day ahead                                    0
price actual                                       0
dtype: int64

对发电数据进行处理

将数据中有较多的无效数据(0/NULL)的特征(列)删除

# 列出需要删除的列名
col_names = [
    # 化石燃料发电
    'generation fossil coal-derived gas',  # 煤制气发电
    'generation fossil oil shale',         # 油页岩发电
    'generation fossil peat',              # 泥炭发电
    
    # 可再生能源发电
    'generation geothermal',               # 地热发电
    'generation hydro pumped storage aggregated',  # 抽水蓄能水电
    'generation marine',                   # 海洋能发电
    'generation wind offshore',            # 海上风电
    
    # 预测数据
    'forecast wind offshore eday ahead',   # 海上风电日前预测
    'forecast solar day ahead',            # 太阳能日前预测
    'forecast wind onshore day ahead'      # 陆上风电日前预测
]

# 删除这些列,axis=-1表示删除列,inplace=True表示直接在原数据上修改
df_energy.drop(columns=col_names,axis=-1,inplace=True)
# 打印删除后数据各列null值的个数
def check_Nas_Dups(df_input):
    print("数据中各列的null值个数为:")
    print(df_input.isnull().sum())
    print("-"*50)
    print("数据中重复行的数据量为:")
    print(df_input.duplicated().sum())

check_Nas_Dups(df_energy)

数据中各列的null值个数为:
time                                            0
generation biomass                             19
generation fossil brown coal/lignite           18
generation fossil gas                          18
generation fossil hard coal                    18
generation fossil oil                          19
generation hydro pumped storage consumption    19
generation hydro run-of-river and poundage     19
generation hydro water reservoir               18
generation nuclear                             17
generation other                               18
generation other renewable                     18
generation solar                               18
generation waste                               19
generation wind onshore                        18
total load forecast                             0
total load actual                              36
price day ahead                                 0
price actual                                    0
dtype: int64
--------------------------------------------------
数据中重复行的数据量为:
0
# 将时间列转换为datetime类型
df_energy["time"]=pd.to_datetime(df_energy["time"])
# 将时间列设置为索引(之前是以行号为索引)
df_energy=df_energy.set_index("time")
df_energy
generation biomassgeneration fossil brown coal/lignitegeneration fossil gasgeneration fossil hard coalgeneration fossil oilgeneration hydro pumped storage consumptiongeneration hydro run-of-river and poundagegeneration hydro water reservoirgeneration nucleargeneration othergeneration other renewablegeneration solargeneration wastegeneration wind onshoretotal load forecasttotal load actualprice day aheadprice actual
time
2015-01-01 00:00:00+01:00447.0329.04844.04821.0162.0863.01051.01899.07096.043.073.049.0196.06378.026118.025385.050.1065.41
2015-01-01 01:00:00+01:00449.0328.05196.04755.0158.0920.01009.01658.07096.043.071.050.0195.05890.024934.024382.048.1064.92
2015-01-01 02:00:00+01:00448.0323.04857.04581.0157.01164.0973.01371.07099.043.073.050.0196.05461.023515.022734.047.3364.48
2015-01-01 03:00:00+01:00438.0254.04314.04131.0160.01503.0949.0779.07098.043.075.050.0191.05238.022642.021286.042.2759.32
2015-01-01 04:00:00+01:00428.0187.04130.03840.0156.01826.0953.0720.07097.043.074.042.0189.04935.021785.020264.038.4156.04
.........................................................
2018-12-31 19:00:00+01:00297.00.07634.02628.0178.01.01135.04836.06073.063.095.085.0277.03113.030619.030653.068.8577.02
2018-12-31 20:00:00+01:00296.00.07241.02566.0174.01.01172.03931.06074.062.095.033.0280.03288.029932.029735.068.4076.16
2018-12-31 21:00:00+01:00292.00.07025.02422.0168.050.01148.02831.06076.061.094.031.0286.03503.027903.028071.066.8874.30
2018-12-31 22:00:00+01:00293.00.06562.02293.0163.0108.01128.02068.06075.061.093.031.0287.03586.025450.025801.063.9369.89
2018-12-31 23:00:00+01:00290.00.06926.02166.0163.0108.01069.01686.06075.061.092.031.0287.03651.024424.024455.064.2769.88

35064 rows × 18 columns

# 用图像显示两周的total load forecast列数据
plt.figure(figsize=(12,6))  # 设置图表的大小
plt.plot(df_energy["total load actual"][:24*7*2])
plt.xlabel("Time")
plt.ylabel("Total Load Actual")
plt.title("Total Load Actual of Two Weeks")
plt.show()
d:\software\anaconda\envs\informer-gpu-36py\lib\site-packages\pandas\plotting\_matplotlib\converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

对数据中的空值进行线性填充

# 返回所有包含空值的行,axis=1表示按行进行判断
df_energy[df_energy.isna().any(axis=1)]
generation biomassgeneration fossil brown coal/lignitegeneration fossil gasgeneration fossil hard coalgeneration fossil oilgeneration hydro pumped storage consumptiongeneration hydro run-of-river and poundagegeneration hydro water reservoirgeneration nucleargeneration othergeneration other renewablegeneration solargeneration wastegeneration wind onshoretotal load forecasttotal load actualprice day aheadprice actual
time
2015-01-05 03:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN21912.021182.035.2059.68
2015-01-05 12:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN23209.0NaN35.5079.14
2015-01-05 13:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN23725.0NaN36.8073.95
2015-01-05 14:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN23614.0NaN32.5071.93
2015-01-05 15:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN22381.0NaN30.0071.50
2015-01-05 16:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN21371.0NaN30.0071.85
2015-01-05 17:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN20760.0NaN30.6080.53
2015-01-19 19:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN38642.039304.070.0188.95
2015-01-19 20:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN38758.039262.069.0087.94
2015-01-27 19:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN38968.038335.066.0083.97
2015-01-28 13:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN36239.0NaN65.0077.62
2015-02-01 07:00:00+01:00449.0312.04765.05269.0222.0480.0980.01174.07101.044.075.048.0208.03289.024379.0NaN56.1016.98
2015-02-01 08:00:00+01:00453.0312.04938.05652.0288.00.01031.03229.07099.044.075.073.0207.03102.027389.0NaN57.6919.56
2015-02-01 09:00:00+01:00452.0302.04997.05770.0296.00.01083.04574.07097.043.071.0809.0204.02838.030619.0NaN60.0123.13
2015-02-01 12:00:00+01:00405.0317.05247.06008.0333.00.01119.04416.07095.042.072.03817.0200.01413.031357.0NaN59.9722.51
2015-02-01 13:00:00+01:00402.0317.05449.06005.0318.00.01171.04475.07096.041.073.03836.0193.01347.031338.0NaN59.6923.44
2015-02-01 14:00:00+01:00400.0317.05266.05995.0327.00.01216.04412.07098.042.079.03701.0199.01345.030874.0NaN58.6924.10
2015-02-01 15:00:00+01:00393.0321.05209.05939.0345.00.01204.03403.07097.041.079.03475.0204.01487.030124.0NaN58.1321.12
2015-02-01 16:00:00+01:00413.0325.05642.06000.0345.00.01193.03333.07097.040.077.02742.0203.01648.029714.0NaN59.0021.73
2015-02-01 17:00:00+01:00465.0321.06127.05912.0346.00.01214.04684.07096.041.077.01281.0207.01857.029801.0NaN59.6925.93
2015-02-01 18:00:00+01:00482.0326.07386.06002.0340.00.01299.06187.07095.041.079.0328.0208.01864.032257.0NaN63.7654.13
2015-02-01 19:00:00+01:00474.0326.07963.06026.0343.00.01313.06895.07096.042.082.0161.0207.01813.033183.0NaN65.0168.53
2015-04-05 03:00:00+02:00371.00.05015.03248.0257.0799.01233.02531.04027.081.067.031.0142.03153.020016.0NaN42.5529.04
2015-04-16 09:00:00+02:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN31001.0NaN56.5467.55
2015-04-20 08:00:00+02:00424.0642.05614.05784.0369.00.01122.04050.06954.041.062.0636.0147.0797.029287.0NaN62.0072.92
2015-04-23 21:00:00+02:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN31421.0NaN69.4982.57
2015-05-02 10:00:00+02:00497.00.05502.05677.0375.00.01425.05289.06353.093.072.02535.0205.010903.039644.0NaN58.4959.09
2015-05-29 03:00:00+02:00569.0756.04239.04635.0365.0755.0667.01277.05035.085.069.0662.0201.06503.023132.0NaN45.9355.07
2015-06-15 09:00:00+02:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN29899.030047.062.4873.82
2015-10-02 08:00:00+02:00483.0961.06545.08250.0385.00.01323.05378.07013.087.070.0140.0205.04362.036798.0NaN66.1970.13
2015-10-02 11:00:00+02:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN38921.0NaN70.0970.49
2015-12-02 09:00:00+01:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN37413.0NaN75.7180.44
2016-04-13 05:00:00+02:00220.00.03390.01242.0243.02270.01622.04515.07097.053.069.0150.0NaN8596.023514.023614.018.6925.14
2016-04-25 05:00:00+02:00190.00.02969.0886.0151.01340.01564.05389.07094.050.059.0454.0195.05989.021471.0NaN15.0022.65
2016-04-25 07:00:00+02:00206.00.03673.01143.0185.0162.01648.06807.07095.051.062.0283.0214.05682.027635.0NaN32.9740.18
2016-05-10 23:00:00+02:00348.0960.06800.05219.0299.00.0443.01750.07002.050.091.058.0280.03311.026641.0NaN51.5739.11
2016-06-12 01:00:00+02:00356.0595.05719.06165.0274.0382.0NaN1325.05056.056.086.030.0291.02019.024715.024155.060.2348.72
2016-07-09 22:00:00+02:00NaNNaNNaNNaNNaNNaNNaNNaN6923.0NaNNaNNaNNaNNaN34985.0NaN45.7251.72
2016-07-12 00:00:00+02:00346.0595.05951.06131.0NaN494.0709.01215.05058.049.083.031.0309.02031.025313.025103.064.9947.49
2016-09-28 09:00:00+02:00347.0594.05522.06272.0292.00.0524.02494.06997.061.086.0982.0300.05478.031072.0NaN49.7256.40
2016-10-27 23:00:00+02:00351.0554.07176.05690.0321.0NaN417.01295.06967.058.091.070.0299.03193.026423.026583.055.7062.84
2016-11-23 04:00:00+01:00NaN900.04838.04547.0269.01413.0795.0435.05040.060.085.015.0227.04598.023469.023112.043.1949.11
2017-11-14 12:00:00+01:000.00.010064.00.00.00.00.00.00.00.00.00.00.00.033805.0NaN60.5366.17
2017-11-14 19:00:00+01:000.00.012336.00.00.00.00.00.00.00.00.00.00.00.035592.0NaN68.0575.45
2018-06-11 18:00:00+02:00331.0506.07538.05360.0300.01.01134.04258.05856.052.096.0170.0269.09165.034752.0NaN69.8764.93
2018-07-11 09:00:00+02:00NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN33938.0NaN63.0169.79
# 对空值进行线性填充,method="linear"表示线性填充,limit_direction="forward"表示向前填充,inplace=True表示直接在原数据上修改
df_energy.interpolate(method="linear",limit_direction="forward",inplace=True)
# 打印填充后数据中各列的null值个数
check_Nas_Dups(df_energy)
数据中各列的null值个数为:
generation biomass                             0
generation fossil brown coal/lignite           0
generation fossil gas                          0
generation fossil hard coal                    0
generation fossil oil                          0
generation hydro pumped storage consumption    0
generation hydro run-of-river and poundage     0
generation hydro water reservoir               0
generation nuclear                             0
generation other                               0
generation other renewable                     0
generation solar                               0
generation waste                               0
generation wind onshore                        0
total load forecast                            0
total load actual                              0
price day ahead                                0
price actual                                   0
dtype: int64
--------------------------------------------------
数据中重复行的数据量为:
0

生成热力图,显示数据中各列之间的相关性

# 热力图函数
def feat_corr(input_df):
    corr=input_df.corr()  # 计算相关性矩阵(使用皮尔逊相关系数)
    plt.figure(figsize=(15,12))  # 设置图表的大小

    #绘制热力图
    g=sns.heatmap(
        corr,
        annot=True,  # 显示相关性系数
        fmt=".2f",  # 设置相关性系数的小数点精度
        cmap="RdYlBu",  # 设置颜色映射
        center=0,  # 设置中心值
        vmin=-1,vmax=1,  # 设置相关性系数的范围
    )
    plt.show()

feat_corr(df_energy)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

根据相关性矩阵对数据进行合并和删除
  • generation fossil hard coal(硬煤发电)与generation fossil brown coal/lignite(褐煤发电)的相关性系数为0.77,且都与煤发电相关,可以合并在一起作为一个特征,删除原特征
  • 两个特征相关系数接近1,说明其包含信息高度重合,可以删除其中一个特征,以减少计算量
# 创建一个新特征,表示煤发电,合并上述两个特征
df_energy["generation fossil total"]=df_energy["generation fossil hard coal"]+df_energy["generation fossil brown coal/lignite"]
# 删除原特征
df_energy.drop(["generation fossil hard coal",
                "generation fossil brown coal/lignite"],
                axis=1,  # axis=1表示按列进行删除
                inplace=True  # inplace=True表示直接在原数据上修改
                )

# 重新绘制热力图
feat_corr(df_energy)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

对天气数据进行处理

# 读取天气相关数据
df_weather=pd.read_csv("E:\\python_project\\informer\\Informer2020-main\\data\\weather_features.csv")
# 查看数据
df_weather.head()
dt_isocity_nametemptemp_mintemp_maxpressurehumiditywind_speedwind_degrain_1hrain_3hsnow_3hclouds_allweather_idweather_mainweather_descriptionweather_icon
02015-01-01 00:00:00+01:00Valencia270.475270.475270.4751001771620.00.00.00800clearsky is clear01n
12015-01-01 01:00:00+01:00Valencia270.475270.475270.4751001771620.00.00.00800clearsky is clear01n
22015-01-01 02:00:00+01:00Valencia269.686269.686269.6861002780230.00.00.00800clearsky is clear01n
32015-01-01 03:00:00+01:00Valencia269.686269.686269.6861002780230.00.00.00800clearsky is clear01n
42015-01-01 04:00:00+01:00Valencia269.686269.686269.6861002780230.00.00.00800clearsky is clear01n
  • 删除一些无用的数据
    • weather_icon:天气图标,展示用途
    • weather_description:天气描述
    • weather_main:天气主要类型
from sklearn.preprocessing import LabelEncoder

df_temp=df_weather.copy(deep=True)
labels=["weather_id","weather_main","weather_description","weather_icon"]
for col in labels:
    df_temp[col]=LabelEncoder().fit_transform(df_weather[col])  # 使用LabelEncoder对分类变量(字符串或数字信息)进行编码
比较天气数据各特征之间的相关性,删除冗余信息
feat_corr(df_temp)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 查看weather_id的不同数值(编码后的数值)
df_weather["weather_id"].unique()
array([800, 801, 802, 803, 804, 500, 501, 502, 701, 522, 521, 503, 202,
       200, 201, 211, 520, 300, 741, 301, 711, 302, 721, 310, 600, 616,
       615, 601, 210, 602, 611, 311, 612, 620, 531, 731, 761, 771],
      dtype=int64)
# 删除一些特征,weather_id、weather_main、weather_description、weather_icon都是天气描述信息,其特征已经表现在其他数据上
# 删除temp_min、temp_max是平均温度的冗余信息
col_drop_name=[
    "weather_id",
    "weather_main",
    "weather_description",
    "weather_icon",
    "temp_min",
    "temp_max"
]

df_weather.drop(col_drop_name,axis=1,inplace=True)
# 检查每一个特征的无效值(NULL)和重复行
check_Nas_Dups(df_weather)
数据中各列的null值个数为:
dt_iso        0
city_name     0
temp          0
pressure      0
humidity      0
wind_speed    0
wind_deg      0
rain_1h       0
rain_3h       0
snow_3h       0
clouds_all    0
dtype: int64
--------------------------------------------------
数据中重复行的数据量为:
3076
# 删除重复行,.reset_index()重置数据框索引为0、1、2,drop_duplicates()保留重复行的第一行
df_weather=df_weather.reset_index().drop_duplicates()
df_weather
indexdt_isocity_nametemppressurehumiditywind_speedwind_degrain_1hrain_3hsnow_3hclouds_all
002015-01-01 00:00:00+01:00Valencia270.4751001771620.00.00.00
112015-01-01 01:00:00+01:00Valencia270.4751001771620.00.00.00
222015-01-01 02:00:00+01:00Valencia269.6861002780230.00.00.00
332015-01-01 03:00:00+01:00Valencia269.6861002780230.00.00.00
442015-01-01 04:00:00+01:00Valencia269.6861002780230.00.00.00
.......................................
1783911783912018-12-31 19:00:00+01:00Seville287.7601028543300.00.00.00
1783921783922018-12-31 20:00:00+01:00Seville285.7601029623300.00.00.00
1783931783932018-12-31 21:00:00+01:00Seville285.1501028584500.00.00.00
1783941783942018-12-31 22:00:00+01:00Seville284.1501029574600.00.00.00
1783951783952018-12-31 23:00:00+01:00Seville283.9701029703500.00.00.00

178396 rows × 12 columns

# 处理时间索引
df_weather["time"]=pd.to_datetime(df_weather["dt_iso"]) #将时间特征转化为pd时间格式
df_weather.drop(["dt_iso"],axis=1,inplace=True)
df_weather=df_weather.set_index('time') #设置时间为索引
df_weather.drop(["index"],axis=1,inplace=True)   #删除多余的index列(这是由之前的reset_index()产生的)
df_weather
city_nametemppressurehumiditywind_speedwind_degrain_1hrain_3hsnow_3hclouds_all
time
2015-01-01 00:00:00+01:00Valencia270.4751001771620.00.00.00
2015-01-01 01:00:00+01:00Valencia270.4751001771620.00.00.00
2015-01-01 02:00:00+01:00Valencia269.6861002780230.00.00.00
2015-01-01 03:00:00+01:00Valencia269.6861002780230.00.00.00
2015-01-01 04:00:00+01:00Valencia269.6861002780230.00.00.00
.................................
2018-12-31 19:00:00+01:00Seville287.7601028543300.00.00.00
2018-12-31 20:00:00+01:00Seville285.7601029623300.00.00.00
2018-12-31 21:00:00+01:00Seville285.1501028584500.00.00.00
2018-12-31 22:00:00+01:00Seville284.1501029574600.00.00.00
2018-12-31 23:00:00+01:00Seville283.9701029703500.00.00.00

178396 rows × 10 columns

处理异常值
# 显示天气的各特征的统计信息,保留两位小数
df_weather.describe().round(2)
temppressurehumiditywind_speedwind_degrain_1hrain_3hsnow_3hclouds_all
count178396.00178396.00178396.00178396.00178396.00178396.00178396.00178396.00178396.00
mean289.621069.2668.422.47166.590.080.000.0025.07
std8.035969.6321.902.10116.610.400.010.2230.77
min262.240.000.000.000.000.000.000.000.00
25%283.671013.0053.001.0055.000.000.000.000.00
50%289.151018.0072.002.00177.000.000.000.0020.00
75%295.151022.0087.004.00270.000.000.000.0040.00
max315.601008371.00100.00133.00360.0012.002.3221.50100.00
# 创建2X2的统计图用于显示4个天气特征
fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(12,8))   #创建图片表

columns_to_plot=["pressure","wind_speed","rain_1h","rain_3h"]   #需要显示的4个特征

for i,ax in enumerate(axes.flat):
    if i < len(columns_to_plot):
        # 绘制时间序列统计图
        ax.plot(
            df_weather.index,   #x轴为时间索引
            df_weather[columns_to_plot[i]]  #y轴为对应的特征
        )
        ax.set_title(columns_to_plot[i])
    else:
        ax.set_visible(False)   #如果没有数据,则隐藏该统计图

plt.tight_layout()  #调整子图之间的距离

plt.show()


外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

#创建2X2的箱线图来显示天气四个特征的分布情况
#箱线图:展示数据分布,可以用于判断异常值,箱子的中间有一条线,代表了数据的中位数。箱子的上下底,分别是数据的上四分位数(Q3)和下四分位数(Q1),这意味着箱体包含了50%的数据。因此,箱子的高度在一定程度上反映了数据的波动程度。上下边缘则代表了该组数据的最大值和最小值
fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(12,8))

columns_to_plot=["pressure","wind_speed","rain_1h","rain_3h"]  #统计的四个特征

for i ,ax in enumerate(axes.flat):  #设置每一个箱线图
    if i< len(columns_to_plot):
        ax.boxplot(x=df_weather[columns_to_plot[i]])
        ax.set_title(columns_to_plot[i])
    else:
        ax.set_visible(False)

plt.tight_layout()  #调整距离
plt.show()

    

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 根据常识,对异常值进行处理,超过某个阈值的数据记为NAN;.log基于条件选择数据
df_weather.loc[df_weather["pressure"] > 1080 ,"pressure"]=np.nan  #pressure>1080的值记为nan
df_weather.loc[df_weather["pressure"] < 870 ,"pressure"]=np.nan 

df_weather.loc[df_weather["wind_speed"] > 113 ,"wind_speed"]=np.nan 

# 对所有异常值进行线性插值填充
df_weather.interpolate(
    method="linear",  #线性插值法
    limit_direction="forward",  #向前填充缺失值
    inplace=True   #在原数据框中修改
)
# 再次检查其箱线图
fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(12,8))

columns_to_plot=["pressure","wind_speed","rain_1h","rain_3h"]  #统计的四个特征

for i ,ax in enumerate(axes.flat):  #设置每一个箱线图
    if i< len(columns_to_plot):
        ax.boxplot(x=df_weather[columns_to_plot[i]])
        ax.set_title(columns_to_plot[i])
    else:
        ax.set_visible(False)

plt.tight_layout()  #调整距离
plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

#删除rain_h3列,数据不准确:三小时降雨量还没有一小时的多
df_weather.drop(['rain_3h'],axis=1,inplace=True)
解决不同城市观测记录的数据量不同问题
#输出发电相关数据以及各城市天气数据的个数
print(f"Number of samples in df_energy is {df_energy.shape[0]}")    #发电数据的数量

city_list=df_weather['city_name'].unique()  #获取各城市名
grouped_weather=df_weather.groupby("city_name")

for city in city_list:
    print(f"Number of samples in df_weather in {city} is {grouped_weather.get_group(city).shape[0]}")
Number of samples in df_energy is 35064
Number of samples in df_weather in Valencia is 35145
Number of samples in df_weather in Madrid is 36267
Number of samples in df_weather in Bilbao is 35951
Number of samples in df_weather in  Barcelona is 35476
Number of samples in df_weather in Seville is 35557
# 根据时间和城市名,删除重复记录的数据
df_weather_cleanned=df_weather.reset_index().drop_duplicates(
    subset=["time","city_name"],  #根据这两个特征是否重复来删除数据
    keep="first"   #保留第一行的数据
).set_index("time")
df_weather_cleanned
city_nametemppressurehumiditywind_speedwind_degrain_1hsnow_3hclouds_all
time
2015-01-01 00:00:00+01:00Valencia270.4751001.0771.0620.00.00
2015-01-01 01:00:00+01:00Valencia270.4751001.0771.0620.00.00
2015-01-01 02:00:00+01:00Valencia269.6861002.0780.0230.00.00
2015-01-01 03:00:00+01:00Valencia269.6861002.0780.0230.00.00
2015-01-01 04:00:00+01:00Valencia269.6861002.0780.0230.00.00
..............................
2018-12-31 19:00:00+01:00Seville287.7601028.0543.0300.00.00
2018-12-31 20:00:00+01:00Seville285.7601029.0623.0300.00.00
2018-12-31 21:00:00+01:00Seville285.1501028.0584.0500.00.00
2018-12-31 22:00:00+01:00Seville284.1501029.0574.0600.00.00
2018-12-31 23:00:00+01:00Seville283.9701029.0703.0500.00.00

175320 rows × 9 columns

# 处理后个城市数据与发电数据记录数据条数相同
print(f"Number of samples in df_energy is {df_energy.shape[0]}")    #发电数据的数量

city_list=df_weather['city_name'].unique()  #获取各城市名
grouped_weather=df_weather_cleanned.groupby("city_name")

for city in city_list:
    print(f"Number of samples in df_weather in {city} is {grouped_weather.get_group(city).shape[0]}")
Number of samples in df_energy is 35064
Number of samples in df_weather in Valencia is 35064
Number of samples in df_weather in Madrid is 35064
Number of samples in df_weather in Bilbao is 35064
Number of samples in df_weather in  Barcelona is 35064
Number of samples in df_weather in Seville is 35064
# 提取每一个城市的数据,返回一个DataFrame列表
df_weather_all_cities=[grouped_weather.get_group(x) for x in grouped_weather.groups]
df_weather_all_cities[0]
city_nametemppressurehumiditywind_speedwind_degrain_1hsnow_3hclouds_all
time
2015-01-01 00:00:00+01:00Barcelona281.6251035.01007.0580.00.00
2015-01-01 01:00:00+01:00Barcelona281.6251035.01007.0580.00.00
2015-01-01 02:00:00+01:00Barcelona281.2861036.01007.0480.00.00
2015-01-01 03:00:00+01:00Barcelona281.2861036.01007.0480.00.00
2015-01-01 04:00:00+01:00Barcelona281.2861036.01007.0480.00.00
..............................
2018-12-31 19:00:00+01:00Barcelona284.1301027.0711.02500.00.00
2018-12-31 20:00:00+01:00Barcelona282.6401027.0623.02700.00.00
2018-12-31 21:00:00+01:00Barcelona282.1401028.0534.03000.00.00
2018-12-31 22:00:00+01:00Barcelona281.1301028.0505.03200.00.00
2018-12-31 23:00:00+01:00Barcelona280.1301028.01005.03100.00.00

35064 rows × 9 columns

合并数据

df_weather_energy=df_energy

for df_city in df_weather_all_cities:
    city_name=df_city.iloc[0]["city_name"].replace(' ','')  #获取城市名称;iloc获取标签以及对应值(类似city_name: 'Beijing', temp: 20, humidity: 65)
    df_temp_city=df_city.add_suffix(f"_{city_name}") #在每一个城市数据中其每一个标签的后面添加上对应的城市名
    df_weather_energy=pd.concat([df_weather_energy,df_temp_city],axis=1)    #合并数据
    df_weather_energy=df_weather_energy.drop(f"city_name_{city_name}",axis=1)  #删除冗余城市名 

df_weather_energy.columns
Index(['generation biomass', 'generation fossil gas', 'generation fossil oil',
       'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore', 'total load forecast',
       'total load actual', 'price day ahead', 'price actual',
       'generation fossil total', 'temp_Barcelona', 'pressure_Barcelona',
       'humidity_Barcelona', 'wind_speed_Barcelona', 'wind_deg_Barcelona',
       'rain_1h_Barcelona', 'snow_3h_Barcelona', 'clouds_all_Barcelona',
       'temp_Bilbao', 'pressure_Bilbao', 'humidity_Bilbao',
       'wind_speed_Bilbao', 'wind_deg_Bilbao', 'rain_1h_Bilbao',
       'snow_3h_Bilbao', 'clouds_all_Bilbao', 'temp_Madrid', 'pressure_Madrid',
       'humidity_Madrid', 'wind_speed_Madrid', 'wind_deg_Madrid',
       'rain_1h_Madrid', 'snow_3h_Madrid', 'clouds_all_Madrid', 'temp_Seville',
       'pressure_Seville', 'humidity_Seville', 'wind_speed_Seville',
       'wind_deg_Seville', 'rain_1h_Seville', 'snow_3h_Seville',
       'clouds_all_Seville', 'temp_Valencia', 'pressure_Valencia',
       'humidity_Valencia', 'wind_speed_Valencia', 'wind_deg_Valencia',
       'rain_1h_Valencia', 'snow_3h_Valencia', 'clouds_all_Valencia'],
      dtype='object')
# 检查无效值个数
check_Nas_Dups(df_weather_energy)
数据中各列的null值个数为:
generation biomass                             0
generation fossil gas                          0
generation fossil oil                          0
generation hydro pumped storage consumption    0
generation hydro run-of-river and poundage     0
generation hydro water reservoir               0
generation nuclear                             0
generation other                               0
generation other renewable                     0
generation solar                               0
generation waste                               0
generation wind onshore                        0
total load forecast                            0
total load actual                              0
price day ahead                                0
price actual                                   0
generation fossil total                        0
temp_Barcelona                                 0
pressure_Barcelona                             0
humidity_Barcelona                             0
wind_speed_Barcelona                           0
wind_deg_Barcelona                             0
rain_1h_Barcelona                              0
snow_3h_Barcelona                              0
clouds_all_Barcelona                           0
temp_Bilbao                                    0
pressure_Bilbao                                0
humidity_Bilbao                                0
wind_speed_Bilbao                              0
wind_deg_Bilbao                                0
rain_1h_Bilbao                                 0
snow_3h_Bilbao                                 0
clouds_all_Bilbao                              0
temp_Madrid                                    0
pressure_Madrid                                0
humidity_Madrid                                0
wind_speed_Madrid                              0
wind_deg_Madrid                                0
rain_1h_Madrid                                 0
snow_3h_Madrid                                 0
clouds_all_Madrid                              0
temp_Seville                                   0
pressure_Seville                               0
humidity_Seville                               0
wind_speed_Seville                             0
wind_deg_Seville                               0
rain_1h_Seville                                0
snow_3h_Seville                                0
clouds_all_Seville                             0
temp_Valencia                                  0
pressure_Valencia                              0
humidity_Valencia                              0
wind_speed_Valencia                            0
wind_deg_Valencia                              0
rain_1h_Valencia                               0
snow_3h_Valencia                               0
clouds_all_Valencia                            0
dtype: int64
--------------------------------------------------
数据中重复行的数据量为:
0

添加数据的时间特征

# 添加从pb时间特征获取小时、星期、月、年特征信息
df_weather_energy["hour"]=df_weather_energy.index.map(lambda x : x.hour)
df_weather_energy["weekday"]=df_weather_energy.index.map(lambda x : x.weekday())
df_weather_energy["month"]=df_weather_energy.index.map(lambda x : x.month)
df_weather_energy["year"]=df_weather_energy.index.map(lambda x : x.year)

# 查看数据的特征信息
df_weather_energy.columns
Index(['generation biomass', 'generation fossil gas', 'generation fossil oil',
       'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore', 'total load forecast',
       'total load actual', 'price day ahead', 'price actual',
       'generation fossil total', 'temp_Barcelona', 'pressure_Barcelona',
       'humidity_Barcelona', 'wind_speed_Barcelona', 'wind_deg_Barcelona',
       'rain_1h_Barcelona', 'snow_3h_Barcelona', 'clouds_all_Barcelona',
       'temp_Bilbao', 'pressure_Bilbao', 'humidity_Bilbao',
       'wind_speed_Bilbao', 'wind_deg_Bilbao', 'rain_1h_Bilbao',
       'snow_3h_Bilbao', 'clouds_all_Bilbao', 'temp_Madrid', 'pressure_Madrid',
       'humidity_Madrid', 'wind_speed_Madrid', 'wind_deg_Madrid',
       'rain_1h_Madrid', 'snow_3h_Madrid', 'clouds_all_Madrid', 'temp_Seville',
       'pressure_Seville', 'humidity_Seville', 'wind_speed_Seville',
       'wind_deg_Seville', 'rain_1h_Seville', 'snow_3h_Seville',
       'clouds_all_Seville', 'temp_Valencia', 'pressure_Valencia',
       'humidity_Valencia', 'wind_speed_Valencia', 'wind_deg_Valencia',
       'rain_1h_Valencia', 'snow_3h_Valencia', 'clouds_all_Valencia', 'hour',
       'weekday', 'month', 'year'],
      dtype='object')

数据可视化

from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig=make_subplots() #创建图表

fig.add_trace(
    go.Scatter(
        x=df_weather_energy.index,  #x轴时间点
        y=df_weather_energy["price actual"], #y轴为实际价格
        name="price actural"
    )
)

fig.add_trace(
    go.Scatter(
        x=df_weather_energy.index,
        y=df_weather_energy.rolling(window=24).mean()["price actual"],  #y轴表示过去24小时的平均值,波动较小
        name="rolling window = daily ave"
    )
)

fig.add_trace(
    go.Scatter(
        x=df_weather_energy.index,
        y=df_weather_energy.rolling(window=24*7).mean()["price actual"],  #y轴表示过去7天的平均值,显示长期趋势
        name="rolling window = weekly ave"
    )
)

fig.show()

  • 展示电价的每个月和每周的变化趋势
    • 可以帮助理解价格的季节性变化
    • 可以观测是否有周期性的价格波动
fig,axes=plt.subplots(
    ncols=2,    #创建2列统计图
    figsize=(14,6)  #大小
)

sns.set(style='darkgrid')   #设置统计图风格为深色网络

sns.barplot(
    x="month",
    y="price actual",    #y轴
    data=df_weather_energy,  #数据源
    estimator=sum,  #对每个月的总和作为统计量
    color='royalblue',  #设置柱状图颜色
    ax=axes[0]  #显示位置
)
axes[0].set_title("Monthly actual price (1 is Jan)")  #设置标题

sns.barplot(
    x="weekday",
    y="price actual",    #y轴
    data=df_weather_energy,  #数据源
    estimator=sum,  #对每个月的总和作为统计量
    color='royalblue',  #设置柱状图颜色
    ax=axes[1]  #显示位置
)
axes[1].set_title("Daily actual price (0 is Monday)")  #设置标题

plt.show()  # 显示图表

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

  • 创建两个直方图来比较真实值与预测值
    • 电价预测与实际电价
    • 负载预测与实际负载
#真实价格与预测价格的直方图
plt.figure(figsize=(14,6))  #设置图表的大小

gr=sns.histplot(    #绘制预测电价的直方图
    df_weather_energy["price day ahead"],   #预测电价的数据
    bins=100,  #直方图的柱数
    label="TSO Prediction",
    element="step",  #阶梯式直方图
    color="lightcoral",
    kde=True  #显示核密度估计曲线
)

gr=sns.histplot(    #绘制实际电价的直方图
    df_weather_energy["price actual"],   #实际电价的数据
    bins=100,  #直方图的柱数
    label="Actual Price",
    element="step",  #阶梯式直方图
    color="royalblue",
    kde=True  #显示核密度估计曲线
)

gr.set(xlabel="Price of Electricity (€/MWh)" , ylabel="Frequency")  #x轴与y轴的标签
plt.legend()  #显示图例
plt.show()  #显示图表

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 负载预测与实际负载
plt.figure(figsize=(14,6))  #设置图表的大小

gr=sns.histplot(    #绘制真实负载的直方图
    df_weather_energy["total load actual"],   #实际负载的数据
    bins=100,  #直方图的柱数
    label="Load actual",
    element="step",  #阶梯式直方图
    color="dimgrey",
    kde=True  #显示核密度估计曲线
)

gr=sns.histplot(    #绘制预测负载的直方图
    df_weather_energy["total load forecast"],   #预测负载的数据
    bins=100,  #直方图的柱数
    label="Load forecast",
    element="step",  #阶梯式直方图
    color="lightyellow",
    kde=True  #显示核密度估计曲线
)

gr.set(xlabel="Load (MWh)" , ylabel="Frequency")  #x轴与y轴的标签
plt.legend()  #显示图例
plt.show()  #显示图表

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

  • 计算TSO预测价格的平均绝对误差(MAE)
    • 先对实际价格和预测价格进行归一化
    • 计算归一化之后的平均绝对误差
  • 为什么归一化拟合的范围只有训练数据,但归一化转化是全部数据?
    • 因为预测和验证的数据是当作未知的,只能用已知的训练集数据来确定归一化数据(拟合)
    • 最后生成的预测数据也是归一化的数据,并且是和训练集相同归一化的标准,因此要用训练数据的标准(拟合范围)来转化所有数据
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error

y_scaler_actual=MinMaxScaler()  #利用MinMaxScaler函数将真实价格数据缩放到[0,1]之间
y_scaler_dayahead=MinMaxScaler()  #利用MinMaxScaler函数将预测价格数据缩放到[0,1]之间

# 数据集划分
train_cutoff=int(0.8*df_weather_energy.shape[0])  #80%训练数据的截止点
val_cutoff=int(0.9*df_weather_energy.shape[0])   #验证数据

# 数据准备
y_price_actual=df_weather_energy[["price actual"]]  #全部真实数据
y_price_dayahead=df_weather_energy[["price day ahead"]]

#实际价格归一化
y_scaler_actual.fit(y_price_actual[:train_cutoff])   #拟合归一化器
actual_norm=y_scaler_actual.transform(y_price_actual)   #所有的实际价格归一化

#预测价格归一化
y_scaler_dayahead.fit(y_price_dayahead[:train_cutoff])   #拟合归一化器
dayahead_norm=y_scaler_dayahead.transform(y_price_dayahead)   #所有的预测价格归一化

#输出平均绝对误差,保留三位小数
print(f"mean absolute error for normalized acutal price and TSO prediction is : {round(mean_absolute_error(actual_norm,dayahead_norm),3)}")
mean absolute error for normalized acutal price and TSO prediction is : 0.071
# 删除总负载特征,在预测时是不知道总负载的特征的
df_weather_energy.drop("total load forecast",axis=1,inplace=True)
将价格数据分解为趋势、季节性、残差三部分并用统计图显示

这几个分解数据怎么来?????

  • 观测值 = 趋势 + 季节性 + 残差

  • 趋势分析可以了解数据的长期特征

  • 季节性数据可以了解到周期性模式

  • 残差数据可以了解到随机波动

from statsmodels.tsa.seasonal import seasonal_decompose

fig,axes=plt.subplots(5,1,figsize=(20,12))  #创建5行1列的统计图

decom_data = df_weather_energy[['price actual']].copy()

decompose_result=seasonal_decompose(   #进行时间序列分解
    decom_data,
    period=5,   #季节行周期为5
    model='additive'  #使用加法模型
)

# 获取分解结果
observed=decompose_result.observed  #原始观测数据
trend=decompose_result.trend  #趋势
seasonal=decompose_result.seasonal #季节性成分
residual=decompose_result.resid   #残差成分

# 绘制统计图
axes[0].plot(observed,color='dimgrey')  #原始数据
axes[0].set_title('Observed')

axes[1].plot(trend,color='royalblue')  #趋势数据
axes[1].set_title('Trend')

axes[2].plot(seasonal,color='lightcoral')  #季节数据
axes[2].set_title('Seasonality')

axes[3].plot(seasonal[:100],color='lightcoral')  #放大季节数据,只显示前一百个数据
axes[3].set_title('Zoomed Seasonality')

axes[4].plot(residual,color='lightcoral')  #残差数据
axes[4].set_title('Residual')

fig.tight_layout()
plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

时间序列的三个重要特征

  • Dickey-Fuller检测(ADF检测):用于判断时间序列是否稳定
  • 平稳性的特征:时间序列的统计特征不随时间改变
  • ARMA模型:用于平稳时间序列的统计模型
    (我的个人理解是ADF值越小,数据越平稳,就适合用类似ARMA这种简单模型)
# 进行ADF检测,用来检测时间序列是否平稳
from statsmodels.tsa.stattools import adfuller

result=adfuller(df_weather_energy[['price actual']])  
print('ADF Statistic',result[0])    # 这个值越小(越负),越表明序列是平稳的
print('P-value',result[1])   # 如果p值 < 0.05,表明序列是平稳的
print('Critical Values',result[4])  # 如果ADF统计量小于这些临界值,表明序列是平稳的
ADF Statistic -9.147016232851161
P-value 2.7504934849347068e-15
Critical Values {'1%': -3.4305367814665044, '5%': -2.8616225527935106, '10%': -2.566813940257257}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值