时间序列数据处理方法_electricity price data (forecast and actual) 数据-优快云博客

本文章代码全部来源于：Electricity_Demand_and_Price_forecasting/Electricity_Demand_and_Price_forecasting.ipynb at main · ritikdhame/Electricity_Demand_and_Price_forecasting · GitHub
数据可视化图片加载不了再git上都有

# 基础数据处理
import pandas as pd
import numpy as np

# 数据预处理工具
from sklearn.preprocessing import StandardScaler

# 时间处理
from datetime import datetime, timedelta

# 数据可视化（可选，用于探索性分析）
import matplotlib.pyplot as plt
import seaborn as sns

对发电数据进行处理

读取数据，显示数据的基本信息

# 读取数据  
df_energy = pd.read_csv('E:\\python_project\\informer\\Informer2020-main\\data\\energy_dataset.csv')

# 查看数据
df_energy.head()

	time	generation biomass	generation fossil brown coal/lignite	generation fossil gas	generation fossil hard coal	generation fossil oil	...	generation waste	generation wind onshore	forecast solar day ahead	forecast wind offshore eday ahead	forecast wind onshore day ahead	total load forecast	total load actual	price day ahead	price actual
0	2015-01-01 00:00:00+01:00	447.0	329.0	4844.0	4821.0	162.0	...	196.0	6378.0	17.0	NaN	6436.0	26118.0	25385.0	50.10	65.41
1	2015-01-01 01:00:00+01:00	449.0	328.0	5196.0	4755.0	158.0	...	195.0	5890.0	16.0	NaN	5856.0	24934.0	24382.0	48.10	64.92
2	2015-01-01 02:00:00+01:00	448.0	323.0	4857.0	4581.0	157.0	...	196.0	5461.0	8.0	NaN	5454.0	23515.0	22734.0	47.33	64.48
3	2015-01-01 03:00:00+01:00	438.0	254.0	4314.0	4131.0	160.0	...	191.0	5238.0	2.0	NaN	5151.0	22642.0	21286.0	42.27	59.32
4	2015-01-01 04:00:00+01:00	428.0	187.0	4130.0	3840.0	156.0	...	189.0	4935.0	9.0	NaN	4861.0	21785.0	20264.0	38.41	56.04

5 rows × 29 columns

# 查看数据的统计信息，显示每一特征（列）的统计信息
df_energy.describe().T

	count	mean	std	min	25%	50%	75%	max
generation biomass	35045.0	383.513540	85.353943	0.00	333.0000	367.00	433.00	592.00
generation fossil brown coal/lignite	35046.0	448.059208	354.568590	0.00	0.0000	509.00	757.00	999.00
generation fossil coal-derived gas	35046.0	0.000000	0.000000	0.00	0.0000	0.00	0.00	0.00
generation fossil gas	35046.0	5622.737488	2201.830478	0.00	4126.0000	4969.00	6429.00	20034.00
generation fossil hard coal	35046.0	4256.065742	1961.601013	0.00	2527.0000	4474.00	5838.75	8359.00
generation fossil oil	35045.0	298.319789	52.520673	0.00	263.0000	300.00	330.00	449.00
generation fossil oil shale	35046.0	0.000000	0.000000	0.00	0.0000	0.00	0.00	0.00
generation fossil peat	35046.0	0.000000	0.000000	0.00	0.0000	0.00	0.00	0.00
generation geothermal	35046.0	0.000000	0.000000	0.00	0.0000	0.00	0.00	0.00
generation hydro pumped storage aggregated	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
generation hydro pumped storage consumption	35045.0	475.577343	792.406614	0.00	0.0000	68.00	616.00	4523.00
generation hydro run-of-river and poundage	35045.0	972.116108	400.777536	0.00	637.0000	906.00	1250.00	2000.00
generation hydro water reservoir	35046.0	2605.114735	1835.199745	0.00	1077.2500	2164.00	3757.00	9728.00
generation marine	35045.0	0.000000	0.000000	0.00	0.0000	0.00	0.00	0.00
generation nuclear	35047.0	6263.907039	839.667958	0.00	5760.0000	6566.00	7025.00	7117.00
generation other	35046.0	60.228585	20.238381	0.00	53.0000	57.00	80.00	106.00
generation other renewable	35046.0	85.639702	14.077554	0.00	73.0000	88.00	97.00	119.00
generation solar	35046.0	1432.665925	1680.119887	0.00	71.0000	616.00	2578.00	5792.00
generation waste	35045.0	269.452133	50.195536	0.00	240.0000	279.00	310.00	357.00
generation wind offshore	35046.0	0.000000	0.000000	0.00	0.0000	0.00	0.00	0.00
generation wind onshore	35046.0	5464.479769	3213.691587	0.00	2933.0000	4849.00	7398.00	17436.00
forecast solar day ahead	35064.0	1439.066735	1677.703355	0.00	69.0000	576.00	2636.00	5836.00
forecast wind offshore eday ahead	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
forecast wind onshore day ahead	35064.0	5471.216689	3176.312853	237.00	2979.0000	4855.00	7353.00	17430.00
total load forecast	35064.0	28712.129962	4594.100854	18105.00	24793.7500	28906.00	32263.25	41390.00
total load actual	35028.0	28696.939905	4574.987950	18041.00	24807.7500	28901.00	32192.00	41015.00
price day ahead	35064.0	49.874341	14.618900	2.06	41.4900	50.52	60.53	101.99
price actual	35064.0	57.884023	14.204083	9.33	49.3475	58.02	68.01	116.80

# 查看数据的基本信息，包括数据类型、非空值数量和内存占用
df_energy.info()
# 查看数据缺失值
df_energy.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35064 entries, 0 to 35063
Data columns (total 29 columns):
time                                           35064 non-null object
generation biomass                             35045 non-null float64
generation fossil brown coal/lignite           35046 non-null float64
generation fossil coal-derived gas             35046 non-null float64
generation fossil gas                          35046 non-null float64
generation fossil hard coal                    35046 non-null float64
generation fossil oil                          35045 non-null float64
generation fossil oil shale                    35046 non-null float64
generation fossil peat                         35046 non-null float64
generation geothermal                          35046 non-null float64
generation hydro pumped storage aggregated     0 non-null float64
generation hydro pumped storage consumption    35045 non-null float64
generation hydro run-of-river and poundage     35045 non-null float64
generation hydro water reservoir               35046 non-null float64
generation marine                              35045 non-null float64
generation nuclear                             35047 non-null float64
generation other                               35046 non-null float64
generation other renewable                     35046 non-null float64
generation solar                               35046 non-null float64
generation waste                               35045 non-null float64
generation wind offshore                       35046 non-null float64
generation wind onshore                        35046 non-null float64
forecast solar day ahead                       35064 non-null float64
forecast wind offshore eday ahead              0 non-null float64
forecast wind onshore day ahead                35064 non-null float64
total load forecast                            35064 non-null float64
total load actual                              35028 non-null float64
price day ahead                                35064 non-null float64
price actual                                   35064 non-null float64
dtypes: float64(28), object(1)
memory usage: 7.8+ MB





time                                               0
generation biomass                                19
generation fossil brown coal/lignite              18
generation fossil coal-derived gas                18
generation fossil gas                             18
generation fossil hard coal                       18
generation fossil oil                             19
generation fossil oil shale                       18
generation fossil peat                            18
generation geothermal                             18
generation hydro pumped storage aggregated     35064
generation hydro pumped storage consumption       19
generation hydro run-of-river and poundage        19
generation hydro water reservoir                  18
generation marine                                 19
generation nuclear                                17
generation other                                  18
generation other renewable                        18
generation solar                                  18
generation waste                                  19
generation wind offshore                          18
generation wind onshore                           18
forecast solar day ahead                           0
forecast wind offshore eday ahead              35064
forecast wind onshore day ahead                    0
total load forecast                                0
total load actual                                 36
price day ahead                                    0
price actual                                       0
dtype: int64

对发电数据进行处理

将数据中有较多的无效数据（0/NULL）的特征（列）删除

# 列出需要删除的列名
col_names = [
    # 化石燃料发电
    'generation fossil coal-derived gas',  # 煤制气发电
    'generation fossil oil shale',         # 油页岩发电
    'generation fossil peat',              # 泥炭发电
    
    # 可再生能源发电
    'generation geothermal',               # 地热发电
    'generation hydro pumped storage aggregated',  # 抽水蓄能水电
    'generation marine',                   # 海洋能发电
    'generation wind offshore',            # 海上风电
    
    # 预测数据
    'forecast wind offshore eday ahead',   # 海上风电日前预测
    'forecast solar day ahead',            # 太阳能日前预测
    'forecast wind onshore day ahead'      # 陆上风电日前预测
]

# 删除这些列,axis=-1表示删除列，inplace=True表示直接在原数据上修改
df_energy.drop(columns=col_names,axis=-1,inplace=True)

# 打印删除后数据各列null值的个数
def check_Nas_Dups(df_input):
    print("数据中各列的null值个数为：")
    print(df_input.isnull().sum())
    print("-"*50)
    print("数据中重复行的数据量为：")
    print(df_input.duplicated().sum())

check_Nas_Dups(df_energy)

数据中各列的null值个数为：
time                                            0
generation biomass                             19
generation fossil brown coal/lignite           18
generation fossil gas                          18
generation fossil hard coal                    18
generation fossil oil                          19
generation hydro pumped storage consumption    19
generation hydro run-of-river and poundage     19
generation hydro water reservoir               18
generation nuclear                             17
generation other                               18
generation other renewable                     18
generation solar                               18
generation waste                               19
generation wind onshore                        18
total load forecast                             0
total load actual                              36
price day ahead                                 0
price actual                                    0
dtype: int64
--------------------------------------------------
数据中重复行的数据量为：
0

# 将时间列转换为datetime类型
df_energy["time"]=pd.to_datetime(df_energy["time"])
# 将时间列设置为索引(之前是以行号为索引)
df_energy=df_energy.set_index("time")
df_energy

	generation biomass	generation fossil brown coal/lignite	generation fossil gas	generation fossil hard coal	generation fossil oil	generation hydro pumped storage consumption	generation hydro run-of-river and poundage	generation hydro water reservoir	generation nuclear	generation other	generation other renewable	generation solar	generation waste	generation wind onshore	total load forecast	total load actual	price day ahead	price actual
time
2015-01-01 00:00:00+01:00	447.0	329.0	4844.0	4821.0	162.0	863.0	1051.0	1899.0	7096.0	43.0	73.0	49.0	196.0	6378.0	26118.0	25385.0	50.10	65.41
2015-01-01 01:00:00+01:00	449.0	328.0	5196.0	4755.0	158.0	920.0	1009.0	1658.0	7096.0	43.0	71.0	50.0	195.0	5890.0	24934.0	24382.0	48.10	64.92
2015-01-01 02:00:00+01:00	448.0	323.0	4857.0	4581.0	157.0	1164.0	973.0	1371.0	7099.0	43.0	73.0	50.0	196.0	5461.0	23515.0	22734.0	47.33	64.48
2015-01-01 03:00:00+01:00	438.0	254.0	4314.0	4131.0	160.0	1503.0	949.0	779.0	7098.0	43.0	75.0	50.0	191.0	5238.0	22642.0	21286.0	42.27	59.32
2015-01-01 04:00:00+01:00	428.0	187.0	4130.0	3840.0	156.0	1826.0	953.0	720.0	7097.0	43.0	74.0	42.0	189.0	4935.0	21785.0	20264.0	38.41	56.04
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2018-12-31 19:00:00+01:00	297.0	0.0	7634.0	2628.0	178.0	1.0	1135.0	4836.0	6073.0	63.0	95.0	85.0	277.0	3113.0	30619.0	30653.0	68.85	77.02
2018-12-31 20:00:00+01:00	296.0	0.0	7241.0	2566.0	174.0	1.0	1172.0	3931.0	6074.0	62.0	95.0	33.0	280.0	3288.0	29932.0	29735.0	68.40	76.16
2018-12-31 21:00:00+01:00	292.0	0.0	7025.0	2422.0	168.0	50.0	1148.0	2831.0	6076.0	61.0	94.0	31.0	286.0	3503.0	27903.0	28071.0	66.88	74.30
2018-12-31 22:00:00+01:00	293.0	0.0	6562.0	2293.0	163.0	108.0	1128.0	2068.0	6075.0	61.0	93.0	31.0	287.0	3586.0	25450.0	25801.0	63.93	69.89
2018-12-31 23:00:00+01:00	290.0	0.0	6926.0	2166.0	163.0	108.0	1069.0	1686.0	6075.0	61.0	92.0	31.0	287.0	3651.0	24424.0	24455.0	64.27	69.88

35064 rows × 18 columns

# 用图像显示两周的total load forecast列数据
plt.figure(figsize=(12,6))  # 设置图表的大小
plt.plot(df_energy["total load actual"][:24*7*2])
plt.xlabel("Time")
plt.ylabel("Total Load Actual")
plt.title("Total Load Actual of Two Weeks")
plt.show()

d:\software\anaconda\envs\informer-gpu-36py\lib\site-packages\pandas\plotting\_matplotlib\converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

对数据中的空值进行线性填充

# 返回所有包含空值的行,axis=1表示按行进行判断
df_energy[df_energy.isna().any(axis=1)]

	generation biomass	generation fossil brown coal/lignite	generation fossil gas	generation fossil hard coal	generation fossil oil	generation hydro pumped storage consumption	generation hydro run-of-river and poundage	generation hydro water reservoir	generation nuclear	generation other	generation other renewable	generation solar	generation waste	generation wind onshore	total load forecast	total load actual	price day ahead	price actual
time
2015-01-05 03:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	21912.0	21182.0	35.20	59.68
2015-01-05 12:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	23209.0	NaN	35.50	79.14
2015-01-05 13:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	23725.0	NaN	36.80	73.95
2015-01-05 14:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	23614.0	NaN	32.50	71.93
2015-01-05 15:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22381.0	NaN	30.00	71.50
2015-01-05 16:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	21371.0	NaN	30.00	71.85
2015-01-05 17:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20760.0	NaN	30.60	80.53
2015-01-19 19:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	38642.0	39304.0	70.01	88.95
2015-01-19 20:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	38758.0	39262.0	69.00	87.94
2015-01-27 19:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	38968.0	38335.0	66.00	83.97
2015-01-28 13:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	36239.0	NaN	65.00	77.62
2015-02-01 07:00:00+01:00	449.0	312.0	4765.0	5269.0	222.0	480.0	980.0	1174.0	7101.0	44.0	75.0	48.0	208.0	3289.0	24379.0	NaN	56.10	16.98
2015-02-01 08:00:00+01:00	453.0	312.0	4938.0	5652.0	288.0	0.0	1031.0	3229.0	7099.0	44.0	75.0	73.0	207.0	3102.0	27389.0	NaN	57.69	19.56
2015-02-01 09:00:00+01:00	452.0	302.0	4997.0	5770.0	296.0	0.0	1083.0	4574.0	7097.0	43.0	71.0	809.0	204.0	2838.0	30619.0	NaN	60.01	23.13
2015-02-01 12:00:00+01:00	405.0	317.0	5247.0	6008.0	333.0	0.0	1119.0	4416.0	7095.0	42.0	72.0	3817.0	200.0	1413.0	31357.0	NaN	59.97	22.51
2015-02-01 13:00:00+01:00	402.0	317.0	5449.0	6005.0	318.0	0.0	1171.0	4475.0	7096.0	41.0	73.0	3836.0	193.0	1347.0	31338.0	NaN	59.69	23.44
2015-02-01 14:00:00+01:00	400.0	317.0	5266.0	5995.0	327.0	0.0	1216.0	4412.0	7098.0	42.0	79.0	3701.0	199.0	1345.0	30874.0	NaN	58.69	24.10
2015-02-01 15:00:00+01:00	393.0	321.0	5209.0	5939.0	345.0	0.0	1204.0	3403.0	7097.0	41.0	79.0	3475.0	204.0	1487.0	30124.0	NaN	58.13	21.12
2015-02-01 16:00:00+01:00	413.0	325.0	5642.0	6000.0	345.0	0.0	1193.0	3333.0	7097.0	40.0	77.0	2742.0	203.0	1648.0	29714.0	NaN	59.00	21.73
2015-02-01 17:00:00+01:00	465.0	321.0	6127.0	5912.0	346.0	0.0	1214.0	4684.0	7096.0	41.0	77.0	1281.0	207.0	1857.0	29801.0	NaN	59.69	25.93
2015-02-01 18:00:00+01:00	482.0	326.0	7386.0	6002.0	340.0	0.0	1299.0	6187.0	7095.0	41.0	79.0	328.0	208.0	1864.0	32257.0	NaN	63.76	54.13
2015-02-01 19:00:00+01:00	474.0	326.0	7963.0	6026.0	343.0	0.0	1313.0	6895.0	7096.0	42.0	82.0	161.0	207.0	1813.0	33183.0	NaN	65.01	68.53
2015-04-05 03:00:00+02:00	371.0	0.0	5015.0	3248.0	257.0	799.0	1233.0	2531.0	4027.0	81.0	67.0	31.0	142.0	3153.0	20016.0	NaN	42.55	29.04
2015-04-16 09:00:00+02:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	31001.0	NaN	56.54	67.55
2015-04-20 08:00:00+02:00	424.0	642.0	5614.0	5784.0	369.0	0.0	1122.0	4050.0	6954.0	41.0	62.0	636.0	147.0	797.0	29287.0	NaN	62.00	72.92
2015-04-23 21:00:00+02:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	31421.0	NaN	69.49	82.57
2015-05-02 10:00:00+02:00	497.0	0.0	5502.0	5677.0	375.0	0.0	1425.0	5289.0	6353.0	93.0	72.0	2535.0	205.0	10903.0	39644.0	NaN	58.49	59.09
2015-05-29 03:00:00+02:00	569.0	756.0	4239.0	4635.0	365.0	755.0	667.0	1277.0	5035.0	85.0	69.0	662.0	201.0	6503.0	23132.0	NaN	45.93	55.07
2015-06-15 09:00:00+02:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	29899.0	30047.0	62.48	73.82
2015-10-02 08:00:00+02:00	483.0	961.0	6545.0	8250.0	385.0	0.0	1323.0	5378.0	7013.0	87.0	70.0	140.0	205.0	4362.0	36798.0	NaN	66.19	70.13
2015-10-02 11:00:00+02:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	38921.0	NaN	70.09	70.49
2015-12-02 09:00:00+01:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	37413.0	NaN	75.71	80.44
2016-04-13 05:00:00+02:00	220.0	0.0	3390.0	1242.0	243.0	2270.0	1622.0	4515.0	7097.0	53.0	69.0	150.0	NaN	8596.0	23514.0	23614.0	18.69	25.14
2016-04-25 05:00:00+02:00	190.0	0.0	2969.0	886.0	151.0	1340.0	1564.0	5389.0	7094.0	50.0	59.0	454.0	195.0	5989.0	21471.0	NaN	15.00	22.65
2016-04-25 07:00:00+02:00	206.0	0.0	3673.0	1143.0	185.0	162.0	1648.0	6807.0	7095.0	51.0	62.0	283.0	214.0	5682.0	27635.0	NaN	32.97	40.18
2016-05-10 23:00:00+02:00	348.0	960.0	6800.0	5219.0	299.0	0.0	443.0	1750.0	7002.0	50.0	91.0	58.0	280.0	3311.0	26641.0	NaN	51.57	39.11
2016-06-12 01:00:00+02:00	356.0	595.0	5719.0	6165.0	274.0	382.0	NaN	1325.0	5056.0	56.0	86.0	30.0	291.0	2019.0	24715.0	24155.0	60.23	48.72
2016-07-09 22:00:00+02:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6923.0	NaN	NaN	NaN	NaN	NaN	34985.0	NaN	45.72	51.72
2016-07-12 00:00:00+02:00	346.0	595.0	5951.0	6131.0	NaN	494.0	709.0	1215.0	5058.0	49.0	83.0	31.0	309.0	2031.0	25313.0	25103.0	64.99	47.49
2016-09-28 09:00:00+02:00	347.0	594.0	5522.0	6272.0	292.0	0.0	524.0	2494.0	6997.0	61.0	86.0	982.0	300.0	5478.0	31072.0	NaN	49.72	56.40
2016-10-27 23:00:00+02:00	351.0	554.0	7176.0	5690.0	321.0	NaN	417.0	1295.0	6967.0	58.0	91.0	70.0	299.0	3193.0	26423.0	26583.0	55.70	62.84
2016-11-23 04:00:00+01:00	NaN	900.0	4838.0	4547.0	269.0	1413.0	795.0	435.0	5040.0	60.0	85.0	15.0	227.0	4598.0	23469.0	23112.0	43.19	49.11
2017-11-14 12:00:00+01:00	0.0	0.0	10064.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	33805.0	NaN	60.53	66.17
2017-11-14 19:00:00+01:00	0.0	0.0	12336.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	35592.0	NaN	68.05	75.45
2018-06-11 18:00:00+02:00	331.0	506.0	7538.0	5360.0	300.0	1.0	1134.0	4258.0	5856.0	52.0	96.0	170.0	269.0	9165.0	34752.0	NaN	69.87	64.93
2018-07-11 09:00:00+02:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	33938.0	NaN	63.01	69.79

# 对空值进行线性填充,method="linear"表示线性填充，limit_direction="forward"表示向前填充，inplace=True表示直接在原数据上修改
df_energy.interpolate(method="linear",limit_direction="forward",inplace=True)
# 打印填充后数据中各列的null值个数
check_Nas_Dups(df_energy)

数据中各列的null值个数为：
generation biomass                             0
generation fossil brown coal/lignite           0
generation fossil gas                          0
generation fossil hard coal                    0
generation fossil oil                          0
generation hydro pumped storage consumption    0
generation hydro run-of-river and poundage     0
generation hydro water reservoir               0
generation nuclear                             0
generation other                               0
generation other renewable                     0
generation solar                               0
generation waste                               0
generation wind onshore                        0
total load forecast                            0
total load actual                              0
price day ahead                                0
price actual                                   0
dtype: int64
--------------------------------------------------
数据中重复行的数据量为：
0

生成热力图，显示数据中各列之间的相关性

# 热力图函数
def feat_corr(input_df):
    corr=input_df.corr()  # 计算相关性矩阵（使用皮尔逊相关系数）
    plt.figure(figsize=(15,12))  # 设置图表的大小

    #绘制热力图
    g=sns.heatmap(
        corr,
        annot=True,  # 显示相关性系数
        fmt=".2f",  # 设置相关性系数的小数点精度
        cmap="RdYlBu",  # 设置颜色映射
        center=0,  # 设置中心值
        vmin=-1,vmax=1,  # 设置相关性系数的范围
    )
    plt.show()

feat_corr(df_energy)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

根据相关性矩阵对数据进行合并和删除

generation fossil hard coal（硬煤发电）与generation fossil brown coal/lignite（褐煤发电）的相关性系数为0.77，且都与煤发电相关，可以合并在一起作为一个特征，删除原特征
两个特征相关系数接近1，说明其包含信息高度重合，可以删除其中一个特征，以减少计算量

# 创建一个新特征，表示煤发电，合并上述两个特征
df_energy["generation fossil total"]=df_energy["generation fossil hard coal"]+df_energy["generation fossil brown coal/lignite"]
# 删除原特征
df_energy.drop(["generation fossil hard coal",
                "generation fossil brown coal/lignite"],
                axis=1,  # axis=1表示按列进行删除
                inplace=True  # inplace=True表示直接在原数据上修改
                )

# 重新绘制热力图
feat_corr(df_energy)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

对天气数据进行处理

# 读取天气相关数据
df_weather=pd.read_csv("E:\\python_project\\informer\\Informer2020-main\\data\\weather_features.csv")
# 查看数据
df_weather.head()

	dt_iso	city_name	temp	temp_min	temp_max	pressure	humidity	wind_speed	wind_deg	weather_id	weather_main	weather_description	weather_icon
0	2015-01-01 00:00:00+01:00	Valencia	270.475	270.475	270.475	1001	77	1	62	800	clear	sky is clear	01n
1	2015-01-01 01:00:00+01:00	Valencia	270.475	270.475	270.475	1001	77	1	62	800	clear	sky is clear	01n
2	2015-01-01 02:00:00+01:00	Valencia	269.686	269.686	269.686	1002	78	0	23	800	clear	sky is clear	01n
3	2015-01-01 03:00:00+01:00	Valencia	269.686	269.686	269.686	1002	78	0	23	800	clear	sky is clear	01n
4	2015-01-01 04:00:00+01:00	Valencia	269.686	269.686	269.686	1002	78	0	23	800	clear	sky is clear	01n

删除一些无用的数据
- weather_icon：天气图标，展示用途
- weather_description：天气描述
- weather_main：天气主要类型

from sklearn.preprocessing import LabelEncoder

df_temp=df_weather.copy(deep=True)
labels=["weather_id","weather_main","weather_description","weather_icon"]
for col in labels:
    df_temp[col]=LabelEncoder().fit_transform(df_weather[col])  # 使用LabelEncoder对分类变量（字符串或数字信息）进行编码

比较天气数据各特征之间的相关性，删除冗余信息

feat_corr(df_temp)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 查看weather_id的不同数值（编码后的数值）
df_weather["weather_id"].unique()

array([800, 801, 802, 803, 804, 500, 501, 502, 701, 522, 521, 503, 202,
       200, 201, 211, 520, 300, 741, 301, 711, 302, 721, 310, 600, 616,
       615, 601, 210, 602, 611, 311, 612, 620, 531, 731, 761, 771],
      dtype=int64)

# 删除一些特征，weather_id、weather_main、weather_description、weather_icon都是天气描述信息，其特征已经表现在其他数据上
# 删除temp_min、temp_max是平均温度的冗余信息
col_drop_name=[
    "weather_id",
    "weather_main",
    "weather_description",
    "weather_icon",
    "temp_min",
    "temp_max"
]

df_weather.drop(col_drop_name,axis=1,inplace=True)

# 检查每一个特征的无效值（NULL）和重复行
check_Nas_Dups(df_weather)

数据中各列的null值个数为：
dt_iso        0
city_name     0
temp          0
pressure      0
humidity      0
wind_speed    0
wind_deg      0
rain_1h       0
rain_3h       0
snow_3h       0
clouds_all    0
dtype: int64
--------------------------------------------------
数据中重复行的数据量为：
3076

# 删除重复行,.reset_index()重置数据框索引为0、1、2，drop_duplicates()保留重复行的第一行
df_weather=df_weather.reset_index().drop_duplicates()
df_weather

	index	dt_iso	city_name	temp	pressure	humidity	wind_speed	wind_deg	rain_1h	rain_3h	snow_3h	clouds_all
0	0	2015-01-01 00:00:00+01:00	Valencia	270.475	1001	77	1	62	0.0	0.0	0.0	0
1	1	2015-01-01 01:00:00+01:00	Valencia	270.475	1001	77	1	62	0.0	0.0	0.0	0
2	2	2015-01-01 02:00:00+01:00	Valencia	269.686	1002	78	0	23	0.0	0.0	0.0	0
3	3	2015-01-01 03:00:00+01:00	Valencia	269.686	1002	78	0	23	0.0	0.0	0.0	0
4	4	2015-01-01 04:00:00+01:00	Valencia	269.686	1002	78	0	23	0.0	0.0	0.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...
178391	178391	2018-12-31 19:00:00+01:00	Seville	287.760	1028	54	3	30	0.0	0.0	0.0	0
178392	178392	2018-12-31 20:00:00+01:00	Seville	285.760	1029	62	3	30	0.0	0.0	0.0	0
178393	178393	2018-12-31 21:00:00+01:00	Seville	285.150	1028	58	4	50	0.0	0.0	0.0	0
178394	178394	2018-12-31 22:00:00+01:00	Seville	284.150	1029	57	4	60	0.0	0.0	0.0	0
178395	178395	2018-12-31 23:00:00+01:00	Seville	283.970	1029	70	3	50	0.0	0.0	0.0	0

178396 rows × 12 columns

# 处理时间索引
df_weather["time"]=pd.to_datetime(df_weather["dt_iso"]) #将时间特征转化为pd时间格式
df_weather.drop(["dt_iso"],axis=1,inplace=True)
df_weather=df_weather.set_index('time') #设置时间为索引
df_weather.drop(["index"],axis=1,inplace=True)   #删除多余的index列(这是由之前的reset_index()产生的)
df_weather

	city_name	temp	pressure	humidity	wind_speed	wind_deg	rain_1h	rain_3h	snow_3h	clouds_all
time
2015-01-01 00:00:00+01:00	Valencia	270.475	1001	77	1	62	0.0	0.0	0.0	0
2015-01-01 01:00:00+01:00	Valencia	270.475	1001	77	1	62	0.0	0.0	0.0	0
2015-01-01 02:00:00+01:00	Valencia	269.686	1002	78	0	23	0.0	0.0	0.0	0
2015-01-01 03:00:00+01:00	Valencia	269.686	1002	78	0	23	0.0	0.0	0.0	0
2015-01-01 04:00:00+01:00	Valencia	269.686	1002	78	0	23	0.0	0.0	0.0	0
...	...	...	...	...	...	...	...	...	...	...
2018-12-31 19:00:00+01:00	Seville	287.760	1028	54	3	30	0.0	0.0	0.0	0
2018-12-31 20:00:00+01:00	Seville	285.760	1029	62	3	30	0.0	0.0	0.0	0
2018-12-31 21:00:00+01:00	Seville	285.150	1028	58	4	50	0.0	0.0	0.0	0
2018-12-31 22:00:00+01:00	Seville	284.150	1029	57	4	60	0.0	0.0	0.0	0
2018-12-31 23:00:00+01:00	Seville	283.970	1029	70	3	50	0.0	0.0	0.0	0

178396 rows × 10 columns

处理异常值

# 显示天气的各特征的统计信息，保留两位小数
df_weather.describe().round(2)

	temp	pressure	humidity	wind_speed	wind_deg	rain_1h	rain_3h	snow_3h	clouds_all
count	178396.00	178396.00	178396.00	178396.00	178396.00	178396.00	178396.00	178396.00	178396.00
mean	289.62	1069.26	68.42	2.47	166.59	0.08	0.00	0.00	25.07
std	8.03	5969.63	21.90	2.10	116.61	0.40	0.01	0.22	30.77
min	262.24	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
25%	283.67	1013.00	53.00	1.00	55.00	0.00	0.00	0.00	0.00
50%	289.15	1018.00	72.00	2.00	177.00	0.00	0.00	0.00	20.00
75%	295.15	1022.00	87.00	4.00	270.00	0.00	0.00	0.00	40.00
max	315.60	1008371.00	100.00	133.00	360.00	12.00	2.32	21.50	100.00

# 创建2X2的统计图用于显示4个天气特征
fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(12,8))   #创建图片表

columns_to_plot=["pressure","wind_speed","rain_1h","rain_3h"]   #需要显示的4个特征

for i,ax in enumerate(axes.flat):
    if i < len(columns_to_plot):
        # 绘制时间序列统计图
        ax.plot(
            df_weather.index,   #x轴为时间索引
            df_weather[columns_to_plot[i]]  #y轴为对应的特征
        )
        ax.set_title(columns_to_plot[i])
    else:
        ax.set_visible(False)   #如果没有数据，则隐藏该统计图

plt.tight_layout()  #调整子图之间的距离

plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

#创建2X2的箱线图来显示天气四个特征的分布情况
#箱线图：展示数据分布，可以用于判断异常值，箱子的中间有一条线，代表了数据的中位数。箱子的上下底，分别是数据的上四分位数（Q3）和下四分位数（Q1），这意味着箱体包含了50%的数据。因此，箱子的高度在一定程度上反映了数据的波动程度。上下边缘则代表了该组数据的最大值和最小值
fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(12,8))

columns_to_plot=["pressure","wind_speed","rain_1h","rain_3h"]  #统计的四个特征

for i ,ax in enumerate(axes.flat):  #设置每一个箱线图
    if i< len(columns_to_plot):
        ax.boxplot(x=df_weather[columns_to_plot[i]])
        ax.set_title(columns_to_plot[i])
    else:
        ax.set_visible(False)

plt.tight_layout()  #调整距离
plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 根据常识，对异常值进行处理，超过某个阈值的数据记为NAN；.log基于条件选择数据
df_weather.loc[df_weather["pressure"] > 1080 ,"pressure"]=np.nan  #pressure>1080的值记为nan
df_weather.loc[df_weather["pressure"] < 870 ,"pressure"]=np.nan 

df_weather.loc[df_weather["wind_speed"] > 113 ,"wind_speed"]=np.nan 

# 对所有异常值进行线性插值填充
df_weather.interpolate(
    method="linear",  #线性插值法
    limit_direction="forward",  #向前填充缺失值
    inplace=True   #在原数据框中修改
)

# 再次检查其箱线图
fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(12,8))

columns_to_plot=["pressure","wind_speed","rain_1h","rain_3h"]  #统计的四个特征

for i ,ax in enumerate(axes.flat):  #设置每一个箱线图
    if i< len(columns_to_plot):
        ax.boxplot(x=df_weather[columns_to_plot[i]])
        ax.set_title(columns_to_plot[i])
    else:
        ax.set_visible(False)

plt.tight_layout()  #调整距离
plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

#删除rain_h3列，数据不准确：三小时降雨量还没有一小时的多
df_weather.drop(['rain_3h'],axis=1,inplace=True)

解决不同城市观测记录的数据量不同问题

#输出发电相关数据以及各城市天气数据的个数
print(f"Number of samples in df_energy is {df_energy.shape[0]}")    #发电数据的数量

city_list=df_weather['city_name'].unique()  #获取各城市名
grouped_weather=df_weather.groupby("city_name")

for city in city_list:
    print(f"Number of samples in df_weather in {city} is {grouped_weather.get_group(city).shape[0]}")

Number of samples in df_energy is 35064
Number of samples in df_weather in Valencia is 35145
Number of samples in df_weather in Madrid is 36267
Number of samples in df_weather in Bilbao is 35951
Number of samples in df_weather in  Barcelona is 35476
Number of samples in df_weather in Seville is 35557

# 根据时间和城市名，删除重复记录的数据
df_weather_cleanned=df_weather.reset_index().drop_duplicates(
    subset=["time","city_name"],  #根据这两个特征是否重复来删除数据
    keep="first"   #保留第一行的数据
).set_index("time")
df_weather_cleanned

	city_name	temp	pressure	humidity	wind_speed	wind_deg	rain_1h	snow_3h	clouds_all
time
2015-01-01 00:00:00+01:00	Valencia	270.475	1001.0	77	1.0	62	0.0	0.0	0
2015-01-01 01:00:00+01:00	Valencia	270.475	1001.0	77	1.0	62	0.0	0.0	0
2015-01-01 02:00:00+01:00	Valencia	269.686	1002.0	78	0.0	23	0.0	0.0	0
2015-01-01 03:00:00+01:00	Valencia	269.686	1002.0	78	0.0	23	0.0	0.0	0
2015-01-01 04:00:00+01:00	Valencia	269.686	1002.0	78	0.0	23	0.0	0.0	0
...	...	...	...	...	...	...	...	...	...
2018-12-31 19:00:00+01:00	Seville	287.760	1028.0	54	3.0	30	0.0	0.0	0
2018-12-31 20:00:00+01:00	Seville	285.760	1029.0	62	3.0	30	0.0	0.0	0
2018-12-31 21:00:00+01:00	Seville	285.150	1028.0	58	4.0	50	0.0	0.0	0
2018-12-31 22:00:00+01:00	Seville	284.150	1029.0	57	4.0	60	0.0	0.0	0
2018-12-31 23:00:00+01:00	Seville	283.970	1029.0	70	3.0	50	0.0	0.0	0

175320 rows × 9 columns

# 处理后个城市数据与发电数据记录数据条数相同
print(f"Number of samples in df_energy is {df_energy.shape[0]}")    #发电数据的数量

city_list=df_weather['city_name'].unique()  #获取各城市名
grouped_weather=df_weather_cleanned.groupby("city_name")

for city in city_list:
    print(f"Number of samples in df_weather in {city} is {grouped_weather.get_group(city).shape[0]}")

Number of samples in df_energy is 35064
Number of samples in df_weather in Valencia is 35064
Number of samples in df_weather in Madrid is 35064
Number of samples in df_weather in Bilbao is 35064
Number of samples in df_weather in  Barcelona is 35064
Number of samples in df_weather in Seville is 35064

# 提取每一个城市的数据，返回一个DataFrame列表
df_weather_all_cities=[grouped_weather.get_group(x) for x in grouped_weather.groups]
df_weather_all_cities[0]

	city_name	temp	pressure	humidity	wind_speed	wind_deg	rain_1h	snow_3h	clouds_all
time
2015-01-01 00:00:00+01:00	Barcelona	281.625	1035.0	100	7.0	58	0.0	0.0	0
2015-01-01 01:00:00+01:00	Barcelona	281.625	1035.0	100	7.0	58	0.0	0.0	0
2015-01-01 02:00:00+01:00	Barcelona	281.286	1036.0	100	7.0	48	0.0	0.0	0
2015-01-01 03:00:00+01:00	Barcelona	281.286	1036.0	100	7.0	48	0.0	0.0	0
2015-01-01 04:00:00+01:00	Barcelona	281.286	1036.0	100	7.0	48	0.0	0.0	0
...	...	...	...	...	...	...	...	...	...
2018-12-31 19:00:00+01:00	Barcelona	284.130	1027.0	71	1.0	250	0.0	0.0	0
2018-12-31 20:00:00+01:00	Barcelona	282.640	1027.0	62	3.0	270	0.0	0.0	0
2018-12-31 21:00:00+01:00	Barcelona	282.140	1028.0	53	4.0	300	0.0	0.0	0
2018-12-31 22:00:00+01:00	Barcelona	281.130	1028.0	50	5.0	320	0.0	0.0	0
2018-12-31 23:00:00+01:00	Barcelona	280.130	1028.0	100	5.0	310	0.0	0.0	0

35064 rows × 9 columns

合并数据

df_weather_energy=df_energy

for df_city in df_weather_all_cities:
    city_name=df_city.iloc[0]["city_name"].replace(' ','')  #获取城市名称；iloc获取标签以及对应值（类似city_name: 'Beijing', temp: 20, humidity: 65）
    df_temp_city=df_city.add_suffix(f"_{city_name}") #在每一个城市数据中其每一个标签的后面添加上对应的城市名
    df_weather_energy=pd.concat([df_weather_energy,df_temp_city],axis=1)    #合并数据
    df_weather_energy=df_weather_energy.drop(f"city_name_{city_name}",axis=1)  #删除冗余城市名 

df_weather_energy.columns

Index(['generation biomass', 'generation fossil gas', 'generation fossil oil',
       'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore', 'total load forecast',
       'total load actual', 'price day ahead', 'price actual',
       'generation fossil total', 'temp_Barcelona', 'pressure_Barcelona',
       'humidity_Barcelona', 'wind_speed_Barcelona', 'wind_deg_Barcelona',
       'rain_1h_Barcelona', 'snow_3h_Barcelona', 'clouds_all_Barcelona',
       'temp_Bilbao', 'pressure_Bilbao', 'humidity_Bilbao',
       'wind_speed_Bilbao', 'wind_deg_Bilbao', 'rain_1h_Bilbao',
       'snow_3h_Bilbao', 'clouds_all_Bilbao', 'temp_Madrid', 'pressure_Madrid',
       'humidity_Madrid', 'wind_speed_Madrid', 'wind_deg_Madrid',
       'rain_1h_Madrid', 'snow_3h_Madrid', 'clouds_all_Madrid', 'temp_Seville',
       'pressure_Seville', 'humidity_Seville', 'wind_speed_Seville',
       'wind_deg_Seville', 'rain_1h_Seville', 'snow_3h_Seville',
       'clouds_all_Seville', 'temp_Valencia', 'pressure_Valencia',
       'humidity_Valencia', 'wind_speed_Valencia', 'wind_deg_Valencia',
       'rain_1h_Valencia', 'snow_3h_Valencia', 'clouds_all_Valencia'],
      dtype='object')

# 检查无效值个数
check_Nas_Dups(df_weather_energy)

数据中各列的null值个数为：
generation biomass                             0
generation fossil gas                          0
generation fossil oil                          0
generation hydro pumped storage consumption    0
generation hydro run-of-river and poundage     0
generation hydro water reservoir               0
generation nuclear                             0
generation other                               0
generation other renewable                     0
generation solar                               0
generation waste                               0
generation wind onshore                        0
total load forecast                            0
total load actual                              0
price day ahead                                0
price actual                                   0
generation fossil total                        0
temp_Barcelona                                 0
pressure_Barcelona                             0
humidity_Barcelona                             0
wind_speed_Barcelona                           0
wind_deg_Barcelona                             0
rain_1h_Barcelona                              0
snow_3h_Barcelona                              0
clouds_all_Barcelona                           0
temp_Bilbao                                    0
pressure_Bilbao                                0
humidity_Bilbao                                0
wind_speed_Bilbao                              0
wind_deg_Bilbao                                0
rain_1h_Bilbao                                 0
snow_3h_Bilbao                                 0
clouds_all_Bilbao                              0
temp_Madrid                                    0
pressure_Madrid                                0
humidity_Madrid                                0
wind_speed_Madrid                              0
wind_deg_Madrid                                0
rain_1h_Madrid                                 0
snow_3h_Madrid                                 0
clouds_all_Madrid                              0
temp_Seville                                   0
pressure_Seville                               0
humidity_Seville                               0
wind_speed_Seville                             0
wind_deg_Seville                               0
rain_1h_Seville                                0
snow_3h_Seville                                0
clouds_all_Seville                             0
temp_Valencia                                  0
pressure_Valencia                              0
humidity_Valencia                              0
wind_speed_Valencia                            0
wind_deg_Valencia                              0
rain_1h_Valencia                               0
snow_3h_Valencia                               0
clouds_all_Valencia                            0
dtype: int64
--------------------------------------------------
数据中重复行的数据量为：
0

添加数据的时间特征

# 添加从pb时间特征获取小时、星期、月、年特征信息
df_weather_energy["hour"]=df_weather_energy.index.map(lambda x : x.hour)
df_weather_energy["weekday"]=df_weather_energy.index.map(lambda x : x.weekday())
df_weather_energy["month"]=df_weather_energy.index.map(lambda x : x.month)
df_weather_energy["year"]=df_weather_energy.index.map(lambda x : x.year)

# 查看数据的特征信息
df_weather_energy.columns

Index(['generation biomass', 'generation fossil gas', 'generation fossil oil',
       'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore', 'total load forecast',
       'total load actual', 'price day ahead', 'price actual',
       'generation fossil total', 'temp_Barcelona', 'pressure_Barcelona',
       'humidity_Barcelona', 'wind_speed_Barcelona', 'wind_deg_Barcelona',
       'rain_1h_Barcelona', 'snow_3h_Barcelona', 'clouds_all_Barcelona',
       'temp_Bilbao', 'pressure_Bilbao', 'humidity_Bilbao',
       'wind_speed_Bilbao', 'wind_deg_Bilbao', 'rain_1h_Bilbao',
       'snow_3h_Bilbao', 'clouds_all_Bilbao', 'temp_Madrid', 'pressure_Madrid',
       'humidity_Madrid', 'wind_speed_Madrid', 'wind_deg_Madrid',
       'rain_1h_Madrid', 'snow_3h_Madrid', 'clouds_all_Madrid', 'temp_Seville',
       'pressure_Seville', 'humidity_Seville', 'wind_speed_Seville',
       'wind_deg_Seville', 'rain_1h_Seville', 'snow_3h_Seville',
       'clouds_all_Seville', 'temp_Valencia', 'pressure_Valencia',
       'humidity_Valencia', 'wind_speed_Valencia', 'wind_deg_Valencia',
       'rain_1h_Valencia', 'snow_3h_Valencia', 'clouds_all_Valencia', 'hour',
       'weekday', 'month', 'year'],
      dtype='object')

数据可视化

from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig=make_subplots() #创建图表

fig.add_trace(
    go.Scatter(
        x=df_weather_energy.index,  #x轴时间点
        y=df_weather_energy["price actual"], #y轴为实际价格
        name="price actural"
    )
)

fig.add_trace(
    go.Scatter(
        x=df_weather_energy.index,
        y=df_weather_energy.rolling(window=24).mean()["price actual"],  #y轴表示过去24小时的平均值，波动较小
        name="rolling window = daily ave"
    )
)

fig.add_trace(
    go.Scatter(
        x=df_weather_energy.index,
        y=df_weather_energy.rolling(window=24*7).mean()["price actual"],  #y轴表示过去7天的平均值，显示长期趋势
        name="rolling window = weekly ave"
    )
)

fig.show()

展示电价的每个月和每周的变化趋势
- 可以帮助理解价格的季节性变化
- 可以观测是否有周期性的价格波动

fig,axes=plt.subplots(
    ncols=2,    #创建2列统计图
    figsize=(14,6)  #大小
)

sns.set(style='darkgrid')   #设置统计图风格为深色网络

sns.barplot(
    x="month",
    y="price actual",    #y轴
    data=df_weather_energy,  #数据源
    estimator=sum,  #对每个月的总和作为统计量
    color='royalblue',  #设置柱状图颜色
    ax=axes[0]  #显示位置
)
axes[0].set_title("Monthly actual price (1 is Jan)")  #设置标题

sns.barplot(
    x="weekday",
    y="price actual",    #y轴
    data=df_weather_energy,  #数据源
    estimator=sum,  #对每个月的总和作为统计量
    color='royalblue',  #设置柱状图颜色
    ax=axes[1]  #显示位置
)
axes[1].set_title("Daily actual price (0 is Monday)")  #设置标题

plt.show()  # 显示图表

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

创建两个直方图来比较真实值与预测值
- 电价预测与实际电价
- 负载预测与实际负载

#真实价格与预测价格的直方图
plt.figure(figsize=(14,6))  #设置图表的大小

gr=sns.histplot(    #绘制预测电价的直方图
    df_weather_energy["price day ahead"],   #预测电价的数据
    bins=100,  #直方图的柱数
    label="TSO Prediction",
    element="step",  #阶梯式直方图
    color="lightcoral",
    kde=True  #显示核密度估计曲线
)

gr=sns.histplot(    #绘制实际电价的直方图
    df_weather_energy["price actual"],   #实际电价的数据
    bins=100,  #直方图的柱数
    label="Actual Price",
    element="step",  #阶梯式直方图
    color="royalblue",
    kde=True  #显示核密度估计曲线
)

gr.set(xlabel="Price of Electricity (€/MWh)" , ylabel="Frequency")  #x轴与y轴的标签
plt.legend()  #显示图例
plt.show()  #显示图表

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 负载预测与实际负载
plt.figure(figsize=(14,6))  #设置图表的大小

gr=sns.histplot(    #绘制真实负载的直方图
    df_weather_energy["total load actual"],   #实际负载的数据
    bins=100,  #直方图的柱数
    label="Load actual",
    element="step",  #阶梯式直方图
    color="dimgrey",
    kde=True  #显示核密度估计曲线
)

gr=sns.histplot(    #绘制预测负载的直方图
    df_weather_energy["total load forecast"],   #预测负载的数据
    bins=100,  #直方图的柱数
    label="Load forecast",
    element="step",  #阶梯式直方图
    color="lightyellow",
    kde=True  #显示核密度估计曲线
)

gr.set(xlabel="Load (MWh)" , ylabel="Frequency")  #x轴与y轴的标签
plt.legend()  #显示图例
plt.show()  #显示图表

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

计算TSO预测价格的平均绝对误差（MAE）
- 先对实际价格和预测价格进行归一化
- 计算归一化之后的平均绝对误差
为什么归一化拟合的范围只有训练数据，但归一化转化是全部数据？
- 因为预测和验证的数据是当作未知的，只能用已知的训练集数据来确定归一化数据（拟合）
- 最后生成的预测数据也是归一化的数据，并且是和训练集相同归一化的标准，因此要用训练数据的标准（拟合范围）来转化所有数据

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error

y_scaler_actual=MinMaxScaler()  #利用MinMaxScaler函数将真实价格数据缩放到[0,1]之间
y_scaler_dayahead=MinMaxScaler()  #利用MinMaxScaler函数将预测价格数据缩放到[0,1]之间

# 数据集划分
train_cutoff=int(0.8*df_weather_energy.shape[0])  #80%训练数据的截止点
val_cutoff=int(0.9*df_weather_energy.shape[0])   #验证数据

# 数据准备
y_price_actual=df_weather_energy[["price actual"]]  #全部真实数据
y_price_dayahead=df_weather_energy[["price day ahead"]]

#实际价格归一化
y_scaler_actual.fit(y_price_actual[:train_cutoff])   #拟合归一化器
actual_norm=y_scaler_actual.transform(y_price_actual)   #所有的实际价格归一化

#预测价格归一化
y_scaler_dayahead.fit(y_price_dayahead[:train_cutoff])   #拟合归一化器
dayahead_norm=y_scaler_dayahead.transform(y_price_dayahead)   #所有的预测价格归一化

#输出平均绝对误差，保留三位小数
print(f"mean absolute error for normalized acutal price and TSO prediction is : {round(mean_absolute_error(actual_norm,dayahead_norm),3)}")

mean absolute error for normalized acutal price and TSO prediction is : 0.071

# 删除总负载特征，在预测时是不知道总负载的特征的
df_weather_energy.drop("total load forecast",axis=1,inplace=True)

将价格数据分解为趋势、季节性、残差三部分并用统计图显示

这几个分解数据怎么来？？？？？

观测值 = 趋势 + 季节性 + 残差
趋势分析可以了解数据的长期特征
季节性数据可以了解到周期性模式
残差数据可以了解到随机波动

from statsmodels.tsa.seasonal import seasonal_decompose

fig,axes=plt.subplots(5,1,figsize=(20,12))  #创建5行1列的统计图

decom_data = df_weather_energy[['price actual']].copy()

decompose_result=seasonal_decompose(   #进行时间序列分解
    decom_data,
    period=5,   #季节行周期为5
    model='additive'  #使用加法模型
)

# 获取分解结果
observed=decompose_result.observed  #原始观测数据
trend=decompose_result.trend  #趋势
seasonal=decompose_result.seasonal #季节性成分
residual=decompose_result.resid   #残差成分

# 绘制统计图
axes[0].plot(observed,color='dimgrey')  #原始数据
axes[0].set_title('Observed')

axes[1].plot(trend,color='royalblue')  #趋势数据
axes[1].set_title('Trend')

axes[2].plot(seasonal,color='lightcoral')  #季节数据
axes[2].set_title('Seasonality')

axes[3].plot(seasonal[:100],color='lightcoral')  #放大季节数据，只显示前一百个数据
axes[3].set_title('Zoomed Seasonality')

axes[4].plot(residual,color='lightcoral')  #残差数据
axes[4].set_title('Residual')

fig.tight_layout()
plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

时间序列的三个重要特征

Dickey-Fuller检测(ADF检测)：用于判断时间序列是否稳定
平稳性的特征：时间序列的统计特征不随时间改变
ARMA模型：用于平稳时间序列的统计模型
（我的个人理解是ADF值越小，数据越平稳，就适合用类似ARMA这种简单模型）

# 进行ADF检测，用来检测时间序列是否平稳
from statsmodels.tsa.stattools import adfuller

result=adfuller(df_weather_energy[['price actual']])  
print('ADF Statistic',result[0])    # 这个值越小（越负），越表明序列是平稳的
print('P-value',result[1])   # 如果p值 < 0.05，表明序列是平稳的
print('Critical Values',result[4])  # 如果ADF统计量小于这些临界值，表明序列是平稳的

ADF Statistic -9.147016232851161
P-value 2.7504934849347068e-15
Critical Values {'1%': -3.4305367814665044, '5%': -2.8616225527935106, '10%': -2.566813940257257}