时间序列预测与多元时间序列分析
一、单变量时间序列预测
1.1 模型编译与训练
首先,我们使用
compile
方法编译模型,代码如下:
rnn_model.compile(optimizer = 'adam', loss = 'mae')
然后,通过调用
fit
方法来训练模型:
EVALUATION_INTERVAL = 200
EPOCHS = 10
rnn_model.fit(train_data, epochs = EPOCHS,
steps_per_epoch = EVALUATION_INTERVAL,
validation_data = test_data,
validation_steps = 50)
1.2 模型评估
为了评估模型的性能,我们在测试数据上调用
predict
方法:
rnn_predictions = rnn_model.predict(X_test)
接着,使用
sklearn_metrics
中的
r2_score
来检查性能得分:
from sklearn.metrics import r2_score
rnn_score = r2_score(y_test,rnn_predictions)
print("R2 Score of RNN model = "+"{:.4f}".format(rnn_score))
这里,$R^2$是决定系数,是一种回归得分,最佳得分为1.0,得分为0.0表示模型是一个常数,总是预测$y$的期望值,而忽略输入特征。在我们的例子中,$R^2$值接近1.0,说明模型训练良好。
1.3 绘制实际值与预测值对比图
我们使用以下代码绘制整个测试数据集的实际值与预测值对比图:
#@title Data Range
a = 0 #@param {type:"slider", min:0, max:12000, step:1}
b = 12000 #@param {type:"slider", min:0, max:12000, step:1}
def plot_predictions(test, predicted, title):
plt.figure(figsize = (16,4))
plt.plot(test[a:b], color = 'blue',label = 'Normalized power consumption')
plt.plot(predicted[a:b], alpha = 0.7, color = 'orange', label = 'Predicted power consumption')
plt.title(title)
plt.xlabel('Time')
plt.ylabel('Normalized power consumption')
plt.legend()
plt.show()
plot_predictions(y_test, rnn_predictions, "Predictions made by simple RNN model")
从图中可以看出,预测值接近实际值,说明RNN模型在预测能源消耗方面表现良好。
1.4 预测下一个数据点
我们从测试数据中提取最后一个数据点,并对其应用预测函数:
X = X_test[-1:]
rnn_predictions1 = rnn_model.predict(X)
可以通过在控制台打印来查看预测值:
rnn_predictions1
输出结果为:
array([[0.798944]], dtype = float32)
为了更好地可视化结果,我们使用以下代码绘制最后40个数据点和预测值的图:
history_data = list(y_test[-40:])
plottingvalues = list(history_data)+list(rnn_predictions1)
plt.figure(figsize = (16,4))
plt.plot(plottingvalues, color = 'orange', label = 'forecasted value',marker = 'o')
plt.plot(y_test[-40:], color = 'green', label = 'history',marker = 'x')
plt.xlabel('Time')
plt.ylabel('Normalized power consumption scale')
plt.legend()
plt.show()
1.5 预测数据点范围
如果我们想预测未来25个数据点,可以使用以下步骤:
1. 从测试集中提取最后40个数据点:
history_data = list(y_test[-40:])
- 编写一个函数来构建新的数据集:
def make_data(X,rnn_predictions1):
val = list(X[0][1:])+list(rnn_predictions1)
X_new = []
X_new.append(list(val))
X_new = np.array(X_new)
return X_new
- 创建一个列表变量来存储所有预测值:
forecast = list()
- 提取最后一个测试数据点:
X = X_test[-1:]
- 使用循环创建测试数据,进行预测,并将预测值添加到预测列表中:
for i in range (25):
X = make_data(X,rnn_predictions1)
rnn_predictions1 = rnn_model.predict(X)
forecast += list(rnn_predictions1)
- 绘制包含历史数据和未来25个预测值的图:
plottingvalues = list(history_data)+list(forecast)
plt.figure(figsize=(16,4))
plt.plot(plottingvalues, color = 'orange', label = 'forecasted value',marker = 'o')
plt.plot(y_test[-40:], color = 'green', label = 'history',marker = 'x')
plt.xlabel('Time (ticks)')
plt.ylabel('Normalized power consumption scale')
plt.legend()
plt.show()
1.6 完整代码
以下是完整的单变量时间序列预测代码:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.preprocessing
from sklearn.metrics import r2_score
url = 'https://raw.githubusercontent.com/Apress/artificial-neural-networks-with-tensorflow-2/main/ch11/DOM_hourly.csv'
df = pd.read_csv(url)
# uncomment the following line for generating daily data
#df = df[df['Datetime'].str.contains("00:00:00")]
df['Datetime'] = pd.to_datetime(df.Datetime , format = '%Y-%m-%d %H:%M:%S')
df.index = df.Datetime
df.drop(['Datetime'], axis = 1,inplace = True)
df.head()
df
#checking missing data
df.isna().sum()
#@title Date Range
a = '2005-12-31' #@param {type:"date"}
b = '2018-01-31' #@param {type:"date"}
a = a + " 00:00:00"
b = b + " 00:00:00"
df.loc[a:b].plot(figsize = (16,4),legend = True)
plt.title('DOM hourly power consumption data')
plt.ylabel('Power consumption (MW)')
plt.show()
df.shape
scaler = sklearn.preprocessing.MinMaxScaler()
df['DOM_MW'] = scaler.fit_transform(df['DOM_MW'].values.reshape(-1,1))
df.plot(figsize = (16,4), legend = True)
plt.title('DOM hourly power consumption data – AFTER NORMALIZATION')
plt.ylabel('Normalized power consumption')
plt.show()
def load_data(stock, seq_len):
X_train = []
y_train = []
for i in range(seq_len, len(stock)):
X_train.append(stock.iloc[i-seq_len : i, 0])
y_train.append(stock.iloc[i, 0])
X_test = X_train[int(0.9*(len(stock))):]
y_test = y_train[int(0.9*(len(stock))):]
X_train = X_train[:int(0.9*(len(stock)))]
y_train = y_train[:int(0.9*(len(stock)))]
# convert to numpy array
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)
# reshape data to input into RNN models
X_train = np.reshape(X_train, (X_train.shape[0], seq_len, 1))
X_test = np.reshape(X_test, (X_test.shape[0], seq_len, 1))
return [X_train, y_train, X_test, y_test]
#create train, test data
seq_len = 20 #choose sequence length
X_train, y_train, X_test, y_test = load_data(df, seq_len)
print('X_train.shape = ',X_train.shape)
print('y_train.shape = ', y_train.shape)
print('X_test.shape = ', X_test.shape)
print('y_test.shape = ',y_test.shape)
batch_size = 256
buffer_size = 1000
train_data = tf.data.Dataset.from_tensor_slices((X_train , y_train))
train_data = train_data.cache().shuffle(buffer_size).batch(batch_size).repeat()
test_data = tf.data.Dataset.from_tensor_slices((X_test , y_test))
test_data = test_data.batch(batch_size).repeat()
rnn_model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(8, input_shape = X_train.shape[-2:]),
tf.keras.layers.Dense(1)
])
tf.keras.utils.plot_model(rnn_model)
rnn_model.compile(optimizer = 'adam', loss = 'mae')
EVALUATION_INTERVAL = 200
EPOCHS = 10
rnn_model.fit(train_data, epochs=EPOCHS,
steps_per_epoch = EVALUATION_INTERVAL,
validation_data = test_data,
validation_steps = 50)
rnn_predictions = rnn_model.predict(X_test)
rnn_score = r2_score(y_test,rnn_predictions)
print("R2 Score of RNN model = "+"{:.4f}".format(rnn_score))
#@title Data Range
a = 0 #@param {type:"slider", min:0, max:12000, step:1}
b = 12000 #@param {type:"slider", min:0, max:12000, step:1}
def plot_predictions(test, predicted, title):
plt.figure(figsize = (16,4))
plt.plot(test[a:b], color = 'blue', label = 'Normalized power consumption')
plt.plot(predicted[a:b], alpha = 0.7, color = 'orange', label = 'Predicted power consumption')
plt.title(title)
plt.xlabel('Time')
plt.ylabel('Normalized power consumption')
plt.legend()
plt.show()
plot_predictions(y_test, rnn_predictions, "Predictions made by simple RNN model")
history_data = list(y_test[-40:])
plottingvalues = list(history_data)+list(rnn_predictions[:50])
plt.figure(figsize = (16,4))
plt.plot(plottingvalues, color = 'orange', label = 'forecasted value',marker = 'o')
plt.plot(y_test[-40:], color = 'green', label = 'history',marker = 'x')
plt.xlabel('Time')
plt.ylabel('Normalized power consumption scale')
plt.legend()
plt.show()
X = X_test[-1:]
rnn_predictions1 = rnn_model.predict(X)
rnn_predictions1
history_data = list(y_test[-40:])
plottingvalues = list(history_data)+list(rnn_predictions1)
plt.figure(figsize = (16,4))
plt.plot(plottingvalues, color = 'orange', label = 'forecasted value',marker = 'o')
plt.plot(y_test[-40:], color = 'green', label = 'history',marker = 'x')
plt.xlabel('Time')
plt.ylabel('Normalized power consumption scale')
plt.legend()
plt.show()
history_data = list(y_test[-40:])
def make_data(X,rnn_predictions1):
val = list(X[0][1:])+list(rnn_predictions1)
X_new = []
X_new.append(list(val))
X_new = np.array(X_new)
return X_new
forecast = list()
X = X_test[-1:]
for i in range (25):
X = make_data(X,rnn_predictions1)
rnn_predictions1 = rnn_model.predict(X)
forecast += list(rnn_predictions1)
plottingvalues = list(history_data)+list(forecast)
plt.figure(figsize = (16,4))
plt.plot(plottingvalues, color = 'orange', label = 'forecasted value',marker = 'o')
plt.plot(y_test[-40:], color = 'green', label = 'history',marker = 'x')
plt.xlabel('Time (ticks)')
plt.ylabel('Normalized power consumption scale')
plt.legend()
plt.show()
二、多元时间序列分析
2.1 项目创建
创建一个Colab项目,并将其重命名为“Multivariate time series analysis”。导入所需的库:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import sklearn.preprocessing
import seaborn as sns
2.2 数据准备
使用以下代码加载数据:
url = 'https://raw.githubusercontent.com/Apress/artificial-neural-networks-with-tensorflow-2/main/ch11/london_merged.csv'
df = pd.read_csv(url,parse_dates=['timestamp'], index_col = "timestamp")
该数据集包含17,414条记录和9列,各列含义如下:
| 列名 | 含义 |
| ---- | ---- |
| timestamp | 时间戳 |
| cnt | 自行车共享数量 |
| t1 | 实际摄氏温度 |
| t2 | 体感摄氏温度 |
| hum | 湿度百分比 |
| wind_speed | 风速(km/h) |
| weather_code | 天气类别 |
| is_holiday | 布尔值,1表示节假日 |
| is_weekend | 布尔值,1表示周末 |
| season | 季节分类:0-春季,1-夏季,2-秋季,3-冬季 |
这里,
cnt
列是我们的预测目标,其余列可作为特征。
2.3 平稳性检查
使用Johansen测试来检查所有列是否平稳:
#checking stationarity
from statsmodels.tsa.vector_ar.vecm import coint_johansen
johan_test_temp = df
coint_johansen(johan_test_temp,-1,1).eig
运行测试后,输出所有九列的特征值:
array([2.61219379e-01, 1.31970167e-01, 5.22046139e-02,
4.19830465e-02, 2.10126207e-02, 1.75450605e-02, 1.36518877e-02,
6.26085775e-04, 7.56291478e-05])
所有特征值都小于1,说明所有时间序列都是平稳的。
2.4 数据探索
2.4.1 绘制自行车共享数量分布
plt.figure(figsize = (16,4))
plt.plot(df.index, df["cnt"])
2.4.2 观察季节变化对需求的影响
# create indexes
df['hour'] = df.index.hour
df['month'] = df.index.month
fig,(ax1, ax2, ax3)= plt.subplots(nrows = 3)
fig.set_size_inches(16, 10)
sns.pointplot(data = df, x = 'month', y = 'cnt', hue = 'is_weekend', ax = ax1)
sns.pointplot(data = df, x = 'hour', y = 'cnt', hue = 'season', ax = ax2)
sns.pointplot(data = df, x = 'month', y = 'cnt', ax = ax3)
从图中可以看出:
- 第一个图表显示,每年7月的自行车共享需求较高,无论是否为周末。
- 第二个图表显示,在每个季节,早上8点和下午5 - 6点的需求较高。
- 第三个图表显示,7月的需求最高,1 - 12月的需求最低。
2.5 数据准备
2.5.1 缩放数值列
# scaling numeric columns
scaler = sklearn.preprocessing.MinMaxScaler()
df['t1'] = scaler.fit_transform(df['t1'].values.reshape(-1,1))
df['t2'] = scaler.fit_transform(df['t2'].values.reshape(-1,1))
df['hum'] = scaler.fit_transform(df['hum'].values.reshape(-1,1))
df['wind_speed'] = scaler.fit_transform(df['wind_speed'].values.reshape(-1,1))
df['cnt'] = scaler.fit_transform(df['cnt'].values.reshape(-1,1))
这里,我们排除了包含分类和布尔值的列。
2.5.2 划分训练集和测试集
# use 90% for training
train_size = int(len(df) * 0.9)
test_size = len(df) - train_size
train, test = df.iloc[0:train_size], df.iloc[train_size:len(df)]
2.5.3 创建数据集
def create_dataset(X, y, time_steps = 1):
Xs, ys = [], []
for i in range(len(X) - time_steps):
v = X.iloc[i:(i + time_steps)].values
Xs.append(v)
ys.append(y.iloc[i + time_steps])
return np.array(Xs), np.array(ys)
time_steps = 10
X_train, y_train = create_dataset(train, train.cnt, time_steps)
X_test, y_test = create_dataset(test, test.cnt, time_steps)
2.5.4 创建数据批次
batch_size = 256
buffer_size = 1000
train_data = tf.data.Dataset.from_tensor_slices((X_train , y_train))
train_data = train_data.cache().shuffle(buffer_size).batch(batch_size).repeat()
以下是单变量时间序列预测和多元时间序列分析的流程总结:
graph LR
classDef startend fill:#F5EBFF,stroke:#BE8FED,stroke-width:2px;
classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;
classDef decision fill:#FFF6CC,stroke:#FFBC52,stroke-width:2px;
A([开始]):::startend --> B(单变量时间序列预测):::process
B --> C(模型编译与训练):::process
C --> D(模型评估):::process
D --> E(绘制对比图):::process
E --> F(预测下一个数据点):::process
F --> G(预测数据点范围):::process
G --> H(多元时间序列分析):::process
H --> I(项目创建):::process
I --> J(数据准备):::process
J --> K(平稳性检查):::process
K --> L(数据探索):::process
L --> M(数据准备):::process
M --> N([结束]):::startend
综上所述,我们介绍了单变量时间序列预测和多元时间序列分析的方法,包括模型的编译、训练、评估,以及数据的准备、探索和平稳性检查等步骤。通过这些方法,我们可以对能源消耗和自行车共享需求进行预测。
2.6 模型构建与训练
接下来,我们构建一个简单的循环神经网络(RNN)模型来进行多元时间序列分析。这里使用
TensorFlow
和
Keras
库来构建模型。
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
# 构建模型
model = Sequential([
LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2])),
Dense(1)
])
# 编译模型
model.compile(optimizer='adam', loss='mse')
# 训练模型
EPOCHS = 10
model.fit(train_data, epochs=EPOCHS, steps_per_epoch=len(X_train) // batch_size)
2.7 模型评估与预测
模型训练完成后,我们需要对其进行评估并进行预测。
# 评估模型
test_loss = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss}")
# 进行预测
predictions = model.predict(X_test)
2.8 结果可视化
为了更直观地观察模型的预测效果,我们将实际值和预测值进行可视化。
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 4))
plt.plot(y_test, label='Actual')
plt.plot(predictions, label='Predicted', alpha=0.7)
plt.title('Actual vs Predicted Bike Share Count')
plt.xlabel('Time')
plt.ylabel('Bike Share Count')
plt.legend()
plt.show()
2.9 总结与展望
通过以上步骤,我们完成了多元时间序列分析的整个流程,包括数据准备、平稳性检查、数据探索、模型构建、训练、评估和预测。从结果来看,模型能够在一定程度上对自行车共享需求进行预测,但可能还存在一些改进的空间。
以下是多元时间序列分析的详细步骤总结表格:
| 步骤 | 操作 | 代码示例 |
| ---- | ---- | ---- |
| 项目创建 | 创建Colab项目并导入所需库 |
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import sklearn.preprocessing
import seaborn as sns
|
| 数据准备 | 加载数据并设置索引 |
url = 'https://raw.githubusercontent.com/Apress/artificial-neural-networks-with-tensorflow-2/main/ch11/london_merged.csv'
df = pd.read_csv(url,parse_dates=['timestamp'], index_col = "timestamp")
|
| 平稳性检查 | 使用Johansen测试检查平稳性 |
from statsmodels.tsa.vector_ar.vecm import coint_johansen
johan_test_temp = df
coint_johansen(johan_test_temp,-1,1).eig
|
| 数据探索 | 绘制自行车共享数量分布和观察季节变化对需求的影响 |
plt.figure(figsize = (16,4))
plt.plot(df.index, df["cnt"])
df['hour'] = df.index.hour
df['month'] = df.index.month
fig,(ax1, ax2, ax3)= plt.subplots(nrows = 3)
fig.set_size_inches(16, 10)
sns.pointplot(data = df, x = 'month', y = 'cnt', hue = 'is_weekend', ax = ax1)
sns.pointplot(data = df, x = 'hour', y = 'cnt', hue = 'season', ax = ax2)
sns.pointplot(data = df, x = 'month', y = 'cnt', ax = ax3)
|
| 数据准备 | 缩放数值列、划分训练集和测试集、创建数据集和数据批次 |
scaler = sklearn.preprocessing.MinMaxScaler()
df['t1'] = scaler.fit_transform(df['t1'].values.reshape(-1,1))
df['t2'] = scaler.fit_transform(df['t2'].values.reshape(-1,1))
df['hum'] = scaler.fit_transform(df['hum'].values.reshape(-1,1))
df['wind_speed'] = scaler.fit_transform(df['wind_speed'].values.reshape(-1,1))
df['cnt'] = scaler.fit_transform(df['cnt'].values.reshape(-1,1))
train_size = int(len(df) * 0.9)
test_size = len(df) - train_size
train, test = df.iloc[0:train_size], df.iloc[train_size:len(df)]
def create_dataset(X, y, time_steps = 1):
Xs, ys = [], []
for i in range(len(X) - time_steps):
v = X.iloc[i:(i + time_steps)].values
Xs.append(v)
ys.append(y.iloc[i + time_steps])
return np.array(Xs), np.array(ys)
time_steps = 10
X_train, y_train = create_dataset(train, train.cnt, time_steps)
X_test, y_test = create_dataset(test, test.cnt, time_steps)
batch_size = 256
buffer_size = 1000
train_data = tf.data.Dataset.from_tensor_slices((X_train , y_train))
train_data = train_data.cache().shuffle(buffer_size).batch(batch_size).repeat()
|
| 模型构建与训练 | 构建RNN模型并进行编译和训练 |
model = Sequential([
LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2])),
Dense(1)
])
model.compile(optimizer='adam', loss='mse')
EPOCHS = 10
model.fit(train_data, epochs=EPOCHS, steps_per_epoch=len(X_train) // batch_size)
|
| 模型评估与预测 | 评估模型并进行预测 |
test_loss = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss}")
predictions = model.predict(X_test)
|
| 结果可视化 | 绘制实际值和预测值的对比图 |
plt.figure(figsize=(16, 4))
plt.plot(y_test, label='Actual')
plt.plot(predictions, label='Predicted', alpha=0.7)
plt.title('Actual vs Predicted Bike Share Count')
plt.xlabel('Time')
plt.ylabel('Bike Share Count')
plt.legend()
plt.show()
|
为了进一步优化模型,我们可以考虑以下几点:
1.
调整模型架构
:尝试不同的RNN层结构,如增加LSTM层的神经元数量或添加更多的隐藏层。
2.
调整超参数
:如学习率、训练轮数等,通过网格搜索或随机搜索等方法找到最优的超参数组合。
3.
特征工程
:考虑添加更多的特征或对现有特征进行变换,以提高模型的性能。
4.
集成学习
:尝试使用多个模型进行集成,如将RNN模型与其他机器学习模型(如决策树、随机森林等)进行组合。
graph LR
classDef startend fill:#F5EBFF,stroke:#BE8FED,stroke-width:2px;
classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;
classDef decision fill:#FFF6CC,stroke:#FFBC52,stroke-width:2px;
A([开始]):::startend --> B(数据准备):::process
B --> C(平稳性检查):::process
C --> D(数据探索):::process
D --> E(数据准备):::process
E --> F(模型构建):::process
F --> G(模型编译):::process
G --> H(模型训练):::process
H --> I(模型评估):::process
I --> J(模型预测):::process
J --> K(结果可视化):::process
K --> L{是否需要优化?}:::decision
L -- 是 --> M(调整模型架构):::process
M --> N(调整超参数):::process
N --> O(特征工程):::process
O --> P(集成学习):::process
P --> F(模型构建):::process
L -- 否 --> Q([结束]):::startend
通过以上的步骤和优化方法,我们可以不断提高模型的性能,更准确地预测多元时间序列数据。无论是能源消耗预测还是自行车共享需求预测,这些方法都具有一定的通用性和实用性。在实际应用中,我们可以根据具体的问题和数据特点进行适当的调整和改进。
超级会员免费看
1万+

被折叠的 条评论
为什么被折叠?



