tensorflow2.3 + 循环神经网络实现空气污染预测(RNN + LSTM)(三）

最新推荐文章于 2025-03-20 09:24:00 发布

深海漫步鹅

最新推荐文章于 2025-03-20 09:24:00 发布

阅读量2k

点赞数 5

分类专栏： tensorflow2.3 文章标签：神经网络深度学习

本文链接：https://blog.youkuaiyun.com/JerryZhang1111/article/details/116456324

版权

tensorflow2.3 专栏收录该内容

24 篇文章

订阅专栏

tensorflow2.3循环神经网络实现空气污染预测

参数列表

No year month day hour pm2.5 DEWP TEMP PRES cbwd Iws Is Ir
序号年月日小时 pm2.5颗粒浓度露点温度大气压风向风速累积雪量累积雨量

导入包

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

查看tensorflow 的版本，

tf.__version__
tf.test.is_gpu_available()

‘2.3.0’
True

读取数据

data = pd.read_csv('./dataset/PRSA_data_2010.1.1-2014.12.31.csv')
data.head()

No year month day hour pm2.5 DEWP TEMP PRES cbwd Iws Is Ir
0 1 2010 1 1 0 NaN -21 -11.0 1021.0 NW 1.79 0 0
1 2 2010 1 1 1 NaN -21 -12.0 1020.0 NW 4.92 0 0
2 3 2010 1 1 2 NaN -21 -11.0 1019.0 NW 6.71 0 0
3 4 2010 1 1 3 NaN -21 -14.0 1019.0 NW 9.84 0 0
4 5 2010 1 1 4 NaN -20 -12.0 1018.0 NW 12.97 0 0

数据每小时记录一次
数据信息解析

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43824 entries, 0 to 43823
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   No      43824 non-null  int64  
 1   year    43824 non-null  int64  
 2   month   43824 non-null  int64  
 3   day     43824 non-null  int64  
 4   hour    43824 non-null  int64  
 5   pm2.5   41757 non-null  float64
 6   DEWP    43824 non-null  int64  
 7   TEMP    43824 non-null  float64
 8   PRES    43824 non-null  float64
 9   cbwd    43824 non-null  object 
 10  Iws     43824 non-null  float64
 11  Is      43824 non-null  int64  
 12  Ir      43824 non-null  int64  
dtypes: float64(4), int64(8), object(1)
memory usage: 4.3+ MB

我们目标值是pm2.5这一列，取出数据中pm2.5是null的数据

data[data['pm2.5'].isna()]

No year month day hour pm2.5 DEWP TEMP PRES cbwd Iws Is Ir
0 1 2010 1 1 0 NaN -21 -11.0 1021.0 NW 1.79 0 0
1 2 2010 1 1 1 NaN -21 -12.0 1020.0 NW 4.92 0 0
2 3 2010 1 1 2 NaN -21 -11.0 1019.0 NW 6.71 0 0
3 4 2010 1 1 3 NaN -21 -14.0 1019.0 NW 9.84 0 0
4 5 2010 1 1 4 NaN -20 -12.0 1018.0 NW 12.97 0 0
… … … … … … … … … … … … … …
43548 43549 2014 12 20 12 NaN -18 0.0 1030.0 NW 244.97 0 0
43549 43550 2014 12 20 13 NaN -19 1.0 1029.0 NW 249.89 0 0
43550 43551 2014 12 20 14 NaN -20 1.0 1029.0 NW 257.04 0 0
43551 43552 2014 12 20 15 NaN -20 2.0 1028.0 NW 262.85 0 0
43552 43553 2014 12 20 16 NaN -21 1.0 1028.0 NW 270.00 0 0
2067 rows × 13 columns

因为前24条数据pm2.5是null，去掉，并填充剩下的null数据（也可以删除，但是删除会打乱顺序，序列问题顺序打乱会对结果有很大影响）

data = data.iloc[24:].copy()
data.fillna(method='ffill', inplace=True)  #ffill是前向填充，把前一个数据赋给这个null值
data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 43800 entries, 24 to 43823
Data columns (total 13 columns):
#Column Non-Null Count Dtype 0 No 43800 non-null int64
1 year 43800 non-null int64
2 month 43800 non-null int64
3 day 43800 non-null int64
4 hour 43800 non-null int64
5 pm2.5 43800 non-null float64
6 DEWP 43800 non-null int64
7 TEMP 43800 non-null float64
8 PRES 43800 non-null float64
9 cbwd 43800 non-null object
10 Iws 43800 non-null float64
11 Is 43800 non-null int64
12 Ir 43800 non-null int64
dtypes: float64(4), int64(8), object(1)
memory usage: 4.3+ MB

所有数据都non-null 。

把时间year month day hour 合并成时间格式，把多列合并成一列，

import datetime
datetime.datetime(year=2021, month=5, day=6, hour=1）

datetime.datetime（2021, 5, 6, 1, 0）


data['time'] = data.apply(lambda x: datetime.datetime(year=x['year'], 
                                                      month=x['month'], 
                                                      day=x['day'], 
                                                      hour=x['hour']), axis=1)

把time列设置为索引列，同时可以删除No year month day hour ,inplace=True是在原数据上生效

data.set_index('time', inplace=True)
data = data.drop(columns=['No','year', 'month', 'day', 'hour'])
data.head()

pm2.5 DEWP TEMP PRES cbwd Iws Is Ir
time
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0

把数据的列名更改

data.columns = ['pm2.5', 'dew', 'temp', 'press', 'cbwd', 'iws', 'snow', 'rain']
data

pm2.5 dew temp press cbwd iws snow rain
time
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
… … … … … … … … …
2014-12-31 19:00:00 8.0 -23 -2.0 1034.0 NW 231.97 0 0
2014-12-31 20:00:00 10.0 -22 -3.0 1034.0 NW 237.78 0 0
2014-12-31 21:00:00 10.0 -22 -3.0 1034.0 NW 242.70 0 0
2014-12-31 22:00:00 8.0 -22 -4.0 1034.0 NW 246.72 0 0
2014-12-31 23:00:00 12.0 -21 -3.0 1034.0 NW 249.85 0 0
43800 rows × 8 columns

cbwb：风向

data.cbwd.unique()

array([‘SE’, ‘cv’, ‘NW’, ‘NE’], dtype=object)

把风向转换为one-hot编码加入到数据列中

data = data.join(pd.get_dummies(data.cbwd))
data

pm2.5 dew temp press cbwd iws snow rain NE NW SE cv
time
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0 0 0 1 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0 0 0 1 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0 0 0 1 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0 0 0 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0 0 0 1 0
。。。。。.。。。。。。。。。。。
2014-12-31 19:00:00 8.0 -23 -2.0 1034.0 NW 231.97 0 0 0 1 0 0
2014-12-31 20:00:00 10.0 -22 -3.0 1034.0 NW 237.78 0 0 0 1 0 0
2014-12-31 21:00:00 10.0 -22 -3.0 1034.0 NW 242.70 0 0 0 1 0 0
2014-12-31 22:00:00 8.0 -22 -4.0 1034.0 NW 246.72 0 0 0 1 0 0
2014-12-31 23:00:00 12.0 -21 -3.0 1034.0 NW 249.85 0 0 0 1 0 0
43800 rows × 12 columns

删除cbwd

del data['cbwd']
data

pm2.5 dew temp press iws snow rain NE NW SE cv
time
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 1.79 0 0 0 0 1 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 2.68 0 0 0 0 1 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 3.57 0 0 0 0 1 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 5.36 1 0 0 0 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 6.25 2 0 0 0 1 0
。。。。。。。。。。。。。。
2014-12-31 19:00:00 8.0 -23 -2.0 1034.0 231.97 0 0 0 1 0 0
2014-12-31 20:00:00 10.0 -22 -3.0 1034.0 237.78 0 0 0 1 0 0
2014-12-31 21:00:00 10.0 -22 -3.0 1034.0 242.70 0 0 0 1 0 0
2014-12-31 22:00:00 8.0 -22 -4.0 1034.0 246.72 0 0 0 1 0 0
2014-12-31 23:00:00 12.0 -21 -3.0 1034.0 249.85 0 0 0 1 0 0
43800 rows × 11 columns

最终数据样式

data.info()

<class ‘pandas.core.frame.DataFrame’> DatetimeIndex: 43800 entries,
2010-01-02 00:00:00 to 2014-12-31 23:00:00
Data columns (total 11 columns):
#Column Non-Null Count Dtype

0 pm2.5 43800 non-null float64
1 dew 43800 non-null int64
2 temp 43800 non-null float64
3 press 43800 non-null float64
4 iws 43800 non-null float64
5 snow 43800 non-null int64
6 rain 43800 non-null int64
7 NE 43800 non-null uint8
8 NW 43800 non-null uint8
9 SE 43800 non-null uint8
10 cv 43800 non-null uint8
dtypes: float64(4), int64(3), uint8(4)
memory usage: 4.1 MB

data.head(3)

pm2.5 dew temp press iws snow rain NE NW SE cv time
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 1.79 0 0 0 0 1 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 2.68 0 0 0 0 1 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 3.57 0 0 0 0 1 0

显示最后1000条数据的pm2.5的值

data['pm2.5'][-1000:].plot()

在这里插入图片描述
显示最后1000条数据的temp的值

data['temp'][-1000:].plot()

在这里插入图片描述
样本中是每隔一小时采集一次数据，
我们可以用前5天的数据去预测明天的pm2.5的值
相当于前5天的数据为训练集的x，后一天的数据为y即label，探究x于y的关系
如今天是 2020年1月6日我们手里有1月1日到5日的数据那么我们需要预测的数据就是明天的值

sequence_length = 5 * 24
delay = 24

data_ = []
for i in range(len(data) -sequence_length - delay):
    data_.append(data.iloc[i:i + sequence_length + delay])

data[0].shape

(144,11)

data_ = np.array([df.values for df in data_])
data_

array([[[129., -16., -4., …, 0., 1., 0.],
[148., -15., -4., …, 0., 1., 0.],
[159., -11., -5., …, 0., 1., 0.],
…,
[159., -19., -14., …, 0., 0., 1.],
[198., -21., -14., …, 0., 0., 1.],
[190., -21., -16., …, 0., 0., 1.]],
…,
[ 8., -23., -2., …, 1., 0., 0.],
[ 10., -22., -3., …, 1., 0., 0.],
[ 10., -22., -3., …, 1., 0., 0.]],
[[ 69., -11., -2., …, 0., 1., 0.],
[ 93., -11., -3., …, 0., 1., 0.],
[ 94., -11., -3., …, 0., 0., 1.],
…,
[ 10., -22., -3., …, 1., 0., 0.],
[ 10., -22., -3., …, 1., 0., 0.],
[ 8., -22., -4., …, 1., 0., 0.]]])

data_.shape

(43656, 144, 11)
共有43656条数据，每条数据长度是144(6天的数据6*24=144），每条数据共有11个特征

np.random.shuffle(data_)
x = data_[:, :5*24, :]   #特征
y = data_[:,-1, 0]  #目标
split_boundary = int(data_.shape[0] * 0.8)
x.shape
y.shape

(43656, 120, 11)
(43656,)

划分训练和测试集

train_x = x[:split_boundary]
train_y = y[:split_boundary]
test_x = x[split_boundary:]
test_y = y[split_boundary:]

train_x.shape, train_y.shape, test_x.shape, test_y.shape

((34924, 120, 11), (34924,), (8732, 120, 11), (8732,))

数据标准化减均值除方差

mean = train_x.mean(axis=0)
std = train_x.std(axis=0)
mean.shape

(120, 11)

train_x = (train_x - mean) / std
test_x =(test_x - mean) / std

设置批次

batch_size = 128

基础建模-多层感知机进行预测

model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(train_x.shape[1:])))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1))

模型编译和训练

model.compile(optimizer=tf.keras.optimizers.Adam(), loss='mae')
history = model.fit(train_x, train_y,
                    batch_size = 128,
                    epochs=50,
                    validation_data=(test_x, test_y))

模型训练可视化

plt.plot(history.epoch, history.history.get('loss'), 'y', label='Training loss')
plt.plot(history.epoch, history.history.get('val_loss'), 'b', label='Test loss')
plt.legend()

在这里插入图片描述
建立循环神经网络LSTM模型

model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(32, input_shape=(train_x.shape[1:])))
model.add(tf.keras.layers.Dense(1))

模型编译

model.compile(optimizer=tf.keras.optimizers.Adam(), loss='mae')
history = model.fit(train_x, train_y,
                    batch_size = 128,
                    epochs=50,
                    validation_data=(test_x, test_y))

plt.plot(history.epoch, history.history.get('loss'), 'y', label='Training loss')
plt.plot(history.epoch, history.history.get('val_loss'), 'b', label='Test loss')
plt.legend()

在这里插入图片描述
LSTM层的优化（堆叠）

model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(32, input_shape=(train_x.shape[1:]), return_sequences=True))
model.add(tf.keras.layers.LSTM(32, return_sequences=True))
model.add(tf.keras.layers.LSTM(32))
model.add(tf.keras.layers.Dense(1))

model.compile(optimizer=tf.keras.optimizers.Adam(), loss='mae')

降低学习速率

learning_rate_reduction = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience=3, factor=0.5, min_lr=0.00001)

history = model.fit(train_x, train_y,
                    batch_size = 128,
                    epochs=200,
                    validation_data=(test_x, test_y),
                    callbacks=[learning_rate_reduction])

plt.plot(history.epoch, history.history.get('loss'), 'y', label='Training loss')
plt.plot(history.epoch, history.history.get('val_loss'), 'b', label='Test loss')
plt.legend()

在这里插入图片描述