更多 LSTM 变种——应用于 CMAPSS 数据集

本文链接：https://blog.youkuaiyun.com/2301_77491330/article/details/146141624

前言

本文并非正式的变种LSTM模型介绍，只是把一些比较火的技术，像双向注意力机制，自注意力机制这样的比较火的内容，简单的和LSTM堆叠，结果发现效果都不如最基础的，这里作者将本人代码及实现结果，进行展示，给各位想更改LSTM的朋友一些借鉴/教训。

使用数据集

C-MAPSS（Commercial Modular Aero-Propulsion System Simulation）数据集是由 NASA 发布的一个用于涡扇发动机剩余寿命预测（RUL, Remaining Useful Life）的数据集。它是航空发动机故障预测和健康管理（PHM, Prognostics and Health Management）领域的标准测试数据集，广泛用于 数据驱动的预测性维护 研究。

关于数据集一个很好的介绍：C-MAPSS数据集详细介绍_cmapss数据集-优快云博客

数据集下载：N-CMAPSS数据集下载链接-优快云博客

这个数据解释不太好，还没有header，直接看比较难理解，如果事先没接触过这个数据集的朋友，最后详细阅读以下readme文件和上面推荐的介绍链接。

数据预处理

数据集一共有四份文件（不是四个），针对一组train，test，rul分别有FD001-FD004四个文件，如下所示：

首先导入必要的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import mean_squared_error

下面我提供两份读取文件的代码，分别是读取单份数据的（简单快速），以及处理整体数据的（总体但较慢），两份代码任选一份即可。

## 1.1 加载数据 

# 数据格式
# 数据以 文本文件 形式存储，每行表示某台发动机在某个周期的状态，共 26 个列：

# unit number（发动机编号）
# time, in cycles（当前运行周期）
# 3-5. operational setting 1-3（操作设置参数，影响发动机性能）
# 6-26. sensor measurement 1-26（26 个传感器数据，监测发动机状态）

# 使用正则表达式 '\s+' 作为分隔符，避免多余空格导致的错误列
train_df = pd.read_csv('CMAPSS_Data/train_FD001.txt', sep=r'\s+', header=None, engine='python')
test_df = pd.read_csv('CMAPSS_Data/test_FD001.txt', sep=r'\s+', header=None, engine='python')
rul_df = pd.read_csv('CMAPSS_Data/RUL_FD001.txt', sep=r'\s+', header=None, engine='python')

# 重新命名列
columns = ['unit', 'time', 'op1', 'op2', 'op3'] + [f'sensor_{i}' for i in range(1, 22)]
train_df.columns = columns
test_df.columns = columns

train_df.head()  # 检查数据

## 1.1 加载数据 

# 数据格式
# 数据以 文本文件 形式存储，每行表示某台发动机在某个周期的状态，共 26 个列：

# unit number（发动机编号）
# time, in cycles（当前运行周期）
# 3-5. operational setting 1-3（操作设置参数，影响发动机性能）
# 6-26. sensor measurement 1-26（26 个传感器数据，监测发动机状态）

# 使用正则表达式 '\s+' 作为分隔符，避免多余空格导致的错误列
import pandas as pd

# 数据文件列表
train_files = ['CMAPSS_Data/train_FD001.txt', 'CMAPSS_Data/train_FD002.txt', 
               'CMAPSS_Data/train_FD003.txt', 'CMAPSS_Data/train_FD004.txt']

test_files = ['CMAPSS_Data/test_FD001.txt', 'CMAPSS_Data/test_FD002.txt', 
              'CMAPSS_Data/test_FD003.txt', 'CMAPSS_Data/test_FD004.txt']

rul_files = ['CMAPSS_Data/RUL_FD001.txt', 'CMAPSS_Data/RUL_FD002.txt', 
             'CMAPSS_Data/RUL_FD003.txt', 'CMAPSS_Data/RUL_FD004.txt']

# 读取并拼接训练数据
train_df_list = [pd.read_csv(f, sep=r'\s+', header=None, engine='python') for f in train_files]
train_df = pd.concat(train_df_list, ignore_index=True)

# 读取并拼接测试数据
test_df_list = [pd.read_csv(f, sep=r'\s+', header=None, engine='python') for f in test_files]
test_df = pd.concat(test_df_list, ignore_index=True)

# 读取并拼接 RUL 数据
rul_df_list = [pd.read_csv(f, sep=r'\s+', header=None, engine='python') for f in rul_files]
rul_df = pd.concat(rul_df_list, ignore_index=True)

# 重新命名列名
columns = ['unit', 'time', 'op1', 'op2', 'op3'] + [f'sensor_{i}' for i in range(1, 22)]
train_df.columns = columns
test_df.columns = columns

print(train_df.shape, test_df.shape, rul_df.shape)  # 查看拼接后的数据维度
train_df.head()  # 查看拼接后的数据

然后是数据预处理，转换成标准形式，把真实的答案，也就是RUL合并到最后一列，下面分别是训练集和测试集的处理。

## 1.2 计算训练集的 RUL

# 计算最大循环数（最大时间步）并反向计算 RUL
max_cycles = train_df.groupby('unit')['time'].max()
train_df = train_df.merge(max_cycles.to_frame(name='max_time'), on='unit')
train_df['RUL'] = train_df['max_time'] - train_df['time']
train_df.drop(columns=['max_time'], inplace=True)
train_df

## 1.3 测试集导入真实 RUL
# 计算测试集 RUL
max_cycles_test = test_df.groupby('unit')['time'].max().reset_index()
max_cycles_test.columns = ['unit', 'max_time']

# rul_df 是按 unit 顺序存储的，给它加上 unit 索引
rul_df['unit'] = max_cycles_test['unit']
rul_df.columns = ['rul_max', 'unit']

# 合并测试数据与 RUL
test_df = test_df.merge(max_cycles_test, on='unit')
test_df = test_df.merge(rul_df, on='unit')

# 计算真实 RUL
test_df['RUL'] = (test_df['max_time'] - test_df['time']) + test_df['rul_max'] 
test_df.drop(columns=['max_time', 'rul_max'], inplace=True) 

test_df.head()

归一化处理，为了加速lstm计算

from sklearn.preprocessing import MinMaxScaler

# 选择需要归一化的列（去掉 ID 和时间步）
feature_cols = train_df.columns.difference(['unit', 'time', 'RUL'])

scaler = MinMaxScaler()
train_df[feature_cols] = scaler.fit_transform(train_df[feature_cols])
test_df[feature_cols] = scaler.transform(test_df[feature_cols])

生成lstm需要的时间步数据，也就是说，数据是三维的，分别表示特征、时间步、数据量（也就是把原先的数据量-特征的二维数据，给拉伸成三维的，从而方便之后更好的代入长短期时间序列这一模型）。

sequence_length = 50

def create_sequences(df, sequence_length):
    """ 生成 LSTM 所需的时间序列数据 """
    sequences = []
    labels = []
    
    for unit in df['unit'].unique():
        unit_df = df[df['unit'] == unit].sort_values(by='time')
        features = unit_df[feature_cols].values
        target = unit_df['RUL'].values
        
        for i in range(len(unit_df) - sequence_length):
            sequences.append(features[i:i+sequence_length])
            labels.append(target[i+sequence_length])
    
    return np.array(sequences), np.array(labels)

X_train, y_train = create_sequences(train_df, sequence_length)
X_test, y_test = create_sequences(test_df, sequence_length)

X_train

array([[[1.90400484e-04, 2.37360551e-04, 1.00000000e+00, ...,
         9.62152586e-01, 9.98776165e-01, 8.42549679e-01],
        [9.99926220e-01, 9.97626394e-01, 1.00000000e+00, ...,
         2.73789803e-03, 6.26983457e-01, 2.59356549e-01],
        [1.95160496e-04, 1.18680275e-03, 1.00000000e+00, ...,
         9.61255292e-01, 9.98565159e-01, 8.55231414e-01],
        ...,
        [2.42760617e-04, 1.89888441e-03, 1.00000000e+00, ...,
         9.61163262e-01, 9.98902768e-01, 8.36716716e-01],
        [4.76215410e-01, 8.31474009e-01, 1.00000000e+00, ...,
         4.58218296e-01, 8.63331364e-01, 5.76058663e-01],
        [5.95175252e-01, 7.36529789e-01, 0.00000000e+00, ...,
         8.82569483e-02, 1.18163403e-03, 1.23484223e-02]],

基础LSTM

## 模型训练 ##

class LSTMRUL(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super(LSTMRUL, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)  # 回归任务，输出一个值

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        out = self.fc(lstm_out[:, -1, :])  # 取最后一个时间步的输出
        return out

# 超参数
input_size = X_train.shape[2]  # 传感器数量
hidden_size = 64
num_layers = 2
lr = 0.0015
epochs = 60
batch_size = 64

# 初始化模型
model = LSTMRUL(input_size, hidden_size, num_layers)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

from torch.utils.data import DataLoader, TensorDataset

# 转换为 PyTorch Tensor
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# 训练循环
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 记录损失
first_loss_history = []

for epoch in range(epochs):
    model.train()
    total_loss = 0
    
    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)

        optimizer.zero_grad()
        output = model(batch_X)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)  # 计算平均损失
    first_loss_history.append(avg_loss)  # 记录损失
    
    print(f'Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader)}')

print("训练完成！")

## 模型评估 ##

model.eval()
with torch.no_grad():
    predictions_baselstm = model(X_test_tensor.to(device)).cpu().numpy()

# 计算 RMSE 误差
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, predictions_baselstm))
print(f'测试 RMSE: {rmse}')

## 结果分析 ##

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(y_test[:], label="True RUL", linestyle='dashed')
plt.plot(predictions_baselstm[:], label="Predicted RUL")
plt.legend()
plt.xlabel("Sample")
plt.ylabel("RUL")
plt.title("RUL Prediction using LSTM")
plt.show()

双向注意力 + LSTM

## 模型训练 ##

import torch
import torch.nn as nn
import torch.optim as optim

class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.attn = nn.Linear(hidden_size * 2, 1)  # 计算注意力权重
        self.softmax = nn.Softmax(dim=1)  # 归一化

    def forward(self, lstm_out):
        attn_weights = self.softmax(self.attn(lstm_out))  # 计算注意力权重
        attn_applied = torch.sum(attn_weights * lstm_out, dim=1)  # 加权求和
        return attn_applied

class BiLSTMAttention(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super(BiLSTMAttention, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
        self.attention = Attention(hidden_size)
        self.fc = nn.Linear(hidden_size * 2, 1)  # 最终输出 RUL 值

    def forward(self, x):
        lstm_out, _ = self.lstm(x)  # 双向 LSTM 输出
        attn_out = self.attention(lstm_out)  # 应用注意力
        out = self.fc(attn_out)  # 通过全连接层输出
        return out

# 超参数
input_size = X_train.shape[2]  # 传感器特征数
hidden_size = 64
num_layers = 2
lr = 0.0015
epochs = 60
batch_size = 64

# 初始化模型
model = BiLSTMAttention(input_size, hidden_size, num_layers)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

# 训练循环
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

from torch.utils.data import DataLoader, TensorDataset

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# 记录损失
second_loss_history = []

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)

        optimizer.zero_grad()
        output = model(batch_X)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        
    avg_loss = total_loss / len(train_loader)  # 计算平均损失
    second_loss_history.append(avg_loss)  # 记录损失
    
    print(f'Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader)}')

print("训练完成！")

## 模型评估 ##

model.eval()
with torch.no_grad():
    predictions_bilstm = model(torch.tensor(X_test, dtype=torch.float32).to(device)).cpu().numpy()

from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test, predictions_bilstm))
print(f'测试 RMSE: {rmse}')

## 结果展示 ##

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(y_test[:], label="True RUL", linestyle='dashed')
plt.plot(predictions_bilstm[:], label="Predicted RUL")
plt.legend()
plt.xlabel("Sample")
plt.ylabel("RUL")
plt.title("RUL Prediction using LSTM")
plt.show()

双向注意力 + 自注意力机制 + LSTM

## 模型训练 ##

import torch
import torch.nn as nn
import torch.optim as optim

class SelfAttention(nn.Module):
    def __init__(self, hidden_size):
        super(SelfAttention, self).__init__()
        self.query = nn.Linear(hidden_size * 2, hidden_size * 2)  # 双向 LSTM 输出的大小是 hidden_size*2
        self.key = nn.Linear(hidden_size * 2, hidden_size * 2)
        self.value = nn.Linear(hidden_size * 2, hidden_size * 2)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, lstm_out):
        # 计算 Q、K、V
        query = self.query(lstm_out)
        key = self.key(lstm_out)
        value = self.value(lstm_out)
        
        # 计算 attention 权重
        attention_scores = torch.bmm(query, key.transpose(1, 2))  # 点积 Q 和 K
        attention_weights = self.softmax(attention_scores)  # Softmax 计算权重
        
        # 得到加权的 V
        attn_out = torch.bmm(attention_weights, value)
        return attn_out, attention_weights

class BiLSTMSelfAttention(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super(BiLSTMSelfAttention, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
        self.attention = SelfAttention(hidden_size)
        self.fc = nn.Linear(hidden_size * 2, 1)  # 输出 RUL 值

    def forward(self, x):
        lstm_out, _ = self.lstm(x)  # 双向 LSTM 输出
        attn_out, attention_weights = self.attention(lstm_out)  # 自注意力加权
        attn_out = torch.sum(attn_out, dim=1)  # 按时间步维度加权求和
        out = self.fc(attn_out)  # 通过全连接层输出 RUL
        return out, attention_weights

# 超参数
input_size = X_train.shape[2]  # 传感器特征数
hidden_size = 64
num_layers = 2
lr = 0.0015
epochs = 60
batch_size = 64

# 初始化模型
model = BiLSTMSelfAttention(input_size, hidden_size, num_layers)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

# 训练循环
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

from torch.utils.data import DataLoader, TensorDataset

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# 记录损失
third_loss_history = []

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)

        optimizer.zero_grad()
        output, _ = model(batch_X)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        
    avg_loss = total_loss / len(train_loader)  # 计算平均损失
    third_loss_history.append(avg_loss)  # 记录损失
    
    print(f'Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader)}')

print("训练完成！")

## 模型评估 ##

# 评估
model.eval()
with torch.no_grad():
    predictions_selfattention, attention_weights = model(torch.tensor(X_test, dtype=torch.float32).to(device))

from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test, predictions_selfattention.cpu().numpy()))
print(f'测试 RMSE: {rmse}')

## 可视化注意力权重 ## 
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(attention_weights.cpu().numpy().flatten(), label="Attention Weights", linestyle='dashed')
plt.legend()
plt.xlabel("Time Step")
plt.ylabel("Attention Weight")
plt.title("Attention Weights Visualization")
plt.show()

## 结果展示 ##

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(y_test[:], label="True RUL", linestyle='dashed')
plt.plot(predictions_selfattention.cpu()[:], label="Predicted RUL")
plt.legend()
plt.xlabel("Sample")
plt.ylabel("RUL")
plt.title("RUL Prediction using LSTM")
plt.show()

双向注意力 + 自注意力机制 + LSTM + 改进损失函数

改进损失函数将 MSE (均方误差) 改为 Huber Loss（胡泊损失），均方误差对异常值（Outliers）非常敏感，因为误差平方后放大了大误差的影响，可能导致模型过度拟合异常值；Huber Loss 结合了 MSE 和 MAE（平均绝对误差） 的优点，能够减小异常值的影响，提高模型的鲁棒性。

比较项	MSE（均方误差）	Huber Loss（胡泊损失）
数学形式	误差的平方	结合 MSE 和 MAE
对小误差	计算平方，梯度平滑	计算平方，梯度平滑
对大误差（异常值）	放大影响，易受异常值影响	采用 MAE，减少异常值影响
鲁棒性	较差，异常值影响大	较强，对异常值不敏感
收敛速度	慢，因异常值影响梯度	快，因减少异常值影响
适用场景	无异常值的回归任务	存在异常值的回归任务

## 模型训练 ##

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class BiLSTMAttention(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super(BiLSTMAttention, self).__init__()
        
        self.bilstm = nn.LSTM(input_size, hidden_size, num_layers, 
                              batch_first=True, bidirectional=True)
        
        # 注意力层
        self.attention = nn.Linear(hidden_size * 2, 1)  # 双向LSTM输出维度为 hidden_size * 2
        self.softmax = nn.Softmax(dim=1)
        
        # 全连接层用于回归预测
        self.fc = nn.Linear(hidden_size * 2, 1)

    def forward(self, x):
        lstm_out, _ = self.bilstm(x)  # BiLSTM 输出 (batch, seq_len, hidden_size*2)
        
        # 计算注意力权重
        attention_weights = self.attention(lstm_out)  # (batch, seq_len, 1)
        attention_weights = self.softmax(attention_weights)  # 归一化

        # 加权求和，得到最终的上下文向量
        context_vector = torch.sum(attention_weights * lstm_out, dim=1)  # (batch, hidden_size*2)

        # 通过全连接层预测 RUL
        out = self.fc(context_vector)  # (batch, 1)
        return out

# **自定义改进损失函数**
# Huber Loss（胡泊损失） 替代————> MSELoss（均方误差）
class HuberLoss(nn.Module):
    """ Huber Loss (Smooth L1 Loss) """
    def __init__(self, delta=1.0):
        super(HuberLoss, self).__init__()
        self.delta = delta

    def forward(self, y_pred, y_true):
        error = y_true - y_pred
        is_small_error = torch.abs(error) <= self.delta
        squared_loss = 0.5 * error ** 2
        linear_loss = self.delta * (torch.abs(error) - 0.5 * self.delta)
        return torch.mean(torch.where(is_small_error, squared_loss, linear_loss))

# 超参数
input_size = X_train.shape[2]  # 传感器数量
hidden_size = 64
num_layers = 2
lr = 0.0015
epochs = 60
batch_size = 64

# 初始化模型
model = BiLSTMAttention(input_size, hidden_size, num_layers)
criterion = HuberLoss(delta=1.0)  # 使用 Huber 损失
optimizer = optim.Adam(model.parameters(), lr=lr)

# 训练数据转换为 PyTorch Tensor
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

from torch.utils.data import DataLoader, TensorDataset

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# 训练循环
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 记录损失
fourth_loss_history = []

for epoch in range(epochs):
    model.train()
    total_loss = 0
    
    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)

        optimizer.zero_grad()
        output = model(batch_X)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        
    avg_loss = total_loss / len(train_loader)  # 计算平均损失
    fourth_loss_history.append(avg_loss)  # 记录损失    
    print(f'Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader)}')

print("训练完成！")

## 模型评估 ##

model.eval()
with torch.no_grad():
    predictions_loss = model(X_test_tensor.to(device)).cpu().numpy()

# 计算 RMSE
rmse = np.sqrt(mean_squared_error(y_test, predictions_loss))
print(f'测试 RMSE: {rmse}')

## 结果展示 ##

# 绘图
plt.figure(figsize=(10, 5))
plt.plot(y_test[:], label="True RUL", linestyle='dashed')
plt.plot(predictions_loss[:], label="Predicted RUL")  # 这里不需要 .cpu()
plt.legend()
plt.xlabel("Sample")
plt.ylabel("RUL")
plt.title("RUL Prediction using BiLSTM + Attention")
plt.show()

损失函数可视化

import matplotlib.pyplot as plt

# 创建主图
fig, ax1 = plt.subplots(figsize=(8, 5))

# 主 Y 轴
ax1.plot(range(1, epochs+1), first_loss_history, marker='.', linestyle='-', label="lstm")
ax1.plot(range(1, epochs+1), second_loss_history, marker='.', linestyle='-', label="bilstm")
ax1.plot(range(1, epochs+1), third_loss_history, marker='.', linestyle='-', label="bilstm + self-attention")

# 创建次 Y 轴
ax2 = ax1.twinx()
ax2.plot(range(1, epochs+1), fourth_loss_history, marker='x', linestyle='-', color='red', label="lstm with another loss")

# 轴标签
ax1.set_xlabel('Epochs')
ax1.set_ylabel('MSE Loss')
ax2.set_ylabel('Huber Loss')

# 设置图例
ax1.legend(loc='upper right', bbox_to_anchor=(1, 1))
ax2.legend(loc='upper right', bbox_to_anchor=(1, 0.80))  # 往下移

# 标题 & 网格
plt.title('Training Loss Curve')
plt.grid(True)

# 显示图像
plt.show()

下面展示的是单个数据集FD001的损失函数，可以看出，因为数据的逻辑结构过于简单，因此最后都收敛了，基础LSTM在前期收敛较慢。