模型ensemble

先上两篇写的很好的ensemble文章:

https://mlwave.com/kaggle-ensembling-guide/

https://mlwave.com/human-ensemble-learning/

 

 

ensemble 的一般性方法:

1,voting(投票)

平等投票:n个模型,每个模型投票次数为1,模型关联性越小,投票越有效;

加权投票:n个模型,效果好的可以投票次数多几次,模型关联性越小,投票越有效;

 

2,Averaging

对多个模型的结果做某种权重的平均,最后得到平均后的结果,权重可以想等也可以不相等;

 

3,Bagging,Boosting,Stacking,,,

 

使用python构建预测模型,要求如下:首先读取excel文件中样本数据(共计7项数据),其中ABCD四列作为输入数据,EFG三列为输出数据,再对样本数据进行归一化处理用于作为训练数据。随后构建第一层预测模型:使用轻量化LSTM长短期记忆网络, 克里金Kriging, 支持向量回归SVR三种机器学习方法分别对训练数据3个输出项EFG归一化后对应训练数据构建独立的预测模型LSTM.E.model、LSTM.F.model、LSTM.G.model、Kriging.E.model、Kriging.F.model、Kriging.G.model、SVR.E.model、SVR.F.model、SVR.G.model共9个预测模型。分别给出9个预测模型在均方误差,R²等不同评价标准下对模型准确性的验证结果,并将9个模型数据以.pkl格式文件独立保存。再将9个预测模型进行逆归一化,对归一化后模型在另外保存。分别将EFG三项输出各自对应的三个预测模型进行加权整合成集成预测模型Ensemble.E.model、Ensemble.F.model、Ensemble.G.model其中:Ensemble.E.model=(W.E1)(LSTM.E.model)+(W.E2)(Kriging.E.model)+(W.E3)(SVR.E.model);Ensemble.F.model=(W.F1)(LSTM.F.model)+(W.F2)(Kriging.F.model)+(W.F3)(SVR.F.model);Ensemble.G.model=(W.G1)(LSTM.G.model)+(W.E2)(Kriging.G.model)+(W.G3)*(SVR.G.model),其中W.XX为常数项权值,限制条件为:(0<W.XX<1且W.X1+W.X2+W.X3=1),以决定系数R²作为3个集成预测模型准确性的评级指标。使用PSO粒子群优化算法对W.X1、W.X2、W.X3三个权重值在取值范围内进行搜索,并以R²值趋近于1作为优化目标来确定W.X1、W.X2、W.X3三个权重值最终的数值。(以Ensemble.E.model为例,将通过3个Ensemble.E.model结果与训练数据对比验证得到Ensemble.E.R²,在权值的限制条件范围下使用PSO粒子群优化算法不断调整W.E1、W.E2、W.E3的赋值,来确定使Ensemble.E.R²的值最接近1的W.E1、W.E2、W.E3的赋值值)以上述方法分别来确定Ensemble.E.model、Ensemble.F.model、Ensemble.G.model三个集成预测模型中W.E1、W.E2、W.E3、W.F1、W.F2、W.F3、W.G1、W.G2、W.G3共9个加权的具体数值。并根据确定的加权值赋予Ensemble.E.model、Ensemble.F.model、Ensemble.G.model作为最终生成的3个集成预测模型,将最终的3个集成预测模型以.pkl格式文件独立保存。再将输出项EFG归一化后对应的训练数据与最终的集成预测模型Ensemble.E.model、Ensemble.F.model、Ensemble.G.model两者数据建立3项数据的残差项Res.E、Res.F、Res.G,再使用GA-BP神经网络分别单独学习并构建关联后的训练数据中的输入项与3项数据的残差项Res.E、Res.F、Res.G的预测模型:P_Res.E、P_Res.F、P_Res.G,,其中GA-BP神经网络通过GA遗传算法对BPNN神经网络结构参数进行调整,使P_Res.E/F/G预测模型的准确率最高。最终得出训练数据的3个预测模型为Final.E.model、Final.F.model、Final.G.model,最终预测模型定义为:Final.E/F/G.model=Ensemble.E/F/G.model+P_Res.E/F/G。将最终得出训练数据的3个预测模型为Final.E.model、Final.F.model、Final.G.model以.pkl格式文件独立保存。再将Final.E.model、Final.F.model、Final.G.model三个模型根据样本数据分别对应EFG输出项进行逆归一化处理得到Original.E.model、Original.F.model、Original.G.model也以.pkl格式文件独立保存。流程最后添加输出样本数据每组ABCD四列输入在Original.E/F/G.model预测模型中对应的预测结果。
最新发布
10-31
你提出的问题是一个非常完整且复杂的**多模型集成+优化+残差修正**的预测建模流程,涉及: - 数据读取与归一化 - 多种机器学习模型(LSTM、Kriging、SVR) - 模型评估与保存 - 集成模型构建 + 权重优化(PSO) - 残差建模 + GA-BP 神经网络 - 最终模型合成与逆归一化 - 预测输出 我们将 **分步骤实现整个流程**,使用 Python 的主流库完成。 --- ## ✅ 所需依赖库安装(建议使用 pip 或 conda) ```bash pip install pandas openpyxl scikit-learn tensorflow numpy scipy deap pyswarm joblib ``` --- ## ✅ 完整代码实现如下 ```python import pandas as pd import numpy as np from sklearn.preprocessing import MinMaxScaler from sklearn.svm import SVR from sklearn.metrics import mean_squared_error, r2_score from scipy.interpolate import Rbf # 克里金近似使用径向基函数 import tensorflow as tf from tensorflow.keras.models import Sequential, save_model, load_model from tensorflow.keras.layers import LSTM, Dense import pickle import pyswarm as ps # PSO优化库 from deap import base, creator, tools, algorithms # GA for BPNN import os from tqdm import tqdm # ---------------------------- # Step 1: 读取Excel数据 # ---------------------------- file_path = 'data.xlsx' # 请确保文件名为 data.xlsx 并包含 A, B, C, D, E, F, G 列 df = pd.read_excel(file_path) input_cols = ['A', 'B', 'C', 'D'] output_cols = ['E', 'F', 'G'] X = df[input_cols].values # 输入 (N, 4) Y = df[output_cols].values # 输出 (N, 3) # ---------------------------- # Step 2: 归一化处理 # ---------------------------- scaler_X = MinMaxScaler() scaler_Y = MinMaxScaler() X_scaled = scaler_X.fit_transform(X) # (N, 4) Y_scaled = scaler_Y.fit_transform(Y) # (N, 3) # 划分训练集(全部用于训练,无测试划分) X_train, Y_train = X_scaled, Y_scaled # Reshape for LSTM: (samples, timesteps, features) X_train_lstm = X_train.reshape((X_train.shape[0], 1, X_train.shape[1])) # (N, 1, 4) # 创建保存目录 os.makedirs('models', exist_ok=True) os.makedirs('results', exist_ok=True) # 存储所有单个模型 models = {} predictions = {} # ---------------------------- # Step 3: 构建9个独立预测模型 (LSTM, Kriging, SVR) × 3输出 # ---------------------------- print("开始训练9个基础模型...") for i, target in enumerate(output_cols): y_train = Y_train[:, i] # 当前目标列 E/F/G # --- LSTM 模型 --- print(f"训练 LSTM.{target}.model") model_lstm = Sequential([ LSTM(32, input_shape=(1, 4), return_sequences=False), Dense(16, activation='relu'), Dense(1, activation='linear') ]) model_lstm.compile(optimizer='adam', loss='mse') history = model_lstm.fit(X_train_lstm, y_train, epochs=50, batch_size=16, verbose=0) # 保存模型结构和权重 save_model(model_lstm, f'models/LSTM.{target}.model.h5') models[f'LSTM.{target}'] = model_lstm # 预测 pred_lstm = model_lstm.predict(X_train_lstm, verbose=0).flatten() predictions[f'LSTM.{target}'] = pred_lstm # --- Kriging 模型 (使用 RBF 近似克里金插值) --- print(f"训练 Kriging.{target}.model") try: from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import RBF as GPR_RBF, ConstantKernel as C kernel = C(1.0, (1e-4, 1e4)) * GPR_RBF(1.0, (1e-4, 1e4)) model_kriging = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=1e-2) model_kriging.fit(X_train, y_train) models[f'Kriging.{target}'] = model_kriging pred_kriging = model_kriging.predict(X_train) predictions[f'Kriging.{target}'] = pred_kriging except Exception as e: print(f"Kriging.{target} 训练失败: {e}") pred_kriging = np.zeros_like(y_train) predictions[f'Kriging.{target}'] = pred_kriging # --- SVR 模型 --- print(f"训练 SVR.{target}.model") model_svr = SVR(kernel='rbf', C=1.0, gamma='scale') model_svr.fit(X_train, y_train) models[f'SVR.{target}'] = model_svr pred_svr = model_svr.predict(X_train) predictions[f'SVR.{target}'] = pred_svr # --- 保存每个模型为 .pkl --- with open(f'models/LSTM.{target}.model.pkl', 'wb') as f: pickle.dump(model_lstm, f) with open(f'models/Kriging.{target}.model.pkl', 'wb') as f: pickle.dump(model_kriging, f) with open(f'models/SVR.{target}.model.pkl', 'wb') as f: pickle.dump(model_svr, f) # --- 模型评估 --- results = [] for name in [f'LSTM.{target}', f'Kriging.{target}', f'SVR.{target}']: y_pred = predictions[name] mse = mean_squared_error(y_train, y_pred) r2 = r2_score(y_train, y_pred) results.append({ 'Model': name, 'MSE': mse, 'R²': r2 }) # 保存评估结果 result_df = pd.DataFrame(results) result_df.to_csv(f'results/{target}_evaluation.csv', index=False) print(result_df) # ---------------------------- # Step 4: 对9个模型进行逆归一化后保存(仅预测值可逆归一化) # ---------------------------- # 注意:模型本身不需要“逆归一化”,但我们可以保存一个包装器或说明 # 我们这里先保存原始归一化模型,后续在集成时统一处理 print("已保存9个基础模型及其评估结果。") # ---------------------------- # Step 5: 构建集成模型 Ensemble.E/F/G.model 并用 PSO 优化权重 W.X1, W.X2, W.X3 # ---------------------------- ensemble_weights = {} def create_ensemble_prediction(weights, preds): w1, w2, w3 = weights return w1 * preds[0] + w2 * preds[1] + w3 * preds[2] def objective_pso(weights, y_true, preds): pred = create_ensemble_prediction(weights, preds) r2 = r2_score(y_true, pred) return -r2 # 最小化负 R² → 最大化 R² bounds = ([0, 0, 0], [1, 1, 1]) # 权重范围 [0,1] print("开始使用PSO优化集成权重...") for target in output_cols: y_true = Y_train[:, output_cols.index(target)] preds = [ predictions[f'LSTM.{target}'], predictions[f'Kriging.{target}'], predictions[f'SVR.{target}'] ] def obj_wrapper(weights): if abs(sum(weights) - 1.0) > 1e-3: penalty = 1e6 * abs(sum(weights) - 1.0) return -r2_score(y_true, create_ensemble_prediction(weights, preds)) + penalty return objective_pso(weights, y_true, preds) print(f"优化 Ensemble.{target}.model 的权重...") best_weights, _ = ps.pso(obj_wrapper, lb=bounds[0], ub=bounds[1], swarmsize=30, maxiter=100) best_weights = best_weights / np.sum(best_weights) # 强制归一化到和为1 ensemble_weights[f'Ensemble.{target}'] = best_weights # 构建最终加权预测 final_pred = create_ensemble_prediction(best_weights, preds) r2_final = r2_score(y_true, final_pred) print(f"Ensemble.{target}.model | 权重: {best_weights}, R²: {r2_final:.4f}") # 保存集成模型权重 + 成员) ensemble_model = { 'members': [f'LSTM.{target}', f'Kriging.{target}', f'SVR.{target}'], 'weights': best_weights, 'scaler_Y_col': scaler_Y, # 用于逆归一化 'target_index': output_cols.index(target) } with open(f'models/Ensemble.{target}.model.pkl', 'wb') as f: pickle.dump(ensemble_model, f) # ---------------------------- # Step 6: 构建残差项 Res.E, Res.F, Res.G # ---------------------------- residuals = {} for target in output_cols: y_true = Y_train[:, output_cols.index(target)] preds = [ predictions[f'LSTM.{target}'], predictions[f'Kriging.{target}'], predictions[f'SVR.{target}'] ] weights = ensemble_weights[f'Ensemble.{target}'] ensemble_pred = create_ensemble_prediction(weights, preds) res = y_true - ensemble_pred residuals[f'Res.{target}'] = res # 保存残差 np.save(f'results/Res.{target}.npy', res) print("残差项计算完成。") # ---------------------------- # Step 7: 使用 GA-BP 构建残差预测模型 P_Res.E, P_Res.F, P_Res.G # ---------------------------- def create_bpnn(individual): """ 根据遗传算法个体创建 BP 神经网络 individual = [n_neurons_layer1, n_neurons_layer2, learning_rate_log] """ n1 = int(individual[0]) n2 = int(individual[1]) lr = 10 ** individual[2] # 学习率对数空间搜索 model = tf.keras.Sequential() model.add(Dense(n1, input_dim=4, activation='relu')) if n2 > 0: model.add(Dense(n2, activation='relu')) model.add(Dense(1, activation='linear')) optimizer = tf.keras.optimizers.Adam(learning_rate=lr) model.compile(optimizer=optimizer, loss='mse', metrics=['mae']) return model # GA 参数设置 creator.create("FitnessMax", base.Fitness, weights=(1.0,)) creator.create("Individual", list, fitness=creator.FitnessMax) def evaluate_ga_bp(individual, X_train, residual): try: model = create_bpnn(individual) # 小批量训练观察性能 model.fit(X_train, residual, epochs=50, batch_size=8, verbose=0) pred = model.predict(X_train, verbose=0).flatten() r2 = r2_score(residual, pred) # 避免过拟合惩罚?此处简化只看 R² return (r2,) except: return (-1e6,) print("开始使用GA优化BPNN结构以拟合残差...") for target in output_cols: residual = residuals[f'Res.{target}'] toolbox = base.Toolbox() toolbox.register("n1", np.random.randint, 5, 64) toolbox.register("n2", np.random.randint, 0, 32) toolbox.register("lr", np.random.uniform, -5, -1) # log10(lr) ∈ [-5,-1] → lr ∈ [1e-5, 0.1] toolbox.register("individual", tools.initCycle, creator.Individual, (toolbox.n1, toolbox.n2, toolbox.lr), n=1) toolbox.register("population", tools.initRepeat, list, toolbox.individual) toolbox.register("mate", tools.cxBlend, alpha=0.5) toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=1, indpb=0.2) toolbox.register("select", tools.selTournament, tournsize=3) toolbox.register("evaluate", evaluate_ga_bp, X_train=X_train, residual=residual) pop = toolbox.population(n=20) hof = tools.HallOfFame(1) stats = tools.Statistics(lambda ind: ind.fitness.values) stats.register("avg", np.mean) stats.register("max", np.max) print(f"GA优化 P_Res.{target} ...") pop, logbook = algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=0.2, ngen=30, stats=stats, halloffame=hof, verbose=True) best_ind = hof[0] print(f"P_Res.{target} 最佳参数: neurons=[{int(best_ind[0])},{int(best_ind[1])}], lr={10**best_ind[2]:.2e}") # 重新训练最优模型 best_model = create_bpnn(best_ind) best_model.fit(X_train, residual, epochs=100, batch_size=8, verbose=0) # 保存模型 with open(f'models/P_Res.{target}.model.pkl', 'wb') as f: pickle.dump({'model': best_model, 'params': best_ind}, f) # 预测残差 pred_res = best_model.predict(X_train, verbose=0).flatten() r2_res = r2_score(residual, pred_res) print(f"P_Res.{target}.model R² on residual: {r2_res:.4f}") # ---------------------------- # Step 8: 构建最终模型 Final.E/F/G.model = Ensemble + P_Res # ---------------------------- print("构建最终模型 Final.E/F/G.model ...") final_models = {} for target in output_cols: # 加载集成模型和残差模型 with open(f'models/Ensemble.{target}.model.pkl', 'rb') as f: ens_model = pickle.load(f) with open(f'models/P_Res.{target}.model.pkl', 'rb') as f: pres_model = pickle.load(f)['model'] final_model = { 'ensemble': ens_model, 'pres': pres_model, 'scaler_Y': scaler_Y, 'target_idx': output_cols.index(target) } final_models[f'Final.{target}'] = final_model # 保存最终模型 with open(f'models/Final.{target}.model.pkl', 'wb') as f: pickle.dump(final_model, f) # ---------------------------- # Step 9: 逆归一化并保存 Original.E/F/G.model # ---------------------------- print("执行逆归一化,生成Original.E/F/G.model ...") original_models = {} for target in output_cols: with open(f'models/Final.{target}.model.pkl', 'rb') as f: final_model = pickle.load(f) # 包装器:支持预测后自动逆归一化 class OriginalModel: def __init__(self, final_model): self.final_model = final_model def predict(self, X_input): # X_input shape: (N, 4) X_lstm = X_input.reshape((X_input.shape[0], 1, 4)) target_idx = self.final_model['target_idx'] # Ensemble prediction (already scaled) w = self.final_model['ensemble']['weights'] lstm_m = load_model(f'models/LSTM.{target}.model.h5') svr_m = pickle.load(open(f'models/SVR.{target}.model.pkl', 'rb')) krg_m = pickle.load(open(f'models/Kriging.{target}.model.pkl', 'rb')) p1 = lstm_m.predict(X_lstm, verbose=0).flatten() p2 = krg_m.predict(X_input) p3 = svr_m.predict(X_input) ensemble_pred_scaled = w[0]*p1 + w[1]*p2 + w[2]*p3 # Residual correction pres_pred_scaled = self.final_model['pres'].predict(X_input, verbose=0).flatten() # Sum in scaled space final_scaled = ensemble_pred_scaled + pres_pred_scaled # Inverse transform N = len(final_scaled) dummy = np.zeros((N, 3)) dummy[:, target_idx] = final_scaled final_original = self.final_model['scaler_Y'].inverse_transform(dummy) return final_original[:, target_idx] wrapped_model = OriginalModel(final_model) original_models[f'Original.{target}'] = wrapped_model # 保存包装模型 with open(f'models/Original.{target}.model.pkl', 'wb') as f: pickle.dump(wrapped_model, f) # ---------------------------- # Step 10: 使用 Original.E/F/G.model 对样本数据进行预测并输出结果 # ---------------------------- print("开始对原始输入 ABCD 进行预测...") # 准备输入数据(原始未归一化) X_original = df[input_cols].values results_final = { 'A': X_original[:, 0], 'B': X_original[:, 1], 'C': X_original[:, 2], 'D': X_original[:, 3], } for target in output_cols: with open(f'models/Original.{target}.model.pkl', 'rb') as f: model = pickle.load(f) pred = model.predict(X_original) results_final[f'Predicted_{target}'] = pred # 转为 DataFrame 并保存 result_df = pd.DataFrame(results_final) print(result_df.head()) result_df.to_csv('results/final_predictions.csv', index=False) print("✅ 所有流程完成!结果已保存至 results/final_predictions.csv") ``` --- ## 🔍 代码解释说明 | 步骤 | 内容 | |------|------| | **1. 数据读取** | 使用 `pandas.read_excel` 读取 Excel 文件中的 ABCDEFG 七列 | | **2. 归一化** | 使用 `MinMaxScaler` 对输入输出分别归一化到 [0,1] | | **3. 模型训练** | 分别训练 LSTM(Keras)、Kriging(GPR)、SVR(sklearn)三种模型,每种输出各三个,共9个 | | **4. 模型评估与保存** | 使用 MSE 和 R² 评估,并以 `.pkl` 保存模型 | | **5. 集成模型 + PSO优化权重** | 使用 `pyswarm` 实现 PSO,在满足 `w1+w2+w3=1`, `wi>0` 下最大化 R² | | **6. 残差建模** | 计算集成模型误差,作为新目标 | | **7. GA-BP 残差预测** | 使用 DEAP 库实现 GA 优化 BPNN 结构(层数、神经元、学习率),提升残差拟合精度 | | **8. 最终模型合成** | `Final = Ensemble + P_Res` | | **9. 逆归一化封装** | 构造 `Original.Model` 类,自动将预测结果还原为原始尺度 | | **10. 预测输出** | 对原始输入数据进行预测,输出 CSV 表格 | --- ## ✅ 输出内容总结 - ✅ 9个基础模型 `.pkl` 文件(LSTM/E/F/G, Kriging/E/F/G, SVR/E/F/G) - ✅ 3个集成模型 `.pkl`(Ensemble.E/F/G) - ✅ 3个残差模型 `.pkl`(P_Res.E/F/G) - ✅ 3个最终模型 `.pkl`(Final.E/F/G) - ✅ 3个逆归一化模型 `.pkl`(Original.E/F/G) - ✅ 每个模型的评估指标(CSV) - ✅ 最终预测结果表(含原始输入和三项预测输出) --- ## ⚠️ 注意事项 1. **Excel 文件名必须是 `data.xlsx`**,位于当前目录。 2. **列名必须为 A, B, C, D, E, F, G**。 3. 若数据量较小,Kriging 可能不稳定,可用普通 RBF 替代。 4. LSTM 时间步设为 1,若有多时间序列可扩展。 5. GA-BP 中的 BPNN 是浅层网络,适合小数据。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值