ModuleNotFoundError: No module named 'torch_sparse.unique_cuda'

本文详细介绍了在安装torch_sparse库时遇到CUDA环境变量未被识别的问题,并提供了详细的解决方案,包括手动安装CUDA和cuDNN,以及如何正确地重新安装torch_sparse库。

原因:pip install torch_sparse 时,编译器没有找到 cuda 或 cudann 的环境变量,只生成了 cpu 版本的 torch_sparse,而省略了和cuda相关的模块。

解决方案:

手动安装 cuda,cudann

参考教程:

https://blog.youkuaiyun.com/Mind_programmonkey/article/details/99688839#commentBox

需要注意,cuda,cudann 和 visual studio 版本之间的对应。比如:cuda 10.0, cudann 7.6,vs 2017
在这里插入图片描述

重新安装 torch_sparse

(pytorch) C:\Users\MONKEY\Desktop>pip uninstall torch-sparse
Uninstalling torch-sparse-0.4.0:
  Would remove:
    d:\miniconda3\envs\pytorch\lib\site-packages\test\*
    d:\miniconda3\envs\pytorch\lib\site-packages\torch_sparse-0.4.0.dist-info\*
    d:\miniconda3\envs\pytorch\lib\site-packages\torch_sparse\*
  Would not remove (might be manually added):
    d:\miniconda3\envs\pytorch\lib\site-packages\test\test_backward.py
    d:\miniconda3\envs\pytorch\lib\site-packages\test\test_forward.py
    d:\miniconda3\envs\pytorch\lib\site-packages\test\test_max_min.py
    d:\miniconda3\envs\pytorch\lib\site-packages\test\test_multi_gpu.py
    d:\miniconda3\envs\pytorch\lib\site-packages\test\test_std.py
Proceed (y/n)? y
  Successfully uninstalled torch-sparse-0.4.0
(pytorch) C:\Users\MONKEY\Desktop>pip install torch-sparse
Collecting torch-sparse
  Using cached https://files.pythonhosted.org/packages/b0/0a/2ff678e0d04e524dd2cf990a6202ced8c0ffe3fe6b08e02f25cc9fd27da0/torch_sparse-0.4.0.tar.gz
Requirement already satisfied: scipy in d:\miniconda3\envs\pytorch\lib\site-packages (from torch-sparse) (1.3.1)
Requirement already satisfied: numpy>=1.13.3 in d:\miniconda3\envs\pytorch\lib\site-packages (from scipy->torch-sparse) (1.16.5)
Building wheels for collected packages: torch-sparse
  Building wheel for torch-sparse (setup.py) ... done
  Created wheel for torch-sparse: filename=torch_sparse-0.4.0-cp36-cp36m-win_amd64.whl size=285224 sha256=7a88deb0de81c8b0095ac5362aaabe291737e8d3aa587b5b0e6309a706be092b
  Stored in directory: C:\Users\MONKEY\AppData\Local\pip\Cache\wheels\9d\83\0a\38ea460df5586a075b877fe089619e5238487712a0645940bd
Successfully built torch-sparse
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.4.0
import pandas as pd import numpy as np import json import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import Dataset, DataLoader from scipy.sparse import csr_matrix from collections import defaultdict import re import warnings from pathlib import Path import time warnings.filterwarnings("ignore") # ============================== # 工具函数:格式化股票代码 # ============================== def format_stock_code(code): if pd.isna(code): return None s = str(code).strip() num_part = re.sub(r"\D", "", s) # 只保留数字 if not num_part: return None return num_part.zfill(6) # ============================== # 数据集类(修复节点特征尺寸问题) # ============================== class FinancialHypergraphDataset(Dataset): def __init__( self, node_feature_path: str, hyper_edge_path: str, event_path: str, node_vector_path: str, window_size: int = 20, ): print("正在加载数据...") start_time = time.time() # 1. 加载数据(保留原始列) self.node_features = pd.read_csv(node_feature_path, parse_dates=['日期']) self.hyper_edges = pd.read_csv(hyper_edge_path, dtype='object') self.events = pd.read_csv(event_path, dtype='object') self.node_vectors = pd.read_csv(node_vector_path, dtype='object') # 2. 格式化股票代码 print("正在统一股票代码格式...") self.node_features["股票代码_str"] = self.node_features["股票代码"].apply(format_stock_code) # 3. 构建映射:股票代码_str → 节点ID print("正在构建股票代码映射...") valid_codes = self.node_features[["股票代码_str", "节点ID"]].dropna() valid_codes = valid_codes[valid_codes["股票代码_str"].notna() & valid_codes["节点ID"].notna()] valid_codes = valid_codes.drop_duplicates() self.code_to_id = dict(zip(valid_codes["股票代码_str"], valid_codes["节点ID"].astype(int))) # 4. 构建映射:公司名称(node_id) → 节点ID print("正在构建公司名称映射...") self.name_to_id = {} if "node_id" in self.node_vectors.columns: for _, row in self.node_vectors.iterrows(): node_id_str = str(row["node_id"]).strip() if node_id_str in self.code_to_id: self.name_to_id[node_id_str] = self.code_to_id[node_id_str] # 5. 处理超边数据 print("正在预处理超边数据...") self.hyper_edges["节点_clean"] = self.hyper_edges["节点"].astype(str).apply(format_stock_code) self.hyper_edges = self.hyper_edges.dropna(subset=["节点_clean", "超边类型"]) self.hyper_edges = self.hyper_edges[self.hyper_edges["节点_clean"].isin(self.code_to_id)] # 6. 时间索引与节点索引 self.time_indices = np.sort(self.node_features["时间索引"].unique()) self.window_size = window_size self.num_nodes = len(self.node_features["节点ID"].dropna().unique()) # 7. 创建节点ID到索引的映射(关键修复) unique_node_ids = sorted(self.node_features["节点ID"].dropna().unique().astype(int)) self.node_id_to_index = {node_id: idx for idx, node_id in enumerate(unique_node_ids)} # 8. 清洗节点向量 print("正在清洗节点向量数据...") for c in [f"dim_{i}" for i in range(16)]: self.node_vectors[c] = pd.to_numeric(self.node_vectors[c], errors='coerce').astype('float32') # 9. 构建节点ID → 静态向量的映射 print("正在构建节点向量映射...") self.node_id_to_vector = {} for _, row in self.node_vectors.iterrows(): node_id_str = str(row["node_id"]).strip() if node_id_str in self.name_to_id: node_id = self.name_to_id[node_id_str] vector = row[[f"dim_{i}" for i in range(16)]].values.astype(np.float32) self.node_id_to_vector[node_id] = vector elapsed = time.time() - start_time print(f"✅ 数据加载完成 | 节点数: {self.num_nodes} | 时间窗口: {len(self.time_indices)} | 耗时: {elapsed:.2f}s") def __len__(self): return len(self.time_indices) - self.window_size + 1 def __getitem__(self, idx: int) -> dict: window_start = self.time_indices[idx] window_end = self.time_indices[idx + self.window_size - 1] # 时间窗口内的节点掩码 mask = (self.node_features["时间索引"] >= window_start) & ( self.node_features["时间索引"] <= window_end ) # 时间窗口内的节点ID node_ids_in_window = set(self.node_features.loc[mask, "节点ID"].dropna().unique()) # 1. 节点特征:创建全零张量,然后填充当前窗口的节点特征(关键修复) node_feats = torch.zeros((1, self.num_nodes, 16), dtype=torch.float32) for node_id in node_ids_in_window: if node_id in self.node_id_to_index and node_id in self.node_id_to_vector: idx = self.node_id_to_index[node_id] node_feats[0, idx, :] = torch.tensor(self.node_id_to_vector[node_id]) # 2. 超图邻接矩阵 adj = self._build_adjacency(mask, node_ids_in_window) # 3. 事件特征:事件情感加权到相关节点 event_feats = self._build_event_features(mask, node_ids_in_window) # 4. 目标:窗口内平均对数收益率 y = self.node_features.loc[mask, "对数收益率"].mean() return { "node_features": node_feats, "hyper_adj": torch.FloatTensor(adj.toarray()), "event_features": torch.FloatTensor(event_feats), "target": torch.FloatTensor([y]), } def _build_adjacency(self, mask, node_ids_in_window): adj_rows, adj_cols, adj_data = [], [], [] group = self.hyper_edges.groupby("节点_clean") for node_str, df_node in group: if node_str not in self.code_to_id: continue u = self.code_to_id[node_str] if u not in node_ids_in_window or u not in self.node_id_to_index: continue u_idx = self.node_id_to_index[u] etype = df_node["超边类型"].iloc[0] weight = 1.0 if etype == "行业" else 0.5 neighbors = set() for _, row in df_node.iterrows(): for n in str(row["超边"]).split(","): n = n.strip() if n in self.code_to_id: neighbors.add(self.code_to_id[n]) for v in neighbors: if v in node_ids_in_window and u != v and v in self.node_id_to_index: v_idx = self.node_id_to_index[v] adj_rows.extend([u_idx, v_idx]) adj_cols.extend([v_idx, u_idx]) adj_data.extend([weight, weight]) return csr_matrix((adj_data, (adj_rows, adj_cols)), shape=(self.num_nodes, self.num_nodes)) def _build_event_features(self, mask, node_ids_in_window): if self.events.empty or "nodes" not in self.events.columns: return np.zeros(self.num_nodes, dtype=np.float32) # 修改为向量形式 E = np.zeros(self.num_nodes, dtype=np.float32) # 修改为向量形式 for i, row in self.events.iterrows(): nodes = json.loads(row["nodes"]) sentiment = row["sentiment"] for name in nodes: name = str(name).strip() if name in self.name_to_id: node_id = self.name_to_id[name] if node_id in node_ids_in_window and node_id in self.node_id_to_index: idx = self.node_id_to_index[node_id] E[idx] += sentiment * 0.8 return E # ============================== # 修复后的超图卷积层(保持不变) # ============================== class HypergraphConvLayer(nn.Module): def __init__(self, in_dim, out_dim, num_heads=2, dropout=0.1): super().__init__() assert in_dim % num_heads == 0, "输入维度必须能被头数整除" self.num_heads = num_heads self.head_dim = in_dim // num_heads self.qkv_transform = nn.Linear(in_dim, 3 * out_dim) self.attn_drop = nn.Dropout(dropout) self.proj = nn.Linear(out_dim, out_dim) self.norm = nn.LayerNorm(out_dim) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout) def forward(self, x, adj): B, N, D = x.shape qkv = self.qkv_transform(x) qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4) q, k, v = qkv.unbind(0) attn = (q @ k.transpose(-2, -1)) * (self.head_dim ** -0.5) adj_mask = adj.unsqueeze(1) attn = attn.masked_fill(adj_mask == 0, float('-inf')) attn = F.softmax(attn, dim=-1) attn = self.attn_drop(attn) out = attn @ v out = out.transpose(1, 2).reshape(B, N, -1) out = self.proj(out) out = self.norm(x + out) out = self.relu(out) out = self.dropout(out) return out # ============================== # 修复后的主模型(保持不变) # ============================== class FinancialHypergraphModel(nn.Module): def __init__(self, node_feat_dim=16, hidden_dim=64, num_nodes=0, num_heads=2, dropout=0.1): super().__init__() self.hidden_dim = hidden_dim self.node_embed = nn.Sequential( nn.Linear(node_feat_dim, hidden_dim), nn.ReLU(), nn.LayerNorm(hidden_dim), nn.Dropout(dropout) ) self.hyper_conv1 = HypergraphConvLayer( in_dim=hidden_dim, out_dim=hidden_dim, num_heads=num_heads, dropout=dropout ) self.hyper_conv2 = HypergraphConvLayer( in_dim=hidden_dim, out_dim=hidden_dim, num_heads=num_heads, dropout=dropout ) self.event_encoder = nn.Sequential( nn.Linear(num_nodes, hidden_dim // 2), nn.ReLU(), nn.LayerNorm(hidden_dim // 2), nn.Dropout(dropout) ) self.fusion = nn.Sequential( nn.Linear(hidden_dim + hidden_dim // 2, hidden_dim), nn.ReLU(), nn.LayerNorm(hidden_dim), nn.Dropout(dropout) ) self.output = nn.Linear(hidden_dim, 1) def forward(self, node_feats, hyper_adj, event_feats): B, T, N, D = node_feats.shape node_feats = node_feats.view(B * T, N, D) node_embed = self.node_embed(node_feats) adj_expanded = hyper_adj.repeat(B * T, 1, 1) h = self.hyper_conv1(node_embed, adj_expanded) h = self.hyper_conv2(h, adj_expanded) h = h.mean(dim=1) h = h.view(B, T, -1).mean(dim=1) e = self.event_encoder(event_feats) e = e.mean(dim=1) fused = torch.cat([h, e], dim=-1) fused = self.fusion(fused) return self.output(fused).squeeze(-1) # ============================== # 修复后的训练函数(保持不变) # ============================== def train_model(dataset, epochs=50, lr=0.001, device=None): device = device or ("cuda" if torch.cuda.is_available() else "cpu") print(f"🚀 开始训练,设备: {device}") def collate_fn(batch): return { 'node_features': torch.cat([item['node_features'] for item in batch], dim=0), 'hyper_adj': torch.stack([item['hyper_adj'] for item in batch]), 'event_features': torch.stack([item['event_features'] for item in batch]), 'target': torch.cat([item['target'] for item in batch]) } dataloader = DataLoader( dataset, batch_size=4, shuffle=True, collate_fn=collate_fn ) model = FinancialHypergraphModel( node_feat_dim=16, hidden_dim=64, num_nodes=dataset.num_nodes, num_heads=4, dropout=0.1, ).to(device) criterion = nn.L1Loss() optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4) scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, 'min', patience=5, factor=0.5 ) def print_lr(optimizer, epoch): lr = optimizer.param_groups[0]['lr'] print(f"Epoch {epoch+1} | 当前学习率: {lr:.6f}") model.train() for epoch in range(epochs): total_loss = 0.0 num_batches = 0 for batch in dataloader: node_feats = batch["node_features"].to(device) hyper_adj = batch["hyper_adj"].to(device) event_feats = batch["event_features"].to(device) targets = batch["target"].to(device) if torch.isnan(node_feats).any() or hyper_adj.sum() == 0: continue outputs = model(node_feats, hyper_adj, event_feats) loss = criterion(outputs, targets) optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() total_loss += loss.item() num_batches += 1 if num_batches == 0: continue avg_loss = total_loss / num_batches scheduler.step(avg_loss) print_lr(optimizer, epoch) print(f"Epoch {epoch+1:3d}/{epochs} | Loss: {avg_loss:.6f}") return model # ============================== # 主函数(保持不变) # ============================== def main(): data_config = { "node_feature_path": r"D:\刘涵.(学习资料)\科研立项\A股周股票综合数据\RESSET_WRESSTK_1_(2).csv\node_features.csv", "hyper_edge_path": r"D:\刘涵.(学习资料)\科研立项\A股周股票综合数据\RESSET_WRESSTK_1_(2).csv\hyper_edges.csv", "event_path": r"C:\Users\刘涵\科研立项数据分析\output\事件id+事件类型+情感强度+关联节点列表(JSON 字符串).csv", "node_vector_path": r"C:\Users\刘涵\科研立项数据分析\output\节点id+更新后的16维向量.csv", } for k, v in data_config.items(): p = Path(v) if not p.exists(): raise FileNotFoundError(f"{k} 文件不存在: {v}") try: dataset = FinancialHypergraphDataset(**data_config) model = train_model(dataset, epochs=50, lr=0.001) torch.save( { "model_state_dict": model.state_dict(), "num_nodes": dataset.num_nodes, "code_to_id": dataset.code_to_id, "name_to_id": dataset.name_to_id, "time_indices": dataset.time_indices.tolist(), "window_size": dataset.window_size, }, "financial_hypergraph_model.pth", ) print("✅ 模型训练完成并已保存为 'financial_hypergraph_model.pth'") except Exception as e: print(f"❌ 训练出错: {str(e)}") import traceback traceback.print_exc() if __name__ == "__main__": main() ModuleNotFoundError Traceback (most recent call last) Cell In[5], line 4 2 import numpy as np 3 import json ----> 4 import torch 5 import torch.nn as nn 6 import torch.nn.functional as F ModuleNotFoundError: No module named 'torch'
最新发布
11-06
import gc import time from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader, TensorDataset from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error from sklearn.preprocessing import StandardScaler from skopt import gp_minimize from skopt.space import Real, Categorical, Integer import warnings import seaborn as sns from sklearn.preprocessing import RobustScaler from sklearn.model_selection import TimeSeriesSplit from scipy.stats import boxcox # 设置中文显示 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False plt.switch_backend('TkAgg') # 设置路径 Path(r"D:\result2").mkdir(parents=True, exist_ok=True) Path("model_results/").mkdir(parents=True, exist_ok=True) # 检查GPU可用性 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"使用 {device} 进行训练") # 设置随机种子保证可重复性 torch.manual_seed(42) np.random.seed(42) # 1. 数据预处理模块 def load_and_preprocess_data(): """加载并预处理数据(内存安全版)""" chunksize = 10000 # 每次处理1万行 dfs = [] datetimes_list = [] location_codes_list = [] # 指定列数据类型以减少内存使用 dtype_dict = { 'damage_count': 'float32', 'damage_depth': 'float32', } for chunk in pd.read_csv( r"D:\my_data\clean\locationTransfer.csv", chunksize=chunksize, dtype=dtype_dict ): # 保存非数值列 datetimes_list.append(chunk['datetime'].copy()) location_codes_list.append(chunk['locationCode'].copy()) # 只处理数值列 numeric_cols = chunk.select_dtypes(include=[np.number]).columns chunk = chunk[numeric_cols] chunk = chunk.dropna(subset=['damage_count']) chunk = chunk[pd.to_numeric(chunk['damage_count'], errors='coerce').notna()] chunk = chunk.fillna(method='ffill').fillna(method='bfill') dfs.append(chunk) if len(dfs) > 10: # 测试时限制块数 break # 合并数据块 df = pd.concat(dfs, ignore_index=True) def create_lag_features(df, lags=3): for lag in range(1, lags + 1): df[f'damage_count_lag_{lag}'] = df['damage_count'].shift(lag) return df.dropna() df = create_lag_features(df) # 在合并df之后,填充na之前 df = df.dropna(subset=['damage_count']) datetimes = pd.concat(datetimes_list, ignore_index=True) location_codes = pd.concat(location_codes_list, ignore_index=True) # 确保长度一致 min_length = min(len(df), len(datetimes), len(location_codes)) df = df.iloc[:min_length] datetimes = datetimes.iloc[:min_length] location_codes = location_codes.iloc[:min_length] # 检查是否存在 NaN 值 nan_check = df.isnull().sum().sum() inf_check = df.isin([np.Inf, -np.Inf]).sum().sum() if nan_check > 0 or inf_check > 0: # 处理 NaN 值或者无穷大值 # 填充缺失值为均值 df = df.fillna(df.mean()) # 删除包含 NaN 值或者无穷大值的行 df = df.dropna() # 结构化特征 X_structured = df.drop(columns=['damage_count', 'damage_depth', 'damage_db', 'asset_code_mapping', 'pile_longitude', 'pile_latitude', 'locationCode', 'datetime', 'locationCode_encoded','damage_count_lag_1','damage_count_lag_2','damage_count_lag_3'], errors='ignore') # 填充缺失值 numeric_cols = X_structured.select_dtypes(include=[np.number]).columns for col in numeric_cols: X_structured[col] = X_structured[col].fillna(X_structured[col].mean()) # 标准化数据 scaler = RobustScaler() # 替换StandardScaler,更抗异常值 X_structured = pd.DataFrame(scaler.fit_transform(X_structured), columns=X_structured.columns) # 确保X_structured是DataFrame if not isinstance(X_structured, pd.DataFrame): X_structured = pd.DataFrame(X_structured, columns=[f"feature_{i}" for i in range(X_structured.shape[1])]) # X_structured = X_structured.values # 将DataFrame转换为NumPy数组 # 修改后的目标变量处理部分 y = df[['damage_count']].values.astype(np.float32) # 添加数据缩放 y_scaler = RobustScaler() y = y_scaler.fit_transform(y) # 使用标准化代替log变换 y = np.clip(y, -1e6, 1e6) # 设置合理的上下界 # 添加数据检查 assert not np.any(np.isinf(y)), "y中包含无限值" assert not np.any(np.isnan(y)), "y中包含NaN值" # 数据检查 print("原始数据统计:") print(f"最小值: {y.min()}, 最大值: {y.max()}, NaN数量: {np.isnan(y).sum()}") print("处理后y值范围:", np.min(y), np.max(y)) print("无限值数量:", np.isinf(y).sum()) # 清理内存 del df, chunk, dfs gc.collect() torch.cuda.empty_cache() return datetimes, X_structured, y, location_codes, scaler, y_scaler # 2. 时间序列数据集类 class TimeSeriesDataset(Dataset): """自定义时间序列数据集类""" def __init__(self, X, y, timesteps): # 确保输入是NumPy数组 if isinstance(X, pd.DataFrame): X = X.values if isinstance(y, pd.DataFrame) or isinstance(y, pd.Series): y = y.values assert X.ndim == 2, f"X应为2维,实际为{X.ndim}维" assert y.ndim == 2, f"y应为2维,实际为{y.ndim}维" # 添加维度调试信息 print(f"数据形状 - X: {X.shape}, y: {y.shape}") self.X = torch.FloatTensor(X).unsqueeze(-1) # [samples, timesteps, 1] self.y = torch.FloatTensor(y) self.timesteps = timesteps # 验证形状 if len(self.X) != len(self.y): raise ValueError("X和y的长度不匹配") def __len__(self): return len(self.X) - self.timesteps def __getitem__(self, idx): # [seq_len, num_features] x_window = self.X[idx:idx + self.timesteps] y_target = self.y[idx + self.timesteps - 1] return x_window.permute(1, 0), y_target # 调整维度顺序 def select_features_by_importance(X, y, n_features, feature_names=None): """使用随机森林选择特征(支持NumPy数组和DataFrame)""" # 确保X是二维数组 if isinstance(X, pd.DataFrame): feature_names = X.columns.tolist() X = X.values elif feature_names is None: feature_names = [f"feature_{i}" for i in range(X.shape[1])] # 处理y的维度 y = np.ravel(np.asarray(y)) y = np.nan_to_num(y, nan=np.nanmean(y)) # 检查特征数 if X.shape[1] < n_features: n_features = X.shape[1] print(f"警告: 特征数少于请求数,使用所有 {n_features} 个特征") # 训练随机森林 rf = RandomForestRegressor(n_estimators=250, random_state=42, n_jobs=-1) rf.fit(X, y) # 获取特征重要性 feature_importances = rf.feature_importances_ indices = np.argsort(feature_importances)[::-1][:n_features] # 返回选中的特征数据和重要性 return X[:, indices], feature_importances[indices], [feature_names[i] for i in indices] # 3. LSTM模型定义 class LSTMModel(nn.Module): """LSTM回归模型""" def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.4): super().__init__() # 确保hidden_size*2能被num_heads整除 if (hidden_size * 2) % 4 != 0: hidden_size = ((hidden_size * 2) // 4) * 4 // 2 # 调整到最近的合规值 print(f"调整hidden_size为{hidden_size}以满足整除条件") self.lstm = nn.LSTM( input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0, bidirectional=True # 添加双向结构 ) # 添加维度检查的初始化 def weights_init(m): if isinstance(m, nn.Linear): if m.weight.dim() < 2: m.weight.data = m.weight.data.unsqueeze(0) # 确保至少2维 nn.init.xavier_uniform_(m.weight) if m.bias is not None: nn.init.constant_(m.bias, 0.0) self.bn = nn.BatchNorm1d(hidden_size * 2) # # 注意力机制层 self.attention = nn.MultiheadAttention(embed_dim=hidden_size * 2, num_heads=4) # 改用多头注意力 # 更深D输出层 self.fc = nn.Sequential( nn.Linear(hidden_size*2, hidden_size), nn.ReLU(), nn.Dropout(dropout), nn.Linear(hidden_size, 1) ) # 应用初始化 self.apply(weights_init) def forward(self, x): lstm_out, _ = self.lstm(x) # [batch, seq_len, hidden*2] lstm_out = lstm_out.permute(1, 0, 2) # [seq_len, batch, features] attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out) attn_out = attn_out.permute(1, 0, 2) # 恢复为[batch, seq_len, features] return self.fc(attn_out[:, -1, :]).squeeze() def plot_feature_importance(feature_names, importance_values, save_path): """绘制特征重要性图""" # 验证输入 if len(feature_names) == 0 or len(importance_values) == 0: print("警告: 无特征重要性数据可绘制") return if len(feature_names) != len(importance_values): print(f"警告: 特征名数量({len(feature_names)})与重要性值数量({len(importance_values)})不匹配") # 取较小值 min_len = min(len(feature_names), len(importance_values)) feature_names = feature_names[:min_len] importance_values = importance_values[:min_len] # 按重要性排序 indices = np.argsort(importance_values)[::-1] sorted_features = [feature_names[i] for i in indices] sorted_importance = importance_values[indices] plt.figure(figsize=(12, 8)) plt.bar(range(len(sorted_features)), sorted_importance, align="center") plt.xticks(range(len(sorted_features)), sorted_features, rotation=90) plt.xlabel("特征") plt.ylabel("重要性得分") plt.title("特征重要性排序") plt.tight_layout() # 确保保存路径存在 save_path.parent.mkdir(parents=True, exist_ok=True) plt.savefig(save_path, dpi=300) plt.close() def evaluate(model, val_loader, criterion): model.eval() val_loss = 0.0 with torch.no_grad(): for inputs, targets in val_loader: inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) val_loss += loss.item() return val_loss / len(val_loader) # 4. 模型训练函数 def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler=None, epochs=100, patience=30): """训练模型并返回最佳模型和训练历史""" best_loss = float('inf') history = {'train_loss': [], 'val_loss': []} # 添加梯度累积 accumulation_steps = 5 # 每4个batch更新一次参数 for epoch in range(epochs): # 训练阶段 model.train() train_loss = 0.0 optimizer.zero_grad() for inputs, targets in train_loader: inputs, targets = inputs.to(device), targets.to(device) scaler = torch.cuda.amp.GradScaler() # 在训练循环中添加 with torch.cuda.amp.autocast(): outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) scaler.scale(loss).backward() for batch_idx, (inputs, targets) in enumerate(train_loader): inputs, targets = inputs.to(device), targets.to(device) # 前向传播 outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) # 梯度累积 loss = loss / accumulation_steps scaler.scale(loss).backward() if (batch_idx + 1) % accumulation_steps == 0: scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() optimizer.zero_grad() train_loss += loss.item() * accumulation_steps # 验证 val_loss = evaluate(model, val_loader, criterion) if scheduler: scheduler.step(val_loss) # 根据验证损失调整学习率 # 记录历史 avg_train_loss = train_loss / len(train_loader) history['train_loss'].append(avg_train_loss) history['val_loss'].append(val_loss) # 早停逻辑 if val_loss < best_loss * 0.99:# 相对改进阈值 best_loss = val_loss best_epoch = epoch torch.save(model.state_dict(), 'best_model.pth') print(f"Epoch {epoch + 1}/{epochs} - 训练损失: {avg_train_loss :.4f} - 验证损失: {val_loss:.4f}") # 早停判断 if epoch - best_epoch >= patience: print(f"早停触发,最佳epoch: {best_epoch+1}") break # 加载最佳模型 model.load_state_dict(torch.load('best_model.pth')) return model, history # 5. 贝叶斯优化函数 def optimize_hyperparameters(X_train, y_train, input_size): """使用贝叶斯优化寻找最佳超参数""" # 自定义评分函数 def score_fn(params): """内部评分函数""" try: params = adjust_hidden_size(params) # 调整参数 hidden_size, num_layers, dropout, lr, batch_size, timesteps = params # 确保参数有效 batch_size = max(32, min(256, int(batch_size))) timesteps = max(3, min(10, int(timesteps))) dropout = min(0.5, max(0.1, float(dropout))) lr = min(0.01, max(1e-5, float(lr))) # 检查数据是否足够 if len(X_train) < 2 * timesteps+1: # 至少需要2倍时间步长的数据 return float('inf') # 创建模型 model = LSTMModel( input_size=input_size, hidden_size=int(hidden_size), num_layers=min(3, int(num_layers)), dropout=min(0.5, float(dropout)) ).to(device) # 初始化权重 # for name, param in model.named_parameters(): # if 'weight' in name: # nn.init.xavier_normal_(param) # elif 'bias' in name: # nn.init.constant_(param, 0.1) # 损失函数和优化器 criterion = nn.HuberLoss(delta=1.0) optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-3) # 创建数据加载器 dataset = TimeSeriesDataset(X_train, y_train, timesteps=int(timesteps)) # 简化验证流程 train_size = int(0.8 * len(dataset)) train_dataset = torch.utils.data.Subset(dataset, range(train_size)) val_dataset = torch.utils.data.Subset(dataset, range(train_size, len(dataset))) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False) # 简单训练和验证 model.train() for epoch in range(15): # 减少epoch数以加快评估 for inputs, targets in train_loader: inputs, targets = inputs.to(device), targets.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() # 验证 model.eval() val_loss = 0.0 with torch.no_grad(): for inputs, targets in val_loader: outputs = model(inputs.to(device)) loss = criterion(outputs, targets.squeeze().to(device)) if torch.isnan(loss) or torch.isinf(loss): return float('inf') val_loss += loss.item() return val_loss / len(val_loader) except Exception as e: print(f"参数评估失败: {str(e)}") return float('inf') # 定义搜索空间 search_spaces = [ Integer(32, 128, name='hidden_size'), Integer(1, 3, name='num_layers'), Real(0.2, 0.5, name='dropout'), Real(5e-4, 1e-3, prior='log-uniform', name='lr'), Categorical([64, 128, 256], name='batch_size'), Integer(3, 10, name='timesteps') # 优化时间步长 ] def adjust_hidden_size(params): """确保hidden_size*2能被4整除""" hs = params[0] params[0] = ((hs * 2) // 4) * 4 // 2 return params result = gp_minimize( score_fn, search_spaces, n_calls=50, random_state=42, verbose=True, n_jobs=1 # 并行执行 ) # 提取最佳参数 best_params = { 'hidden_size': result.x[0], 'num_layers': result.x[1], 'dropout': result.x[2], 'lr': result.x[3], 'batch_size': result.x[4], 'timesteps': result.x[5] } print("优化完成,最佳参数:", best_params) return best_params def collate_fn(batch): """增强型数据批处理函数""" # 解包批次数据 inputs, targets = zip(*batch) # 维度转换 (已包含在数据集中) # inputs = [batch_size, features, seq_len] # 数据增强(可选) # 添加高斯噪声 noise = torch.randn_like(inputs) * 0.05 inputs = inputs + noise # 归一化处理(可选) # mean = inputs.mean(dim=(1,2), keepdim=True) # std = inputs.std(dim=(1,2), keepdim=True) # inputs = (inputs - mean) / (std + 1e-8) return inputs.permute(0, 2, 1), torch.stack(targets) # [batch, seq_len, features] # 6. 评估函数 def evaluate_model(model, test_loader, criterion, test_indices, y_scaler=None): """评估模型性能""" model.eval() test_loss = 0.0 y_true = [] y_pred = [] all_indices = [] with torch.no_grad(): for batch_idx, (inputs, targets) in enumerate(test_loader): inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) if outputs.dim() == 1: outputs = outputs.unsqueeze(1) loss = criterion(outputs, targets) test_loss += loss.item() * inputs.size(0) # 收集预测结果 y_true.extend(targets.cpu().numpy()) y_pred.extend(outputs.cpu().numpy()) # 获取原始数据集中的索引 current_indices = test_indices[batch_idx * test_loader.batch_size: (batch_idx + 1) * test_loader.batch_size] all_indices.extend(current_indices) y_true = np.array(y_true).reshape(-1) y_pred = np.array(y_pred).reshape(-1) if y_scaler is not None: y_true = y_scaler.inverse_transform(y_true.reshape(-1, 1)).flatten() y_pred = y_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten() # 基础指标 metrics = { 'MSE': mean_squared_error(y_true, y_pred), 'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)), 'MAE': mean_absolute_error(y_true, y_pred), 'R2': r2_score(y_true, y_pred), 'MAPE': np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100, # 避免除0 'indices': all_indices, # 添加原始索引 'y_true_original': y_true, 'y_pred_original': y_pred, 'test_loss': test_loss } # 可视化误差分布 errors = y_true - y_pred plt.figure(figsize=(12, 6)) sns.histplot(errors, kde=True, bins=50) plt.title('Error Distribution') plt.savefig('error_distribution.png') plt.close() return metrics, y_true, y_pred def collate_fn(batch): """增强型数据批处理函数""" # 解包批次数据 inputs, targets = zip(*batch) # 维度转换 (已包含在数据集中) # inputs = [batch_size, features, seq_len] # 数据增强(可选) # 添加高斯噪声 noise = torch.randn_like(inputs) * 0.05 inputs = inputs + noise # 归一化处理(可选) # mean = inputs.mean(dim=(1,2), keepdim=True) # std = inputs.std(dim=(1,2), keepdim=True) # inputs = (inputs - mean) / (std + 1e-8) return inputs.permute(0, 2, 1), torch.stack(targets) # [batch, seq_len, features] # 7. 主函数 def main(): # 1. 加载和预处理数据 print("正在加载和预处理数据...") datetimes, X_structured, y, location_codes, scaler , y_scaler= load_and_preprocess_data() # 2. 特征选择 print('正在进行特征选择') # 修改为选择前15%特征 n_features = int(X_structured.shape[1] * 0.15) X_selected, feature_importances, top_features = select_features_by_importance( X_structured, y, n_features ) X_selected = X_structured[top_features] print(f"选择后的特征及其重要性:") for feature, importance in zip(top_features, feature_importances): print(f"{feature}: {importance:.4f}") print(X_selected) # 绘制特征重要性图 plot_feature_importance(top_features, feature_importances, Path("feature_importance.png")) # 3. 创建时间序列数据集 print("正在创建时间序列数据集...") timesteps = 5 dataset = TimeSeriesDataset(X_selected, y, timesteps) # 4. 数据划分 train_size = int(0.8 * len(dataset)) train_indices = list(range(train_size)) test_indices = list(range(train_size, len(dataset))) train_dataset = torch.utils.data.Subset(dataset, train_indices) test_dataset = torch.utils.data.Subset(dataset, test_indices) # 5. 贝叶斯优化超参数 print("正在进行贝叶斯优化...") try: best_params = optimize_hyperparameters( X_selected.iloc[:train_size], y[:train_size].copy(), input_size=X_selected.shape[1] ) print("最佳参数:", best_params) except Exception as e: print(f"贝叶斯优化失败: {str(e)}") # 6. 使用最佳参数训练最终模型 torch.cuda.empty_cache() # 清理 GPU 缓存 print("\n使用最佳参数训练模型...") # 获取并验证batch_size batch_size = int(best_params.get('batch_size')) print(f"实际使用的batch_size类型: {type(batch_size)}, 值: {batch_size}") # 调试输出 model = LSTMModel( input_size=X_selected.shape[1], hidden_size=int(best_params['hidden_size']), num_layers=int(best_params['num_layers']), dropout=float(best_params['dropout']) ).to(device) # 数据加载器 train_loader = DataLoader( train_dataset, batch_size=int(batch_size), shuffle=True, # 训练集需要打乱 collate_fn=collate_fn, num_workers=4, # 多进程加载 pin_memory=True # 加速GPU传输 ) val_loader = DataLoader( test_dataset, batch_size=int(batch_size)*2, # 更大的批次提升验证效率 shuffle=False, # 验证集不需要打乱 collate_fn=lambda batch: ( torch.stack([x for x, y in batch]).permute(0, 2, 1), torch.stack([y for x, y in batch]) ), num_workers=2, pin_memory=True ) # 损失函数和优化器 criterion = nn.HuberLoss(delta=1.0) optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-5) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) # 训练模型 model, history = train_model( model, train_loader, val_loader, criterion, optimizer, scheduler=scheduler, epochs=200, patience=15) torch.cuda.empty_cache() # 清理 GPU 缓存 # 7. 评估模型 print("\n评估模型性能...") metrics, y_true, y_pred = evaluate_model(model, val_loader, criterion, test_indices, y_scaler) print(f"测试集 MSE: {metrics['MSE']:.4f}, MAE: {metrics['MAE']:.4f}, R2: {metrics['R2']:.4f}") # 8. 保存所有结果 print("\n保存所有结果...") output_dir = Path(r"D:\result2") output_dir.mkdir(parents=True, exist_ok=True) # 保存评估指标 metrics_df = pd.DataFrame({ 'Metric': ['MSE', 'MAE', 'R2', 'MAPE', 'Test Loss'], 'Value': [metrics['MSE'], metrics['MAE'], metrics['R2'], metrics['MAPE'], metrics['test_loss']] }) metrics_df.to_csv(output_dir / 'evaluation_metrics.csv', index=False) # 保存训练历史 history_df = pd.DataFrame(history) history_df.to_csv(output_dir / 'training_history.csv', index=False) # 保存预测结果与原始数据 pred_indices = [i + timesteps - 1 for i in metrics['indices']] # 调整索引以匹配原始数据 # 确保我们有足够的datetime和locationCode数据 if len(datetimes) > max(pred_indices) and len(location_codes) > max(pred_indices): y_true = y_true.flatten() # 确保是一维 y_pred = y_pred.flatten() # 确保是一维 result_df = pd.DataFrame({ 'datetime': datetimes.iloc[pred_indices].values, 'locationCode': location_codes.iloc[pred_indices].values, 'true_value': y_true, 'predicted_value': y_pred }) # 有条件地添加分位数 if y_pred.shape[1] > 2: result_df['predicted_lower'] = y_pred[:, 0] # 10%分位数 result_df['predicted_upper'] = y_pred[:, 2] # 90%分位数 # 添加其他特征 for i, feature in enumerate(X_selected.columns): result_df[feature] = X_selected.iloc[pred_indices, i].values result_df.to_csv(output_dir / 'predictions_with_metadata.csv', index=False) else: print("警告: datetime或locationCode数据不足,无法完全匹配预测结果") # 保存基础预测结果 pd.DataFrame({ 'true_value': y_true.flatten(), 'predicted_value': y_pred.flatten() }).to_csv(output_dir / 'predictions.csv', index=False) # 9. 可视化结果 plt.figure(figsize=(12, 6)) plt.plot(history['train_loss'], label='训练损失') plt.plot(history['val_loss'], label='验证损失') plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('训练过程') plt.legend() plt.savefig(output_dir / 'training_process.png', dpi=300) plt.close() # 添加预测结果可视化 plt.figure(figsize=(15, 6)) plt.plot(y_true[:200], label='真实值') plt.plot(y_pred[:200], label='预测值') # 只使用中位数预测 plt.title('预测结果对比') plt.legend() plt.savefig(output_dir / 'prediction_comparison.png', dpi=300) plt.show() # 误差分布图 errors = y_true - y_pred[:, 1] plt.hist(errors, bins=50) plt.title('预测误差分布') plt.savefig(output_dir / 'error_distribution.png', dpi=300) # 保存图像 plt.close() # # 添加分位数预测可视化 # plt.figure(figsize=(15, 6)) # plt.plot(y_true[:100], label='真实值') # plt.plot(y_pred[:100, 0], label='10%分位数') # plt.plot(y_pred[:100, 1], label='中位数') # plt.plot(y_pred[:100, 2], label='90%分位数') # plt.legend() # plt.savefig(output_dir / 'quantile_predictions.png', dpi=300) # 保存图像 # plt.close() # 9. 保存模型 if metrics['r2'] > 0.8: model_path = output_dir / 'best_model.pth' torch.save(model.state_dict(), model_path) print(f"模型保存成功: {model_path}") print(f"所有结果已保存到 {output_dir}") if __name__ == "__main__": warnings.filterwarnings('ignore') start_time = time.time() main() print(f"总运行时间: {(time.time() - start_time) / 60:.2f}分钟")import gc import time from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader, TensorDataset from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error from sklearn.preprocessing import StandardScaler from skopt import gp_minimize from skopt.space import Real, Categorical, Integer import warnings import seaborn as sns from sklearn.preprocessing import RobustScaler from sklearn.model_selection import TimeSeriesSplit from scipy.stats import boxcox # 设置中文显示 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False plt.switch_backend('TkAgg') # 设置路径 Path(r"D:\result2").mkdir(parents=True, exist_ok=True) Path("model_results/").mkdir(parents=True, exist_ok=True) # 检查GPU可用性 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"使用 {device} 进行训练") # 设置随机种子保证可重复性 torch.manual_seed(42) np.random.seed(42) # 1. 数据预处理模块 def load_and_preprocess_data(): """加载并预处理数据(内存安全版)""" chunksize = 10000 # 每次处理1万行 dfs = [] datetimes_list = [] location_codes_list = [] # 指定列数据类型以减少内存使用 dtype_dict = { 'damage_count': 'float32', 'damage_depth': 'float32', } for chunk in pd.read_csv( r"D:\my_data\clean\locationTransfer.csv", chunksize=chunksize, dtype=dtype_dict ): # 保存非数值列 datetimes_list.append(chunk['datetime'].copy()) location_codes_list.append(chunk['locationCode'].copy()) # 只处理数值列 numeric_cols = chunk.select_dtypes(include=[np.number]).columns chunk = chunk[numeric_cols] chunk = chunk.dropna(subset=['damage_count']) chunk = chunk[pd.to_numeric(chunk['damage_count'], errors='coerce').notna()] chunk = chunk.fillna(method='ffill').fillna(method='bfill') dfs.append(chunk) if len(dfs) > 10: # 测试时限制块数 break # 合并数据块 df = pd.concat(dfs, ignore_index=True) def create_lag_features(df, lags=3): for lag in range(1, lags + 1): df[f'damage_count_lag_{lag}'] = df['damage_count'].shift(lag) return df.dropna() df = create_lag_features(df) # 在合并df之后,填充na之前 df = df.dropna(subset=['damage_count']) datetimes = pd.concat(datetimes_list, ignore_index=True) location_codes = pd.concat(location_codes_list, ignore_index=True) # 确保长度一致 min_length = min(len(df), len(datetimes), len(location_codes)) df = df.iloc[:min_length] datetimes = datetimes.iloc[:min_length] location_codes = location_codes.iloc[:min_length] # 检查是否存在 NaN 值 nan_check = df.isnull().sum().sum() inf_check = df.isin([np.Inf, -np.Inf]).sum().sum() if nan_check > 0 or inf_check > 0: # 处理 NaN 值或者无穷大值 # 填充缺失值为均值 df = df.fillna(df.mean()) # 删除包含 NaN 值或者无穷大值的行 df = df.dropna() # 结构化特征 X_structured = df.drop(columns=['damage_count', 'damage_depth', 'damage_db', 'asset_code_mapping', 'pile_longitude', 'pile_latitude', 'locationCode', 'datetime', 'locationCode_encoded','damage_count_lag_1','damage_count_lag_2','damage_count_lag_3'], errors='ignore') # 填充缺失值 numeric_cols = X_structured.select_dtypes(include=[np.number]).columns for col in numeric_cols: X_structured[col] = X_structured[col].fillna(X_structured[col].mean()) # 标准化数据 scaler = RobustScaler() # 替换StandardScaler,更抗异常值 X_structured = pd.DataFrame(scaler.fit_transform(X_structured), columns=X_structured.columns) # 确保X_structured是DataFrame if not isinstance(X_structured, pd.DataFrame): X_structured = pd.DataFrame(X_structured, columns=[f"feature_{i}" for i in range(X_structured.shape[1])]) # X_structured = X_structured.values # 将DataFrame转换为NumPy数组 # 修改后的目标变量处理部分 y = df[['damage_count']].values.astype(np.float32) # 添加数据缩放 y_scaler = RobustScaler() y = y_scaler.fit_transform(y) # 使用标准化代替log变换 y = np.clip(y, -1e6, 1e6) # 设置合理的上下界 # 添加数据检查 assert not np.any(np.isinf(y)), "y中包含无限值" assert not np.any(np.isnan(y)), "y中包含NaN值" # 数据检查 print("原始数据统计:") print(f"最小值: {y.min()}, 最大值: {y.max()}, NaN数量: {np.isnan(y).sum()}") print("处理后y值范围:", np.min(y), np.max(y)) print("无限值数量:", np.isinf(y).sum()) # 清理内存 del df, chunk, dfs gc.collect() torch.cuda.empty_cache() return datetimes, X_structured, y, location_codes, scaler, y_scaler # 2. 时间序列数据集类 class TimeSeriesDataset(Dataset): """自定义时间序列数据集类""" def __init__(self, X, y, timesteps): # 确保输入是NumPy数组 if isinstance(X, pd.DataFrame): X = X.values if isinstance(y, pd.DataFrame) or isinstance(y, pd.Series): y = y.values assert X.ndim == 2, f"X应为2维,实际为{X.ndim}维" assert y.ndim == 2, f"y应为2维,实际为{y.ndim}维" # 添加维度调试信息 print(f"数据形状 - X: {X.shape}, y: {y.shape}") self.X = torch.FloatTensor(X).unsqueeze(-1) # [samples, timesteps, 1] self.y = torch.FloatTensor(y) self.timesteps = timesteps # 验证形状 if len(self.X) != len(self.y): raise ValueError("X和y的长度不匹配") def __len__(self): return len(self.X) - self.timesteps def __getitem__(self, idx): # [seq_len, num_features] x_window = self.X[idx:idx + self.timesteps] y_target = self.y[idx + self.timesteps - 1] return x_window.permute(1, 0), y_target # 调整维度顺序 def select_features_by_importance(X, y, n_features, feature_names=None): """使用随机森林选择特征(支持NumPy数组和DataFrame)""" # 确保X是二维数组 if isinstance(X, pd.DataFrame): feature_names = X.columns.tolist() X = X.values elif feature_names is None: feature_names = [f"feature_{i}" for i in range(X.shape[1])] # 处理y的维度 y = np.ravel(np.asarray(y)) y = np.nan_to_num(y, nan=np.nanmean(y)) # 检查特征数 if X.shape[1] < n_features: n_features = X.shape[1] print(f"警告: 特征数少于请求数,使用所有 {n_features} 个特征") # 训练随机森林 rf = RandomForestRegressor(n_estimators=250, random_state=42, n_jobs=-1) rf.fit(X, y) # 获取特征重要性 feature_importances = rf.feature_importances_ indices = np.argsort(feature_importances)[::-1][:n_features] # 返回选中的特征数据和重要性 return X[:, indices], feature_importances[indices], [feature_names[i] for i in indices] # 3. LSTM模型定义 class LSTMModel(nn.Module): """LSTM回归模型""" def __init__(self, input_size, hidden_size=64, num_layers=2, dropout=0.4): super().__init__() # 确保hidden_size*2能被num_heads整除 if (hidden_size * 2) % 4 != 0: hidden_size = ((hidden_size * 2) // 4) * 4 // 2 # 调整到最近的合规值 print(f"调整hidden_size为{hidden_size}以满足整除条件") self.lstm = nn.LSTM( input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0, bidirectional=True # 添加双向结构 ) # 添加维度检查的初始化 def weights_init(m): if isinstance(m, nn.Linear): if m.weight.dim() < 2: m.weight.data = m.weight.data.unsqueeze(0) # 确保至少2维 nn.init.xavier_uniform_(m.weight) if m.bias is not None: nn.init.constant_(m.bias, 0.0) self.bn = nn.BatchNorm1d(hidden_size * 2) # # 注意力机制层 self.attention = nn.MultiheadAttention(embed_dim=hidden_size * 2, num_heads=4) # 改用多头注意力 # 更深D输出层 self.fc = nn.Sequential( nn.Linear(hidden_size*2, hidden_size), nn.ReLU(), nn.Dropout(dropout), nn.Linear(hidden_size, 1) ) # 应用初始化 self.apply(weights_init) def forward(self, x): lstm_out, _ = self.lstm(x) # [batch, seq_len, hidden*2] lstm_out = lstm_out.permute(1, 0, 2) # [seq_len, batch, features] attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out) attn_out = attn_out.permute(1, 0, 2) # 恢复为[batch, seq_len, features] return self.fc(attn_out[:, -1, :]).squeeze() def plot_feature_importance(feature_names, importance_values, save_path): """绘制特征重要性图""" # 验证输入 if len(feature_names) == 0 or len(importance_values) == 0: print("警告: 无特征重要性数据可绘制") return if len(feature_names) != len(importance_values): print(f"警告: 特征名数量({len(feature_names)})与重要性值数量({len(importance_values)})不匹配") # 取较小值 min_len = min(len(feature_names), len(importance_values)) feature_names = feature_names[:min_len] importance_values = importance_values[:min_len] # 按重要性排序 indices = np.argsort(importance_values)[::-1] sorted_features = [feature_names[i] for i in indices] sorted_importance = importance_values[indices] plt.figure(figsize=(12, 8)) plt.bar(range(len(sorted_features)), sorted_importance, align="center") plt.xticks(range(len(sorted_features)), sorted_features, rotation=90) plt.xlabel("特征") plt.ylabel("重要性得分") plt.title("特征重要性排序") plt.tight_layout() # 确保保存路径存在 save_path.parent.mkdir(parents=True, exist_ok=True) plt.savefig(save_path, dpi=300) plt.close() def evaluate(model, val_loader, criterion): model.eval() val_loss = 0.0 with torch.no_grad(): for inputs, targets in val_loader: inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) val_loss += loss.item() return val_loss / len(val_loader) # 4. 模型训练函数 def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler=None, epochs=100, patience=30): """训练模型并返回最佳模型和训练历史""" best_loss = float('inf') history = {'train_loss': [], 'val_loss': []} # 添加梯度累积 accumulation_steps = 5 # 每4个batch更新一次参数 for epoch in range(epochs): # 训练阶段 model.train() train_loss = 0.0 optimizer.zero_grad() for inputs, targets in train_loader: inputs, targets = inputs.to(device), targets.to(device) scaler = torch.cuda.amp.GradScaler() # 在训练循环中添加 with torch.cuda.amp.autocast(): outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) scaler.scale(loss).backward() for batch_idx, (inputs, targets) in enumerate(train_loader): inputs, targets = inputs.to(device), targets.to(device) # 前向传播 outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) # 梯度累积 loss = loss / accumulation_steps scaler.scale(loss).backward() if (batch_idx + 1) % accumulation_steps == 0: scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() optimizer.zero_grad() train_loss += loss.item() * accumulation_steps # 验证 val_loss = evaluate(model, val_loader, criterion) if scheduler: scheduler.step(val_loss) # 根据验证损失调整学习率 # 记录历史 avg_train_loss = train_loss / len(train_loader) history['train_loss'].append(avg_train_loss) history['val_loss'].append(val_loss) # 早停逻辑 if val_loss < best_loss * 0.99:# 相对改进阈值 best_loss = val_loss best_epoch = epoch torch.save(model.state_dict(), 'best_model.pth') print(f"Epoch {epoch + 1}/{epochs} - 训练损失: {avg_train_loss :.4f} - 验证损失: {val_loss:.4f}") # 早停判断 if epoch - best_epoch >= patience: print(f"早停触发,最佳epoch: {best_epoch+1}") break # 加载最佳模型 model.load_state_dict(torch.load('best_model.pth')) return model, history # 5. 贝叶斯优化函数 def optimize_hyperparameters(X_train, y_train, input_size): """使用贝叶斯优化寻找最佳超参数""" # 自定义评分函数 def score_fn(params): """内部评分函数""" try: params = adjust_hidden_size(params) # 调整参数 hidden_size, num_layers, dropout, lr, batch_size, timesteps = params # 确保参数有效 batch_size = max(32, min(256, int(batch_size))) timesteps = max(3, min(10, int(timesteps))) dropout = min(0.5, max(0.1, float(dropout))) lr = min(0.01, max(1e-5, float(lr))) # 检查数据是否足够 if len(X_train) < 2 * timesteps+1: # 至少需要2倍时间步长的数据 return float('inf') # 创建模型 model = LSTMModel( input_size=input_size, hidden_size=int(hidden_size), num_layers=min(3, int(num_layers)), dropout=min(0.5, float(dropout)) ).to(device) # 初始化权重 # for name, param in model.named_parameters(): # if 'weight' in name: # nn.init.xavier_normal_(param) # elif 'bias' in name: # nn.init.constant_(param, 0.1) # 损失函数和优化器 criterion = nn.HuberLoss(delta=1.0) optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-3) # 创建数据加载器 dataset = TimeSeriesDataset(X_train, y_train, timesteps=int(timesteps)) # 简化验证流程 train_size = int(0.8 * len(dataset)) train_dataset = torch.utils.data.Subset(dataset, range(train_size)) val_dataset = torch.utils.data.Subset(dataset, range(train_size, len(dataset))) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False) # 简单训练和验证 model.train() for epoch in range(15): # 减少epoch数以加快评估 for inputs, targets in train_loader: inputs, targets = inputs.to(device), targets.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets.squeeze()) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() # 验证 model.eval() val_loss = 0.0 with torch.no_grad(): for inputs, targets in val_loader: outputs = model(inputs.to(device)) loss = criterion(outputs, targets.squeeze().to(device)) if torch.isnan(loss) or torch.isinf(loss): return float('inf') val_loss += loss.item() return val_loss / len(val_loader) except Exception as e: print(f"参数评估失败: {str(e)}") return float('inf') # 定义搜索空间 search_spaces = [ Integer(32, 128, name='hidden_size'), Integer(1, 3, name='num_layers'), Real(0.2, 0.5, name='dropout'), Real(5e-4, 1e-3, prior='log-uniform', name='lr'), Categorical([64, 128, 256], name='batch_size'), Integer(3, 10, name='timesteps') # 优化时间步长 ] def adjust_hidden_size(params): """确保hidden_size*2能被4整除""" hs = params[0] params[0] = ((hs * 2) // 4) * 4 // 2 return params result = gp_minimize( score_fn, search_spaces, n_calls=50, random_state=42, verbose=True, n_jobs=1 # 并行执行 ) # 提取最佳参数 best_params = { 'hidden_size': result.x[0], 'num_layers': result.x[1], 'dropout': result.x[2], 'lr': result.x[3], 'batch_size': result.x[4], 'timesteps': result.x[5] } print("优化完成,最佳参数:", best_params) return best_params def collate_fn(batch): """增强型数据批处理函数""" # 解包批次数据 inputs, targets = zip(*batch) # 维度转换 (已包含在数据集中) # inputs = [batch_size, features, seq_len] # 数据增强(可选) # 添加高斯噪声 noise = torch.randn_like(inputs) * 0.05 inputs = inputs + noise # 归一化处理(可选) # mean = inputs.mean(dim=(1,2), keepdim=True) # std = inputs.std(dim=(1,2), keepdim=True) # inputs = (inputs - mean) / (std + 1e-8) return inputs.permute(0, 2, 1), torch.stack(targets) # [batch, seq_len, features] # 6. 评估函数 def evaluate_model(model, test_loader, criterion, test_indices, y_scaler=None): """评估模型性能""" model.eval() test_loss = 0.0 y_true = [] y_pred = [] all_indices = [] with torch.no_grad(): for batch_idx, (inputs, targets) in enumerate(test_loader): inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) if outputs.dim() == 1: outputs = outputs.unsqueeze(1) loss = criterion(outputs, targets) test_loss += loss.item() * inputs.size(0) # 收集预测结果 y_true.extend(targets.cpu().numpy()) y_pred.extend(outputs.cpu().numpy()) # 获取原始数据集中的索引 current_indices = test_indices[batch_idx * test_loader.batch_size: (batch_idx + 1) * test_loader.batch_size] all_indices.extend(current_indices) y_true = np.array(y_true).reshape(-1) y_pred = np.array(y_pred).reshape(-1) if y_scaler is not None: y_true = y_scaler.inverse_transform(y_true.reshape(-1, 1)).flatten() y_pred = y_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten() # 基础指标 metrics = { 'MSE': mean_squared_error(y_true, y_pred), 'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)), 'MAE': mean_absolute_error(y_true, y_pred), 'R2': r2_score(y_true, y_pred), 'MAPE': np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100, # 避免除0 'indices': all_indices, # 添加原始索引 'y_true_original': y_true, 'y_pred_original': y_pred, 'test_loss': test_loss } # 可视化误差分布 errors = y_true - y_pred plt.figure(figsize=(12, 6)) sns.histplot(errors, kde=True, bins=50) plt.title('Error Distribution') plt.savefig('error_distribution.png') plt.close() return metrics, y_true, y_pred def collate_fn(batch): """增强型数据批处理函数""" # 解包批次数据 inputs, targets = zip(*batch) # 维度转换 (已包含在数据集中) # inputs = [batch_size, features, seq_len] # 数据增强(可选) # 添加高斯噪声 noise = torch.randn_like(inputs) * 0.05 inputs = inputs + noise # 归一化处理(可选) # mean = inputs.mean(dim=(1,2), keepdim=True) # std = inputs.std(dim=(1,2), keepdim=True) # inputs = (inputs - mean) / (std + 1e-8) return inputs.permute(0, 2, 1), torch.stack(targets) # [batch, seq_len, features] # 7. 主函数 def main(): # 1. 加载和预处理数据 print("正在加载和预处理数据...") datetimes, X_structured, y, location_codes, scaler , y_scaler= load_and_preprocess_data() # 2. 特征选择 print('正在进行特征选择') # 修改为选择前15%特征 n_features = int(X_structured.shape[1] * 0.15) X_selected, feature_importances, top_features = select_features_by_importance( X_structured, y, n_features ) X_selected = X_structured[top_features] print(f"选择后的特征及其重要性:") for feature, importance in zip(top_features, feature_importances): print(f"{feature}: {importance:.4f}") print(X_selected) # 绘制特征重要性图 plot_feature_importance(top_features, feature_importances, Path("feature_importance.png")) # 3. 创建时间序列数据集 print("正在创建时间序列数据集...") timesteps = 5 dataset = TimeSeriesDataset(X_selected, y, timesteps) # 4. 数据划分 train_size = int(0.8 * len(dataset)) train_indices = list(range(train_size)) test_indices = list(range(train_size, len(dataset))) train_dataset = torch.utils.data.Subset(dataset, train_indices) test_dataset = torch.utils.data.Subset(dataset, test_indices) # 5. 贝叶斯优化超参数 print("正在进行贝叶斯优化...") try: best_params = optimize_hyperparameters( X_selected.iloc[:train_size], y[:train_size].copy(), input_size=X_selected.shape[1] ) print("最佳参数:", best_params) except Exception as e: print(f"贝叶斯优化失败: {str(e)}") # 6. 使用最佳参数训练最终模型 torch.cuda.empty_cache() # 清理 GPU 缓存 print("\n使用最佳参数训练模型...") # 获取并验证batch_size batch_size = int(best_params.get('batch_size')) print(f"实际使用的batch_size类型: {type(batch_size)}, 值: {batch_size}") # 调试输出 model = LSTMModel( input_size=X_selected.shape[1], hidden_size=int(best_params['hidden_size']), num_layers=int(best_params['num_layers']), dropout=float(best_params['dropout']) ).to(device) # 数据加载器 train_loader = DataLoader( train_dataset, batch_size=int(batch_size), shuffle=True, # 训练集需要打乱 collate_fn=collate_fn, num_workers=4, # 多进程加载 pin_memory=True # 加速GPU传输 ) val_loader = DataLoader( test_dataset, batch_size=int(batch_size)*2, # 更大的批次提升验证效率 shuffle=False, # 验证集不需要打乱 collate_fn=lambda batch: ( torch.stack([x for x, y in batch]).permute(0, 2, 1), torch.stack([y for x, y in batch]) ), num_workers=2, pin_memory=True ) # 损失函数和优化器 criterion = nn.HuberLoss(delta=1.0) optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-5) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) # 训练模型 model, history = train_model( model, train_loader, val_loader, criterion, optimizer, scheduler=scheduler, epochs=200, patience=15) torch.cuda.empty_cache() # 清理 GPU 缓存 # 7. 评估模型 print("\n评估模型性能...") metrics, y_true, y_pred = evaluate_model(model, val_loader, criterion, test_indices, y_scaler) print(f"测试集 MSE: {metrics['MSE']:.4f}, MAE: {metrics['MAE']:.4f}, R2: {metrics['R2']:.4f}") # 8. 保存所有结果 print("\n保存所有结果...") output_dir = Path(r"D:\result2") output_dir.mkdir(parents=True, exist_ok=True) # 保存评估指标 metrics_df = pd.DataFrame({ 'Metric': ['MSE', 'MAE', 'R2', 'MAPE', 'Test Loss'], 'Value': [metrics['MSE'], metrics['MAE'], metrics['R2'], metrics['MAPE'], metrics['test_loss']] }) metrics_df.to_csv(output_dir / 'evaluation_metrics.csv', index=False) # 保存训练历史 history_df = pd.DataFrame(history) history_df.to_csv(output_dir / 'training_history.csv', index=False) # 保存预测结果与原始数据 pred_indices = [i + timesteps - 1 for i in metrics['indices']] # 调整索引以匹配原始数据 # 确保我们有足够的datetime和locationCode数据 if len(datetimes) > max(pred_indices) and len(location_codes) > max(pred_indices): y_true = y_true.flatten() # 确保是一维 y_pred = y_pred.flatten() # 确保是一维 result_df = pd.DataFrame({ 'datetime': datetimes.iloc[pred_indices].values, 'locationCode': location_codes.iloc[pred_indices].values, 'true_value': y_true, 'predicted_value': y_pred }) # 有条件地添加分位数 if y_pred.shape[1] > 2: result_df['predicted_lower'] = y_pred[:, 0] # 10%分位数 result_df['predicted_upper'] = y_pred[:, 2] # 90%分位数 # 添加其他特征 for i, feature in enumerate(X_selected.columns): result_df[feature] = X_selected.iloc[pred_indices, i].values result_df.to_csv(output_dir / 'predictions_with_metadata.csv', index=False) else: print("警告: datetime或locationCode数据不足,无法完全匹配预测结果") # 保存基础预测结果 pd.DataFrame({ 'true_value': y_true.flatten(), 'predicted_value': y_pred.flatten() }).to_csv(output_dir / 'predictions.csv', index=False) # 9. 可视化结果 plt.figure(figsize=(12, 6)) plt.plot(history['train_loss'], label='训练损失') plt.plot(history['val_loss'], label='验证损失') plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('训练过程') plt.legend() plt.savefig(output_dir / 'training_process.png', dpi=300) plt.close() # 添加预测结果可视化 plt.figure(figsize=(15, 6)) plt.plot(y_true[:200], label='真实值') plt.plot(y_pred[:200], label='预测值') # 只使用中位数预测 plt.title('预测结果对比') plt.legend() plt.savefig(output_dir / 'prediction_comparison.png', dpi=300) plt.show() # 误差分布图 errors = y_true - y_pred[:, 1] plt.hist(errors, bins=50) plt.title('预测误差分布') plt.savefig(output_dir / 'error_distribution.png', dpi=300) # 保存图像 plt.close() # # 添加分位数预测可视化 # plt.figure(figsize=(15, 6)) # plt.plot(y_true[:100], label='真实值') # plt.plot(y_pred[:100, 0], label='10%分位数') # plt.plot(y_pred[:100, 1], label='中位数') # plt.plot(y_pred[:100, 2], label='90%分位数') # plt.legend() # plt.savefig(output_dir / 'quantile_predictions.png', dpi=300) # 保存图像 # plt.close() # 9. 保存模型 if metrics['r2'] > 0.8: model_path = output_dir / 'best_model.pth' torch.save(model.state_dict(), model_path) print(f"模型保存成功: {model_path}") print(f"所有结果已保存到 {output_dir}") if __name__ == "__main__": warnings.filterwarnings('ignore') start_time = time.time() main() print(f"总运行时间: {(time.time() - start_time) / 60:.2f}分钟")参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 5 ended. Evaluation done at random point. Time taken: 0.0120 Function value obtained: inf Current minimum: inf Iteration No: 6 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 6 ended. Evaluation done at random point. Time taken: 0.0170 Function value obtained: inf Current minimum: inf Iteration No: 7 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 7 ended. Evaluation done at random point. Time taken: 0.0126 Function value obtained: inf Current minimum: inf Iteration No: 8 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 8 ended. Evaluation done at random point. Time taken: 0.0126 Function value obtained: inf Current minimum: inf Iteration No: 9 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 Iteration No: 9 ended. Evaluation done at random point. Time taken: 0.0085 Function value obtained: inf Current minimum: inf Iteration No: 10 started. Evaluating function at random point. 数据形状 - X: (21168, 3), y: (21168, 1) 参数评估失败: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 2 贝叶斯优化失败: Input y contains infinity or a value too large for dtype('float64').结合代码分析为啥优化失败 并且改造
05-12
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

颹蕭蕭

白嫖?

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值