Day 37 早停策略和模型权重的保存

今日任务:
  1. 过拟合的判断:测试集和训练集同步打印指标
  2. 模型的保存和加载仅保存权重;保存权重和模型;保存全部信息checkpoint,还包含训练状态;
  3. 早停策略

作业:对信贷数据集训练后保存权重,加载权重后继续训练50轮,并采取早停策略

模型的保存与加载

在深度学习中,模型保存与加载主要涉及参数(权重)和整个模型结构的存储,同时需兼顾训练状态(如优化器参数、轮次等)以支持断点续训

在pytorch中使用torch.save()函数可执行保存功能,它主要提供了以下几种保存机制:

(1)仅保存模型参数

使用state_dict 将模型中所有可学习参数(权重、偏置)存储在字典中,适合于模型推理。

  • 优点:文件小(不保存模型结构代码,仅参数),安全(不依赖具体类定义)
  • 缺点:加载时需先定义模型结构(实例化与训练时相同模型结构
# 保存参数
torch.save(model.state_dict(),'model_weights.pth')
# 加载,需先定义模型结构
model = MLP() #先定义模型结构(与训练时相同)
model.load_state_dict(torch.load('model_weights.pth')) # 加载参数
# model.eval() # 进入推理模型

(2)保存整个模型

保存整个模型,包括结构和参数,适合于临时调试的场景

  • 优点:加载时无需提前定义模型类
  • 缺点:依赖 Python 类的具体定义;文件大;存在安全隐患(代码环境的一致性)
# 保存模型和参数
torch.save(model,'full_model.pth')
# 加载,无需提前定义类,但需确保环境一致
model = torch.load('full_model.pth') # 加载保存的模型
# model.eval() # 进入推理模型

(3)保存训练状态

保存模型参数、优化器状态、当前 epoch、损失值等信息,适合于断点续训(长时间训练任务)。

训练 → 第99轮结束 → 保存 checkpoint(含 epoch=99)
↓
程序中断 / 重启
↓
加载 checkpoint
  ├─ 先定义模型结构,恢复模型参数
  ├─ 创建优化器,恢复优化器状态
  ├─ 获取 epoch=99, best_loss=0.123
↓
从 epoch=100 开始继续训练
# 保存检查点
checking_point = {
    'model_state':model.state_dict(), # 保存当前的可学习参数
    'optimiser_state':optimizer.state_dict(), # 保存优化器状态
    'epoch':epoch, # 当前已完成的训练轮数(用于知道从哪里继续)
    'loss':best_loss, # 最佳损失值,用于早停策略或模型选择
}
torch.save(checking_point,'checking_point.pth') # 保存

# 加载
model = MLP() # 先定义模型结构,只保存了参数
optimiser = optim.SGD(model.parameters(),lr=0.01) # 创建优化器

checking_point = torch.load('checking_point.pth') # 加载检查点
model.load_state_dict(checking_point['model_state']) # 加载参数
optimiser.load_state_dict(checking_point['optimiser_state']) # 加载优化器状态
start_epoch = checking_point['epoch'] + 1 # 加载epoch,从下一轮开始训练
best_loss = checking_point['loss'] # 加载loss值

# 继续训练
for epoch in range(start_epoch,num_epochs):
    # 继续流程
    pass

总结上述保存机制的适用场景:

过拟合(Overfitting)

过拟合:模型在训练集上表现非常好(完美拟合),但在新数据(验证集或测试集)上表现明显变差(泛化能力差)。具体表现为训练误差持续下降,但验证误差却在某个点后上升。

出现过拟合的原因:

  • 模型太复杂:神经网络参数多、层数多
  • 训练数据少或代表性差:样本不足以让模型学习到泛化的规律,侧重了细节
  • 训练时间长:epoch太多(学了太多遍),可能会死记硬背

既然要看模型在训练集和测试集上的表现差异,那么可以同步打印两者的loss值,来判断是否出现了过拟合现象。通过对两者变化的同步性,来判断过拟合。

注意:由于在某些层中,训练和评估时的行为是不同的。因此在训练中途,若手动进入推理模式(测试)后,要记得切换回训练模式。

在训练部分做修改,其它部分不变:

# 训练
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(),lr=0.01)

epoch_num = 20000
train_losses = []
test_losses = []
epochs = []

start_time = time.time()
with tqdm(total=epoch_num) as pbar:
    for epoch in range(epoch_num):
        # 训练集
        train_output = model(X_train)
        train_loss = criterion(train_output,y_train)
        # 反向传播和优化
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
        
        # 记录
        if (epoch + 1) % 200 == 0:
            # 测试集的loss,后续可视化
            model.eval() # 进入推理模式
            with torch.no_grad():
                test_output = model(X_test)
                test_loss = criterion(test_output,y_test)
            model.train() # 切换回训练模式
            # 存储
            train_losses.append(train_loss.item())
            test_losses.append(test_loss.item())
            epochs.append(epoch + 1)

            pbar.set_postfix({'Train Loss': f'{train_loss.item():.4f}', 'Test Loss': f'{test_loss.item():.4f}'})
        # 更新进度条
        if (epoch + 1) % 1000 == 0:
            pbar.update(1000)
    # 补全进度条
    if pbar.n < epoch_num:
        pbar.update(epoch_num - pbar.n)
end_time = time.time()
print(print(f'Training time: {end_time-start_time:.2f} seconds'))

# 可视化损失曲线
plt.figure(figsize=(10, 6))
plt.plot(epochs, train_losses,label='Train Loss')
plt.plot(epochs, test_losses,label='Test Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss over Epochs')
plt.legend() # 图例显示
plt.grid(True)
plt.show()

早停策略(Early Stopping)

防止过拟合的策略有很多,比如正则化dropout(随机关闭部分神经元)、早停策略等。下面介绍早停策略,它的核心思想是:在验证误差不再改善提前停止训练。

具体的流程:

  1. 数据划分:分为训练集、验证集和测试集。
  2. 开始训练:在每个训练周期(Epoch)结束后,同时在训练集和验证集上评估模型性能(例如,计算损失或准确率),并持续监控验证集上的性能指标(通常是验证损失)。
  3. 设定耐心值(正整数patience):允许验证集性能在连续 patience个周期内不再提升。
  4. 保存最佳模型:每当在验证集上取得一个新的最佳性能时,就保存当前模型的权重。
  5. 判断停止:如果在连续的 patience个训练周期内,验证集性能都没有超过之前保存的最佳性能,则触发早停,训练终止。
  6. 恢复模型:训练结束后,丢弃最后一次训练的模型权重,加载之前保存的性能最佳的那个模型权重。

具体代码的实现:

# 1-加入验证集
# 加载数据
iris = load_iris()
X = iris.data
y = iris.target
X_train,X_temp,y_train,y_temp = train_test_split(X,y,train_size=0.8,random_state=42)
X_val,X_test,y_val,y_test = train_test_split(X_temp,y_temp,train_size=0.5,random_state=42)

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print('使用的设备:{}'.format(device))

# 数据预处理
# 归一化
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
# 张量转换
X_train = torch.FloatTensor(X_train).to(device)
y_train = torch.LongTensor(y_train).to(device)
X_val = torch.FloatTensor(X_val).to(device)
y_val = torch.LongTensor(y_val).to(device)
X_test = torch.FloatTensor(X_test).to(device)
y_test = torch.LongTensor(y_test).to(device)
# 2-训练部分加入早停逻辑
#-------早停策略参数添加-----------
patience = 50
best_loss = float('inf') # 初始值设为无穷大,方便比较
best_epoch = 0 # best_loss 对应的epoch
count = 0 # 开始上升的次数
early_stopping = False # 是否早停标志
#---------------------------------

start_time = time.time()
with tqdm(total=epoch_num) as pbar:
    for epoch in range(epoch_num):
        # 训练集
        train_output = model(X_train)
        train_loss = criterion(train_output,y_train)
        # 反向传播和优化
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
        
        # 记录
        if (epoch + 1) % 200 == 0:
            # 测试集的loss,后续可视化
            model.eval() # 进入推理模式
            with torch.no_grad():
                val_output = model(X_val)
                val_loss = criterion(val_output,y_val)
            model.train() # 切换回训练模式
            # 存储
            train_losses.append(train_loss.item())
            val_losses.append(val_loss.item())
            epochs.append(epoch + 1)

            pbar.set_postfix({'Train Loss': f'{train_loss.item():.4f}', 'Val Loss': f'{val_loss.item():.4f}'})

            #-------早停逻辑添加-----------
            if val_loss.item() < best_loss:
                best_loss = val_loss.item()
                best_epoch = epoch + 1
                count = 0 # 清零(只要在耐心值范围内,出现loss值下降的情况,就清零重来)
                # 保存模型参数,便于后续加载、评估
                torch.save(model.state_dict(), 'best_model.pth')
            else:
                count += 1
                if count >= patience:
                    print(f"早停触发!在第{epoch+1}轮,测试集损失已有{patience}轮未改善。")
                    print(f"最佳测试集损失出现在第{best_epoch}轮,损失值为{best_loss:.4f}")
                    early_stopping = True
                    break #手动中止循环
            #---------------------------------
           
        # 更新进度条
        if (epoch + 1) % 1000 == 0:
            pbar.update(1000)
    # 补全进度条
    if pbar.n < epoch_num:
        pbar.update(epoch_num - pbar.n)
end_time = time.time()
print(f'Training time: {end_time-start_time:.2f} seconds')

#-------加载最终模型来评估-----------
if early_stopping:
    print(f"加载第{best_epoch}轮的最佳模型进行最终评估...")
    model.load_state_dict(torch.load('best_model.pth'))
#---------------------------------

作业

作业:对信贷数据集训练后保存权重,加载权重后继续训练50轮,并采取早停策略

一开始层数设置的比较大加上epoch值也比较大,虽然触发了早停策略,但是从曲线上可以看出在后期验证损失高于训练损失,存在明显的过拟合,后续需尝试调整参数、加入正则化等措施优化。

调整隐藏层层数为8,最终准确率:76.93%,epoch跑了4500+50,没有触发早停,训练集和验证集的loss曲线比较接近,过拟合程度较小。

完整代码:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import torch
import torch.nn as nn
import torch.optim as optim # 优化器
from tqdm import tqdm # 进度条
import time 

# 数据导入
data = pd.read_csv('data.csv')

# 数据预处理-1
# 离散特征编码
mapping_dict = {
    'Home Ownership':{
        'Own Home':0,
        'Rent':1,
        'Home Mortgage':2,
        'Have Mortgage':3
    },
    'Years in current job':{
        '< 1 year': 0,
        '1 year': 1,
        '2 years': 2,
        '3 years': 3,
        '4 years': 4,
        '5 years': 5,
        '6 years': 6,
        '7 years': 7,
        '8 years': 8,
        '9 years': 9,
        '10+ years': 10
    },
    'Term':{
        'Short Term':0,
        'Long Term':1
    }
} # 映射字典
data['Home Ownership'] = data['Home Ownership'].map(mapping_dict['Home Ownership'])
data['Years in current job'] = data['Years in current job'].map(mapping_dict['Years in current job'])
data['Term'] = data['Term'].map(mapping_dict['Term']) # 0-1映射
data.rename(columns={'Term':'Long Term'},inplace=True) # 重命名列

df = pd.get_dummies(data,columns=['Purpose']) # 独热编码
new_features = [] # 编码后的新特征
for i in df.columns:
    if i not in data.columns:
        new_features.append(i)
for j in new_features:
    df[j] = df[j].astype(int) # 将bool型转换为int型

# 缺失值填充
for col in df.columns.tolist():
    mode_value = df[col].mode()[0]
    df[col] = df[col].fillna(mode_value)

# 设置设备
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print('使用的设备:{}'.format(device))
# 数据预处理-2
# 归一化
X = df.drop(columns=['Credit Default'],axis=1)
y = df['Credit Default']
X_train,X_temp,y_train,y_temp = train_test_split(X,y,train_size=0.8,random_state=42)
X_val,X_test,y_val,y_test = train_test_split(X_temp,y_temp,train_size=0.5,random_state=42)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# 张量转换
X_train = torch.FloatTensor(X_train).to(device)
y_train = torch.LongTensor(y_train.to_numpy()).to(device)
X_val = torch.FloatTensor(X_val).to(device)
y_val = torch.LongTensor(y_val.to_numpy()).to(device)
X_test = torch.FloatTensor(X_test).to(device)
y_test = torch.LongTensor(y_test.to_numpy()).to(device)

# 模型框架
class MLP(nn.Module):
    def __init__(self):
        super(MLP,self).__init__()
        self.fc1 = nn.Linear(31,8)
        self.relu = nn.ReLU()
        #self.dropout = nn.Dropout(0.3)  # 添加dropout防止过拟合
        self.fc2 = nn.Linear(8,2)

    def forward(self,x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

model = MLP().to(device)

# 模型训练,单独提出为一部分

# 可视化损失曲线
plt.figure(figsize=(10, 6))
plt.plot(epochs, train_losses,label='Train Loss')
plt.plot(epochs, val_losses,label='Val Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Val Loss over Epochs')
plt.legend() # 图例显示
plt.grid(True)
plt.show()

# 模型评估
model.eval() # 进入评估模式
with torch.no_grad():
    out = model(X_test) # 输入测试集,得到预测结果
    _,predictions = torch.max(out,dim=1) # 获取概率最大值的索引,即预测的标签

    # 计算准确率
    correct_num = (predictions == y_test).sum().item() #转换int
    accuracy = correct_num / y_test.size(0)
    print('Accuracy:{:.2f}%'.format(accuracy*100))    
# 模型训练
criterion = nn.CrossEntropyLoss()
optimiser = optim.Adam(model.parameters(),lr=0.001)

epoch_num = 20000
train_losses = []
val_losses = []
epochs = []

#-------早停策略参数添加-----------
patience = 50
best_loss = float('inf') # 初始值设为无穷大,方便比较
best_epoch = 0 # best_loss 对应的epoch
count = 0 # 开始上升的次数
early_stopping = False # 是否早停标志
#---------------------------------

# 第一阶段
first_stage_epochs = 4500  # 第一阶段训练轮数
second_stage_epochs = 50    # 第二阶段训练轮数

#加入训练时间
start_time = time.time()
print('---------第一阶段训练----------')
with tqdm(total=first_stage_epochs,desc='第一阶段训练',unit='epoch') as pbar:
    for epoch in range(first_stage_epochs):
        # 训练集
        train_output = model(X_train)
        train_loss = criterion(train_output,y_train)
        # 反向传播和优化
        optimiser.zero_grad()
        train_loss.backward()
        optimiser.step()
        
        # 记录
        if (epoch + 1) % 200 == 0:
            # 测试集的loss,后续可视化
            model.eval() # 进入推理模式
            with torch.no_grad():
                val_output = model(X_val)
                val_loss = criterion(val_output,y_val)
            model.train() # 切换回训练模式
            # 存储
            train_losses.append(train_loss.item())
            val_losses.append(val_loss.item())
            epochs.append(epoch + 1)

            pbar.set_postfix({'Train Loss': f'{train_loss.item():.4f}', 'Val Loss': f'{val_loss.item():.4f}'})

            #-------早停逻辑添加-----------
            if val_loss.item() < best_loss:
                best_loss = val_loss.item()
                best_epoch = epoch + 1
                count = 0 # 清零(只要在耐心值范围内,出现loss值下降的情况,就清零重来)
                # 保存模型参数,便于后续加载、评估
                torch.save(model.state_dict(), 'best_model.pth')
            else:
                count += 1
                if count >= patience:
                    print(f"早停触发!在第{epoch+1}轮,测试集损失已有{patience}轮未改善。")
                    print(f"最佳测试集损失出现在第{best_epoch}轮,损失值为{best_loss:.4f}")
                    early_stopping = True
                    break #手动中止循环
            #---------------------------------
           
        # 更新进度条
        if (epoch + 1) % 1000 == 0:
            pbar.update(1000)
    # 补全进度条
    if pbar.n < first_stage_epochs:
        pbar.update(first_stage_epochs - pbar.n)

# 保存第一阶段训练后的权重
torch.save(model.state_dict(),'first_stage_model.pth')
print("第一阶段训练完成,模型权重已保存到 'first_stage_model.pth'")

# ========== 加载权重并继续训练 ==========
print("\n=== 加载权重并继续训练 ===")
model.load_state_dict(torch.load('first_stage_model.pth'))
print("已加载第一阶段训练权重")

# 记录当前epoch数,用于可视化
current_epoch = len(epochs) if epochs else 0
# 第二阶段训练
with tqdm(total=second_stage_epochs, desc="第二阶段训练") as pbar:
    for epoch in range(second_stage_epochs):
        # 训练集
        train_output = model(X_train)
        train_loss = criterion(train_output, y_train)
        
        # 反向传播和优化
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
        
        # 记录
        current_epoch_total = first_stage_epochs + epoch + 1
        
        if (epoch + 1) % 10 == 0:  # 第二阶段记录更频繁
            # 验证集的loss
            model.eval()
            with torch.no_grad():
                val_output = model(X_val)
                val_loss = criterion(val_output, y_val)
            model.train()
            
            # 存储
            train_losses.append(train_loss.item())
            val_losses.append(val_loss.item())
            epochs.append(current_epoch_total)

            pbar.set_postfix({'Train Loss': f'{train_loss.item():.4f}', 'Val Loss': f'{val_loss.item():.4f}'})

            #-------早停逻辑(第二阶段继续生效)-----------
            if val_loss.item() < best_loss:
                best_loss = val_loss.item()
                best_epoch = current_epoch_total
                count = 0
                torch.save(model.state_dict(), 'best_model.pth')
            else:
                count += 1
                if count >= patience:
                    print(f"早停触发!在第{current_epoch_total}轮,验证集损失已有{patience}轮未改善。")
                    print(f"最佳验证集损失出现在第{best_epoch}轮,损失值为{best_loss:.4f}")
                    early_stopping = True
                    break
            #---------------------------------
           
        # 更新进度条
        pbar.update(1)

end_time = time.time()
print(f'Training time: {end_time:.2f} seconds')

#-------加载最终模型来评估-----------
if early_stopping:
    print(f"加载第{best_epoch}轮的最佳模型进行最终评估...")
    model.load_state_dict(torch.load('best_model.pth'))
#---------------------------------

我的训练程序是这个,参考其输出的pkl文件程序,修改一下你刚才的代码 # -*- coding: utf-8 -*- """ Created on Mon Jul 21 14:13:11 2025 @author: srx20 """ import os import gc import numpy as np import pandas as pd import joblib import talib as ta from tqdm import tqdm import random from sklearn.cluster import MiniBatchKMeans from sklearn.preprocessing import StandardScaler from sklearn.model_selection import RandomizedSearchCV, GroupKFold from sklearn.feature_selection import SelectKBest, f_classif from sklearn.metrics import make_scorer, recall_score, classification_report import lightgbm as lgb import logging import psutil import warnings from scipy import sparse warnings.filterwarnings('ignore') # 设置日志记录 logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler('stock_prediction_fixed.log'), logging.StreamHandler() ] ) logger = logging.getLogger(__name__) # ========== 配置类 ========== class StockConfig: def __init__(self): # 数据路径 self.SH_PATH = r"D:\股票量化数据库\股票csv数据\上证" self.SZ_PATH = r"D:\股票量化数据库\股票csv数据\深证" # 时间范围 self.START_DATE = "2011-1-1" self.END_DATE = "2024-1-1" self.TEST_START = "2024-1-1" self.TEST_END = "2025-7-18" # 聚类设置 self.CLUSTER_NUM = 8 self.CLUSTER_FEATURES = [ 'price_change', 'volatility', 'volume_change', 'MA5', 'MA20', 'RSI14', 'MACD_hist' ] # 预测特征 (初始列表,实际使用时会动态更新) self.PREDICT_FEATURES = [ 'open', 'high', 'low', 'close', 'volume', 'price_change', 'volatility', 'volume_change', 'MA5', 'MA20', 'RSI14', 'MACD_hist', 'cluster', 'MOM10', 'ATR14', 'VWAP', 'RSI_diff', 'price_vol_ratio', 'MACD_RSI', 'advance_decline', 'day_of_week', 'month' ] # 模型参数优化范围(内存优化版) self.PARAM_GRID = { 'boosting_type': ['gbdt'], # 减少选项 'num_leaves': [31, 63], # 减少选项 'max_depth': [-1, 7], # 减少选项 'learning_rate': [0.01, 0.05], 'n_estimators': [300, 500], # 减少选项 'min_child_samples': [50], # 固定值 'min_split_gain': [0.0, 0.1], 'reg_alpha': [0, 0.1], 'reg_lambda': [0, 0.1], 'feature_fraction': [0.7, 0.9], 'bagging_fraction': [0.7, 0.9], 'bagging_freq': [1] } # 目标条件 self.MIN_GAIN = 0.05 self.MIN_LOW_RATIO = 0.98 # 调试模式 self.DEBUG_MODE = False self.MAX_STOCKS = 50 if self.DEBUG_MODE else None self.SAMPLE_FRACTION = 0.3 if not self.DEBUG_MODE else 1.0 # 采样比例 # ========== 内存管理工具 (修复版) ========== def reduce_mem_usage(df): """优化DataFrame内存使用,只处理数值列""" start_mem = df.memory_usage().sum() / 1024**2 # 只处理数值列 numeric_cols = df.select_dtypes(include=['int', 'float', 'integer']).columns for col in numeric_cols: col_type = df[col].dtype if col_type != object: c_min = df[col].min() c_max = df[col].max() if str(col_type)[:3] == 'int': if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: df[col] = df[col].astype(np.float32) else: df[col] = df[col].astype(np.float64) end_mem = df.memory_usage().sum() / 1024**2 logger.info(f'内存优化: 从 {start_mem:.2f} MB 减少到 {end_mem:.2f} MB ({100*(start_mem-end_mem)/start_mem:.1f}%)') return df def print_memory_usage(): """打印当前内存使用情况""" process = psutil.Process(os.getpid()) mem = process.memory_info().rss / (1024 ** 2) logger.info(f"当前内存使用: {mem:.2f} MB") # ========== 数据加载 (修复版) ========== def load_stock_data(sh_path, sz_path, start_date, end_date, sample_fraction=1.0, debug_mode=False, max_stocks=None): """加载股票数据,并过滤日期范围(修复随机抽样问题)""" stock_data = {} # 创建文件列表 all_files = [] for exchange, path in [('SH', sh_path), ('SZ', sz_path)]: if os.path.exists(path): csv_files = [f for f in os.listdir(path) if f.endswith('.csv')] for file in csv_files: all_files.append((exchange, path, file)) if not all_files: logger.warning("没有找到任何CSV文件") return stock_data # 随机抽样(修复一维问题) if sample_fraction < 1.0: sample_size = max(1, int(len(all_files) * sample_fraction)) # 使用random.sample代替np.random.choice all_files = random.sample(all_files, sample_size) logger.info(f"抽样 {len(all_files)} 只股票文件 (比例: {sample_fraction})") total_files = len(all_files) pbar = tqdm(total=total_files, desc='加载股票数据') loaded_count = 0 for exchange, path, file in all_files: if max_stocks is not None and loaded_count >= max_stocks: break if file.endswith('.csv'): stock_code = f"{exchange}_{file.split('.')[0]}" file_path = os.path.join(path, file) try: # 读取数据并验证列名 df = pd.read_csv(file_path) # 验证必要的列是否存在 required_cols = ['date', 'open', 'high', 'low', 'close', 'volume'] if not all(col in df.columns for col in required_cols): logger.warning(f"股票 {stock_code} 缺少必要列,跳过") pbar.update(1) continue # 转换日期并过滤 df['date'] = pd.to_datetime(df['date']) df = df[(df['date'] >= start_date) & (df['date'] <= end_date)] if len(df) < 50: # 至少50个交易日 logger.info(f"股票 {stock_code} 数据不足({len(df)}条),跳过") pbar.update(1) continue # 转换数据类型 for col in ['open', 'high', 'low', 'close']: df[col] = pd.to_numeric(df[col], errors='coerce').astype(np.float32) df['volume'] = pd.to_numeric(df['volume'], errors='coerce').astype(np.uint32) # 删除包含NaN的行 df = df.dropna(subset=required_cols) if len(df) > 0: stock_data[stock_code] = df loaded_count += 1 logger.debug(f"成功加载股票 {stock_code},数据条数: {len(df)}") else: logger.warning(f"股票 {stock_code} 过滤后无数据") except Exception as e: logger.error(f"加载股票 {stock_code} 失败: {str(e)}", exc_info=True) pbar.update(1) # 调试模式只处理少量股票 if debug_mode and loaded_count >= 10: logger.info("调试模式: 已加载10只股票,提前结束") break pbar.close() logger.info(f"成功加载 {len(stock_data)} 只股票数据") return stock_data # ========== 特征工程 (修复版) ========== class FeatureEngineer: def __init__(self, config): self.config = config def safe_fillna(self, series, default=0): """安全填充NaN值""" if isinstance(series, pd.Series): return series.fillna(default) elif isinstance(series, np.ndarray): return np.nan_to_num(series, nan=default) return series def transform(self, df): """添加技术指标特征(修复NumPy数组问题)""" try: # 创建临时副本用于TA-Lib计算 df_temp = df.copy() # 将价格列转换为float64以满足TA-Lib要求 for col in ['open', 'high', 'low', 'close']: df_temp[col] = df_temp[col].astype(np.float64) # 基础特征 df['price_change'] = df['close'].pct_change().fillna(0) df['volatility'] = df['close'].rolling(5).std().fillna(0) df['volume_change'] = df['volume'].pct_change().fillna(0) df['MA5'] = df['close'].rolling(5).mean().fillna(0) df['MA20'] = df['close'].rolling(20).mean().fillna(0) # 技术指标 - 修复NumPy数组问题 rsi = ta.RSI(df_temp['close'].values, timeperiod=14) df['RSI14'] = self.safe_fillna(rsi, 50) macd, macd_signal, macd_hist = ta.MACD( df_temp['close'].values, fastperiod=12, slowperiod=26, signalperiod=9 ) df['MACD_hist'] = self.safe_fillna(macd_hist, 0) # 新增特征 mom = ta.MOM(df_temp['close'].values, timeperiod=10) df['MOM10'] = self.safe_fillna(mom, 0) atr = ta.ATR( df_temp['high'].values, df_temp['low'].values, df_temp['close'].values, timeperiod=14 ) df['ATR14'] = self.safe_fillna(atr, 0) # 成交量加权平均价 vwap = (df['volume'] * (df['high'] + df['low'] + df['close']) / 3).cumsum() / df['volume'].cumsum() df['VWAP'] = self.safe_fillna(vwap, 0) # 相对强弱指数差值 df['RSI_diff'] = df['RSI14'] - df['RSI14'].rolling(5).mean().fillna(0) # 价格波动比率 df['price_vol_ratio'] = df['price_change'] / (df['volatility'].replace(0, 1e-8) + 1e-8) # 技术指标组合特征 df['MACD_RSI'] = df['MACD_hist'] * df['RSI14'] # 市场情绪指标 df['advance_decline'] = (df['close'] > df['open']).astype(int).rolling(5).sum().fillna(0) # 时间特征 df['day_of_week'] = df['date'].dt.dayofweek df['month'] = df['date'].dt.month # 处理无穷大NaN df = df.replace([np.inf, -np.inf], np.nan) df = df.fillna(0) # 优化内存(只处理数值列) return reduce_mem_usage(df) except Exception as e: logger.error(f"特征工程失败: {str(e)}", exc_info=True) # 返回基本特征作为回退方案 df['price_change'] = df['close'].pct_change().fillna(0) df['volatility'] = df['close'].rolling(5).std().fillna(0) df['volume_change'] = df['volume'].pct_change().fillna(0) df['MA5'] = df['close'].rolling(5).mean().fillna(0) df['MA20'] = df['close'].rolling(20).mean().fillna(0) # 填充缺失的技术指标 for col in self.config.PREDICT_FEATURES: if col not in df.columns: df[col] = 0 return df # ========== 聚类模型 (添加保存/加载功能) ========== class StockCluster: def __init__(self, config): self.config = config self.scaler = StandardScaler() self.kmeans = MiniBatchKMeans( n_clusters=config.CLUSTER_NUM, random_state=42, batch_size=1000 ) self.cluster_map = {} # 股票代码到聚类ID的映射 self.model_file = "stock_cluster_model.pkl" # 模型保存路径 def save(self): """保存聚类模型到文件""" # 创建包含所有必要组件的字典 model_data = { 'kmeans': self.kmeans, 'scaler': self.scaler, 'cluster_map': self.cluster_map, 'config_cluster_num': self.config.CLUSTER_NUM } # 使用joblib保存模型 joblib.dump(model_data, self.model_file) logger.info(f"聚类模型保存到: {self.model_file}") def load(self): """从文件加载聚类模型""" if os.path.exists(self.model_file): model_data = joblib.load(self.model_file) self.kmeans = model_data['kmeans'] self.scaler = model_data['scaler'] self.cluster_map = model_data['cluster_map'] logger.info(f"从 {self.model_file} 加载聚类模型") return True else: logger.warning("聚类模型文件不存在,需要重新训练") return False def fit(self, stock_data): """训练聚类模型""" logger.info("开始股票聚类分析...") cluster_features = [] # 提取每只股票的特征 for stock_code, df in tqdm(stock_data.items(), desc="提取聚类特征"): if len(df) < 50: # 至少50个交易日 continue features = {} for feat in self.config.CLUSTER_FEATURES: if feat in df.columns: # 使用统计特征 features[f"{feat}_mean"] = df[feat].mean() features[f"{feat}_std"] = df[feat].std() else: # 特征缺失时填充0 features[f"{feat}_mean"] = 0 features[f"{feat}_std"] = 0 cluster_features.append(features) if not cluster_features: logger.warning("没有可用的聚类特征,使用默认聚类") # 创建默认聚类映射 self.cluster_map = {code: 0 for code in stock_data.keys()} return self # 创建特征DataFrame feature_df = pd.DataFrame(cluster_features) feature_df = reduce_mem_usage(feature_df) # 标准化特征 scaled_features = self.scaler.fit_transform(feature_df) # 聚类 self.kmeans.fit(scaled_features) clusters = self.kmeans.predict(scaled_features) feature_df['cluster'] = clusters # 创建股票到聚类的映射 stock_codes = list(stock_data.keys())[:len(clusters)] # 确保长度匹配 for i, stock_code in enumerate(stock_codes): self.cluster_map[stock_code] = clusters[i] logger.info("聚类分布统计:") logger.info(feature_df['cluster'].value_counts().to_string()) logger.info(f"股票聚类完成,共分为 {self.config.CLUSTER_NUM} 个类别") # 训练完成后自动保存模型 self.save() return self def transform(self, df, stock_code): """为数据添加聚类特征""" cluster_id = self.cluster_map.get(stock_code, -1) # 默认为-1表示未知聚类 df['cluster'] = cluster_id return df # ========== 目标创建 ========== class TargetCreator: def __init__(self, config): self.config = config def create_targets(self, df): """创建目标变量 - 增加T+2收盘价不低于T+1收盘价的条件""" # 计算次日(T+1)收盘价相对于开盘价的涨幅 df['next_day_open_to_close_gain'] = df['close'].shift(-1) / df['open'].shift(-1) - 1 # 计算次日(T+1)最低价与开盘价比例 df['next_day_low_ratio'] = df['low'].shift(-1) / df['open'].shift(-1) # 获取T+1T+2的收盘价 df['next_day_close'] = df['close'].shift(-1) # T+1收盘价 df['next_next_day_close'] = df['close'].shift(-2) # T+2收盘价 # 创建复合目标: # 1. T+1收盘价比开盘价高5% # 2. T+1最低价≥开盘价98% # 3. T+2收盘价 ≥ T+1收盘价 df['target'] = 0 mask = ( (df['next_day_open_to_close_gain'] > self.config.MIN_GAIN) & (df['next_day_low_ratio'] >= self.config.MIN_LOW_RATIO) & (df['next_next_day_close'] >= df['next_day_close']) ) df.loc[mask, 'target'] = 1 # 删除最后两行(没有完整的T+1T+2数据) df = df.iloc[:-2] # 检查目标分布 target_counts = df['target'].value_counts() logger.info(f"目标分布: 0={target_counts.get(0, 0)}, 1={target_counts.get(1, 0)}") # 添加调试信息 if self.config.DEBUG_MODE: sample_targets = df[['open', 'close', 'next_day_open_to_close_gain', 'next_day_close', 'next_next_day_close', 'target']].tail(5) logger.debug(f"目标创建示例:\n{sample_targets}") # 清理临时列 df.drop(columns=['next_day_close', 'next_next_day_close'], inplace=True, errors='ignore') return df # ========== 模型训练 (内存优化版) ========== class StockModelTrainer: def __init__(self, config): self.config = config self.model_name = "stock_prediction_model" self.feature_importance = None def prepare_dataset(self, stock_data, cluster_model, feature_engineer): """准备训练数据集(内存优化版)""" logger.info("准备训练数据集...") X_list = [] y_list = [] stock_group_list = [] # 用于分组交叉验证 target_creator = TargetCreator(self.config) # 使用生成器减少内存占用 for stock_code, df in tqdm(stock_data.items(), desc="处理股票数据"): try: # 特征工程 df = feature_engineer.transform(df.copy()) # 添加聚类特征 df = cluster_model.transform(df, stock_code) # 创建目标 df = target_creator.create_targets(df) # 只保留所需特征目标 features = self.config.PREDICT_FEATURES if 'target' not in df.columns: logger.warning(f"股票 {stock_code} 缺少目标列,跳过") continue X = df[features] y = df['target'] # 确保没有NaN值 if X.isnull().any().any(): logger.warning(f"股票 {stock_code} 特征包含NaN值,跳过") continue # 使用稀疏矩阵存储(减少内存) sparse_X = sparse.csr_matrix(X.values.astype(np.float32)) X_list.append(sparse_X) y_list.append(y.values) stock_group_list.extend([stock_code] * len(X)) # 为每个样本添加股票代码作为组标识 # 定期清理内存 if len(X_list) % 100 == 0: gc.collect() print_memory_usage() except Exception as e: logger.error(f"处理股票 {stock_code} 失败: {str(e)}", exc_info=True) if not X_list: logger.error("没有可用的训练数据") return None, None, None # 合并所有数据 X_full = sparse.vstack(X_list) y_full = np.concatenate(y_list) groups = np.array(stock_group_list) logger.info(f"数据集准备完成,样本数: {X_full.shape[0]}") logger.info(f"目标分布: 0={sum(y_full==0)}, 1={sum(y_full==1)}") return X_full, y_full, groups def feature_selection(self, X, y): """执行特征选择(内存优化版)""" logger.info("执行特征选择...") # 使用基模型评估特征重要性 base_model = lgb.LGBMClassifier( n_estimators=100, random_state=42, n_jobs=-1 ) # 分批训练(减少内存占用) batch_size = 100000 for i in range(0, X.shape[0], batch_size): end_idx = min(i + batch_size, X.shape[0]) X_batch = X[i:end_idx].toarray() if sparse.issparse(X) else X[i:end_idx] y_batch = y[i:end_idx] if i == 0: base_model.fit(X_batch, y_batch) else: base_model.fit(X_batch, y_batch, init_model=base_model) # 获取特征重要性 importance = pd.Series(base_model.feature_importances_, index=self.config.PREDICT_FEATURES) importance = importance.sort_values(ascending=False) logger.info("特征重要性:\n" + importance.to_string()) # 选择前K个重要特征 k = min(15, len(self.config.PREDICT_FEATURES)) selected_features = importance.head(k).index.tolist() logger.info(f"选择前 {k} 个特征: {selected_features}") # 更新配置中的特征列表 self.config.PREDICT_FEATURES = selected_features # 转换特征矩阵 if sparse.issparse(X): # 对于稀疏矩阵,我们需要重新索引 feature_indices = [self.config.PREDICT_FEATURES.index(f) for f in selected_features] X_selected = X[:, feature_indices] else: X_selected = X[selected_features] return X_selected, selected_features def train_model(self, X, y, groups): """训练并优化模型(内存优化版)""" if X is None or len(y) == 0: logger.error("训练数据为空,无法训练模型") return None logger.info("开始训练模型...") # 1. 处理类别不平衡 pos_count = sum(y == 1) neg_count = sum(y == 0) scale_pos_weight = neg_count / pos_count if pos_count > 0 else 1.0 logger.info(f"类别不平衡处理: 正样本权重 = {scale_pos_weight:.2f}") # 2. 特征选择 X_selected, selected_features = self.feature_selection(X, y) # 3. 自定义评分函数 - 关注正类召回率 def positive_recall_score(y_true, y_pred): return recall_score(y_true, y_pred, pos_label=1) custom_scorer = make_scorer(positive_recall_score, greater_is_better=True) # 4. 使用分组时间序列交叉验证(减少折数) group_kfold = GroupKFold(n_splits=2) # 减少折数以节省内存 cv = list(group_kfold.split(X_selected, y, groups=groups)) # 5. 创建模型 model = lgb.LGBMClassifier( objective='binary', random_state=42, n_jobs=-1, scale_pos_weight=scale_pos_weight, verbose=-1 ) # 6. 参数搜索(减少迭代次数) search = RandomizedSearchCV( estimator=model, param_distributions=self.config.PARAM_GRID, n_iter=10, # 减少迭代次数以节省内存 scoring=custom_scorer, cv=cv, verbose=2, n_jobs=1, # 减少并行任务以节省内存 pre_dispatch='2*n_jobs', # 控制任务分发 random_state=42 ) logger.info("开始参数搜索...") # 分批处理数据(减少内存占用) if sparse.issparse(X_selected): X_dense = X_selected.toarray() # 转换为密集矩阵用于搜索 else: X_dense = X_selected search.fit(X_dense, y) # 7. 使用最佳参数训练最终模型 best_params = search.best_params_ logger.info(f"最佳参数: {best_params}") logger.info(f"最佳召回率: {search.best_score_}") final_model = lgb.LGBMClassifier( **best_params, objective='binary', random_state=42, n_jobs=-1, scale_pos_weight=scale_pos_weight ) # 使用策略训练最终模型 logger.info("训练最终模型...") final_model.fit( X_dense, y, eval_set=[(X_dense, y)], eval_metric='binary_logloss', callbacks=[ lgb.early_stopping(stopping_rounds=50, verbose=False), lgb.log_evaluation(period=100) ] ) # 保存特征重要性 self.feature_importance = pd.Series( final_model.feature_importances_, index=selected_features ).sort_values(ascending=False) # 8. 保存模型 model_path = f"{self.model_name}.pkl" joblib.dump((final_model, selected_features), model_path) logger.info(f"模型保存到: {model_path}") return final_model def evaluate_model(self, model, X_test, y_test): """评估模型性能""" if model is None or len(X_test) == 0: logger.warning("无法评估模型,缺少数据或模型") return # 预测测试集 y_pred = model.predict(X_test) # 计算召回率 recall = recall_score(y_test, y_pred, pos_label=1) logger.info(f"测试集召回率: {recall:.4f}") # 计算满足条件的样本比例 condition_ratio = sum(y_test == 1) / len(y_test) logger.info(f"满足条件的样本比例: {condition_ratio:.4f}") # 详细分类报告 report = classification_report(y_test, y_pred) logger.info("分类报告:\n" + report) # 特征重要性 if self.feature_importance is not None: logger.info("特征重要性:\n" + self.feature_importance.to_string()) # ========== 主程序 ========== def main(): # 初始化配置 config = StockConfig() logger.info("===== 股票上涨预测程序 (修复版) =====") # 加载训练数据(添加抽样) logger.info(f"加载训练数据: {config.START_DATE} 至 {config.END_DATE}") train_data = load_stock_data( config.SH_PATH, config.SZ_PATH, config.START_DATE, config.END_DATE, sample_fraction=config.SAMPLE_FRACTION, debug_mode=config.DEBUG_MODE, max_stocks=config.MAX_STOCKS ) if not train_data: logger.error("错误: 没有加载到任何股票数据,请检查数据路径格式") return # 特征工程 feature_engineer = FeatureEngineer(config) # 聚类分析 - 尝试加载现有模型,否则训练新模型 cluster_model = StockCluster(config) if not cluster_model.load(): # 尝试加载模型 try: cluster_model.fit(train_data) except Exception as e: logger.error(f"聚类分析失败: {str(e)}", exc_info=True) # 创建默认聚类映射 cluster_model.cluster_map = {code: 0 for code in train_data.keys()} logger.info("使用默认聚类(所有股票归为同一类)") cluster_model.save() # 保存默认聚类模型 # 准备训练数据 trainer = StockModelTrainer(config) try: X_train, y_train, groups = trainer.prepare_dataset( train_data, cluster_model, feature_engineer ) except Exception as e: logger.error(f"准备训练数据失败: {str(e)}", exc_info=True) return if X_train is None or len(y_train) == 0: logger.error("错误: 没有可用的训练数据") return # 训练模型 model = trainer.train_model(X_train, y_train, groups) if model is None: logger.error("模型训练失败") return # 加载测试数据(添加抽样) logger.info(f"\n加载测试数据: {config.TEST_START} 至 {config.TEST_END}") test_data = load_stock_data( config.SH_PATH, config.SZ_PATH, config.TEST_START, config.TEST_END, sample_fraction=config.SAMPLE_FRACTION, debug_mode=config.DEBUG_MODE, max_stocks=config.MAX_STOCKS ) if test_data: # 准备测试数据 X_test, y_test, _ = trainer.prepare_dataset( test_data, cluster_model, feature_engineer ) if X_test is not None and len(y_test) > 0: # 评估模型 if sparse.issparse(X_test): X_test = X_test.toarray() trainer.evaluate_model(model, X_test, y_test) else: logger.warning("测试数据准备失败,无法评估模型") else: logger.warning("没有测试数据可用") logger.info("===== 程序执行完成 =====") if __name__ == "__main__": main()
07-22
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值