2017.07.27回顾 ET和RT比较 高级切片 argsort barplot(yerr=) xgb.plot_importance

本文回顾了ExtraRandomizedTrees(ET)与RandomForest(RF)的区别,重点在于采样策略和结果整合方式。同时,介绍了Python中的高级切片操作,如逆序、步长选择等,并探讨了seaborn库中barplot的yerr参数用于展示误差条的效果。此外,还提及了matplotlib的xlim功能和XGBoost的变量重要性绘制功能。

1、一到办公室写了上一日的小结

2、昨天帮同事拉进白名单,就只能顺带更新了channel rolling variable,运行良好

3、看了下ExtraRandomizedTrees和randomForest的文档,随机森林是有放回的抽样,样本数等于原始样本数,特征数随机子集,结果回归用average,分类用voting(但是sklearn的实现是对probability的average),ExtraRandomizedTrees是用的全样本,默认不抽样,特征是子集,他的极限随机是体现在每个候选特征,是随机生成一个分裂值,然后选出最佳分化特征。GBDT有两个参数可以选择样本子集和特征子集,目的都是增加偏差来降低方差防止过拟合。当时GBDT的子树用的是CART树,分裂准则是GINI,随机森林和ET树我需要再看一下!

4、下午又是开会,反正都没什么收获了

5、python有一些切片方法,我并没有掌握,小结一下

  • a[::-1]逆向list
  • a[:10:2]取前10个元素,每2个取1个
  • a[::5]每5个元素取1个
6、numpy的argsort()其实见过很多次了,但是老是记不住,他是对一个array-like结构的元素下标进行排序,要特别记住的的地方就是对下标进行排序

7、然后新同事的任务中涉及到如何判断一个数是整数,因为有2.0这种形式,实际上还是整数,原来python数字都有一个内建方法,2.0.is_integer()

8、把阿三的风骚EDA是抄完了,最后两步就是用ET树和XGBOOST输出了一个feature_importance,其中xgboost提供了方法直接输出比较方便,ET树需要自己画图设置一些东西

9、sns.barplot或者plt.bar,有一个参数yerr是y error的简称,可以画出y的偏差,另外sns.barplot确实屌,自动配色,简直骚气

10、plt.xlim,xlim是x limit的简写,可以限定X轴的范围,其实没有多大必要,跟默认autoscaling的结果似乎没差

11、xgb内建画变量重要性

fig, ax = plt.subplots(figsize=(12,12))
#xgboost的内建画变量重要性的方法需要传axes对象
xgb.plot_importance(model, max_number_features=50, height=0.8, ax=ax)
plt.show()

import os import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split, RandomizedSearchCV from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_auc_score, roc_curve, confusion_matrix) from xgboost import XGBClassifier import numpy as np plt.rcParams['font.sans-serif'] = ['SimHei'] # 设置中文字体 plt.rcParams['axes.unicode_minus'] = False # 修复负号显示问题 # 设置环境变量 os.environ['JOBLIB_TEMP_FOLDER'] = 'D:\\conda\\temp' def load_and_preprocess(): """数据加载与预处理流程""" # 加载数据 train_data = pd.read_csv('winequality-red_train.csv') test_data = pd.read_csv('winequality-red_test.csv') # 创建二分类目标变量(先处理训练集测试集) def create_quality_class(df): df['quality_class'] = df['quality'].apply(lambda x: 1 if x >= 7 else 0) # 确保分类标签正确 df['quality_class'] = df['quality_class'].astype('category') df['quality_class'] = df['quality_class'].cat.rename_categories(['质量一般', '质量优良']) return df train_data = create_quality_class(train_data) test_data = create_quality_class(test_data) # 合并数据集(确保合并后保留分类类型) full_data = pd.concat([train_data, test_data], axis=0, ignore_index=True) # 重新确认类型 full_data['quality_class'] = full_data['quality_class'].astype('category') # 特征与标签分离(使用数值标签进行模型训练) X_train = train_data.drop(['quality', 'quality_class'], axis=1) y_train = train_data['quality_class'].cat.codes # 转为数值标签 X_test = test_data.drop(['quality', 'quality_class'], axis=1) y_test = test_data['quality_class'].cat.codes # 转为数值标签 # 数据标准化 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 数据分布可视化 plt.figure(figsize=(15, 12)) for i, feature in enumerate(X_train.columns): plt.subplot(4, 3, i + 1) sns.histplot(full_data[feature], kde=True, bins=20) plt.title(f'{feature} 分布') plt.tight_layout() plt.show() # 特征箱线图(按质量等级) plt.figure(figsize=(15, 10)) for i, feature in enumerate(X_train.columns): plt.subplot(4, 3, i + 1) # 显式指定顺序,确保分类标签正确 sns.boxplot(x='quality_class', y=feature, data=full_data, order=['质量一般', '质量优良']) plt.title(f'{feature} 箱线图') plt.tight_layout() plt.show() # 新增:特征相关性热力图 plt.figure(figsize=(12, 10)) corr_matrix = train_data.corr() sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, fmt='.2f') plt.title('特征相关性热力图', fontsize=14) plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight') plt.show() return X_train_scaled, X_test_scaled, y_train, y_test, X_train.columns.tolist() # 计算特异度 def specificity_score(y_true, y_pred): cm = confusion_matrix(y_true, y_pred) if cm.shape != (2, 2): return np.nan tn, fp, fn, tp = cm.ravel() return tn / (tn + fp) if (tn + fp) > 0 else np.nan # 模型训练与评估函数 def train_evaluate(model, model_name, X_train, X_test, y_train, y_test, plot_roc=True): model.fit(X_train, y_train) y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None metrics = { 'Accuracy': accuracy_score(y_test, y_pred), 'Precision': precision_score(y_test, y_pred), 'Recall': recall_score(y_test, y_pred), 'Specificity': specificity_score(y_test, y_pred), 'F1': f1_score(y_test, y_pred) } if y_proba is not None: metrics['AUC'] = roc_auc_score(y_test, y_proba) print(f"\n{model_name} 分类报告:") # 使用原始分类标签名称生成报告 target_names = ['质量一般', '质量优良'] print(classification_report(y_test, y_pred, target_names=target_names)) print(f"特异度: {metrics['Specificity']:.4f}") roc_data = None if y_proba is not None and plot_roc: fpr, tpr, _ = roc_curve(y_test, y_proba) roc_data = (fpr, tpr, metrics['AUC']) return metrics, roc_data # ========== 主程序 ========== if __name__ == "__main__": X_train, X_test, y_train, y_test, features = load_and_preprocess() # ========== 模型定义(移除随机森林优化部分) ========== models = [ ('Logistic Regression', LogisticRegression(max_iter=1000, random_state=42)), ('Random Forest', RandomForestClassifier(random_state=42)), # 基准模型 ('XGBoost', XGBClassifier(random_state=42)) ] # ========== XGBoost参数优化(保留XGBoost优化) ========== xgb_param_dist = { 'n_estimators': [100, 200, 300], 'learning_rate': [0.01, 0.05, 0.1], 'max_depth': [3, 4, 5], 'subsample': [0.8, 0.9, 1.0], 'colsample_bytree': [0.7, 0.8, 0.9], 'gamma': [0, 0.1, 0.2], 'tree_method': ['hist'], 'n_jobs': [4] } random_search_xgb = RandomizedSearchCV( estimator=XGBClassifier(random_state=42), param_distributions=xgb_param_dist, n_iter=20, scoring='f1', cv=3, n_jobs=-1, verbose=1, random_state=42 ) print("\n开始优化XGBoost参数...") random_search_xgb.fit(X_train, y_train) best_xgb = random_search_xgb.best_estimator_ print(f"\nXGBoost最佳参数: {random_search_xgb.best_params_}") models.append(('Optimized XGBoost', best_xgb)) # ========== 模型训练与评估 ========== results = {} roc_data_list = [] for name, model in models: metrics, roc_data = train_evaluate(model, name, X_train, X_test, y_train, y_test) results[name] = metrics if roc_data: roc_data_list.append((name, roc_data)) # ========== 结果展示 ========== # 性能对比表格 pd.set_option('display.max_columns', None) print("\n模型性能对比表:") df_results = pd.DataFrame(results).T.round(4) print(df_results) # 特征重要性对比(仅保留优化后XGBoost) plt.figure(figsize=(6, 8)) name = 'Optimized XGBoost' model = best_xgb importance_df = pd.DataFrame({ 'Feature': features, 'Importance': model.feature_importances_ }).sort_values('Importance', ascending=False) sns.barplot(x='Importance', y='Feature', data=importance_df, palette='viridis') plt.title(f'{name} 特征重要性') plt.tight_layout() plt.show() # ROC曲线对比(所有模型) plt.figure(figsize=(10, 8)) for name, (fpr, tpr, auc) in roc_data_list: plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc:.2f})') plt.plot([0, 1], [0, 1], 'k--', linewidth=2) plt.xlabel('假正率 (FPR)', fontsize=12) plt.ylabel('真正率 (TPR)', fontsize=12) plt.title('ROC曲线对比', fontsize=14) plt.legend(fontsize=10, loc='lower right') plt.grid(True, linestyle='--', alpha=0.7) plt.show() # 混淆矩阵可视化(优化XGBoost) y_pred_xgb = best_xgb.predict(X_test) cm = confusion_matrix(y_test, y_pred_xgb) plt.figure(figsize=(6, 5)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['质量一般', '质量优良'], yticklabels=['质量一般', '质量优良']) plt.xlabel('预测值') plt.ylabel('真实值') plt.title('优化XGBoost混淆矩阵') plt.show()帮我添加随机森林的优化
05-28
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, classification_report from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.feature_selection import SelectFromModel import xgboost as xgb import lightgbm as lgb import warnings import copy from scipy.stats import pearsonr from collections import Counter warnings.filterwarnings('ignore') # 设置中文字体 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False print("=== 企业违约预测分析优化版 ===") # 1. 数据加载初步探索 print("\n1. 数据加载...") df_train = pd.read_csv('./train.csv', low_memory=False) df_test = pd.read_csv('./test.csv', low_memory=False) data_dict = pd.read_csv('./数据字典.csv', encoding='utf-8') print(f"训练集形状: {df_train.shape}") print(f"测试集形状: {df_test.shape}") # 2. 数据质量检查 print("\n2. 数据质量检查...") # 检查缺失值 missing_train = df_train.isnull().sum() missing_test = df_test.isnull().sum() print("训练集缺失值情况:") print(missing_train[missing_train > 0].sort_values(ascending=False).head(10)) print("\n测试集缺失值情况:") print(missing_test[missing_test > 0].sort_values(ascending=False).head(10)) # 检查目标变量分布 target_dist = df_train['target'].value_counts() print(f"\n目标变量分布:\n{target_dist}") print(f"违约率: {df_train['target'].mean():.4f}") # 3. 数据清洗 print("\n3. 数据清洗...") # 处理分类变量 categorical_columns = ['province', 'city', 'industry_l1_name', 'industry_l2_name', 'industry_l3_name', 'industry_l4_name', 'honor_titles', 'sci_tech_ent_tags', 'top100_tags', 'company_on_scale'] # 对分类变量进行编码 label_encoders = {} for col in categorical_columns: if col in df_train.columns: le = LabelEncoder() # 合并训练集测试集进行编码以确保一致性 combined = pd.concat([df_train[col].fillna('missing'), df_test[col].fillna('missing')]) le.fit(combined) df_train[col] = le.transform(df_train[col].fillna('missing')) df_test[col] = le.transform(df_test[col].fillna('missing')) label_encoders[col] = le # 处理数值型变量的缺失值 numeric_columns = df_train.select_dtypes(include=[np.number]).columns.tolist() numeric_columns = [col for col in numeric_columns if col not in ['company_id', 'target']] for col in numeric_columns: if col in df_train.columns: # 使用中位数填充 median_val = df_train[col].median() df_train[col].fillna(median_val, inplace=True) if col in df_test.columns: df_test[col].fillna(median_val, inplace=True) # 4. 特征工程 print("\n4. 特征工程...") def safe_divide(a, b): """安全除法,避免除零无穷大值""" with np.errstate(divide='ignore', invalid='ignore'): result = np.divide(a, b) result[~np.isfinite(result)] = 0 # 将无穷大NaN替换为0 return result # 财务健康指标 if all(col in df_train.columns for col in ['total_assets', 'total_liabilities', 'total_equity']): df_train['asset_liability_ratio'] = safe_divide(df_train['total_liabilities'], df_train['total_assets'] + 1) if all(col in df_test.columns for col in ['total_assets', 'total_liabilities', 'total_equity']): df_test['asset_liability_ratio'] = safe_divide(df_test['total_liabilities'], df_test['total_assets'] + 1) df_train['equity_ratio'] = safe_divide(df_train['total_equity'], df_train['total_assets'] + 1) if all(col in df_test.columns for col in ['total_assets', 'total_equity']): df_test['equity_ratio'] = safe_divide(df_test['total_equity'], df_test['total_assets'] + 1) # 添加缺失的debt_to_equity特征 df_train['debt_to_equity'] = safe_divide(df_train['total_liabilities'], df_train['total_equity'] + 1) if all(col in df_test.columns for col in ['total_liabilities', 'total_equity']): df_test['debt_to_equity'] = safe_divide(df_test['total_liabilities'], df_test['total_equity'] + 1) # 增长性指标 if all(col in df_train.columns for col in ['net_profit', 'total_sales']): df_train['profit_margin'] = safe_divide(df_train['net_profit'], df_train['total_sales'] + 1) if all(col in df_test.columns for col in ['net_profit', 'total_sales']): df_test['profit_margin'] = safe_divide(df_test['net_profit'], df_test['total_sales'] + 1) # 变更频率指标 change_features = [col for col in df_train.columns if 'chg_cnt' in col] if change_features: df_train['total_change_12m'] = df_train[[col for col in change_features if '12m' in col]].sum(axis=1) # 确保测试集也有相同的特征 test_change_features = [col for col in df_test.columns if 'chg_cnt' in col and '12m' in col] if test_change_features: df_test['total_change_12m'] = df_test[test_change_features].sum(axis=1) else: df_test['total_change_12m'] = 0 # 风险指标 risk_features = [col for col in df_train.columns if any(x in col for x in ['abn_cnt', 'pun_cnt', 'dishonest'])] if risk_features: df_train['total_risk_12m'] = df_train[[col for col in risk_features if '12m' in col]].sum(axis=1) # 确保测试集也有相同的特征 test_risk_features = [col for col in df_test.columns if any(x in col for x in ['abn_cnt', 'pun_cnt', 'dishonest']) and '12m' in col] if test_risk_features: df_test['total_risk_12m'] = df_test[test_risk_features].sum(axis=1) else: df_test['total_risk_12m'] = 0 # 5. 探索性数据分析 print("\n5. 探索性数据分析...") plt.figure(figsize=(15, 10)) plt.subplot(2, 3, 1) df_train['target'].value_counts().plot.pie(autopct='%1.1f%%', colors=['lightblue', 'lightcoral']) plt.title('目标变量分布') # 关键财务指标分析 key_financial_metrics = ['total_assets', 'total_liabilities', 'total_equity', 'total_sales', 'total_profit', 'net_profit'] plt.subplot(2, 3, 2) for metric in key_financial_metrics[:3]: if metric in df_train.columns: # 使用对数变换处理极端值 safe_data = np.log1p(np.abs(df_train[metric])) * np.sign(df_train[metric]) sns.kdeplot(safe_data, label=metric) plt.title('关键财务指标分布') plt.legend() plt.subplot(2, 3, 3) # 财务比率分析 if 'debt_to_equity' in df_train.columns: sns.boxplot(x='target', y='debt_to_equity', data=df_train) plt.title('资产负债率 vs 违约') # 时间序列特征分析 time_features = [col for col in df_train.columns if any(x in col for x in ['12m', '36m', '60m'])] plt.subplot(2, 3, 4) if len(time_features) > 0: # 选择第一个时间特征进行展示 feature = time_features[0] df_train.groupby('target')[feature].mean().plot(kind='bar') plt.title(f'{feature} vs 违约') # 地区分布 plt.subplot(2, 3, 5) if 'province' in df_train.columns: province_default = df_train.groupby('province')['target'].mean().sort_values(ascending=False).head(10) province_default.plot(kind='bar') plt.title('各省份违约率Top10') # 行业分布 plt.subplot(2, 3, 6) if 'industry_l1_name' in df_train.columns: industry_default = df_train.groupby('industry_l1_name')['target'].mean().sort_values(ascending=False).head(10) industry_default.plot(kind='bar') plt.title('各行业违约率Top10') plt.tight_layout() plt.show() # 6. 改进的假指标识别 - 进一步优化 print("\n6. 改进的假指标识别...") def detect_fake_features_optimized(df, target_col='target', min_fake=2, max_fake=5): """ 进一步优化的假指标检测方法 """ fake_candidates = [] # 方法1: 基于随机森林的特征重要性 X_temp = df.drop(['company_id', target_col], axis=1, errors='ignore') y_temp = df[target_col] numeric_features = X_temp.select_dtypes(include=[np.number]).columns X_temp = X_temp[numeric_features] X_temp = X_temp.replace([np.inf, -np.inf], np.nan).fillna(X_temp.median()) rf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10) rf.fit(X_temp, y_temp) feature_importance = pd.DataFrame({ 'feature': numeric_features, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) # 方法2: 基于XGBoost的特征重要性 xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42, max_depth=6) xgb_model.fit(X_temp, y_temp) xgb_importance = pd.DataFrame({ 'feature': numeric_features, 'importance': xgb_model.feature_importances_ }).sort_values('importance', ascending=False) # 方法3: 基于相关性的分析 feature_corr = [] for col in numeric_features: try: clean_data = df[col].replace([np.inf, -np.inf], np.nan).fillna(0) if len(clean_data.unique()) > 1: corr, p_value = pearsonr(clean_data, df[target_col]) if not np.isnan(corr): feature_corr.append((col, abs(corr), corr, p_value)) except: continue feature_corr.sort(key=lambda x: x[1], reverse=True) # 方法4: 基于方差的特征选择 low_variance_features = [] for col in numeric_features: if df[col].std() < 0.001: low_variance_features.append(col) # 综合判断:使用加权评分系统 feature_scores = {} for feature in numeric_features: score = 0 # 随机森林重要性评分 (权重0.4) rf_rank = feature_importance[feature_importance['feature'] == feature].index if len(rf_rank) > 0: rf_score = 1 - (rf_rank[0] / len(feature_importance)) score += rf_score * 0.4 # XGBoost重要性评分 (权重0.3) xgb_rank = xgb_importance[xgb_importance['feature'] == feature].index if len(xgb_rank) > 0: xgb_score = 1 - (xgb_rank[0] / len(xgb_importance)) score += xgb_score * 0.3 # 相关性评分 (权重0.2) corr_info = next((item for item in feature_corr if item[0] == feature), None) if corr_info: corr_score = corr_info[1] # 使用绝对值 score += corr_score * 0.2 # 方差评分 (权重0.1) if feature in low_variance_features: score -= 0.1 feature_scores[feature] = score # 按分数排序,选择分数最低的特征作为假指标候选 sorted_features = sorted(feature_scores.items(), key=lambda x: x[1]) # 排除明显重要的特征(基于业务逻辑) important_keywords = ['lhc_cnt', 'court', 'justice', 'dishonest', 'pun_cnt', 'abn_cnt', 'total_assets', 'total_liabilities', 'total_equity', 'net_profit', 'legal_person', 'bus_chg', 'establish_year'] fake_candidates = [] for feature, score in sorted_features: if len(fake_candidates) >= max_fake: break # 如果特征不包含重要关键词且分数很低,则认为是假指标 if not any(keyword in feature for keyword in important_keywords) and score < 0.1: fake_candidates.append(feature) # 确保识别足够数量的假指标 if len(fake_candidates) < min_fake: # 从分数最低的特征中补充 for feature, score in sorted_features: if feature not in fake_candidates and len(fake_candidates) < max_fake: fake_candidates.append(feature) return fake_candidates[:max_fake] # 执行优化的假指标检测 potential_fake_features = detect_fake_features_optimized(df_train) print(f"识别出的可能假指标 ({len(potential_fake_features)}个): {potential_fake_features}") # 7. 改进的模型构建 - 进一步优化参数 print("\n7. 改进的模型构建...") # 准备特征数据 common_features = list(set(df_train.columns) & set(df_test.columns)) common_features = [col for col in common_features if col not in ['company_id', 'target']] common_features = [col for col in common_features if df_train[col].dtype in [np.int64, np.float64]] # 移除假指标 features_to_use = [col for col in common_features if col not in potential_fake_features] print(f"使用的特征数量: {len(features_to_use)}") X = df_train[features_to_use] y = df_train['target'] X_test = df_test[features_to_use] # 填充缺失值并处理无穷大 X = X.replace([np.inf, -np.inf], np.nan) X_test = X_test.replace([np.inf, -np.inf], np.nan) X = X.fillna(X.median()) X_test = X_test.fillna(X.median()) # 特征缩放 scaler = StandardScaler() X_scaled = scaler.fit_transform(X) X_test_scaled = scaler.transform(X_test) # 划分训练验证集 X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y) # 优化的评分函数 - 更严格的违约率控制 def custom_score_optimized(y_true, y_pred_proba_val): """严格按照评分规则:0.5*AUC + 0.3*Recall + 0.2*Precision""" auc_score = roc_auc_score(y_true, y_pred_proba_val) best_score_val = 0 best_threshold_val = 0.5 actual_default_rate_val = y_true.mean() # 更严格的违约率范围控制 min_default_rate = max(0.03, actual_default_rate_val * 0.7) # 提高下限 max_default_rate = min(0.10, actual_default_rate_val * 1.3) # 降低上限 # 更精细的阈值搜索 for threshold_val in np.arange(0.01, 0.99, 0.002): y_pred_val = (y_pred_proba_val > threshold_val).astype(int) pred_default_rate_val = y_pred_val.mean() # 严格的违约率约束 if pred_default_rate_val < min_default_rate or pred_default_rate_val > max_default_rate: continue precision_val = precision_score(y_true, y_pred_val, zero_division=0) recall_val = recall_score(y_true, y_pred_val, zero_division=0) # 严格按照评分规则 score_val = 0.5 * auc_score + 0.3 * recall_val + 0.2 * precision_val if score_val > best_score_val: best_score_val = score_val best_threshold_val = threshold_val # 如果没有找到合理阈值,使用基于F1的阈值,但限制违约率 if best_score_val == 0: print(" 警告:未找到合理违约率的阈值,使用F1优化(带违约率约束)") for threshold_val in np.arange(0.01, 0.99, 0.01): y_pred_val = (y_pred_proba_val > threshold_val).astype(int) pred_default_rate_val = y_pred_val.mean() # 仍然保持违约率约束 if pred_default_rate_val < min_default_rate or pred_default_rate_val > max_default_rate: continue f1_val = f1_score(y_true, y_pred_val, zero_division=0) score_val = f1_val if score_val > best_score_val: best_score_val = score_val best_threshold_val = threshold_val return best_score_val, best_threshold_val # 进一步优化的模型参数 models = { 'Logistic Regression': LogisticRegression( random_state=42, class_weight='balanced', C=0.8, max_iter=2000, solver='liblinear' # 降低C值,增加正则化 ), 'Random Forest': RandomForestClassifier( n_estimators=250, random_state=42, max_depth=12, # 增加树数量,降低深度 class_weight='balanced', min_samples_split=25, # 增加分裂要求 min_samples_leaf=15, max_features=0.6, # 更保守的参数 bootstrap=True, n_jobs=-1 ), 'Gradient Boosting': GradientBoostingClassifier( n_estimators=250, random_state=42, max_depth=5, # 增加树数量,降低深度 learning_rate=0.08, subsample=0.75, # 降低学习率采样率 min_samples_split=25, min_samples_leaf=15, # 增加分裂要求 max_features=0.7 ), 'XGBoost': xgb.XGBClassifier( n_estimators=250, random_state=42, max_depth=5, # 增加树数量,降低深度 learning_rate=0.08, subsample=0.75, colsample_bytree=0.75, reg_alpha=0.15, reg_lambda=0.15, gamma=0.1, # 增加正则化 scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]), eval_metric='logloss', n_jobs=-1 ), 'LightGBM': lgb.LGBMClassifier( n_estimators=300, random_state=42, max_depth=7, learning_rate=0.06, subsample=0.75, colsample_bytree=0.75, reg_alpha=0.2, reg_lambda=0.2, min_child_samples=60, # 增加正则化最小样本 num_leaves=35, boosting_type='gbdt', n_jobs=-1, class_weight='balanced', min_split_gain=0.02, subsample_freq=1, verbose=-1 ) } print("\n模型性能比较:") best_model = None best_score_main = 0 best_threshold_main = 0.5 model_performance = {} for name, model in models.items(): try: print(f"训练 {name}...") model.fit(X_train, y_train) y_pred_proba_main = model.predict_proba(X_val)[:, 1] score_main, threshold_main = custom_score_optimized(y_val, y_pred_proba_main) auc_main = roc_auc_score(y_val, y_pred_proba_main) y_pred_main = (y_pred_proba_main > threshold_main).astype(int) pred_default_rate_main = y_pred_main.mean() precision_main = precision_score(y_val, y_pred_main, zero_division=0) recall_main = recall_score(y_val, y_pred_main, zero_division=0) f1_main = f1_score(y_val, y_pred_main, zero_division=0) print(f"{name:20} AUC: {auc_main:.4f}, 自定义评分: {score_main:.4f}, 阈值: {threshold_main:.3f}") print( f"{'':20} Precision: {precision_main:.4f}, Recall: {recall_main:.4f}, F1: {f1_main:.4f}, 预测违约率: {pred_default_rate_main:.4f}") model_performance[name] = { 'model': model, 'score': score_main, 'auc': auc_main, 'threshold': threshold_main, 'precision': precision_main, 'recall': recall_main, 'f1': f1_main } if score_main > best_score_main: best_score_main = score_main best_model = model best_threshold_main = threshold_main except Exception as e: print(f"{name} 训练失败: {str(e)}") print( f"\n最佳模型: {type(best_model).__name__}, 自定义评分: {best_score_main:.4f}, 最佳阈值: {best_threshold_main:.3f}") # 8. 改进的模型集成 print("\n8. 改进的模型集成...") if len(model_performance) >= 2: # 选择表现最好的模型进行集成 top_models = sorted(model_performance.items(), key=lambda x: x[1]['score'], reverse=True)[:3] print(f"集成模型候选: {[name for name, _ in top_models]}") # 创建软投票集成 estimators_list = [(name, perf['model']) for name, perf in top_models] voting_clf = VotingClassifier(estimators=estimators_list, voting='soft') voting_clf.fit(X_train, y_train) # 评估集成模型 y_pred_proba_ensemble = voting_clf.predict_proba(X_val)[:, 1] ensemble_score, ensemble_threshold = custom_score_optimized(y_val, y_pred_proba_ensemble) ensemble_auc = roc_auc_score(y_val, y_pred_proba_ensemble) print(f"集成模型性能: AUC: {ensemble_auc:.4f}, 自定义评分: {ensemble_score:.4f}") # 只有当集成模型明显更好时才使用 improvement_threshold = 0.003 # 进一步降低要求 if ensemble_score > best_score_main + improvement_threshold: best_model = voting_clf best_threshold_main = ensemble_threshold print(f"使用集成模型作为最终模型,提升: {ensemble_score - best_score_main:.4f}") else: print("使用单一最佳模型作为最终模型") else: print("模型数量不足,跳过集成") # 9. 模型解释 print("\n9. 模型解释...") if hasattr(best_model, 'feature_importances_'): # 树模型的特征重要性 importance_df = pd.DataFrame({ 'feature': features_to_use, 'importance': best_model.feature_importances_ }).sort_values('importance', ascending=False) plt.figure(figsize=(10, 8)) sns.barplot(data=importance_df.head(15), y='feature', x='importance') plt.title('Top 15 重要特征') plt.tight_layout() plt.show() print("Top 10 重要特征:") print(importance_df.head(10)) # 10. 稳健的测试集预测 - 更严格的违约率控制 print("\n10. 稳健的测试集预测...") # 在整个训练集上重新训练最佳模型 final_model = copy.deepcopy(best_model) final_model.fit(X_scaled, y) # 预测测试集概率 test_predictions_proba = final_model.predict_proba(X_test_scaled)[:, 1] # 使用最佳阈值进行预测 test_predictions = (test_predictions_proba > best_threshold_main).astype(int) # 验证预测违约率 pred_default_rate = test_predictions.mean() actual_default_rate = y.mean() print(f"\n预测违约率验证:") print(f"实际训练集违约率: {actual_default_rate:.4f}") print(f"预测测试集违约率: {pred_default_rate:.4f}") # 更严格的违约率控制 target_min_rate = actual_default_rate * 0.8 # 提高下限 target_max_rate = actual_default_rate * 1.25 # 降低上限 if pred_default_rate < target_min_rate or pred_default_rate > target_max_rate: print("预测违约率需要调整...") # 使用更保守的分位数调整,目标违约率更接近实际值 target_rate = min(max(actual_default_rate * 1.1, target_min_rate), target_max_rate) adjusted_threshold = np.percentile(test_predictions_proba, 100 * (1 - target_rate)) # 在最佳阈值调整阈值之间取更保守的平衡 balanced_threshold = (best_threshold_main * 0.6 + adjusted_threshold * 0.4) test_predictions = (test_predictions_proba > balanced_threshold).astype(int) best_threshold_main = balanced_threshold pred_default_rate = test_predictions.mean() print(f"平衡后阈值: {best_threshold_main:.3f}, 调整后预测违约率: {pred_default_rate:.4f}") print(f"目标违约率范围: [{target_min_rate:.4f}, {target_max_rate:.4f}]") # 11. 生成提交文件 print("\n11. 生成提交文件...") submit_df = pd.DataFrame({ 'uuid': df_test['company_id'], 'proba': test_predictions_proba, 'prediction': test_predictions }) # 保存结果 submit_df.to_csv('submission.csv', index=False) print("预测结果已保存到 submission.csv") # 输出预测结果统计 print(f"\n预测结果统计:") print(f"预测为违约的企业数量: {test_predictions.sum()}") print(f"预测违约率: {test_predictions.mean():.4f}") print(f"实际训练集违约率: {df_train['target'].mean():.4f}") print(f"概率预测范围: [{test_predictions_proba.min():.4f}, {test_predictions_proba.max():.4f}]") print(f"使用的阈值: {best_threshold_main:.3f}") # 12. 最终模型评估 print("\n12. 最终模型评估...") # 交叉验证 cv_scores = cross_val_score(final_model, X_scaled, y, cv=5, scoring='roc_auc') print(f"5折交叉验证AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})") # 计算最终评分 - 严格按照评分规则 y_val_pred_proba = final_model.predict_proba(X_val)[:, 1] y_val_pred = (y_val_pred_proba > best_threshold_main).astype(int) final_auc = roc_auc_score(y_val, y_val_pred_proba) final_precision = precision_score(y_val, y_val_pred, zero_division=0) final_recall = recall_score(y_val, y_val_pred, zero_division=0) # 严格按照评分规则:0.5*AUC + 0.3*Recall + 0.2*Precision final_score = 0.5 * final_auc + 0.3 * final_recall + 0.2 * final_precision print(f"\n最终评分计算:") print(f"AUC: {final_auc:.4f}") print(f"Precision: {final_precision:.4f}") print(f"Recall: {final_recall:.4f}") print(f"最终分数: {final_score * 100:.2f}") # 输出分类报告 print(f"\n分类报告:") print(classification_report(y_val, y_val_pred)) # 特征分析总结 print(f"\n=== 分析总结 ===") print(f"• 原始特征数量: {len(df_train.columns) - 2}") print(f"• 使用特征数量: {len(features_to_use)}") print(f"• 识别出的假指标: {len(potential_fake_features)}个") print(f"• 训练集样本数: {len(df_train)}") print(f"• 测试集样本数: {len(df_test)}") print(f"• 最佳模型: {type(final_model).__name__}") print(f"• 最佳阈值: {best_threshold_main:.3f}") print(f"• 验证集AUC: {final_auc:.4f}") print(f"• 交叉验证AUC: {cv_scores.mean():.4f}") print(f"• 预估最终分数: {final_score * 100:.2f}") print("\n优化分析完成!") 修改此程序
最新发布
10-14
import os import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split, RandomizedSearchCV from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_auc_score, roc_curve, confusion_matrix) from xgboost import XGBClassifier import numpy as np plt.rcParams['font.sans-serif'] = ['SimHei'] # 设置中文字体 plt.rcParams['axes.unicode_minus'] = False # 修复负号显示问题 # 设置环境变量 os.environ['JOBLIB_TEMP_FOLDER'] = 'D:\\conda\\temp' def load_and_preprocess(): """数据加载与预处理流程""" # 加载数据 train_data = pd.read_csv('winequality-red_train.csv') test_data = pd.read_csv('winequality-red_test.csv') # 创建二分类目标变量(先处理训练集测试集) def create_quality_class(df): df['quality_class'] = df['quality'].apply(lambda x: 1 if x >= 7 else 0) # 确保分类标签正确 df['quality_class'] = df['quality_class'].astype('category') df['quality_class'] = df['quality_class'].cat.rename_categories(['质量一般', '质量优良']) return df train_data = create_quality_class(train_data) test_data = create_quality_class(test_data) # 合并数据集(确保合并后保留分类类型) full_data = pd.concat([train_data, test_data], axis=0, ignore_index=True) # 重新确认类型 full_data['quality_class'] = full_data['quality_class'].astype('category') # 特征与标签分离(使用数值标签进行模型训练) X_train = train_data.drop(['quality', 'quality_class'], axis=1) y_train = train_data['quality_class'].cat.codes # 转为数值标签 X_test = test_data.drop(['quality', 'quality_class'], axis=1) y_test = test_data['quality_class'].cat.codes # 转为数值标签 # 数据标准化 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 数据分布可视化 plt.figure(figsize=(15, 12)) for i, feature in enumerate(X_train.columns): plt.subplot(4, 3, i + 1) sns.histplot(full_data[feature], kde=True, bins=20) plt.title(f'{feature} 分布') plt.tight_layout() plt.show() # 特征箱线图(按质量等级) plt.figure(figsize=(15, 10)) for i, feature in enumerate(X_train.columns): plt.subplot(4, 3, i + 1) # 显式指定顺序,确保分类标签正确 sns.boxplot(x='quality_class', y=feature, data=full_data, order=['质量一般', '质量优良']) plt.title(f'{feature} 箱线图') plt.tight_layout() plt.show() # 新增:特征相关性热力图 plt.figure(figsize=(12, 10)) corr_matrix = train_data.corr() sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, fmt='.2f') plt.title('特征相关性热力图', fontsize=14) plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight') plt.show() # 输出各特征与质量的相关性(按相关性降序排列) corr_with_quality = train_data.drop('quality_class', axis=1).corr()['quality'].drop('quality') corr_with_quality = corr_with_quality.abs().sort_values(ascending=False) print("各特征与质量的相关性(按相关性降序排列):") print(corr_with_quality) return X_train_scaled, X_test_scaled, y_train, y_test, X_train.columns.tolist() # 计算特异度 def specificity_score(y_true, y_pred): cm = confusion_matrix(y_true, y_pred) if cm.shape != (2, 2): return np.nan tn, fp, fn, tp = cm.ravel() return tn / (tn + fp) if (tn + fp) > 0 else np.nan # 模型训练与评估函数 def train_evaluate(model, model_name, X_train, X_test, y_train, y_test, plot_roc=True): model.fit(X_train, y_train) y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None metrics = { 'Accuracy': accuracy_score(y_test, y_pred), 'Precision': precision_score(y_test, y_pred), 'Recall': recall_score(y_test, y_pred), 'Specificity': specificity_score(y_test, y_pred), 'F1': f1_score(y_test, y_pred) } if y_proba is not None: metrics['AUC'] = roc_auc_score(y_test, y_proba) print(f"\n{model_name} 分类报告:") # 使用原始分类标签名称生成报告 target_names = ['质量一般', '质量优良'] print(classification_report(y_test, y_pred, target_names=target_names)) print(f"特异度: {metrics['Specificity']:.4f}") roc_data = None if y_proba is not None and plot_roc: fpr, tpr, _ = roc_curve(y_test, y_proba) roc_data = (fpr, tpr, metrics['AUC']) return metrics, roc_data # ========== 主程序 ========== if __name__ == "__main__": X_train, X_test, y_train, y_test, features = load_and_preprocess() # ========== 模型定义(移除随机森林优化部分) ========== models = [ ('Logistic Regression', LogisticRegression(max_iter=1000, random_state=42)), ('Random Forest', RandomForestClassifier(random_state=42)), # 基准模型 ('XGBoost', XGBClassifier(random_state=42)) ] # ========== XGBoost参数优化(保留XGBoost优化) ========== # ========== XGBoost参数优化 ========== xgb_param_dist = { 'n_estimators': [200, 300, 400], # 减少树的数量范围 'learning_rate': [0.01, 0.05, 0.1], # 调整学习率范围 'max_depth': [3, 4, 5], # 控制树的深度 'subsample': [0.7, 0.8, 0.9], # 行采样比例 'colsample_bytree': [0.7, 0.8, 0.9], # 列采样比例 'gamma': [0, 0.1, 0.2], # 控制叶子节点分裂的最小损失下降 'reg_alpha': [0, 0.1, 1], # L1 正则化强度 'reg_lambda': [0, 0.1, 1], # L2 正则化强度 'min_child_weight': [1, 3, 5], # 子节点最小样本权重 'scale_pos_weight': [1, 2], # 处理类别不平衡 'tree_method': ['hist'], # 使用直方图算法加速训练 'grow_policy': ['depthwise'], # 生长策略 'n_jobs': [4], 'eval_metric': ['logloss'] } random_search_xgb = RandomizedSearchCV( estimator=XGBClassifier( random_state=42, early_stopping_rounds=10, # 添加早停法 use_label_encoder=False ), param_distributions=xgb_param_dist, n_iter=20, # 减少搜索次数 scoring='f1', cv=3, # 减少交叉验证折数 n_jobs=-1, verbose=1, random_state=42 ) print("\n开始优化XGBoost参数...") random_search_xgb.fit( X_train, y_train, eval_set=[(X_test, y_test)], # 提供验证集 verbose=False ) best_xgb = random_search_xgb.best_estimator_ print(f"\nXGBoost最佳参数: {random_search_xgb.best_params_}") models.append(('Optimized XGBoost', best_xgb)) # ========== 模型训练与评估 ========== results = {} roc_data_list = [] for name, model in models: metrics, roc_data = train_evaluate(model, name, X_train, X_test, y_train, y_test) results[name] = metrics if roc_data: roc_data_list.append((name, roc_data)) # ========== 结果展示 ========== # 性能对比表格 pd.set_option('display.max_columns', None) print("\n模型性能对比表:") df_results = pd.DataFrame(results).T.round(4) print(df_results) # 特征重要性对比(仅保留优化后XGBoost) plt.figure(figsize=(6, 8)) name = 'Optimized XGBoost' model = best_xgb importance_df = pd.DataFrame({ 'Feature': features, 'Importance': model.feature_importances_ }).sort_values('Importance', ascending=False) sns.barplot(x='Importance', y='Feature', data=importance_df, palette='viridis') plt.title(f'{name} 特征重要性') plt.tight_layout() plt.show() # ROC曲线对比(所有模型) plt.figure(figsize=(10, 8)) for name, (fpr, tpr, auc) in roc_data_list: plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc:.2f})') plt.plot([0, 1], [0, 1], 'k--', linewidth=2) plt.xlabel('假正率 (FPR)', fontsize=12) plt.ylabel('真正率 (TPR)', fontsize=12) plt.title('ROC曲线对比', fontsize=14) plt.legend(fontsize=10, loc='lower right') plt.grid(True, linestyle='--', alpha=0.7) plt.show() # 混淆矩阵可视化(优化XGBoost) y_pred_xgb = best_xgb.predict(X_test) cm = confusion_matrix(y_test, y_pred_xgb) plt.figure(figsize=(6, 5)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['质量一般', '质量优良'], yticklabels=['质量一般', '质量优良']) plt.xlabel('预测值') plt.ylabel('真实值') plt.title('优化XGBoost混淆矩阵') plt.show()Traceback (most recent call last): File "D:\conda\模式识别\dazuoye\3.0.py", line 191, in <module> metrics, roc_data = train_evaluate(model, name, X_train, X_test, y_train, y_test) File "D:\conda\模式识别\dazuoye\3.0.py", line 102, in train_evaluate model.fit(X_train, y_train) File "D:\conda\Lib\site-packages\xgboost\core.py", line 575, in inner_f return f(**kwargs) File "D:\conda\Lib\site-packages\xgboost\sklearn.py", line 1411, in fit callbacks=callbacks, File "D:\conda\Lib\site-packages\xgboost\core.py", line 575, in inner_f return f(**kwargs) File "D:\conda\Lib\site-packages\xgboost\training.py", line 182, in train if cb_container.after_iteration(bst, i, dtrain, evals): File "D:\conda\Lib\site-packages\xgboost\callback.py", line 247, in after_iteration for c in self.callbacks) File "D:\conda\Lib\site-packages\xgboost\callback.py", line 247, in <genexpr> for c in self.callbacks) File "D:\conda\Lib\site-packages\xgboost\callback.py", line 411, in after_iteration assert len(evals_log.keys()) >= 1, msg AssertionError: Must have at least 1 validation dataset for early stopping.给一个完整修改代码
05-28
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值