scikit-learn AUC-ROC评估与调优完整实例

最新推荐文章于 2025-04-01 13:44:44 发布

奋斗者1号

最新推荐文章于 2025-04-01 13:44:44 发布

阅读量835

点赞数 22

文章标签： scikit-learn 机器学习人工智能

本文链接：https://blog.youkuaiyun.com/xinjichenlibing/article/details/146466518

版权

scikit-learn AUC-ROC评估与调优完整实例

AUC-ROC (Area Under the Receiver Operating Characteristic Curve) 是评估二分类模型性能的重要指标，特别适用于不平衡数据集。下面我将提供一个完整的AUC-ROC评估和模型调优实例。

1. 什么是AUC-ROC曲线

ROC曲线通过绘制不同阈值下的真正率(TPR)与假正率(FPR)来展示模型性能。AUC是ROC曲线下的面积，值范围为0-1，值越高代表模型性能越好。

2. 完整实例

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc, roc_auc_score, classification_report, confusion_matrix
from sklearn.calibration import CalibratedClassifierCV
import seaborn as sns

# 设置随机种子，保证结果可重现
np.random.seed(42)

# 1. 创建不平衡数据集
X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, 
                          n_informative=10, n_redundant=5, n_clusters_per_class=3,
                          weights=[0.9, 0.1], random_state=42)  # 90% 类别0, 10% 类别1

# 2. 数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, 
                                                   random_state=42, stratify=y)

# 3. 创建数据预处理和模型训练的管道
def create_pipeline(model):
    return Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', model)
    ])

# 4. 定义要测试的不同模型
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(random_state=42, probability=True)
}

# 5. 训练并评估基本模型
results = {}
plt.figure(figsize=(12, 8))

for name, model in models.items():
    # 创建并训练管道
    pipeline = create_pipeline(model)
    pipeline.fit(X_train, y_train)
    
    # 获取测试集上的预测概率
    y_prob = pipeline.predict_proba(X_test)[:, 1]
    
    # 计算ROC曲线
    fpr, tpr, thresholds = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    results[name] = roc_auc
    
    # 绘制ROC曲线
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')

# 绘制随机猜测的基准线
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess (AUC = 0.500)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves of Different Models')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.savefig('roc_curves_comparison.png')
plt.close()

print("基本模型的AUC-ROC评分:")
for name, score in results.items():
    print(f"{name}: {score:.4f}")

# 6. 选择性能最佳的模型进行超参数调优
best_model_name = max(results, key=results.get)
print(f"\n性能最佳的模型是: {best_model_name}，AUC = {results[best_model_name]:.4f}")

# 7. 针对不同模型的超参数调优
if best_model_name == 'Logistic Regression':
    param_grid = {
        'classifier__C': [0.01, 0.1, 1, 10, 100],
        'classifier__penalty': ['l1', 'l2', 'elasticnet'],
        'classifier__solver': ['liblinear', 'saga'],
        'classifier__class_weight': [None, 'balanced']
    }
elif best_model_name == 'Random Forest':
    param_grid = {
        'classifier__n_estimators': [50, 100, 200],
        'classifier__max_depth': [None, 10, 20, 30],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__min_samples_leaf': [1, 2, 4],
        'classifier__class_weight': [None, 'balanced', 'balanced_subsample']
    }
elif best_model_name == 'Gradient Boosting':
    param_grid = {
        'classifier__n_estimators': [50, 100, 200],
        'classifier__learning_rate': [0.01, 0.1, 0.2],
        'classifier__max_depth': [3, 5, 7],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__subsample': [0.8, 0.9, 1.0]
    }
elif best_model_name == 'SVM':
    param_grid = {
        'classifier__C': [0.1, 1, 10, 100],
        'classifier__gamma': ['scale', 'auto', 0.1, 0.01],
        'classifier__kernel': ['rbf', 'linear', 'poly'],
        'classifier__class_weight': [None, 'balanced']
    }

# 8. 使用网格搜索进行超参数调优，以AUC为评估指标
pipeline = create_pipeline(models[best_model_name])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    pipeline, 
    param_grid=param_grid,
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\n最佳参数组合:")
for param, value in grid_search.best_params_.items():
    print(f"{param}: {value}")

# 9. 用调优后的模型评估测试集
best_model = grid_search.best_estimator_
y_prob_tuned = best_model.predict_proba(X_test)[:, 1]
y_pred_tuned = best_model.predict(X_test)

# 计算调优后的AUC
tuned_auc = roc_auc_score(y_test, y_prob_tuned)
print(f"\n调优后的AUC: {tuned_auc:.4f}")
print(f"调优前的AUC: {results[best_model_name]:.4f}")
print(f"提升了: {tuned_auc - results[best_model_name]:.4f}")

# 10. 绘制调优前后的ROC曲线比较
plt.figure(figsize=(10, 8))

# 原始模型
original_model = create_pipeline(models[best_model_name])
original_model.fit(X_train, y_train)
y_prob_original = original_model.predict_proba(X_test)[:, 1]
fpr_original, tpr_original, _ = roc_curve(y_test, y_prob_original)
auc_original = auc(fpr_original, tpr_original)

# 调优后的模型
fpr_tuned, tpr_tuned, _ = roc_curve(y_test, y_prob_tuned)
auc_tuned = auc(fpr_tuned, tpr_tuned)

plt.plot(fpr_original, tpr_original, label=f'原始 {best_model_name} (AUC = {auc_original:.3f})')
plt.plot(fpr_tuned, tpr_tuned, label=f'调优后 {best_model_name} (AUC = {auc_tuned:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='随机猜测 (AUC = 0.500)')
plt.xlabel('假正率 (FPR)')
plt.ylabel('真正率 (TPR)')
plt.title('调优前后ROC曲线对比')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.savefig('roc_curve_tuned_vs_original.png')
plt.close()

# 11. 阈值分析 - 找到最佳决策阈值
thresholds = np.linspace(0, 1, 100)
f1_scores = []
precision_scores = []
recall_scores = []
specificity_scores = []

for threshold in thresholds:
    y_pred_threshold = (y_prob_tuned >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_threshold).ravel()
    
    # 计算性能指标
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    f1_scores.append(f1)
    precision_scores.append(precision)
    recall_scores.append(recall)
    specificity_scores.append(specificity)

# 绘制不同阈值下的性能指标
plt.figure(figsize=(12, 8))
plt.plot(thresholds, precision_scores, label='精确率')
plt.plot(thresholds, recall_scores, label='召回率')
plt.plot(thresholds, specificity_scores, label='特异性')
plt.plot(thresholds, f1_scores, label='F1分数')
plt.axvline(x=0.5, color='k', linestyle='--', label='默认阈值 = 0.5')

# 找到最佳F1分数对应的阈值
best_f1_threshold = thresholds[np.argmax(f1_scores)]
plt.axvline(x=best_f1_threshold, color='r', linestyle='--', 
            label=f'最佳F1阈值 = {best_f1_threshold:.2f}')

plt.xlabel('分类阈值')
plt.ylabel('性能指标值')
plt.title('不同分类阈值下的性能指标')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.savefig('threshold_analysis.png')
plt.close()

print(f"\n最佳F1分数的阈值: {best_f1_threshold:.4f}")

# 12. 使用最佳阈值的最终性能评估
y_pred_best_threshold = (y_prob_tuned >= best_f1_threshold).astype(int)
final_report = classification_report(y_test, y_pred_best_threshold)
print("\n最终分类报告 (使用最佳阈值):")
print(final_report)

# 13. 绘制混淆矩阵
cm = confusion_matrix(y_test, y_pred_best_threshold)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.title(f'使用阈值 {best_f1_threshold:.2f} 的混淆矩阵')
plt.savefig('confusion_matrix.png')
plt.close()

# 14. 概率校准分析（可选）
plt.figure(figsize=(10, 8))

# 创建校准曲线
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_test, y_prob_tuned, n_bins=10)
plt.plot([0, 1], [0, 1], 'k--', label='完美校准')
plt.plot(prob_pred, prob_true, 's-', label=f'调优的 {best_model_name}')

# 使用Platt缩放进行概率校准
calibrated_model = CalibratedClassifierCV(best_model, cv='prefit')
calibrated_model.fit(X_test, y_test)
y_prob_calibrated = calibrated_model.predict_proba(X_test)[:, 1]

# 添加校准后的模型曲线
prob_true_cal, prob_pred_cal = calibration_curve(y_test, y_prob_calibrated, n_bins=10)
plt.plot(prob_pred_cal, prob_true_cal, 's-', label='校准后的模型')

plt.xlabel('预测概率')
plt.ylabel('实际概率')
plt.title('概率校准曲线')
plt.legend(loc='best')
plt.grid(alpha=0.3)
plt.savefig('calibration_curve.png')
plt.close()

# 计算校准前后的AUC
cal_auc = roc_auc_score(y_test, y_prob_calibrated)
print(f"\n校准后的AUC: {cal_auc:.4f}")

3. 各种调优方法及使用场景

3.1 模型选择调优

不同模型对AUC-ROC有不同的表现：

模型	使用场景	调优参数
逻辑回归	线性关系的数据，需要可解释性的场景	C（正则化强度），penalty（正则化类型），class_weight（类别权重）
随机森林	非线性关系，特征间有交互作用，希望模型稳健	n_estimators（树的数量），max_depth（树深度），min_samples_split，class_weight
梯度提升	复杂非线性关系，追求高性能	learning_rate（学习率），n_estimators，max_depth，subsample（样本抽样比例）
SVM	特征维度高但样本量较小的数据集	C，gamma，kernel（核函数），class_weight

3.2 类别不平衡数据的调优方法

类别权重调整:
- 参数选择: class_weight='balanced' 或自定义权重
- 使用场景: 当关注少数类别的召回率时
阈值调整:
- 不同阈值对应不同的precision-recall权衡
- 使用场景: 当默认阈值0.5不是最优选择时
采样方法调优:
- 过采样（SMOTE）或欠采样
- 使用场景: 当类别极度不平衡时

3.3 超参数调优策略

网格搜索（GridSearchCV）:
- 优点: 彻底，一定能找到指定范围内的最优参数
- 缺点: 计算量大
- 使用场景: 参数空间较小时
随机搜索（RandomizedSearchCV）:
- 优点: 效率高，能在有限时间内找到较好参数
- 缺点: 不保证找到全局最优
- 使用场景: 参数空间较大时
贝叶斯优化（BayesianOptimization）:
- 优点: 智能探索参数空间，效率高
- 使用场景: 计算资源有限，但需要较好调优效果时

3.4 评估指标调优

AUC-ROC最大化:
- 使用场景: 关注模型整体排序能力，对阈值不敏感
- 调优参数: 使用scoring='roc_auc'
特定阈值下的性能指标:
- F1分数最大化: 精确率和召回率的平衡
- 使用场景: 关注特定阈值下的决策表现
概率校准:
- 使用CalibratedClassifierCV校准概率输出
- 使用场景: 需要可靠概率估计而非仅排序时

4. 常见调优问题及解决方案

过拟合问题:
- 症状: 训练集AUC高，测试集AUC低
- 解决方案: 增加正则化，减少模型复杂度，使用交叉验证
类别不平衡问题:
- 症状: 对少数类预测效果差
- 解决方案: 类别权重调整，采样平衡，调整决策阈值
特征尺度问题:
- 症状: 某些模型（如SVM、逻辑回归）表现不佳
- 解决方案: 确保使用StandardScaler或MinMaxScaler
概率校准问题:
- 症状: 预测概率不可靠（校准曲线偏离对角线）
- 解决方案: 使用CalibratedClassifierCV校准模型