Kaggle竞赛——桑坦德银行客户满意度预测（四）

原创

已于 2023-02-10 15:58:16 修改 · 1.6k 阅读

12 ·

CC 4.0 BY-SA版权

文章标签：

#python #深度学习

于 2023-01-18 01:10:58 首次发布

文章详细介绍了使用逻辑回归、决策树、随机森林和XGBoost、LightGBM等模型对桑坦德银行客户满意度进行预测的过程，包括数据读取、AUC函数定义、模型调参、概率校准以及模型性能评估。重点讨论了概率校准的重要性，并展示了不同模型的ROC曲线和性能指标，其中LightGBM模型在AUC上表现最佳。

桑坦德银行客户满意度预测（四）

模型训练与评估

模型训练与评估

读取数据

#加载数据集
dataset = 'Normal'
train = pd.read_pickle('./data/santander-customer-satisfaction/output/train_normal.pkl')
test = pd.read_pickle('./data/santander-customer-satisfaction/output/test_normal.pkl')
X_train = train.drop(['ID','TARGET'],axis=1)
y_train = train['TARGET'].values
X_test = test.drop('ID',axis=1)
test_id = test['ID']

del train,test

#划分数据集
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify=y_train, test_size=0.15)
X_train.shape, X_val.shape, X_test.shape

我们分别输出训练集、验证集、测试集的shape：

((64617, 336), (11403, 336), (75818, 336))

定义AUC函数和调参函数

ROC:接收器工作曲线
AUC(Area Under the Curve):ROC曲线下面积

接下来我们定义ROC曲线绘制函数

global i
i = 0

def plot_auc(y_true,y_pred,label,dataset = dataset):
    '''
    给出y_true和y_pred时绘制ROC曲线
    dataset:告诉我们使用了哪个数据集
    label:告诉我们使用了哪个模型，若label是一个列表，则绘制所有标签的所有ROC曲线
    '''
    if (type(label) != list) & (type(label) != np.array):
        print("\t\t %s on %s dataset \t\t \n" % (label, dataset))
        auc = roc_auc_score(y_true, y_pred)
        logloss = log_loss(y_true, y_pred)  #-(ylog(p) + (1-y)log(1-p)))
        label_1 = label + ' AUC=%.3f' % (auc)

        # 绘制ROC曲线
        fpr, tpr, threshold = roc_curve(y_true, y_pred)
        sns.lineplot(fpr, tpr, label=label_1)
        x = np.arange(0, 1.1, 0.1)  # 绘制AUC=0.5的直线
        sns.lineplot(x, x, label="AUC=0.5")
        plt.title("ROC on %s dataset" % (dataset))
        plt.xlabel('False Positive Rate')
        plt.ylabel("True Positive Rate")
        plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)  # 设置图例在图形外
        plt.show()
        print("在 %s 数据集上 %s 模型的 logloss = %.3f  AUC = %.3f" % (dataset, label, logloss, auc))

        # 创建结果数据框
        result_dict = {
   
   
            "Model": label,
            'Dataset': dataset,
            'log_loss': logloss,
            'AUC': auc
        }
        return pd.DataFrame(result_dict, index=[i])

    else:
        # 绘制ROC曲线
        plt.figure(figsize=(12, 8))
        for k, y in enumerate(y_pred):
            fpr, tpr, threshold = roc_curve(y_true, y)
            auc = roc_auc_score(y_true, y)
            label_ = label[k] +