使用H2O构建异构集成分类器预测信用卡违约者
在机器学习领域,准确预测信用卡违约者是一个重要的任务。本文将介绍如何使用H2O这个开源、分布式、内存中的机器学习平台,构建异构集成分类器来预测信用卡违约者。
1. 简介
H2O提供了大量的监督和无监督算法,包括神经网络、随机森林(RF)、广义线性模型、梯度提升机、朴素贝叶斯分类器和XGBoost等。此外,H2O还提供了堆叠集成方法,旨在通过堆叠过程找到一组预测算法的最佳组合,支持回归和分类任务。
2. 数据准备
我们将使用台湾信用卡支付违约者的数据作为示例,该数据集包含信用卡客户的信息,如违约情况、客户的人口统计因素、信用数据和支付历史等。数据集可从GitHub或UCI ML Repository获取:https://bit.ly/2EZX6IC 。
以下是具体的数据准备步骤:
1.
安装H2O
:在Google Colab中安装H2O,执行以下命令:
! pip install h2o
- 导入所需库 :
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn import tree
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
- 初始化H2O :
h2o.init()
- 挂载Google Drive并读取数据集 :
from google.colab import drive
drive.mount('/content/drive')
df_creditcarddata = h2o.import_file("/content/drive/My Drive/Colab Notebooks/UCI_Credit_Card.csv")
使用
h2o.import_file
创建的是
h2o.frame.H2OFrame
,类似于pandas的DataFrame,但数据存储在H2O集群中。
5.
数据探索
:
- 查看数据集的基本信息:
df_creditcarddata.head()
df_creditcarddata.shape
df_creditcarddata.columns
df_creditcarddata.types
- 查看目标变量`default.payment.next.month`的分布:
df_creditcarddata['default.payment.next.month'].table()
- 移除不需要的`ID`列:
df_creditcarddata = df_creditcarddata.drop(["ID"], axis = 1)
- 分析数值变量的分布:
import pylab as pl
df_creditcarddata[['AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6', 'LIMIT_BAL']].as_data_frame().hist(figsize=(20,20))
pl.show()
- 查看不同类别下违约者和非违约者的分布:
# Defaulters by Gender
columns = ["default.payment.next.month","SEX"]
default_by_gender = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_gender.get_frame())
# Defaulters by education
columns = ["default.payment.next.month","EDUCATION"]
default_by_education = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_education.get_frame())
# Defaulters by MARRIAGE
columns = ["default.payment.next.month","MARRIAGE"]
default_by_marriage = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_marriage.get_frame())
-
数据预处理
:
- 将分类变量转换为因子类型:
df_creditcarddata['SEX'] = df_creditcarddata['SEX'].asfactor()
df_creditcarddata['EDUCATION'] = df_creditcarddata['EDUCATION'].asfactor()
df_creditcarddata['MARRIAGE'] = df_creditcarddata['MARRIAGE'].asfactor()
df_creditcarddata['PAY_0'] = df_creditcarddata['PAY_0'].asfactor()
df_creditcarddata['PAY_2'] = df_creditcarddata['PAY_2'].asfactor()
df_creditcarddata['PAY_3'] = df_creditcarddata['PAY_3'].asfactor()
df_creditcarddata['PAY_4'] = df_creditcarddata['PAY_4'].asfactor()
df_creditcarddata['PAY_5'] = df_creditcarddata['PAY_5'].asfactor()
df_creditcarddata['PAY_6'] = df_creditcarddata['PAY_6'].asfactor()
- 将二元目标变量编码为因子变量:
df_creditcarddata['default.payment.next.month'] = df_creditcarddata['default.payment.next.month'].asfactor()
df_creditcarddata['default.payment.next.month'].levels()
- 定义预测变量和目标变量 :
predictors = ['LIMIT_BAL','SEX','EDUCATION','MARRIAGE','AGE','PAY_0','PAY_2','PAY_3', 'PAY_4','PAY_5','PAY_6','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4', 'BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
target = 'default.payment.next.month'
- 划分数据集 :
splits = df_creditcarddata.split_frame(ratios=[0.7], seed=1)
train = splits[0]
test = splits[1]
3. 模型训练
我们将使用以下算法训练模型:
- 广义线性模型(GLM)
- 分布式随机森林
- 梯度提升机
- 堆叠集成
3.1 广义线性模型(GLM)
我们将构建三个GLM模型:
-
GLM默认参数模型
:
GLM_default_settings = H2OGeneralizedLinearEstimator(family='binomial', model_id='GLM_default',nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True)
GLM_default_settings.train(x = predictors, y = target, training_frame = train)
- GLM带Lambda搜索(正则化)模型 :
GLM_regularized = H2OGeneralizedLinearEstimator(family='binomial', model_id='GLM', lambda_search=True, nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True)
GLM_regularized.train(x = predictors, y = target, training_frame = train)
lambda_search
参数用于帮助GLM找到最佳的正则化参数λ。
-
GLM网格搜索模型
:
hyper_parameters = { 'alpha': [0.001, 0.01, 0.05, 0.1, 1.0], 'lambda': [0.001, 0.01, 0.1, 1] }
search_criteria = { 'strategy': "RandomDiscrete", 'seed': 1, 'stopping_metric': "AUTO", 'stopping_rounds': 5 }
GLM_grid_search = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial', nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True), hyper_parameters, grid_id="GLM_grid", search_criteria=search_criteria)
GLM_grid_search.train(x= predictors,y= target, training_frame=train)
# Get the grid results, sorted by validation AUC
GLM_grid_sorted = GLM_grid_search.get_grid(sort_by='auc', decreasing=True)
GLM_grid_sorted
# Extract the best model from random grid search
Best_GLM_model_from_Grid = GLM_grid_sorted.model_ids[0]
Best_GLM_model_from_Grid = h2o.get_model(Best_GLM_model_from_Grid)
print(Best_GLM_model_from_Grid)
通过网格搜索,我们可以找到最佳的模型参数组合。
3.2 随机森林模型
- 随机森林默认参数模型 :
RF_default_settings = H2ORandomForestEstimator(model_id = 'RF_D', nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True)
RF_default_settings.train(x = predictors, y = target, training_frame = train)
RF_default_settings.summary()
- 随机森林网格搜索模型 :
hyper_params = {'sample_rate':[0.7, 0.9], 'col_sample_rate_per_tree': [0.8, 0.9], 'max_depth': [3, 5, 9], 'ntrees': [200, 300, 400] }
RF_grid_search = H2OGridSearch(H2ORandomForestEstimator(nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True, stopping_metric = 'AUC',stopping_rounds = 5), hyper_params = hyper_params, grid_id= 'RF_gridsearch')
RF_grid_search.train(x = predictors, y = target, training_frame = train)
# Sort the grid models
RF_grid_sorted = RF_grid_search.get_grid(sort_by='auc', decreasing=True)
print(RF_grid_sorted)
# Extract the best model from the grid search result
Best_RF_model_from_Grid = RF_grid_sorted.model_ids[0]
Best_RF_model_from_Grid = h2o.get_model(Best_RF_model_from_Grid)
print(Best_RF_model_from_Grid)
3.3 梯度提升机(GBM)模型
- GBM默认参数模型 :
GBM_default_settings = H2OGradientBoostingEstimator(model_id = 'GBM_default', nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True)
GBM_default_settings.train(x = predictors, y = target, training_frame = train)
- GBM网格搜索模型 :
hyper_params = {'learn_rate': [0.001,0.01, 0.1], 'sample_rate': [0.8, 0.9], 'col_sample_rate': [0.2, 0.5, 1], 'max_depth': [3, 5, 9]}
GBM_grid_search = H2OGridSearch(H2OGradientBoostingEstimator(nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True, stopping_metric = 'AUC', stopping_rounds = 5), hyper_params = hyper_params, grid_id= 'GBM_Grid')
GBM_grid_search.train(x = predictors, y = target, training_frame = train)
# Sort and show the grid search results
GBM_grid_sorted = GBM_grid_search.get_grid(sort_by='auc', decreasing=True)
print(GBM_grid_sorted)
# Extract the best model from the grid search
Best_GBM_model_from_Grid = GBM_grid_sorted.model_ids[0]
Best_GBM_model_from_Grid = h2o.get_model(Best_GBM_model_from_Grid)
print(Best_GBM_model_from_Grid)
4. 堆叠集成模型
我们将使用前面网格搜索得到的最佳模型构建堆叠集成模型:
# list the best models from each grid
all_models = [Best_GLM_model_from_Grid, Best_RF_model_from_Grid, Best_GBM_model_from_Grid]
# Set up Stacked Ensemble
ensemble = H2OStackedEnsembleEstimator(model_id = "ensemble", base_models = all_models, metalearner_algorithm = "deeplearning")
ensemble.train(y = target, training_frame = train)
# Eval ensemble performance on the test data
Ens_model = ensemble.model_performance(test)
Ens_AUC = Ens_model.auc()
5. 模型评估
我们将比较各个基础模型和堆叠集成模型在测试数据上的性能:
# Checking the model performance for all GLM models built
model_perf_GLM_default = GLM_default_settings.model_performance(test)
model_perf_GLM_regularized = GLM_regularized.model_performance(test)
model_perf_Best_GLM_model_from_Grid = Best_GLM_model_from_Grid.model_performance(test)
# Checking the model performance for all RF models built
model_perf_RF_default_settings = RF_default_settings.model_performance(test)
model_perf_Best_RF_model_from_Grid = Best_RF_model_from_Grid.model_performance(test)
# Checking the model performance for all GBM models built
model_perf_GBM_default_settings = GBM_default_settings.model_performance(test)
model_perf_Best_GBM_model_from_Grid = Best_GBM_model_from_Grid.model_performance(test)
# Best AUC from the base learner models
best_auc = max(model_perf_GLM_default.auc(), model_perf_GLM_regularized.auc(), model_perf_Best_GLM_model_from_Grid.auc(), model_perf_RF_default_settings.auc(), model_perf_Best_RF_model_from_Grid.auc(), model_perf_GBM_default_settings.auc(), model_perf_Best_GBM_model_from_Grid.auc())
print("Best AUC out of all the models performed: ", format(best_auc))
# Eval ensemble performance on the test data
Ensemble_model = ensemble.model_performance(test)
Ensemble_model = Ensemble_model.auc()
通过以上步骤,我们使用H2O构建了异构集成分类器,并对信用卡违约者进行了预测。堆叠集成模型通常可以提高预测性能,通过结合多个基础模型的优势,得到更准确的预测结果。
以下是整个流程的mermaid流程图:
graph LR
A[数据准备] --> B[安装H2O]
B --> C[导入所需库]
C --> D[初始化H2O]
D --> E[挂载Google Drive并读取数据集]
E --> F[数据探索]
F --> G[数据预处理]
G --> H[定义预测变量和目标变量]
H --> I[划分数据集]
I --> J[模型训练]
J --> K[GLM模型训练]
J --> L[随机森林模型训练]
J --> M[GBM模型训练]
K --> K1[GLM默认参数模型]
K --> K2[GLM带Lambda搜索模型]
K --> K3[GLM网格搜索模型]
L --> L1[随机森林默认参数模型]
L --> L2[随机森林网格搜索模型]
M --> M1[GBM默认参数模型]
M --> M2[GBM网格搜索模型]
K3 --> N[提取最佳GLM模型]
L2 --> O[提取最佳随机森林模型]
M2 --> P[提取最佳GBM模型]
N & O & P --> Q[构建堆叠集成模型]
Q --> R[模型评估]
通过这个流程图,我们可以清晰地看到整个构建异构集成分类器的过程,从数据准备到模型训练和评估,每个步骤都紧密相连。希望本文能帮助你更好地理解如何使用H2O构建异构集成分类器来预测信用卡违约者。
使用H2O构建异构集成分类器预测信用卡违约者
6. 模型训练与评估的详细解析
6.1 广义线性模型(GLM)
在构建GLM模型时,我们采用了三种不同的方式,每种方式都有其独特的特点和用途。
-
默认参数模型
:使用
H2OGeneralizedLinearEstimator构建默认参数的GLM模型,设置family='binomial'用于二分类问题,nfolds = 10进行十折交叉验证,fold_assignment = "Modulo"指定交叉验证的折叠分配方式,keep_cross_validation_predictions = True保留交叉验证的预测结果。通过train方法传入预测变量predictors、目标变量target和训练数据集train进行模型训练。
GLM_default_settings = H2OGeneralizedLinearEstimator(family='binomial', model_id='GLM_default',nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True)
GLM_default_settings.train(x = predictors, y = target, training_frame = train)
-
带Lambda搜索的模型
:在这个模型中,设置
lambda_search=True,让模型自动搜索最佳的正则化参数λ。正则化可以防止模型过拟合,提高模型的泛化能力。同样进行十折交叉验证,并传入相应的变量和数据集进行训练。
GLM_regularized = H2OGeneralizedLinearEstimator(family='binomial', model_id='GLM', lambda_search=True, nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True)
GLM_regularized.train(x = predictors, y = target, training_frame = train)
-
网格搜索模型
:通过设置超参数
hyper_parameters和搜索标准search_criteria,使用H2OGridSearch进行网格搜索。hyper_parameters中包含了不同的alpha和lambda值,search_criteria指定了搜索策略为RandomDiscrete,并设置了停止指标和停止轮数。训练完成后,通过get_grid方法按AUC值对结果进行排序,提取最佳模型。
hyper_parameters = { 'alpha': [0.001, 0.01, 0.05, 0.1, 1.0], 'lambda': [0.001, 0.01, 0.1, 1] }
search_criteria = { 'strategy': "RandomDiscrete", 'seed': 1, 'stopping_metric': "AUTO", 'stopping_rounds': 5 }
GLM_grid_search = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial', nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True), hyper_parameters, grid_id="GLM_grid", search_criteria=search_criteria)
GLM_grid_search.train(x= predictors,y= target, training_frame=train)
# Get the grid results, sorted by validation AUC
GLM_grid_sorted = GLM_grid_search.get_grid(sort_by='auc', decreasing=True)
GLM_grid_sorted
# Extract the best model from random grid search
Best_GLM_model_from_Grid = GLM_grid_sorted.model_ids[0]
Best_GLM_model_from_Grid = h2o.get_model(Best_GLM_model_from_Grid)
print(Best_GLM_model_from_Grid)
6.2 随机森林模型
随机森林是一种强大的集成学习算法,我们同样构建了默认参数和网格搜索的随机森林模型。
-
默认参数模型
:使用
H2ORandomForestEstimator构建默认参数的随机森林模型,进行十折交叉验证,通过train方法传入相应变量和数据集进行训练。训练完成后,使用summary方法查看模型的摘要信息。
RF_default_settings = H2ORandomForestEstimator(model_id = 'RF_D', nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True)
RF_default_settings.train(x = predictors, y = target, training_frame = train)
RF_default_settings.summary()
-
网格搜索模型
:设置超参数
hyper_params,包含sample_rate、col_sample_rate_per_tree、max_depth和ntrees等参数。使用H2OGridSearch进行网格搜索,训练完成后按AUC值排序,提取最佳模型。
hyper_params = {'sample_rate':[0.7, 0.9], 'col_sample_rate_per_tree': [0.8, 0.9], 'max_depth': [3, 5, 9], 'ntrees': [200, 300, 400] }
RF_grid_search = H2OGridSearch(H2ORandomForestEstimator(nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True, stopping_metric = 'AUC',stopping_rounds = 5), hyper_params = hyper_params, grid_id= 'RF_gridsearch')
RF_grid_search.train(x = predictors, y = target, training_frame = train)
# Sort the grid models
RF_grid_sorted = RF_grid_search.get_grid(sort_by='auc', decreasing=True)
print(RF_grid_sorted)
# Extract the best model from the grid search result
Best_RF_model_from_Grid = RF_grid_sorted.model_ids[0]
Best_RF_model_from_Grid = h2o.get_model(Best_RF_model_from_Grid)
print(Best_RF_model_from_Grid)
6.3 梯度提升机(GBM)模型
GBM是另一种常用的集成学习算法,我们也构建了默认参数和网格搜索的GBM模型。
-
默认参数模型
:使用
H2OGradientBoostingEstimator构建默认参数的GBM模型,进行十折交叉验证,通过train方法传入相应变量和数据集进行训练。
GBM_default_settings = H2OGradientBoostingEstimator(model_id = 'GBM_default', nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True)
GBM_default_settings.train(x = predictors, y = target, training_frame = train)
-
网格搜索模型
:设置超参数
hyper_params,包含learn_rate、sample_rate、col_sample_rate和max_depth等参数。使用H2OGridSearch进行网格搜索,训练完成后按AUC值排序,提取最佳模型。
hyper_params = {'learn_rate': [0.001,0.01, 0.1], 'sample_rate': [0.8, 0.9], 'col_sample_rate': [0.2, 0.5, 1], 'max_depth': [3, 5, 9]}
GBM_grid_search = H2OGridSearch(H2OGradientBoostingEstimator(nfolds = 10, fold_assignment = "Modulo", keep_cross_validation_predictions = True, stopping_metric = 'AUC', stopping_rounds = 5), hyper_params = hyper_params, grid_id= 'GBM_Grid')
GBM_grid_search.train(x = predictors, y = target, training_frame = train)
# Sort and show the grid search results
GBM_grid_sorted = GBM_grid_search.get_grid(sort_by='auc', decreasing=True)
print(GBM_grid_sorted)
# Extract the best model from the grid search
Best_GBM_model_from_Grid = GBM_grid_sorted.model_ids[0]
Best_GBM_model_from_Grid = h2o.get_model(Best_GBM_model_from_Grid)
print(Best_GBM_model_from_Grid)
7. 堆叠集成模型的原理与优势
堆叠集成模型是将多个基础模型的预测结果进行组合,通过一个元学习器(
metalearner
)来学习如何综合这些基础模型的输出,从而得到更准确的预测结果。在我们的例子中,使用
H2OStackedEnsembleEstimator
构建堆叠集成模型,将前面网格搜索得到的最佳GLM、随机森林和GBM模型作为基础模型,设置
metalearner_algorithm = "deeplearning"
使用深度学习作为元学习器。
# list the best models from each grid
all_models = [Best_GLM_model_from_Grid, Best_RF_model_from_Grid, Best_GBM_model_from_Grid]
# Set up Stacked Ensemble
ensemble = H2OStackedEnsembleEstimator(model_id = "ensemble", base_models = all_models, metalearner_algorithm = "deeplearning")
ensemble.train(y = target, training_frame = train)
# Eval ensemble performance on the test data
Ens_model = ensemble.model_performance(test)
Ens_AUC = Ens_model.auc()
堆叠集成模型的优势在于它能够综合不同基础模型的优势,减少单个模型的偏差和方差,提高模型的泛化能力和预测准确性。通过元学习器的学习,能够找到基础模型之间的最佳组合方式,从而在测试数据上取得更好的性能。
8. 模型评估指标的重要性
在模型评估阶段,我们使用了AUC(Area Under the Curve)作为主要的评估指标。AUC是ROC曲线下的面积,它衡量了模型在不同阈值下的分类性能,取值范围在0到1之间,值越接近1表示模型的分类性能越好。
# Checking the model performance for all GLM models built
model_perf_GLM_default = GLM_default_settings.model_performance(test)
model_perf_GLM_regularized = GLM_regularized.model_performance(test)
model_perf_Best_GLM_model_from_Grid = Best_GLM_model_from_Grid.model_performance(test)
# Checking the model performance for all RF models built
model_perf_RF_default_settings = RF_default_settings.model_performance(test)
model_perf_Best_RF_model_from_Grid = Best_RF_model_from_Grid.model_performance(test)
# Checking the model performance for all GBM models built
model_perf_GBM_default_settings = GBM_default_settings.model_performance(test)
model_perf_Best_GBM_model_from_Grid = Best_GBM_model_from_Grid.model_performance(test)
# Best AUC from the base learner models
best_auc = max(model_perf_GLM_default.auc(), model_perf_GLM_regularized.auc(), model_perf_Best_GLM_model_from_Grid.auc(), model_perf_RF_default_settings.auc(), model_perf_Best_RF_model_from_Grid.auc(), model_perf_GBM_default_settings.auc(), model_perf_Best_GBM_model_from_Grid.auc())
print("Best AUC out of all the models performed: ", format(best_auc))
# Eval ensemble performance on the test data
Ensemble_model = ensemble.model_performance(test)
Ensemble_model = Ensemble_model.auc()
除了AUC,我们还可以使用其他评估指标,如准确率、召回率、F1值等,根据具体的业务需求和数据特点选择合适的评估指标。不同的评估指标关注模型的不同方面,综合使用多个评估指标可以更全面地评估模型的性能。
9. 总结
本文详细介绍了如何使用H2O构建异构集成分类器来预测信用卡违约者。整个过程包括数据准备、模型训练和模型评估三个主要阶段。
| 阶段 | 主要步骤 |
|---|---|
| 数据准备 | 安装H2O、导入所需库、初始化H2O、挂载Google Drive并读取数据集、数据探索、数据预处理、定义预测变量和目标变量、划分数据集 |
| 模型训练 | 训练GLM、随机森林、GBM模型,使用默认参数和网格搜索的方式,提取最佳模型 |
| 模型评估 | 比较各个基础模型和堆叠集成模型在测试数据上的性能,使用AUC作为主要评估指标 |
通过构建堆叠集成模型,我们能够综合多个基础模型的优势,提高模型的预测准确性和泛化能力。希望本文能够帮助你掌握使用H2O构建异构集成分类器的方法,在实际应用中取得更好的效果。
以下是整个过程的关键步骤总结mermaid流程图:
graph LR
A[开始] --> B[数据准备]
B --> C[模型训练]
C --> D[GLM训练]
C --> E[随机森林训练]
C --> F[GBM训练]
D --> D1[默认GLM]
D --> D2[带Lambda GLM]
D --> D3[网格搜索GLM]
E --> E1[默认随机森林]
E --> E2[网格搜索随机森林]
F --> F1[默认GBM]
F --> F2[网格搜索GBM]
D3 --> G[最佳GLM]
E2 --> H[最佳随机森林]
F2 --> I[最佳GBM]
G & H & I --> J[堆叠集成模型]
J --> K[模型评估]
K --> L[输出结果]
L --> M[结束]
这个流程图再次清晰地展示了从数据准备到最终输出结果的整个过程,每个步骤都紧密相连,为我们构建异构集成分类器提供了清晰的指导。
超级会员免费看
464

被折叠的 条评论
为什么被折叠?



