(六)XGBoost使用交叉验证

本文介绍了如何使用XGBoost进行交叉验证,并演示了不同设置下的效果对比,包括禁用标准差显示、使用预处理函数及自定义损失函数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

import numpy as np
import xgboost as xgb

### load data in do training
dtrain = xgb.DMatrix(basePath+'data/agaricus.txt.train')
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
num_round = 2

print('running cross validation')
running cross validation
# do cross validation, this will print result out as
# [iteration]  metric_name:mean_value+std_value
# std_value is standard deviation of the metric
#metrics:验证数据的评估指标,默认指标(rmse用于回归,error误差用于分类
xgb.cv(param, dtrain, num_round, nfold=5,
       metrics={'error'}, seed=0,
       callbacks=[xgb.callback.print_evaluation(show_stdv=True)])
[0] train-error:0.0506682+0.009201 test-error:0.0557316+0.0158887 [1] train-error:0.0213034+0.00205561 test-error:0.0211884+0.00365323
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
test-error-meantest-error-stdtrain-error-meantrain-error-std
00.0557320.0158890.0506680.009201
10.0211880.0036530.0213030.002056
print('running cross validation, disable standard deviation display')
running cross validation, disable standard deviation display
# do cross validation, this will print result out as
# [iteration]  metric_name:mean_value 
# num_boost_round=10:增强数量的迭代
res = xgb.cv(param, dtrain, num_boost_round=10, nfold=5,
             metrics={'error'}, seed=0,
             callbacks=[xgb.callback.print_evaluation(show_stdv=False),
                        xgb.callback.early_stop(3)])  #提前停止的条件:Will train until test-error hasn't improved in 3 rounds.
[0] train-error:0.0506682 test-error:0.0557316 Multiple eval metrics have been passed: ‘test-error’ will be used for early stopping. Will train until test-error hasn’t improved in 3 rounds. [1] train-error:0.0213034 test-error:0.0211884 [2] train-error:0.0099418 test-error:0.0099786 [3] train-error:0.0141256 test-error:0.0144336 [4] train-error:0.0059878 test-error:0.0062948 [5] train-error:0.0020344 test-error:0.0016886 [6] train-error:0.0012284 test-error:0.001228 [7] train-error:0.0012284 test-error:0.001228 [8] train-error:0.0009212 test-error:0.001228 [9] train-error:0.0006142 test-error:0.001228 Stopping. Best iteration: [6] train-error:0.0012284+0.000260265 test-error:0.001228+0.00104094
print(res)
test-error-mean test-error-std train-error-mean train-error-std 0 0.055732 0.015889 0.050668 0.009201 1 0.021188 0.003653 0.021303 0.002056 2 0.009979 0.004828 0.009942 0.006076 3 0.014434 0.003517 0.014126 0.001706 4 0.006295 0.003123 0.005988 0.001878 5 0.001689 0.000574 0.002034 0.001470 6 0.001228 0.001041 0.001228 0.000260
print('running cross validation, with preprocessing function')
running cross validation, with preprocessing function
# define the preprocessing function
# used to return the preprocessed training, test data, and parameter
# we can use this to do weight rescale, etc.
# as a example, we try to set scale_pos_weight
#预处理函数,接受(dtrain, dtest, param)并返回转换后的版本。
def fpreproc(dtrain, dtest, param):
    label = dtrain.get_label()
    ratio = float(np.sum(label == 0)) / np.sum(label == 1)
    param['scale_pos_weight'] = ratio   #控制正权重和负权重的平衡,这对不平衡类很有用。要考虑的一个典型值:sum(负实例)/ sum(正实例)
    return (dtrain, dtest, param)
# do cross validation, for each fold
# the dtrain, dtest, param will be passed into fpreproc
# then the return value of fpreproc will be used to generate
# results of that fold
xgb.cv(param, dtrain, num_round, nfold=5,
       metrics={'auc'}, seed=0, fpreproc=fpreproc)  #auc:曲线下的面积
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
test-auc-meantest-auc-stdtrain-auc-meantrain-auc-std
00.9582320.0057780.9582280.001442
10.9814310.0025950.9814140.000647
###使用自定义损失函数
# you can also do cross validation with customized loss function
# See custom_objective.py
##
print('running cross validation, with cutomsized loss function')
running cross validation, with cutomsized loss function
def logregobj(preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0 - preds)
    return grad, hess
def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
param = {'max_depth':2, 'eta':1, 'silent':1}
# train with customized objective
xgb.cv(param, dtrain, num_round, nfold=5, seed=0,
       obj=logregobj, feval=evalerror)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
test-error-meantest-error-stdtest-rmse-meantest-rmse-stdtrain-error-meantrain-error-stdtrain-rmse-meantrain-rmse-std
00.0557320.0158891.5980430.0128260.0506680.0092011.5950720.003868
10.0211880.0036532.4492820.0809000.0213030.0020562.4426000.076834
#rmse: root mean square error
#mae: mean absolute error
XGBoost(eXtreme Gradient Boosting)是一种优秀的机器学习算法,在训练过程中使用了梯度提升的方法。为了评估XGBoost模型的性能,可以使用五折交叉验证法。 五折交叉验证法是一种常用的模型评估方法,特别适用于数据集较小或样本不均衡的情况。它将原始数据集分成五个子集,其中四个子集用于训练模型,剩余的一个子集用于测试模型。这一过程重复五次,每次使用不同的子集作为测试集。最后将五次的结果取平均得到最终的评估指标。 使用XGBoost进行五折交叉验证的步骤如下: 1. 将原始数据集划分为五个子集(通常是随机划分)。这五个子集的大小应该尽可能相似。 2. 对于每次交叉验证改变训练集和测试集,即从五个子集中选择其中四个作为训练集,剩余的一个作为测试集。 3. 在每次交叉验证中,使用训练集对XGBoost模型进行训练。可以设置一些超参数,如列采样、树的数量和深度等等,以优化模型。 4. 使用上述训练得到的模型对测试集进行预测,并计算评估指标,如准确率、精确率、召回率等等。 5. 重复步骤2至4,直到将所有的子集都作为测试集进行了一次。 6. 将五次交叉验证的评估指标取平均,得到最终的模型性能评估结果。 五折交叉验证可以帮助我们评估在不同的训练集和测试集上的XGBoost模型性能,有效地避免了模型在特定数据集上过拟合或欠拟合。通过平均多次交叉验证的结果,我们可以更准确地评估XGBoost模型的性能,并选择最优的超参数配置。这样可以提高模型的泛化能力和稳定性,使其在未知数据上的预测结果更可靠。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值