机器学习——eXtreme Gradient Boosting（XGBoost）模型实战

最新推荐文章于 2024-08-11 22:42:20 发布

Alphoseven

最新推荐文章于 2024-08-11 22:42:20 发布

阅读量1.2k

点赞数

文章标签： python 机器学习

本文链接：https://blog.youkuaiyun.com/Alphoseven/article/details/113812048

版权

XGBoost实战笔记

最近在做的项目中，有利用XGBoost模型作为Baseline进行比较。本篇文章记录了一些学习资源和在写代码过程中遇到的一些问题及解决方法。

I. 学习资源

XGBoost论文原文
XGBoost的解读及对参数解释
XGBoost调参方法（若要详细了解可以参考这篇paper，但不一定能打开)(｡ì _ í｡)

II.实战

笔者第一次接触XGBoost,若有写的不对的地方请见谅。

a.导入必要的包

一些包的接口可能会发生变化，读者还需自行验证一下(◐‿◑)

import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_val_score
from xgboost import XGBClassifier
from sklearn import metrics
import matplotlib.pyplot as plt

b.划分数据集

当我们获得的数据集中0、1比例失调的时候，我们可以在train_test_split()函数中使用参数stratify = y .这样做可以使得0、1比在训练集和测试集中相近。

df = pd.read_csv('new_feature_snp_data.csv')	
X = df.iloc[:, 2:74]
y = df.iloc[:, -1]
y = pd.DataFrame(y)
#stratify=y这个参数可以让0、1在训练集和测试集中的比例近似
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 5, stratify = y)

以下是导出数据为CSV格式的方法：

train_set = pd.concat([X_train, y_train], axis = 1)
test_set = pd.concat([X_test, y_test], axis = 1)
train_set.to_csv('train_set.csv', index = True, header = True)
test_set.to_csv('test_set.csv', index = True, header = True)

c.调参

调参过程这边就省略了，笔者是按照学习资源中的调参方法来进行的。将参数分类然然后步调参的方法可以节省很多时间，但是有可能会使得我们的模型最终收敛于局部最优解。To make our life easier, 这边就选用此种方法啦。（想要调的一手好参，需多加练习。但是我们更应该关注算法本身，因为算法决定了上限，调参只是起到逼近这个上限的作用）

d.利用5Fold cross-validation计算训练集上的AUC值

以下是调完参数的XGBoost：

xgb_f = XGBClassifier(learning_rate=0.01, n_estimators=277, max_depth=5, min_child_weight=1, gamma=0.5,
    				  subsample=0.65, colsample_bytree=0.9, objective= 'binary:logistic', scale_pos_weight=1,
    				  seed=27)

计算AUC：

kfold = KFold(n_splits=5, random_state=7)
results = cross_val_score(xgb_f, X_train, y_train.values.ravel(), cv=kfold, scoring='roc_auc')
print(results.mean())		#0.7241887534252218

我担心这样计算出来的AUC值，不是让XGBoost输出结果为1的概率来算的，所以我用了以下的代码手动检验一下：

aucs = []
cv = KFold(n_splits = 5, random_state = 7)
for train, test in cv.split(X_train, y_train):
#cv.split(X_train, y_train)的结果是选中作为训练集的samples的标记和作为测试集的samples的标记
    probas_ = xgb_f.fit(X_train.iloc[train], y_train.iloc[train]).predict_proba(X_train.iloc[test])
    fpr, tpr, threshold = metrics.roc_curve(y_train.iloc[test], probas_[:, 1])
    roc_auc = metrics.auc(fpr, tpr)
    aucs.append(roc_auc)
print(sum(aucs)/5)

predict_proba()这个函数输出的是结果为0、1的概率，如下图所示：
在这里插入图片描述

按照roc曲线的计算方法，我们只取结果为1的概率然后进行计算就可以啦！

e.对测试集计算AUC值并画出ROC曲线

e.1.计算AUC

xgb_f.fit(X_train, y_train.values.ravel())
pred = xgb_f.predict_proba(X_test)[:,1]
#计算fpr和tpr
fpr, tpr, threshold = metrics.roc_curve(y_test.values.ravel(), pred)
#计算AUC
roc_auc = metrics.auc(fpr, tpr)		#0.6847573479152426

e.2.Draw roc_curve

plt.figure()
lw = 2
plt.figure(figsize=(10,10))
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) 
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()