文章目录
一.XGBoost原理
1.目标函数
y
^
i
\hat y_i
y^i是第i个样本
x
i
x_i
xi的预测值
2.第t颗树
XGBoost是一个加法模型,假设我们第t次迭代要训练的树模型是
f
t
(
)
f_t()
ft(),则有:
将上式代入目标函数得到:
3.泰勒公式展开
(1)先来回顾一下二阶泰勒公式;
(2)定义损失函数关于
y
^
(
t
−
1
)
\hat y^{(t-1)}
y^(t−1)的一阶和二阶倒数:
则损失函数转化为下式:
4.定义一颗树
5.定义树的复杂度
6.叶子节点归组
我们将属于第
j
j
j个叶子节点的所有样本
x
i
x_i
xi 划入到一个叶子节点样本之中,数学表达式如下。
然后代入到泰勒展开后的目标函数obj中,得到:
7.树结构打分
8.分裂一个节点
9.寻找最佳分裂点
10.停止分裂
二.XGBoost实战
1.简单流程
2.参数解析
3.用XGBoost进行二分类
问题介绍:
目的是判断病人是否会在 5 年内患糖尿病,给出的数据为csv文件,一共9列数据,这个数据前 8 列是变量,最后一列是预测值为 0 或 1。
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
dataset = loadtxt(r'C:\Users\dahuo\Desktop\dataset_001.csv', delimiter=",")
df = pd.read_csv(r'C:\Users\dahuo\Desktop\dataset_001.csv')
print(df.dtypes)
X = dataset[:,0:8]
Y = dataset[:,8]
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# 不可视化数据集
#model = XGBClassifier()
#model.fit(X_train, y_train)
##可视化测试集的loss
model = XGBClassifier(base_score=0.5,
booster='gbtree',
colsample_bylevel=1,
colsample_bytree=1,
gamma=0,
learning_rate=0.1,
max_delta_step=0,
max_depth=3,
min_child_weight=1,
missing=None,
n_estimators=100,
n_jobs=1,
nthread=None,
objective='binary:logistic',
random_state=0,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
seed=None,
silent=True,
subsample=1,
use_label_encoder=False)
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
y_pred = model.predict(X_test)
#predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
"""
对每个特征的重要性进行分析
"""
from xgboost import plot_importance
from matplotlib import pyplot
model.fit(X, Y)
plot_importance(model)
pyplot.show()
4.用XGBoost进行回归
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
import numpy as np
from xgboost import plot_importance
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
def loadDataset(filePath):
df = pd.read_csv(filepath_or_buffer=filePath)
return df
# 只选了了5个特征建立模型,包括年龄
def featureSet(data):
# 进行缺失值的填充
X = data.loc[:, 'rw':'gk']
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
x_new = imputer.fit_transform(X) # 结果为numpy类型
data_num = len(data)
XList = []
for row in range(0, data_num):
tmp_list = []
tmp_list.append(data.iloc[row]['club'])
tmp_list.append(data.iloc[row]['league'])
tmp_list.append(data.iloc[row]['potential'])
tmp_list.append(data.iloc[row]['international_reputation'])
# 获得年龄
d = data.iloc[row]['birth_date']
s = d.split('/')
tmp_list.append(s[-1])
# 特征选择增加的(可以通过特征选择添加)
tmp_list.append(data.iloc[row]['pac'])
tmp_list.append(data.iloc[row]['sho'])
tmp_list.append(data.iloc[row]['pas'])
tmp_list.append(data.iloc[row]['dri'])
tmp_list.append(data.iloc[row]['def'])
tmp_list.append(data.iloc[row]['phy'])
tmp_list.append(data.iloc[row]['skill_moves'])
# 增加筛选的7个特征
tmp_list.append(x_new[row][0])
tmp_list.append(x_new[row][2])
tmp_list.append(x_new[row][3])
tmp_list.append(x_new[row][4])
tmp_list.append(x_new[row][5])
tmp_list.append(x_new[row][6])
tmp_list.append(x_new[row][7])
XList.append(tmp_list)
yList = data.y.values
return np.array(XList), yList
def trainandTest_mae(X_train, y_train, X_test, y_test):
# XGBoost训练过程
model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=False, objective='reg:gamma')
model.fit(X_train, y_train)
# 对测试集进行预测
ans = model.predict(X_test)
# 误差mae
mae = mean_absolute_error(y_test, ans)
print("mae:", mae)
# 显示重要特征
plot_importance(model)
plt.show()
trainFilePath = 'train.csv'
testFilePath = 'test.csv'
data = loadDataset(trainFilePath)
# 获得数据集
X, y = featureSet(data)
# 划分训练和测试数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 训练和测试模型,并输出mae指标
trainandTest_mae(X_train, y_train, X_test, y_test)
参考文章:
一.损失函数(代价函数)
二.正则项
三.模型原理
四.优化方法
五.实战
- XGBoost多分类预测
- mirrors / decouples / Awesome_Python
- 用XGBoost进行二分类
- xgboost回归预测模型_机器学习直播案例|使用回归模型预测鲍鱼年龄
- XGBoost线性回归工控数据分析实践案例(原生篇)
- xgboost.train()和xgboost.XGBClassifier().fit()的区别
- XGBoost解决多分类问题
六.其他
七。数据集