以下内容笔记出自‘跟着迪哥学python数据分析与机器学习实战’,外加个人整理添加,仅供个人复习使用。
XGBoost为Boosting集成算法,这里为XGBoostClassifier举例。
import xgboost
from numpy import loadtxt
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
1. 导入数据
dataset=pd.read_csv(r'PimaIndiansdiabetes.csv',
)
dataset.head(2)
dataset.isnull().sum()
2. 建立基础分类模型
X=dataset.iloc[:,0:8]
Y=dataset.iloc[:,8]
X_train,X_test,y_train,y_test=train_test_split(X,Y,
test_size=0.33,
random_state=7)
#建模
model=XGBClassifier()
model.fit(X_train,y_train)
#预测
y_pred=model.predict(X_test)
predictions=[round(value) for value in y_pred]
#准确率
accuracy=accuracy_score(y_test,predictions)
print('accu:%.2f%%' % (accuracy *100))
accu:74.02%
3. 基础模型展示建模过程(加入过程)
model=XGBClassifier()
eval_set=[(X_test,y_test)]
model.fit(X_train,y_train,
early_stopping_rounds=10,
eval_metric='logloss',
eval_set=eval_set,
verbose=True)
y_pred=model.predict(X_test)
predictions=[round(value) for value in y_pred]
accuracy=accuracy_score(y_test,predictions)
print('accu: %.2f%%' % (accuracy*100))
特征重要性:
from xgboost import plot_importance
import matplotlib.pyplot as plt
plot_importance(model)
plt.show()
4. 调参示例
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
#参数集并转化为字典格式
#学习率调参
learning_rate=[0.0001,0.001,0.01,0.1,0.2,0.3]
param_grid=dict(learning_rate=learning_rate)
#交叉验证
kfold=StratifiedKFold(n_splits=10,shuffle=True,
random_state=7)
model=XGBClassifier()
grid_search=GridSearchCV(model,param_grid,
scoring='neg_log_loss',n_jobs=-1,
cv=kfold)
grid_result=grid_search.fit(X,Y)
print('best:%f using %s' % (grid_result.best_score_,
grid_result.best_params_))
means=grid_result.cv_results_['mean_test_score']
params=grid_result.cv_results_['params']
for mean,param in zip(means,params):
print('mae %f with %r' % (mean,param))
参数说明:
- learnig rate :一般比较小
- tree
- - max_depth
- - min_child_weight
- - subsample: 选择样本时是不是随机选80%,就像随机森林一样,不随机就是1.0
- - colsample_bytree :选择特征时是否随机进行选择
- - gamma :叶子节点T前面的参数,影响模型复杂度
- 正则化参数
- - lambda
- - alpha
- objective : loss function,用什么损失函数需要指定(很多)
例子:
xgb1=XGBClassifier(
learning_rate=0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
xolsample_bytree=0.8,
objective='binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)