任务描述:
用上一篇博客中评分最高的模型作为基准模型,和其他模型进行stacking融合,得到最终模型及评分果。
模型融合
表现最好的模型是GBDT和XGBoost。(以准确率和AUC值为判别标准)
模型评估代码如下:
#GBDT
#训练集预测标签和概率输出
train_gbdt_predict = clf_gbdt.predict(X_train)
train_gbdt_predict_pro = clf_gbdt.predict_proba(X_train)[:,1]
#测试集预测标签和概率输出
test_gbdt_predict = clf_gbdt.predict(X_test)
test_gbdt_predict_pro = clf_gbdt.predict_proba(X_test)[:,1]
#训练集评分
model_evaluation(y_train,train_gbdt_predict,train_gbdt_predict_pro)
#测试集评分
model_evaluation(y_test,test_gbdt_predict,test_gbdt_predict_pro)
结果:
====================训练集 accuracy: 0.856026450255 precision: 0.865979381443 recall: 0.503597122302 f1_score: 0.636846095527 roc_auc_score: 0.909330778458 ====================测试集 accuracy: 0.771548703574 precision: 0.577464788732 recall: 0.342618384401 f1_score: 0.43006993007 roc_auc_score: 0.762693916727
#XGBoost
#训练集预测标签和概率输出
train_xgb_predict = clf_xgb.predict(X_train)
train_xgb_predict_pro = clf_xgb.predict_proba(X_train)[:,1]
#测试集预测标签和概率输出
test_xgb_predict = clf_xgb.predict(X_test)
test_xgb_predict_pro = clf_xgb.predict_proba(X_test)[:,1]
#训练集评分
model_evaluation(y_train,train_xgb_predict,train_xgb_predict_pro)
#测试集评分
model_evaluation(y_test,test_xgb_predict,test_xgb_predict_pro)
结果:
====================训练集 accuracy: 0.848512173129 precision: 0.846638655462 recall: 0.483213429257 f1_score: 0.615267175573 roc_auc_score: 0.905284436711 ====================测试集 accuracy: 0.784162578837 precision: 0.624390243902 recall: 0.356545961003 f1_score: 0.45390070922 roc_auc_score: 0.769253440164
#lightGBM
#训练集预测标签和概率输出
train_lgb_predict = clf_lgb.predict(X_train)
train_lgb_predict_pro = clf_lgb.predict_proba(X_train)[:,1]
#测试集预测标签和概率输出
test_lgb_predict = clf_lgb.predict(X_test)
test_lgb_predict_pro = clf_lgb.predict_proba(X_test)[:,1]
#训练集评分
model_evaluation(y_train,train_lgb_predict,train_lgb_predict_pro)
#测试集评分
model_evaluation(y_test,test_lgb_predict,test_lgb_predict_pro)
结果:
====================训练集 accuracy: 0.994289149384 precision: 1.0 recall: 0.97721822542 f1_score: 0.988477865373 roc_auc_score: 0.999994709407 ====================测试集 accuracy: 0.768745620182 precision: 0.564444444444 recall: 0.353760445682 f1_score: 0.434931506849 roc_auc_score: 0.749950444952
stacking
# 将xgb作为初始模型进行stacking
folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
oof_stack = np.zeros(X_train.shape[0])
predictions = np.zeros(X_test.shape[0])
#XGBoost
clf_xgb = xgb.XGBClassifier()
for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(X_train,y_train)):
print("fold {}".format(fold_))
trn_data, trn_y = X_train.iloc[trn_idx], y_train.iloc[trn_idx].values
val_data, val_y = X_train.iloc[val_idx], y_train.iloc[val_idx].values
clf_xgb.fit(trn_data,trn_y)
meta_train_x = clf_xgb.predict(trn_data)#次级模型的训练、验证和测试输入
meta_val_x = clf_xgb.predict(val_data)
meta_test_x = clf_xgb.predict(X_test)
clf_3 = BayesianRidge()
clf_3.fit(meta_train_x, trn_y)
oof_stack[val_idx] = clf_3.predict(meta_val_x)
predictions += clf_3.predict(meta_test_x) / 10
#模型融合评估
model_evaluation(y_train,np.int64(oof_stack>0.5),oof_stack)
model_evaluation(y_test,np.int64(predictions>0.5),predictions)
结果:
====================训练集 交叉验证 accuracy: 0.793207093478 precision: 0.661504424779 recall: 0.358513189448 f1_score: 0.46500777605 roc_auc_score: 0.64548842274 ====================测试集 accuracy: 0.781359495445 precision: 0.614634146341 recall: 0.350974930362 f1_score: 0.446808510638 roc_auc_score: 0.682931154998
问题:
stacking没有提高。。