文章目录
使用网格搜索法对5个模型进行调优(调参时采用五折交叉验证的方式),并进行模型评估,记得展示代码的运行结果。
一、K折交叉验证和网格搜索法
1、K折交叉验证
K折交叉验证(k-fold cross validation),将初始采样(样本集X,Y)分割成K份,一份被保留作为验证模型的数据(test set),其他K-1份用来训练(train set)。交叉验证重复K次,每份验证一次,平均K次的结果或者使用其它结合方式,最终得到一个单一估测。
2、网格搜索法
Grid Search:一种调参手段;穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。(为什么叫网格搜索?以有两个参数的模型为例,参数a有3种可能,参数b有4种可能,把所有可能性列出来,可以表示成一个3*4的表格,其中每个cell就是一个网格,循环过程就像是在每个网格里遍历、搜索,所以叫grid search)
二、代码实现
1、利用GGridSearchCV调参
1.1 参数
# 5个模型要调的参数
parameters_log = {'C':[0.001,0.01,0.1,1,10]}
parameters_svc = {'C':[0.001,0.01,0.1,1,10]} #这两个模型本来分数就不行,就少选择写参数来搜索
parameters_tree = {'max_depth':[5,8,15,25,30,None],'min_samples_leaf':[1,2,5,10], 'min_samples_split':[2,5,10,15]}
parameters_forest = {'max_depth':[5,8,15,25,30,None],'min_samples_leaf':[1,2,5,10],
'min_samples_split':[2,5,10,15],'n_estimators':[7,8,9,10]} #这两个模型过拟合很厉害,参数多点
parameters_xgb = {'gamma':[0,0.05,0.1,0.3,0.5],'learning_rate':[0.01,0.015,0.025,0.05,0.1],
'max_depth':[3,5,7,9],'reg_alpha':[0,0.1,0.5,1.0]} #这个模型表现挺好,多调试一点
parameters_total = {'log_clf':parameters_log,'svc_clf':parameters_svc,'tree_clf':parameters_tree,
'forest_clf':parameters_forest,'xgb_clf':parameters_xgb}
1.2 划分数据集
X_val = X_train_scaled[:1000]
y_val = y_train[:1000]
1.3 模型用字典集合
from sklearn.model_selection import GridSearchCV
def gridsearch(X_val,y_val,models,parameters_total):
models_grid = {}
for model in models:
# cv=5 表示5折交叉
grid_search = GridSearchCV(models[model],param_grid=parameters_total[model],n_jobs=-1,cv=5,verbose=10)
grid_search.fit(X_val,y_val)
models_grid[model] = grid_search.best_estimator_
return models_grid
1.4 查看参数
models_grid
输出:
{'forest_clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=8, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=15,
min_weight_fraction_leaf=0.0, n_estimators=8, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
'log_clf': LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False),
'svc_clf': SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
'tree_clf': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=10, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'),
'xgb_clf': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.05,
max_delta_step=0, max_depth=7, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=1.0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)}
2、参数优化前后对比
models_grid = gridsearch(X_val,y_val,models,parameters_total)
results_test_grid,results_train_grid = metrics(models_grid,X_train_scaled,X_test_scaled,y_train,y_test)
训练集上:
测试集上: