模型调参-利器 GridSearchCV(网格搜索)
参考:https://blog.youkuaiyun.com/weixin_41988628/article/details/83098130
由 于 带 交 叉 验 证 的 网 格 搜 索 是 一 种 常 用 的 调 参 方 法, 因 此 scikit-learn 提 供 了GridSearchCV 类,它以估计器(estimator)的形式实现了这种方法。要使用 GridSearchCV类,你首先需要用一个字典指定要搜索的参数。然后 GridSearchCV 会执行所有必要的模型拟合。字典的键是我们要调节的参数名称(param_grid,字典的值是我们想要尝试的参数设置)。我们创建的 grid_search 对象的行为就像是一个分类器,我们可以对它调用标准的 fit、predict 和 score 方法。 2 但我们在调用 fit 时,它会对 param_grid 指定的每种参数组合都运行交叉验证,拟合 GridSearchCV 对象不仅会搜索最佳参数,还会利用得到最佳交叉验证性能的参数在整个训练数据集上自动拟合一个新模型。GridSearchCV 类提供了一个非常方便的接口,可以用 predict 和 score 方法来访问重新训练过的模型。
对LR 和SVC进行网格搜索,得到最优参数集和最佳精度
tuned_parameters = [{'C':[100,1000,100000]}]
for model in models[:2]:
clf = GridSearchCV(model,tuned_parameters, cv=10, scoring='accuracy')
clf.fit(x_train, y_train)
#best_estimator_ returns the best estimator chosen by the search
#grid_scores_的返回值:
# * a dict of parameter settings
# * the mean score over the cross-validation folds
# * the list of scores for each fold
print("Best parameters:{}".format(clf.best_estimator_))
print("Best cv score:{}".format(clf.best_score_))
对决策树、随机森林、XGBoost进行网格搜索,得到最优参数集和最佳精度
tuned_parameters = [{'max_depth':range(1,10,1)}]
for model in models[2:]:
clf = GridSearchCV(model,tuned_parameters, cv=10, scoring='accuracy')
clf.fit(x_train, y_train)
#best_estimator_ returns the best estimator chosen by the search
#grid_scores_的返回值:
# * a dict of parameter settings
# * the mean score over the cross-validation folds
# * the list of scores for each fold
print("Best parameters:{}".format(clf.best_estimator_))
print("Best cv score:{}".format(clf.best_score_))
画出每个模型的交叉验证后的学习曲线
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
from sklearn.model_selection import ShuffleSplit
titles= ["Learning Curves LR","Learning Curves SVC", "Learning Curves DT", "Learning Curves RFC","Learning Curves Xgb"]
cv = ShuffleSplit(n_splits=10,test_size=0.2, random_state=0)
for estimator,title in zip(models,titles):
plot_learning_curve(estimator, title, X, y, (0.0, 1.01), cv=cv, n_jobs=4)
plt.show()