12天summer----数据挖掘实战-模型调优

最新推荐文章于 2025-04-09 22:48:29 发布

beautiful_well

最新推荐文章于 2025-04-09 22:48:29 发布

阅读量172

点赞数

CC 4.0 BY-SA版权

分类专栏： DataWhale-数据挖掘实战文章标签：模型调优

本文链接：https://blog.youkuaiyun.com/beautiful_well/article/details/99711306

DataWhale-数据挖掘实战专栏收录该内容

9 篇文章

订阅专栏

本文介绍如何使用GridSearchCV进行模型参数调优，通过网格搜索找到最佳参数组合，提高模型精度。并展示了如何对多种模型如LR、SVC、决策树、随机森林和XGBoost进行网格搜索，同时给出了绘制学习曲线的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

模型调参-利器 GridSearchCV（网格搜索）

参考：https://blog.youkuaiyun.com/weixin_41988628/article/details/83098130

由于带交叉验证的网格搜索是一种常用的调参方法，因此 scikit-learn 提供了GridSearchCV 类，它以估计器（estimator）的形式实现了这种方法。要使用 GridSearchCV类，你首先需要用一个字典指定要搜索的参数。然后 GridSearchCV 会执行所有必要的模型拟合。字典的键是我们要调节的参数名称（param_grid，字典的值是我们想要尝试的参数设置)。我们创建的 grid_search 对象的行为就像是一个分类器，我们可以对它调用标准的 fit、predict 和 score 方法。 2 但我们在调用 fit 时，它会对 param_grid 指定的每种参数组合都运行交叉验证,拟合 GridSearchCV 对象不仅会搜索最佳参数，还会利用得到最佳交叉验证性能的参数在整个训练数据集上自动拟合一个新模型。GridSearchCV 类提供了一个非常方便的接口，可以用 predict 和 score 方法来访问重新训练过的模型。

对LR 和SVC进行网格搜索，得到最优参数集和最佳精度

tuned_parameters = [{'C':[100,1000,100000]}] 
for model in models[:2]:
    clf = GridSearchCV(model,tuned_parameters, cv=10, scoring='accuracy')
    clf.fit(x_train, y_train)
    #best_estimator_ returns the best estimator chosen by the search
    #grid_scores_的返回值:
    #    * a dict of parameter settings
    #    * the mean score over the cross-validation folds 
    #    * the list of scores for each fold

print("Best parameters:{}".format(clf.best_estimator_))
print("Best cv score:{}".format(clf.best_score_))

对决策树、随机森林、XGBoost进行网格搜索，得到最优参数集和最佳精度

tuned_parameters = [{'max_depth':range(1,10,1)}] 
for model in models[2:]:
    clf = GridSearchCV(model,tuned_parameters, cv=10, scoring='accuracy')
    clf.fit(x_train, y_train)
    #best_estimator_ returns the best estimator chosen by the search

    #grid_scores_的返回值:
    #    * a dict of parameter settings
    #    * the mean score over the cross-validation folds 
    #    * the list of scores for each fold

print("Best parameters:{}".format(clf.best_estimator_))
print("Best cv score:{}".format(clf.best_score_))

画出每个模型的交叉验证后的学习曲线

from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

from sklearn.model_selection import ShuffleSplit
titles= ["Learning Curves LR","Learning Curves  SVC", "Learning Curves DT", "Learning Curves RFC","Learning Curves Xgb"]
cv = ShuffleSplit(n_splits=10,test_size=0.2, random_state=0)
for estimator,title in zip(models,titles):
    plot_learning_curve(estimator, title, X, y, (0.0, 1.01), cv=cv, n_jobs=4)
plt.show()