scikit-learn：3.1. Cross-validation: evaluating estimator performance

最新推荐文章于 2023-07-15 07:12:24 发布

mmc2015

最新推荐文章于 2023-07-15 07:12:24 发布

阅读量2.4k

点赞数 1

分类专栏： scikit-learn scikit-learn 文章标签： scikit-learn 交叉验证模型评估

本文链接：https://blog.youkuaiyun.com/mmc2015/article/details/47099275

版权

本文介绍了交叉验证（CV）的概念，用于评估模型性能，防止过拟合。scikit-learn提供了cross_val_score函数进行交叉验证，支持自定义评分函数和CV策略。还介绍了K折、分层K折、留一法等多种CV策略，并强调了在某些情况下，数据预处理和是否先洗牌的重要性。最后讨论了交叉验证在模型选择中的应用，如网格搜索。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

参考：http://scikit-learn.org/stable/modules/cross_validation.html

overfitting很常见，所以提出使用test set来验证模型的performance。给个直观的例子：

>>> import numpy as np
>>> from sklearn import cross_validation
>>> from sklearn import datasets
>>> from sklearn import svm
>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))

>>> X_train, X_test, y_train, y_test = <strong>cross_validation.train_test_split</strong>(
...     iris.data, iris.target, <strong>test_size=0.4, random_state=0</strong>) #<span style="font-family: Arial, Helvetica, sans-serif;"><strong>holding out 40% of the data for testing</strong></span>
>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))
>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)                           
0.96...

还有个问题就是，超参数（ C=1）是人工设置，这样会造成overfitting。所以提出training set、validation set、test set的三级概念： training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set。

三级概念也有问题，数据量少时，进一步加重了训练数据的量少。所以提出 cross-validation (CV for short，k-fold CV)的概念：

A model is trained using $k-1$ of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k -fold cross-validation is then the average of the values computed in the loop.计算量虽然大，但好处多多。

1、 Computing cross-validated metrics

使用CV最简单的方法是，同时对estimator和dataset调用 cross_val_score helper function：

 
  >>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_validation.cross_val_score(
...    clf, iris.data, iris.target, cv=5)
...
>>>
 

最低0.47元/天解锁文章