交叉检验 Cross-validation: evaluating estimator performance

交叉验证是评估模型性能的重要方法,包括train_test_split、k折交叉验证、分层k折交叉验证、留一法和shuffle-split交叉验证。它通过多次划分训练集和测试集,减少数据划分偶然性,提高模型泛化能力。其中,分层k折交叉验证确保了每折中各类别比例与原始数据一致,适用于类别不平衡的情况。留一法适用于小样本数据。shuffle-split则提供更灵活的划分控制。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1、train_test_split

在分类问题中,我们通常通过对训练集进行train_test_split,划分成train 和test 两部分,其中train用来训练模型,test用来评估模型,模型通过fit方法从train数据集中学习,然后调用score方法在test集上进行评估,打分;从分数上我们可以知道 模型当前的训练水平如何。

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> from sklearn import datasets
>>> from sklearn import svm

>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))
>>> X_train, X_test, y_train, y_test = train_test_split(
...     iris.data, iris.target, test_size=0.4, random_state=0)

>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))

>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)                           
0.96...

上面利用train_test_split函数将iris数据的60%作为训练集,将40%作为测试集,并用svm模型进行建模,模型在测试集中的准确率为0.96。

然而,这种方式存:只进行了一次划分,数据结果具有偶然性,如果在某次划分中,训练集里全是容易学习的数据,测试集里全是复杂的数据,这样就会导致最终的结果不尽如意;反之,亦是如此。

2、Standard k-fold Cross Validation

针对上面通过train_test_split划分,从而进行模型评估方式存在的弊端,提出Cross Validation 交叉验证。
K折交叉验证,初始采样分割成K个子样本,一个单独的子样本被保留作为验证模型的数据,其他K-1个样本用来训练。交叉验证重复K次,每个子样本验证一次,平均K次的结果或者使用其它结合方式,最终得到一个单一估测。这个方法的优势在于,同时重复运用随机产生的子样本进行训练和验证,每次的结果验证一次,10折交叉验证是最常用的

>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

利用cross_val_score进行k折交叉验证,这里cv为k,默认为3。上述代码中对iris数据集,利用svm模型建模,并进行5折交叉验证,得到5个模型准确率结果。

交叉验证的优点:

  • 原始采用的train_test_split方法,数据划分具有偶然性;交叉验证通过多次划分,大大降低了这种由一次随机划分带来的偶然性,同时通过多次划分,多次训练,模型也能遇到各种各样的数据,从而提高其泛化能力;
  • 与原始的train_test_split相比,对数据的使用效率更高。train_test_split,默认训练集、测试集比例为3:1,而对交叉验证来说,如果是5折交叉验证,训练集比测试集为4:1;10折交叉验证训练集比测试集为9:1。数据量越大,模型准确率越高!

    缺点:

  • 这种简答的交叉验证方式,设想一下,会不会存在一种情况:数据集有5类,抽取出来的也正好是按照类别划分的5类,也就是说第一折全是0类,第二折全是1类,等等;这样的结果就会导致,模型训练时,没有学习到测试集中数据的特点,从而导致模型得分很低,甚至为0,!为了避免这种情况,又出现了其他的各种交叉验证方式。

    3、Stratified k-fold cross validation

    分层交叉验证(Stratified k-fold cross validation):首先它属于交叉验证类型,分层的意思是说在每一折中都保持着原始数据中各个类别的比例关系,比如说:原始数据有3类,比例为1:2:1,采用3折分层交叉验证,那么划分的3折中,每一折中的数据类别保持着1:2:1的比例,这样的验证结果更加可信。
    通常情况下,可以设置cv参数来控制几折,但是我们希望对其划分等加以控制,所以出现了KFold,KFold控制划分折,可以控制划分折的数目,是否打乱顺序等,可以赋值给cv,用来控制划分。

  • >>> from sklearn.datasets import load_iris
    >>> from sklearn.model_selection import StratifiedKFold,cross_val_score
    >>> from sklearn.linear_model import LogisticRegression
    
    >>> iris = load_iris()
    >>> logreg = LogisticRegression()
    >>> strKFold = StratifiedKFold(n_splits=3,shuffle=False,random_state=0)
    >>> scores = cross_val_score(logreg,iris.data,iris.target,cv=strKFold)
    >>> print("straitified cross validation scores:{}".format(scores))
    >>> print("Mean score of straitified cross validation:{:.2f}".format(scores.mean()))
    
    straitified cross validation scores:[0.96078431 0.92156863 0.95833333]
    Mean score of straitified cross validation:0.95

    4、Leave-one-out Cross-validation 留一法

    留一法Leave-one-out Cross-validation:是一种特殊的交叉验证方式。顾名思义,如果样本容量为n,则k=n,进行n折交叉验证,每次留下一个样本进行验证。主要针对小样本数据。

  • >>> from sklearn.datasets import load_iris
    >>> from sklearn.model_selection import LeaveOneOut,cross_val_score
    >>> from sklearn.linear_model import LogisticRegression
    
    >>> iris = load_iris()
    >>> logreg = LogisticRegression()
    >>> loout = LeaveOneOut()
    >>> scores = cross_val_score(logreg,iris.data,iris.target,cv=loout)
    >>> print("leave-one-out cross validation scores:{}".format(scores))
    >>> print("Mean score of leave-one-out cross validation:{:.2f}".format(scores.mean()))
    
    leave-one-out cross validation scores:[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
    1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
    1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1.
    1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
    1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
    1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
    1. 1. 1. 1. 1. 1.]
    Mean score of leave-one-out cross validation:0.95

 

5、Shuffle-split cross-validation

控制更加灵活:可以控制划分迭代次数、每次划分时测试集和训练集的比例(也就是说:可以存在既不在训练集也不再测试集的情况);

>>> from sklearn.model_selection import ShuffleSplit
>>> n_samples = iris.data.shape[0]
>>> cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0) #迭代3次
>>> cross_val_score(clf, iris.data, iris.target, cv=cv)
...                                                     
array([ 0.97...,  0.97...,  1.        ])

关于python sklearn库cross validation内容可参见 :http://scikit-learn.org/stable/modules/cross_validation.html

### Pipeline Model in Machine Learning and Data Processing In machine learning, pipelines are used to streamline the process of building models by chaining multiple steps together. A typical pipeline might include preprocessing steps such as feature scaling or encoding categorical variables, followed by training a model like logistic regression. A basic example can be demonstrated with `scikit-learn`'s `Pipeline`, which allows combining several operations into one object that behaves similarly to any other estimator: ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Define individual components of the pipeline steps = [ ('scaling', StandardScaler()), # Preprocessing step for standardization ('model', LogisticRegression()) # The actual ML algorithm being applied ] # Constructing the pipeline pipeline = Pipeline(steps) # Training the entire pipeline on your dataset X_train, y_train = ... # Assume these contain preloaded datasets pipeline.fit(X_train, y_train) # Making predictions using the trained pipeline y_pred = pipeline.predict(X_train) # Evaluating performance metrics after prediction generation accuracy = accuracy_score(y_train, y_pred) print(f"Model Accuracy: {accuracy}") ``` This code snippet demonstrates how a pipeline combines both data transformation (`StandardScaler`) and modeling (`LogisticRegression`). It ensures consistency between transformations during fitting and predicting phases while simplifying cross-validation processes[^1]. Additionally, understanding why certain choices were made within this structure involves considering aspects discussed elsewhere regarding interpretability issues surrounding complex systems including those involving artificial intelligence technologies where transparency becomes crucial when deploying them responsibly [^2].
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值