学习预测函数的参数并在相同的数据上测试,可能会在这组测试数据上有一个完美的分数,但是它不能预测未知的数据,或者说效果可能不理想,这种情况叫过拟合,交叉验证就是用来解决这种过拟合的情况的。
Titannic显然是一个监督学习的模型,那主要采用逻辑回归(0.8249810358016404)、决策树(
0.8103866355863051)、支持向量机(0.8159543256281413)这三种。括号里的是某次交叉验证的平均分,可以看到
逻辑回归的准确率最高,因此下面的学习默认就采用逻辑回归。
train_test_split
利用train_test_split快速地随机分割训练和测试集
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
iris.data.shape, iris.target.shape((150, 4), (150,))
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
X_train.shape, y_train.shape((90, 4), (90,))
X_test.shape, y_test.shape((60, 4), (60,))
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
iris.data.shape, iris.target.shape((150, 4), (150,))
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
X_train.shape, y_train.shape((90, 4), (90,))
X_test.shape, y_test.shape((60, 4), (60,))
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
cross_val_score
交叉验证最简单的方法是在估计器和数据集上调用cross_val_score辅助函数。下面的例子演示了如何通过分割数据,拟合模型和连续计算5次得分(每次不同分割)来估计线性核心支持向量机在虹膜数据集上的准确性:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
#scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro') scoring='f1_macro'是评分参数
scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)
当cv参数是整数时,cross_val_score默认使用KFold或StratifiedKFold参数,如果估计器来自ClassifierMixin,则使用StratifiedKFold评分参数。
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
#scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro') scoring='f1_macro'是评分参数
scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)
当cv参数是整数时,cross_val_score默认使用KFold或StratifiedKFold参数,如果估计器来自ClassifierMixin,则使用StratifiedKFold评分参数。
通过交叉验证迭代器来使用其他交叉验证方法
from sklearn.model_selection import ShuffleSplit
n_samples = iris.data.shape[0]
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0) #打乱数据并获得cv
cross_val_score(clf, iris.data, iris.target, cv=cv)
array([ 0.97..., 0.97..., 1. ])
n_samples = iris.data.shape[0]
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0) #打乱数据并获得cv
cross_val_score(clf, iris.data, iris.target, cv=cv)
array([ 0.97..., 0.97..., 1. ])
StandardScaler()
标准化和类似的数据变换也应该从训练集中学习
from sklearn import preprocessing
X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.4, random_state=0)
scaler = preprocessing.StandardScaler().fit(X_train) #StandardScaler() 标准化,从训练数据X_train中得到
#调用fit方法,根据已有的训练数据创建一个标准化的转换器 scaler = preprocessing.StandardScaler().fit(x)
X_train_transformed = scaler.transform(X_train) #利用转换器对X_train数据进行标准化
clf = svm.SVC(C=1).fit(X_train_transformed, y_train) #通过SVM对训练数据中标准化后的X_train_transformed和对应的标签y_train进行拟合
X_test_transformed = scaler.transform(X_test) #同样对X_test也要进行标准化
clf.score(X_test_transformed, y_test)
0.9333..
from sklearn import preprocessing
X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.4, random_state=0)
scaler = preprocessing.StandardScaler().fit(X_train) #StandardScaler() 标准化,从训练数据X_train中得到
#调用fit方法,根据已有的训练数据创建一个标准化的转换器 scaler = preprocessing.StandardScaler().fit(x)
X_train_transformed = scaler.transform(X_train) #利用转换器对X_train数据进行标准化
clf = svm.SVC(C=1).fit(X_train_transformed, y_train) #通过SVM对训练数据中标准化后的X_train_transformed和对应的标签y_train进行拟合
X_test_transformed = scaler.transform(X_test) #同样对X_test也要进行标准化
clf.score(X_test_transformed, y_test)
0.9333..
Pipeline(可以通过Pipeline将标准化和SVM参数拟合一步完成)
from sklearn.pipeline import make_pipeline
clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
cross_val_score(clf, iris.data, iris.target, cv=cv)
array([ 0.97..., 0.93..., 0.95...])
clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
cross_val_score(clf, iris.data, iris.target, cv=cv)
array([ 0.97..., 0.93..., 0.95...])
cross_validate
(cross_validate函数与cross_val_score的区别在于 :除测试分数,它还会返回一个包含训练分数,合格次数和得分次数的集合)
单度量评估['test_score', 'fit_time', 'score_time']
多度量评估['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']
多度量评估['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5, return_train_score=False)
#默认是return_train_score,若return_train_score=False则不返回
sorted(scores.keys())
['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']
scores['test_recall_macro']
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
from sklearn.metrics import recall_score
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5, return_train_score=False)
#默认是return_train_score,若return_train_score=False则不返回
sorted(scores.keys())
['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']
scores['test_recall_macro']
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
cross_val_predict
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
metrics.accuracy_score(iris.target, predicted)
0.973...
predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
metrics.accuracy_score(iris.target, predicted)
0.973...
如果知道样本是使用时间依赖性过程生成的,那么使用时间序列感知交叉验证方案会更安全。
同样,如果我们知道生成过程具有群组结构使用分组交叉验证更安全。
同样,如果我们知道生成过程具有群组结构使用分组交叉验证更安全。
KFold
(KFold divides all the samples in groups of samples, called folds of equal sizes. )
四个样本两次折叠的例子
import numpy as np
from sklearn.model_selection import KFold
X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]
Repeated K-Fold
KFold折叠n次
import numpy as np
from sklearn.model_selection import RepeatedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
random_state = 12883823
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
for train, test in rkf.split(X):
print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]
import numpy as np
from sklearn.model_selection import RepeatedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
random_state = 12883823
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
for train, test in rkf.split(X):
print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]
LeaveOneOut(或LOO)是一个简单的交叉验证
from sklearn.model_selection import LeaveOneOut
X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
每个学习集合都是通过除了一个样本以外的所有样本创建的,测试集合是样本被遗漏的。
因此,对于样本,我们有不同的训练集和不同的测试集。
这种交叉验证程序不会浪费太多数据,因为只有一个样本从训练集中删除
X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
每个学习集合都是通过除了一个样本以外的所有样本创建的,测试集合是样本被遗漏的。
因此,对于样本,我们有不同的训练集和不同的测试集。
这种交叉验证程序不会浪费太多数据,因为只有一个样本从训练集中删除
Leave P Out (LPO)
与LeaveOneOut和KFold不同,测试集将折叠为两个。
from sklearn.model_selection import LeavePOut
X = np.ones(4)
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
print("%s %s" % (train, test))
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]
ShuffleSplit(字面意思洗牌并切分)
首先将样品洗牌,然后将其分成一对训练和测试装置。
from sklearn.model_selection import ShuffleSplit
X = np.arange(5)
ss = ShuffleSplit(n_splits=3, test_size=0.25,random_state=0)
for train_index, test_index in ss.split(X):
print("%s %s" % (train_index, test_index))
[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]
ShuffleSplit是KFold交叉验证的一个很好的替代方案,它允许更精细地控制迭代次数和训练/测试分割两侧的样本比例。
from sklearn.model_selection import ShuffleSplit
X = np.arange(5)
ss = ShuffleSplit(n_splits=3, test_size=0.25,random_state=0)
for train_index, test_index in ss.split(X):
print("%s %s" % (train_index, test_index))
[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]
ShuffleSplit是KFold交叉验证的一个很好的替代方案,它允许更精细地控制迭代次数和训练/测试分割两侧的样本比例。
Stratified k-fold(分层K-Fold)
当负样本可能比正样本多几倍时,建议使用StratifiedKFold和StratifiedShuffleSplit中实施的分层抽样,
以保证正反样本中切分的相对的百分比,不会切到某一边只有正样本或负样本
from sklearn.model_selection import StratifiedKFold
X = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(n_splits=3) #n_splits=3对应的是切分的次数
for train, test in skf.split(X, y):
print("%s %s" % (train, test))
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]
StratifiedShuffleSplit也是类似
最后是group切分,比如GroupShuffleSplit,Leave P Groups Out, Leave One Group Out,Group k-fold,更多内容点击http://scikit-learn.org/stable/modules/cross_validation.html
以保证正反样本中切分的相对的百分比,不会切到某一边只有正样本或负样本
from sklearn.model_selection import StratifiedKFold
X = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(n_splits=3) #n_splits=3对应的是切分的次数
for train, test in skf.split(X, y):
print("%s %s" % (train, test))
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]
StratifiedShuffleSplit也是类似
最后是group切分,比如GroupShuffleSplit,Leave P Groups Out, Leave One Group Out,Group k-fold,更多内容点击http://scikit-learn.org/stable/modules/cross_validation.html