模型评估、选择与验证——数据集切分

最新推荐文章于 2024-07-21 15:38:09 发布

原创最新推荐文章于 2024-07-21 15:38:09 发布 · 2.4k 阅读

2 ·

CC 4.0 BY-SA版权

Python 同时被 2 个专栏收录

120 篇文章

订阅专栏

机器学习

44 篇文章

订阅专栏

本文介绍如何使用sklearn库中的train_test_split和KFold方法来高效地进行数据集划分。通过实例演示了如何设置test_size、train_size等参数，并展示了如何通过stratify参数保持类别比例的一致性。同时，也介绍了KFold交叉验证方法及其shuffle参数的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

train_test_split

模型原型

class sklearn.model_selection.train_test_split(arrays, *options)
参数

*arrays : 一个或多个数据集
test_size : 指定测试集的大小
- 浮点数：测试集占原始数据集的比例
- 整数：测试集的大小
- None：测试集大小=原始数据集大小-训练数据集大小
train_size : 指定训练集的大小
- 浮点数：训练集占原始数据集的比例
- 整数：训练集的大小
- None：训练集大小=原始数据集大小-测试数据集大小
random_state
shuffle
stratify : 采样的标记数组

返回值

一个列表，依次给出一个或多个数据集划分的结果，每个数据集都划分为两部分：训练集，测试集

示例

from sklearn.model_selection import train_test_split
X=[
    [1,2,3,4],
    [11,12,13,14],
    [21,22,23,24],
    [31,32,33,34],
    [41,42,43,44],
    [51,52,53,54],
    [61,62,63,64],
    [71,72,73,74]
]
y=[1,1,0,0,1,1,0,0]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=0)
print('X_train:%s\nX_test:%s\ny_train:%s\ny_test:%s'%(X_train,X_test,y_train,y_test))
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=0,stratify=y)
print('\nStratify:\nX_train:%s\nX_test:%s\ny_train:%s\ny_test:%s'%(X_train,X_test,y_train,y_test))

KFold

模型原型

class sklearn.model_selection.KFold(n_splits=3,shuffle=False,random_state=None)
参数

n_splits
shuffle
random_state

方法

get_n_splits([X,y,groups])
split(X[,y,groups])

示例

from sklearn.model_selection import KFold
import numpy as np

X=np.array([
    [1,2,3,4],
    [11,12,13,14],
    [21,22,23,24],
    [31,32,33,34],
    [41,42,43,44],
    [51,52,53,54],
    [61,62,63,64],
    [71,72,73,74],
    [81,82,83,84]
])
y=np.array([1,1,0,0,1,1,0,0,1])

folder=KFold(random_state=0,shuffle=False)
for train_index,test_index in folder.split(X,y):
    print('Train Index:%s\nTest Index:%s\nX_train:\n%s\nX_test:\n%s\n'%
        (train_index,test_index,X[train_index],X[test_index]))

shuffle_folder=KFold(random_state=0,shuffle=True)
for train_index,test_index in shuffle_folder.split(X,y):
    print('Shuffled\nTrain Index:%s\n
        Test Index:%s\nX_train:\n%s\nX_test:\n%s\n'%
        (train_index,test_index,X[train_index],X[test_index]))