scikit-learn中常见的train test split

最新推荐文章于 2025-03-17 09:55:09 发布

npupengsir

最新推荐文章于 2025-03-17 09:55:09 发布

阅读量2.9k

点赞数

分类专栏： python

本文链接：https://blog.youkuaiyun.com/u012897374/article/details/108864687

版权

python 专栏收录该内容

23 篇文章

订阅专栏

1. train_test_split

进行一次性划分

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
"""X: array([[0, 1],
             [2, 3],
             [4, 5],
             [6, 7],
             [8, 9]])
list(y): [0, 1, 2, 3, 4]
"""

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

"""
    >>> X_train
    array([[4, 5],
           [0, 1],
           [6, 7]])
    >>> y_train
    [2, 0, 3]
    >>> X_test
    array([[2, 3],
           [8, 9]])
    >>> y_test
    [1, 4]
"""
train_test_split(y, shuffle=False)
    [[0, 1, 2], [3, 4]]

X, y: 可为lists, numpy arrays, scipy-sparse, matrices或者dataframes

2. ShuffleSplit

sklearn.model_selection.ShuffleSplit用来将数据集分为测试集和验证集，可以多次划分

from sklearn.model_selection import ShuffleSplit
import numpy as np

X, y = np.arange(20).reshape((10, 2)), range(10)

ss = ShuffleSplit(n_splits=10, test_size=0.2, train_size=None, random_state=None)

for train_indices, test_indices in ss.split(sample):
	print(f"train_indices: {train_indices}, test_indices: {test_indices}")

输出:

train_indices: [4 3 0 6 8 1 9 2], test_indices: [7 5]
train_indices: [0 5 3 4 2 6 9 8], test_indices: [1 7]
train_indices: [2 0 4 1 7 6 3 9], test_indices: [5 8]
train_indices: [2 6 9 8 5 3 4 1], test_indices: [0 7]
train_indices: [0 8 7 9 4 5 2 1], test_indices: [6 3]
train_indices: [6 5 2 8 1 0 3 4], test_indices: [9 7]
train_indices: [8 4 9 5 0 3 2 6], test_indices: [1 7]
train_indices: [6 5 2 1 4 3 0 7], test_indices: [8 9]
train_indices: [8 9 1 7 4 6 5 3], test_indices: [0 2]
train_indices: [1 3 9 5 0 2 7 6], test_indices: [4 8]

n_splits: int, 划分训练集、测试集的次数，默认为10
test_size: float, int, None, default=0.1；测试集比例或样本数量，该值为[0.0, 1.0]内的浮点数时，表示测试集占总样本的比例；该值为整型值时，表示具体的测试集样本数量。
方法：
get_n_splits：获取分割次数
split(X, y=None, groups=None)：进行分割，并返回索引

3. GroupShuffleSplit

与ShuffleSplit基本相同，区别在于先进行分组，然后按照分组来进行训练集和验证集划分。

import pandas as pd
import numpy as np
from sklearn.model_selection import ShuffleSplit, GroupShuffleSplit
sample = pd.DataFrame({
	'subject':['p012', 'p012', 'p014', 'p014', 'p014', 'p024', 'p024', 'p024', 'p024', 'p081'],
	'classname':['c5','c0','c1','c5','c0','c0','c1','c1','c2','c6'],
	'img':['img_41179.jpg','img_50749.jpg','img_53609.jpg','img_52213.jpg','img_72495.jpg', 'img_66836.jpg','img_32639.jpg','img_31777.jpg','img_97535.jpg','img_1399.jpg']})

gss = GroupShuffleSplit(n_splits=4, test_size=0.25, random_state=0)

tmp_groups = sample.loc[:, 'subject'].values

# 进行一次划分
train_idxs, test_idxs =next(
	gss.split(X=sample['img'], y=sample['classname'], 
	groups=tmp_groups))
                             
# 进行多次划分
for train_indices, test_indices in gss.split(sample.loc[:, "img"], sample.loc[:, "classname"], groups=tmp_groups):
    print(f"\ntrain_indices: {train_indices}, test_indices: {test_indices}")
    print(f"train subjects: {sample.loc[train_indices, 'subject']}, test subjects: {sample.loc[test_indices, 'subject']}")

输出:

fold====0=====
train_indices: [0 1 2 3 4 9], test_indices: [5 6 7 8]
train subjects: 0    p012
1    p012
2    p014
3    p014
4    p014
9    p081
Name: subject, dtype: object, test subjects: 5    p024
6    p024
7    p024
8    p024
Name: subject, dtype: object

fold====1=====
train_indices: [2 3 4 5 6 7 8 9], test_indices: [0 1]
train subjects: 2    p014
3    p014
4    p014
5    p024
6    p024
7    p024
8    p024
9    p081
Name: subject, dtype: object, test subjects: 0    p012
1    p012
Name: subject, dtype: object

fold====2=====
train_indices: [0 1 2 3 4 5 6 7 8], test_indices: [9]
train subjects: 0    p012
1    p012
2    p014
3    p014
4    p014
5    p024
6    p024
7    p024
8    p024
Name: subject, dtype: object, test subjects: 9    p081
Name: subject, dtype: object

fold====3=====
train_indices: [0 1 5 6 7 8 9], test_indices: [2 3 4]
train subjects: 0    p012
1    p012
5    p024
6    p024
7    p024
8    p024
9    p081
Name: subject, dtype: object, test subjects: 2    p014
3    p014
4    p014
Name: subject, dtype: object

可以看出已经进行了分组之后再进行划分。

4. GroupKFold

GroupKFold和GroupShuffleSplit基本相同，区别是GroupShuffleSplit每次都是做独立的划分，不同的划分之家可能会重叠。而GroupKFold则没有重叠，因此没有所谓的test_size和random_state参数。

import pandas as pd
import numpy as np
from sklearn.model_selection import ShuffleSplit, GroupKFold
sample = pd.DataFrame({
	'subject':['p012', 'p012', 'p014', 'p014', 'p014', 'p024', 'p024', 'p024', 'p024', 'p081'],
	'classname':['c5','c0','c1','c5','c0','c0','c1','c1','c2','c6'],
	'img':['img_41179.jpg','img_50749.jpg','img_53609.jpg','img_52213.jpg','img_72495.jpg', 'img_66836.jpg','img_32639.jpg','img_31777.jpg','img_97535.jpg','img_1399.jpg']})

gkf = GroupKFold(n_splits=4)

tmp_groups = sample.loc[:, 'subject'].values

# 进行一次划分
train_idxs, test_idxs =next(
	gkf.split(X=sample['img'], y=sample['classname'], 
	groups=tmp_groups))
                             
# 进行多次划分
for train_indices, test_indices in gkf.split(sample.loc[:, "img"], sample.loc[:, "classname"], groups=tmp_groups):
    print(f"\ntrain_indices: {train_indices}, test_indices: {test_indices}")
    print(f"train subjects: \n{sample.loc[train_indices, 'subject']}, \ntest subjects: \n{sample.loc[test_indices, 'subject']}")

输出:

train_indices: [0 1 2 3 4 9], test_indices: [5 6 7 8]
train subjects:
0    p012
1    p012
2    p014
3    p014
4    p014
9    p081
Name: subject, dtype: object,
test subjects:
5    p024
6    p024
7    p024
8    p024
Name: subject, dtype: object

train_indices: [0 1 5 6 7 8 9], test_indices: [2 3 4]
train subjects:
0    p012
1    p012
5    p024
6    p024
7    p024
8    p024
9    p081
Name: subject, dtype: object,
test subjects:
2    p014
3    p014
4    p014
Name: subject, dtype: object

train_indices: [2 3 4 5 6 7 8 9], test_indices: [0 1]
train subjects:
2    p014
3    p014
4    p014
5    p024
6    p024
7    p024
8    p024
9    p081
Name: subject, dtype: object,
test subjects:
0    p012
1    p012
Name: subject, dtype: object

train_indices: [0 1 2 3 4 5 6 7 8], test_indices: [9]
train subjects:
0    p012
1    p012
2    p014
3    p014
4    p014
5    p024
6    p024
7    p024
8    p024
Name: subject, dtype: object,
test subjects:
9    p081
Name: subject, dtype: object

其结果按组来划分且没有重复。