sklearn.model_selection.StratifiedShuffleSplit

本文介绍sklearn中StratifiedShuffleSplit模块的使用方法及应用场景,该模块能够确保训练集和测试集中各类别样本的比例一致。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer # 加载数据 housing = pd.read_csv("housing\housing.csv") #使用head()方法查看前五行 housing.head() #使用info()方法查看数据描述 housing.info() #使用housing[“ocean_proximity”].value_counts()查看非数值的项,也就是距离大海距离的项包含哪些属性,每个属性包含多少个街区 housing["ocean_proximity"].value_counts() #使用housing.describe()方法查看数值属性的概括 housing.describe() import matplotlib.pyplot as plt housing.hist(bins=50,figsize=(20,15)) plt.show() #随机取样 from sklearn.model_selection import train_test_split train_set,test_set=train_test_split(housing,test_size=0.2,random_state=42) test_set.head() #收入中位数柱状图 housing["median_income"].hist() plt.show() # Divide by 1.5 to limit the number of income categories import numpy as np housing["income_cat"]=np.ceil(housing["median_income"]/1.5) housing["income_cat"].head() # Label those above 5 as 5 housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace=True) housing["income_cat"].hist() plt.show() #根据收入分类,进行分层采样,使用sklearnstratifiedShuffleSplit类 from sklearn.model_selection import StratifiedShuffleSplit split=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42) for train_index,test_index in split.split(housing,housing["income_cat"]): strat_train_set=housing.loc[train_index] strat_test_set=housing.loc[test_index] #查看分层抽样之后测试集的分布状况 strat_test_set["income_cat"].value_counts()/len(strat_test_set) len(strat_test_set) #查看原始数据集的分布状况 housing["income_cat"].value_counts()/len(housing) # .value_counts():确认数据出现的频数 len(housing) #对分层采样和随机采样的对比 def income_cat_proportions(data): return data["income_cat"].value_counts()/len(data) train_set,test_set=train_test_split(housing,test_size=0.2,random_state=42) compare_props=pd.DataFrame({"Overall":income_cat_proportions(housing), "Stratified":income_cat_proportions(strat_test_set), "Random":income_cat_proportion
最新发布
03-30
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值