sklearn非数值特征处理(实例)

博客介绍了sklearn中非数值特征的处理方式。有OrdinalEncoder和OneHotEncoder两种方法,前者将分类特征转换为整数新特征,后者对特征可能值进行编码,有值为1,无值为0。还可通过参数指定编码特征,OneHotEncoder能设置handle_unknown='ignore'。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

sklearn非数值特征处理

sklearn中非数值特征有两种处理方式

1.一个是OrdinalEncoder, 此估计器将每个分类特征转换为整数的一个新特征(0到n_categories - 1)

from sklearn import preprocessing
encoder1=preprocessing.OrdinalEncoder()
X= [[23,'male', 'from US', 'uses Safari'], [26,'female', 'from Europe', 'uses Firefox'],[27,'female', 'from Asia', 'uses Google']]#数据集中只有两个样本
encoder1.fit(X)#先训练一个encoder
encoder1.transform(X)#使用这个encoder对样本进行转换
array([[0., 1., 2., 2.],
       [1., 0., 1., 0.],
       [2., 0., 0., 1.]])

2.另一个OneHotEncoder,OneHotEncoder对每个特征的每个可能的值,若有这个值则为1,若没有这个属性则为0,因此这种OneHot编码出来的样本特征向量可能很长.

encoder2=preprocessing.OneHotEncoder()
encoder2.fit(X)
encoder2.transform(X).toarray()
array([[1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0.]])

可以看见编码之后特征向量长度是所有特征的可能值得个数和

也可以使用参数categories来指定编码的特征,对于训练数据集中的可能没有出现过的特征值,我们可以指定参数handle_unknown=‘ignore’,这个参数只在OneHotEncoder可设置.

genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers],handle_unknown='ignore')
# Note that for there are missing categorical values for the 2nd and 3rd
# feature
X = [['male', 'from US', 'uses Safari'], ['female', 'from Suzhou', 'uses Firefox']]#ignore值是from Suzhou
enc.fit(X) 
enc.transform([['female', 'from Suzhou', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0., 1., 0., 0., 0.]])

可以看见编码结果中特征向量的长度任然是10,忽略了"from Suzhou"的编码

下面我们使用一个实例来展示

import pandas as pd
from sklearn import preprocessing
from sklearn import svm
data=pd.read_csv("data/breast-cancer/breast-cancer.data",names=["class","age","menopause","tumor-size"," inv-nodes"," node-caps","deg-malig","breast","breast-quad","irradiat"])
data.head()#乳腺癌数据
classagemenopausetumor-sizeinv-nodesnode-capsdeg-maligbreastbreast-quadirradiat
0no-recurrence-events30-39premeno30-340-2no3leftleft_lowno
1no-recurrence-events40-49premeno20-240-2no2rightright_upno
2no-recurrence-events40-49premeno20-240-2no2leftleft_lowno
3no-recurrence-events60-69ge4015-190-2no2rightleft_upno
4no-recurrence-events40-49premeno0-40-2no2rightright_lowno

第一种编码

train_num=int(0.75*len(data))
print(train_num)
encoder=preprocessing.OrdinalEncoder()
encoder.fit(data.iloc[:,:-1])
num_data=encoder.transform(data.iloc[:,:-1])
num_data
214
array([[0., 1., 2., ..., 2., 0., 2.],
       [0., 2., 2., ..., 1., 1., 5.],
       [0., 2., 2., ..., 1., 0., 2.],
       ...,
       [1., 4., 0., ..., 0., 1., 3.],
       [1., 2., 0., ..., 2., 0., 2.],
       [1., 3., 0., ..., 2., 0., 2.]])
##数据预处理
#min_max_scaler=preprocessing.MinMaxScaler()
#num_data=min_max_scaler.fit_transform(num_data)
num_data=preprocessing.scale(num_data)
print(num_data.mean(axis=0))
print(num_data.std(axis=0))
[ 4.96883032e-17  1.92542175e-16  1.52170429e-16 -1.73909061e-16
  7.45324548e-17 -2.23597364e-16  0.00000000e+00 -7.45324548e-17
 -1.52170429e-16]
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
##标签值
encoder.fit(data.iloc[:,-1:])
num_target=encoder.transform(data.iloc[:,-1:])
model=svm.SVC(gamma='scale',C=1,kernel='rbf')#sigmoid,rbf,poly,precomputed
model.fit(num_data[:train_num,:],num_target[:train_num,:])
D:\anaconda\lib\site-packages\sklearn\utils\validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
model.score(num_data[:train_num,:],num_target[:train_num,:])
0.8925233644859814

第二种编码


encoder=preprocessing.OneHotEncoder(handle_unknown="ignore")
encoder.fit(data.iloc[:,:-1])
num_data=encoder.transform(data.iloc[:,:-1]).toarray()
num_data
array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])
num_data=preprocessing.scale(num_data)
print(num_data.mean(axis=0))
print(num_data.std(axis=0))
[-1.43940803e-15  1.43940803e-15 -1.51927810e-16 -1.50229479e-16
  6.67686574e-17  8.38490117e-17  1.26549897e-16 -2.18162706e-16
  4.96883032e-17 -1.31596366e-16 -7.76379738e-18  3.72856369e-16
 -2.74062047e-16 -1.86331137e-17  5.04646829e-18 -1.25773517e-16
 -2.85707743e-16  2.36019440e-16 -3.84307970e-17  5.91989550e-18
  2.86290028e-17  1.55275948e-16  1.41301112e-15 -1.58284419e-16
 -3.33843287e-17 -1.33052078e-16  9.72027431e-16 -3.65286666e-16
  5.23862228e-16  2.14863092e-16 -8.48583053e-16 -3.28796819e-16
  4.28561615e-16 -1.27326277e-16  6.35078625e-16  2.40677719e-17
 -2.40677719e-17 -8.35093455e-17 -4.93001133e-17  1.64592504e-16
  3.10551895e-17  1.46735770e-16 -5.16292525e-17]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
##标签值使用原来的
model=svm.SVC(gamma="scale",C=1,kernel="rbf")
model.fit(num_data[:train_num,:],num_target[:train_num,:])
D:\anaconda\lib\site-packages\sklearn\utils\validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
model.score(num_data[:train_num,:],num_target[:train_num,:])
0.9018691588785047
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值