sklearn非数值特征处理(实例)

最新推荐文章于 2024-03-31 13:44:09 发布

原创最新推荐文章于 2024-03-31 13:44:09 发布 · 4.3k 阅读

20 ·

CC 4.0 BY-SA版权

文章标签：

#sklearn非数值特征处理 #sklearn #非数值特征

机器学习同时被 2 个专栏收录

21 篇文章

订阅专栏

python

17 篇文章

订阅专栏

博客介绍了sklearn中非数值特征的处理方式。有OrdinalEncoder和OneHotEncoder两种方法，前者将分类特征转换为整数新特征，后者对特征可能值进行编码，有值为1，无值为0。还可通过参数指定编码特征，OneHotEncoder能设置handle_unknown='ignore'。

sklearn非数值特征处理

sklearn中非数值特征有两种处理方式

1.一个是OrdinalEncoder, 此估计器将每个分类特征转换为整数的一个新特征(0到n_categories - 1)

from sklearn import preprocessing
encoder1=preprocessing.OrdinalEncoder()
X= [[23,'male', 'from US', 'uses Safari'], [26,'female', 'from Europe', 'uses Firefox'],[27,'female', 'from Asia', 'uses Google']]#数据集中只有两个样本
encoder1.fit(X)#先训练一个encoder
encoder1.transform(X)#使用这个encoder对样本进行转换

array([[0., 1., 2., 2.],
       [1., 0., 1., 0.],
       [2., 0., 0., 1.]])

2.另一个OneHotEncoder，OneHotEncoder对每个特征的每个可能的值，若有这个值则为1，若没有这个属性则为0，因此这种OneHot编码出来的样本特征向量可能很长.

encoder2=preprocessing.OneHotEncoder()
encoder2.fit(X)
encoder2.transform(X).toarray()

array([[1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0.]])

可以看见编码之后特征向量长度是所有特征的可能值得个数和

也可以使用参数categories来指定编码的特征，对于训练数据集中的可能没有出现过的特征值，我们可以指定参数handle_unknown=‘ignore’，这个参数只在OneHotEncoder可设置.

genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers],handle_unknown='ignore')
# Note that for there are missing categorical values for the 2nd and 3rd
# feature
X = [['male', 'from US', 'uses Safari'], ['female', 'from Suzhou', 'uses Firefox']]#ignore值是from Suzhou
enc.fit(X) 
enc.transform([['female', 'from Suzhou', 'uses Chrome']]).toarray()

array([[1., 0., 0., 0., 0., 0., 1., 0., 0., 0.]])

可以看见编码结果中特征向量的长度任然是10，忽略了"from Suzhou"的编码

下面我们使用一个实例来展示

import pandas as pd
from sklearn import preprocessing
from sklearn import svm
data=pd.read_csv("data/breast-cancer/breast-cancer.data",names=["class","age","menopause","tumor-size"," inv-nodes"," node-caps","deg-malig","breast","breast-quad","irradiat"])
data.head()#乳腺癌数据

	class	age	menopause	tumor-size	inv-nodes	node-caps	deg-malig	breast	breast-quad	irradiat
0	no-recurrence-events	30-39	premeno	30-34	0-2	no	3	left	left_low	no
1	no-recurrence-events	40-49	premeno	20-24	0-2	no	2	right	right_up	no
2	no-recurrence-events	40-49	premeno	20-24	0-2	no	2	left	left_low	no
3	no-recurrence-events	60-69	ge40	15-19	0-2	no	2	right	left_up	no
4	no-recurrence-events	40-49	premeno	0-4	0-2	no	2	right	right_low	no

第一种编码

train_num=int(0.75*len(data))
print(train_num)
encoder=preprocessing.OrdinalEncoder()
encoder.fit(data.iloc[:,:-1])
num_data=encoder.transform(data.iloc[:,:-1])
num_data

214
array([[0., 1., 2., ..., 2., 0., 2.],
       [0., 2., 2., ..., 1., 1., 5.],
       [0., 2., 2., ..., 1., 0., 2.],
       ...,
       [1., 4., 0., ..., 0., 1., 3.],
       [1., 2., 0., ..., 2., 0., 2.],
       [1., 3., 0., ..., 2., 0., 2.]])

##数据预处理
#min_max_scaler=preprocessing.MinMaxScaler()
#num_data=min_max_scaler.fit_transform(num_data)
num_data=preprocessing.scale(num_data)
print(num_data.mean(axis=0))
print(num_data.std(axis=0))

[ 4.96883032e-17  1.92542175e-16  1.52170429e-16 -1.73909061e-16
  7.45324548e-17 -2.23597364e-16  0.00000000e+00 -7.45324548e-17
 -1.52170429e-16]
[1. 1. 1. 1. 1. 1. 1. 1. 1.]

##标签值
encoder.fit(data.iloc[:,-1:])
num_target=encoder.transform(data.iloc[:,-1:])

model=svm.SVC(gamma='scale',C=1,kernel='rbf')#sigmoid,rbf,poly,precomputed

model.fit(num_data[:train_num,:],num_target[:train_num,:])

D:\anaconda\lib\site-packages\sklearn\utils\validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

model.score(num_data[:train_num,:],num_target[:train_num,:])

0.8925233644859814

第二种编码


encoder=preprocessing.OneHotEncoder(handle_unknown="ignore")
encoder.fit(data.iloc[:,:-1])
num_data=encoder.transform(data.iloc[:,:-1]).toarray()
num_data

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

num_data=preprocessing.scale(num_data)
print(num_data.mean(axis=0))
print(num_data.std(axis=0))

[-1.43940803e-15  1.43940803e-15 -1.51927810e-16 -1.50229479e-16
  6.67686574e-17  8.38490117e-17  1.26549897e-16 -2.18162706e-16
  4.96883032e-17 -1.31596366e-16 -7.76379738e-18  3.72856369e-16
 -2.74062047e-16 -1.86331137e-17  5.04646829e-18 -1.25773517e-16
 -2.85707743e-16  2.36019440e-16 -3.84307970e-17  5.91989550e-18
  2.86290028e-17  1.55275948e-16  1.41301112e-15 -1.58284419e-16
 -3.33843287e-17 -1.33052078e-16  9.72027431e-16 -3.65286666e-16
  5.23862228e-16  2.14863092e-16 -8.48583053e-16 -3.28796819e-16
  4.28561615e-16 -1.27326277e-16  6.35078625e-16  2.40677719e-17
 -2.40677719e-17 -8.35093455e-17 -4.93001133e-17  1.64592504e-16
  3.10551895e-17  1.46735770e-16 -5.16292525e-17]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

##标签值使用原来的

model=svm.SVC(gamma="scale",C=1,kernel="rbf")
model.fit(num_data[:train_num,:],num_target[:train_num,:])

D:\anaconda\lib\site-packages\sklearn\utils\validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

model.score(num_data[:train_num,:],num_target[:train_num,:])

0.9018691588785047