数据归一化处理和常用标签编码，独热编码

最新推荐文章于 2024-07-31 15:54:05 发布

原创最新推荐文章于 2024-07-31 15:54:05 发布 · 2k 阅读

6 ·

CC 4.0 BY-SA版权

ML 专栏收录该内容

11 篇文章

订阅专栏

使用机器学习时，会存在不同的特征类型：连续型特征和离散型特征
拿到获取的原始特征，必须对每一特征分别进行归一化，比如，特征A的取值范围是[-1000,1000]，特征B的取值范围是[-1,1].如果使用logistic回归，w1x1+w2x2，因为x1的取值太大了，所以x2基本起不了作用。所以，必须进行特征的归一化，每个特征都单独进行归一化。

对于连续性特征：
Rescale bounded continuous features: x = (2x - max - min)/(max - min).
线性放缩到[-1,1]

Standardize all continuous features:x = (x - u)/s.
放缩到均值为0，方差为1

对于离散性特征：
Binarize categorical/discrete features:对于离散的特征基本就是按照one-hot（独热）编码，该离散特征有多少取值，就用多少维来表示该特征。

1.标签编码
LabelEncoder():
将[1,11,111,1111]->[0,1,2,3]

from sklearn.preprocessing import LabelEncoder
s1 = [1,11,111,1111]
le = LabelEncoder()
le.fit(s1)
print(le.transform([111,1111,1,1]))
>>>
[2 3 0 0]

2.独热编码
OneHotEncoder():

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit([[1],[2],[3],[4]])
ohe_r = ohe.transform([[4],[3],[1],[4]]).toarray()
ohe_r
>>>
array([[0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.]])