前言
笔者刚入门机器学习,本文是参照官方文档的学习笔记,夹杂着自己的理解,如果有错误的地方欢迎纠正。
定序变量编码-OrdinalEncoder
用OrdinalEncoder
将分类特征转换为有序整数,每个分类特征转换为整数作为新特征
取值:0到n_categories-1
适用于定序变量,因为无序的话就会给原本不应有顺序的特征强行多了一层有序的解释。
-
默认情况下会忽略
np.nan
表示的缺失值 -
也可以通过设置
encoded_missing_value
来对缺失值编码enc = preprocessing.OrdinalEncoder(encoded_missing_value=-1) X = [['male'], ['female'], [np.nan], ['female']] enc.**fit_transform**(X)
-
省去 创建
pipeline
和使用SimpleImputer
的麻烦from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer enc = Pipeline(steps=[("encoder", preprocessing.OrdinalEncoder()),/ ("imputer", SimpleImputer(strategy="constant", fill_value=-1)),/ ]) enc.**fit_transform**(X)
有序编码例子
ordinal_encoder = preprocessing.OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-1,encoded_missing_value=-2,min_frequency=10,max_categories=7)
ordinal_encoder.fit(train_df[['Title']]) # fit只能用于训练集,不能用于测试集
train_df['Title']=ordinal_encoder.transform(train_df[['Title']])
test_df['Title']=ordinal_encoder.transform(test_df[['Title']]) # 在这里只是顺便转换用于到时测试,不是训练过程
print(ordinal_encoder.categories_)
print(ordinal_encoder.infrequent_categories_)
# 打印出ordinal_encoder.categories_中不在ordinal_encoder.infrequent_categories_中的元素
print([item for item in ordinal_encoder.categories_[0] if item not in ordinal_encoder.infrequent_categories_[0]])
[array([‘Billiard’, ‘Capt’, ‘Carlo’, ‘Col’, ‘Cruyssen’, ‘Don’, ‘Dr’,
‘Gordon’, ‘Impe’, ‘Jonkheer’, ‘Major’, ‘Master’, ‘Melkebeke’,
‘Messemaeker’, ‘Miss’, ‘Mr’, ‘Mrs’, ‘Mulder’, ‘Pelsmaeker’,
‘Planke’, ‘Rev’, ‘Shawah’, ‘Steen’, ‘Velde’, ‘Walle’, ‘der’, ‘the’,
‘y’], dtype=object)]
[array([‘Billiard’, ‘Capt’, ‘Carlo’, ‘Col’, ‘Cruyssen’, ‘Don’, ‘Dr’,
‘Gordon’, ‘Impe’, ‘Jonkheer’, ‘Major’, ‘Melkebeke’, ‘Messemaeker’,
‘Mulder’, ‘Pelsmaeker’, ‘Planke’, ‘Rev’, ‘Shawah’, ‘Steen’,
‘Velde’, ‘Walle’, ‘der’, ‘the’, ‘y’], dtype=object)]
[‘Master’, ‘Miss’, ‘Mr’, ‘Mrs’]
train_titile_nums = len(train_df['Title'].unique())
print(ordinal_encoder.inverse_transform([[i] for i in range(train_titile_nums)])) # 每个数字对应的原来的字符串
sns.displot(train_df,x="Title",hue="Survived",height=3,bins= train_titile_nums)
输出:
独热编码-OneHotEncoder
使用OneHotEncoder
对分类变量进行独热编码,是哪个值,对应二进制为1,其余0
>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.**fit**(X)
OneHotEncoder()
>>> enc.**transform**([['female', 'from US', 'uses Safari'],
... ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
[0., 1., 1., 0., 0., 1.]])
可以用参数categories
显式地指定二进制位的顺序
>>> genders = ['female', 'male']
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
>>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.**fit**(X)
OneHotEncoder(categories=[['female', 'male'],
['from Africa', 'from Asia', 'from Europe',
'from US'],
['uses Chrome', 'uses Firefox', 'uses IE',
'uses Safari']])
>>> enc.**transform**([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
请注意区别未知和缺失,二者是不同的概念
- 未知值:训练集中没有,但测试集中遇到的。有具体值,只是没见过
- 缺失值:没有值,空白的。
如果有未知值,可以用handle_unknown='infrequent_if_exist’
而不是像上面那样手动去设置categories
,这样未知值对应编码就是全0,或者归为不常见类(如果设置了Infrequent categories
的话)
>>> enc = preprocessing.OneHotEncoder(**handle_unknown='infrequent_if_exist'**)
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='infrequent_if_exist')
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])
这里的’from Asia’, 'uses Chrome’都是未知值
可用drop参数来制定每个特征要舍去的一个特征值(下称drop值),这样可以使用n_categories - 1个虚拟变量而不是n_categories个虚拟变量来避免多重共线性
虚拟变量相当于前面理解的二进制位,所以其实回归中虚拟变量的概念和独热编码是差不多的
>>> drop_enc = preprocessing.OneHotEncoder(**drop='first'**).fit(X)
-
也可以用drop='if_binary'
只对二分类变量设置,不过其实drop='first
’就可以了,有点多余,因为一般来说不止二分类变量需要避免多重共线性。>>> drop_enc = preprocessing.OneHotEncoder(drop='if_binary').fit(X)
当handle_unknown='ignore'
且drop
非空,未知值就会被编码为全0。
(一般来说是对仅对训练集编码的)因此测试集中那些未知的值就会被当成全0,这意味着未知值和那个被drop的值有着相同的映射(或者说编码)。
OneHotEncoder.inverse_transform
就是这么做的,它做反向转化时会把去全0值映射到drop值对应的类或者None(表示不是drop值,而是未知值)
>>> drop_enc = preprocessing.OneHotEncoder(drop='if_binary', sparse_output=False,
... handle_unknown='ignore').fit(X)
>>> X_test = [['unknown', 'America', 'IE']]
>>> X_trans = drop_enc.transform(X_test)
>>> X_trans
array([[0., 0., 0., 0., 0., 0., 0.]])
>>> drop_enc.inverse_transform(X_trans)
array([['female', None, None]], dtype=object)
OneHotEncoder
也支持把缺失、未知值当成一个新的类
-
缺失值:
nan
-
未知值:
None
>>> X = [['male', 'Safari'], ... ['female', None], ... [np.nan, 'Firefox']] >>> enc = preprocessing.OneHotEncoder(handle_unknown='error').fit(X) >>> enc.categories_ [array(['female', 'male', nan], dtype=object), array(['Firefox', 'Safari', None], dtype=object)] >>> enc.transform(X).toarray() array([[0., 1., 0., 0., 1., 0.], [1., 0., 0., 0., 0., 1.], [0., 0., 1., 1., 0., 0.]])
因此如果同时存在缺失、未知值,那就多了2个类了
>>> X = [['Safari'], [None], [np.nan], ['Firefox']] >>> enc = preprocessing.OneHotEncoder(handle_unknown='error').fit(X) >>> enc.categories_ [array(['Firefox', 'Safari', **None, nan**], dtype=object)] >>> enc.transform(X).toarray() array([[0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.], [1., 0., 0., 0.]])
独热编码例子
columns_to_encode = ["Sex","Embarked"]
onehot_encoder = preprocessing.OneHotEncoder(handle_unknown='error',drop='first',sparse_output=False).fit(train_df[columns_to_encode])
result_df = pd.DataFrame(onehot_encoder.transform(train_df[["Sex","Embarked"]]),columns=onehot_encoder.get_feature_names_out(columns_to_encode))
train_df = pd.concat([train_df,result_df],axis=1)
# .drop(columns_to_encode,axis=1) # 这里不用删除,因为后面要用到这两列
result_df = pd.DataFrame(onehot_encoder.transform(test_df[["Sex","Embarked"]]),columns=onehot_encoder.get_feature_names_out(columns_to_encode))
# .drop(columns_to_encode,axis=1) # 这里不用删除,因为后面要用到这两列
test_df = pd.concat([test_df,result_df],axis=1)
train_df:
Survived Pclass Sex Age SibSp Parch Fare Embarked Title
0 0 3 male 22.0 1 0 7.2500 S 2.0
1 1 1 female 38.0 1 0 71.2833 C 3.0
2 1 3 female 26.0 0 0 7.9250 S 1.0
3 1 1 female 35.0 1 0 53.1000 S 3.0
4 0 3 male 35.0 0 0 8.0500 S 2.0
Sex_male Embarked_Q Embarked_S Embarked_nan
0 1.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0
3 0.0 0.0 1.0 0.0
4 1.0 0.0 1.0 0.0
test_df:
PassengerId Pclass Sex Age SibSp Parch Fare Embarked Title
0 892 3 male 34.5 0 0 7.8292 Q 2.0
1 893 3 female 47.0 1 0 7.0000 S 3.0
2 894 2 male 62.0 0 0 9.6875 Q 2.0
3 895 3 male 27.0 0 0 8.6625 S 2.0
4 896 3 female 22.0 1 1 12.2875 S 3.0
Sex_male Embarked_Q Embarked_S Embarked_nan
0 1.0 1.0 0.0 0.0
1 0.0 0.0 1.0 0.0
2 1.0 1.0 0.0 0.0
3 1.0 0.0 1.0 0.0
4 0.0 0.0 1.0 0.0
(891, 13) (418, 13)
设置不常见类别
不常见类别的设置:Infrequent categories
OneHotEncoder
and OrdinalEncoder
都支持把低频值归成不常见类Infrequent categories
,相当于多了一个新的类
-
min_frequency
- 可以是个≥1的整数,表示频数小于这个数的就会被当成
Infrequent categories
- 或(0.0, 1.0)的浮点数,表示频率小于这个数的就会被当成
Infrequent categories
默认为1,表示不考虑
Infrequent categories
>>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 + ... ['snake'] * 3], dtype=object).T >>> enc = preprocessing.OrdinalEncoder(min_frequency=6).fit(X) >>> enc.infrequent_categories_ [array(['dog', 'snake'], dtype=object)] >>> enc.transform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']])) array([[2.], [0.], [1.], [2.]])
- 可以是个≥1的整数,表示频数小于这个数的就会被当成
-
max_categories
也是可整数、可浮点数。但是规定了编码后类别的上限,即从0类别开始,不断用频数最高的作为一个新的类别,如果达到了上限,就会把剩余的其他所有还未被作为新类别建立的类别当成一个
Infrequent categories
。等价于:如果编码后类别超了,就会不断把频数少的类别归到
Infrequent categories
中,直到最后的类别≤max_categories。注意上述过程不考虑缺失、未知值,如果同时存在缺失、未知值并分别归为新的一类,那最后至多会有max_categories+2个类别。
>>> X_train = np.array( ... [["a"] * 5 + ["b"] * 20 + ["c"] * 10 + ["d"] * 3 + [np.nan]], ... dtype=object).T >>> enc = preprocessing.OrdinalEncoder( ... handle_unknown="use_encoded_value", unknown_value=3, ... max_categories=3, encoded_missing_value=4) >>> _ = enc.fit(X_train) >>> X_test = np.array([["a"], ["b"], ["c"], ["d"], ["e"], [np.nan]], dtype=object) >>> enc.transform(X_test) array([[2.], [0.], [1.], [2.], [3.], [4.]])
可见这里有5类,比max_categories=3多了2类,是因为同时存在缺失、未知值,且归缺失值为4、未知值为3
OneHotEncoder.get_feature_names_out
把输出不常见类的类名时会包含有infrequent
。
如前面所说,设置handle_unknown='infrequent_if_exist’
会使得未知值被当成低频值,即未知类别归为Infrequent categories
如果设置了handle_unknown='infrequent_if_exist’:
- 但是未配置低频类别相关支持,或者训练过程中没遇到未知值,那最后未知值会被当成全0,翻转时会把未知值转成None
- 如果训练时遇到了,那就会新增一类
Infrequent categories
翻转时用infrequent_sklearn
来标识低频类
如果同时设置了max_categories
和min_frequency
,那么会先判断min_frequency
然后判断max_categories
:
>>> enc = preprocessing.OneHotEncoder(min_frequency=4, max_categories=3, sparse_output=False)
>>> enc = enc.fit(X)
>>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
示例中,min_frequency=4只认为蛇被归为
Infrequent categories
,但max_categories=3会强制狗也被归为Infrequent categories
如果出现了临界情况,会根据字典排序来进行决策以满足max_categories
>>> X = np.asarray([["a"] * 20 + ["b"] * 10 + ["c"] * 10 + ["d"] * 10], dtype=object).T
>>> enc = preprocessing.OneHotEncoder(max_categories=3).fit(X)
>>> enc.infrequent_categories_
[array(['b', 'c'], dtype=object)]
示例中,“b”、“c”和“d”具有相同的频数,并且在max_categories=2的情况下,“b”和“c”被归为了Infrequent categories
,因为它们的词典顺序更高。
目标编码
类别太多,就别用独热码了,使用TargetEncoder