数据预处理技巧：One-Hot编码与训练集划分-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_33344148/article/details/121533452

one-hot数据预处理的tricks

1、标签one-hot转化

对特征进行硬编码不仅可以使用pandas的 factorize函数将文本特征进行编号，也可以使用sklearn 的LabelEncoder函数，两者的效果几乎是一样的。编码为0~n-1(n为种类数)

from sklearn.preprocessing import LabelEncoder
data = pd.read_csv(pathUtils.train_path,engine='python')
# 必须先fit，然后transform
encoder= LabelEncoder().fit(data["job"])
data["job"] = encoder.transform(data["job"])

主要是两个步骤

先构造encoder,通过fit函数传入需要编码的数据，在内部生成对应的key-value
然后encoder 用于需要转化的数据，用transform函数

还有另一种方式sklearn 的OneHotEncoder ：

class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')[source]

Encode categorical features as a one-hot numeric array. categories‘auto’ or a list of array-like, default=’auto’

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names_out(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)

如果是简单的0、1标签，或者是0-n的标签，那么可以使用 keras的to_categorical

将整型的类别标签转为onehot编码。y为int数组，num_classes为标签类别总数，大于max(y)（标签从0开始的）。

to_categorical的源码在在keras的utils/np_utils.py中，源码中的

def to_categorical(y, num_classes=None, dtype='float32'):
    """Converts a class vector (integers) to binary class matrix.
    E.g. for use with categorical_crossentropy.
    # Arguments
        y: class vector to be converted into a matrix
            (integers from 0 to num_classes).
        num_classes: total number of classes.
        dtype: The data type expected by the input, as a string
            (`float32`, `float64`, `int32`...)
    # Returns
        A binary matrix representation of the input. The classes axis
        is placed last.
    # Example
    ```python
    # Consider an array of 5 labels out of a set of 3 classes {0, 1, 2}:
    > labels
    array([0, 2, 1, 2, 0])
    # `to_categorical` converts this into a matrix with as many
    # columns as there are classes. The number of rows
    # stays the same.
    > to_categorical(labels)
    array([[ 1.,  0.,  0.],
           [ 0.,  0.,  1.],
           [ 0.,  1.,  0.],
           [ 0.,  0.,  1.],
           [ 1.,  0.,  0.]], dtype=float32)
    ```
    """
    #将输入y向量转换为数组
    y = np.array(y, dtype='int')
    #获取数组的行列大小
    input_shape = y.shape
    if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:
        input_shape = tuple(input_shape[:-1])
    #y变为1维数组
    y = y.ravel()
    #如果用户没有输入分类个数，则自行计算分类个数，这里就有问题，如果测试集中没有某个标签的数据，测试集标签进行转化的时候就不能这么用。
    if not num_classes:
        num_classes = np.max(y) + 1
    n = y.shape[0]
    #生成全为0的n行num_classes列的值全为0的矩阵
    categorical = np.zeros((n, num_classes), dtype=dtype) ##这里就限制了第一类必须是从0开始。
    #np.arange(n)得到每个行的位置值，y里边则是每个列的位置值
    categorical[np.arange(n), y] = 1
    #进行reshape矫正
    output_shape = input_shape + (num_classes,)
    categorical = np.reshape(categorical, output_shape)
    return categorical

2、训练和测试集划分

一般使用sklearn的train_test_split

补充的点：

1、numpy相关函数

numpy中的ravel()、flatten()、squeeze()都有将多维数组转换为一维数组的功能，区别：
ravel()：如果没有必要，不会产生源数据的副本
flatten()：返回源数据的副本
squeeze()：只能对维数为1的维度降维

np.argmax() 可以返回对应的维度最大数的index，可以用来对softmax的输出进行处理。

categorical[np.arange(n), y] = 1，这个妙啊，使用一个index的list进行赋值，省掉了for循环。

2、pandas相关函数和操作

一个操作

data_sets = [
        pd.read_csv('../data/mixed.csv', index_col=False),
        pd.DataFrame({
            'url': ['http://' + map_text(X) for X in open('../data/MalwareURLExport.csv', 'r').readlines()],
            'type': 1
        }).reset_index(drop=True).sample(
            n=malware_url_samples,
            replace=False,
            random_state=random_state
        ),
        pd.read_csv('../data/kaggle_data_clean.csv', index_col=0).reset_index(drop=True)
    ]