sklearn中常用数据预处理方法

最新推荐文章于 2025-10-22 09:44:19 发布

原创

最新推荐文章于 2025-10-22 09:44:19 发布 · 1.7k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #数据预处理 #sklearn

本文详细介绍了sklearn库中常用的数据预处理方法，包括标准化、最小-最大规范化、规范化、特征二值化、标签二值化、类别特征编码、标签编码、处理异常值、生成多项式特征和缺失值插补。这些方法在机器学习中起到关键作用，能够提升模型的性能和预测准确性。

1. 标准化（Standardization or Mean Removal and Variance Scaling)

Standardization即标准化，尽量将数据转化为均值为0，方差为1的数据。实际中我们会忽略数据的分布情况，仅仅是通过改变均值来集中数据，然后将非连续特征除以他们的标准差。sklearn中 scale函数提供了简单快速的single array-like数据集操作。

from sklearn import preprocessing
import numpy as np
x = np.array([[ 1., -1.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
x_scaled = preprocessing.scale(x)
print(x_scaled)

output:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]

scale处理之后均值为0和方差为1：

print(x_scaled.mean(axis=0))
print(x_scaled.std(axis=0))

outPut:
[0. 0. 0.]
[1. 1. 1.]
StandardScaler计算平均值和标准偏差在一个训练集,可以以后再申请相同的方式转换测试集。

x = np.array([[ 1., -1.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
scaler=preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True).fit(x)
scaler.transform(x)
print(scaler.transform(x))

output:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]

同样的，将相同的转化应用到测试集合:

y = np.array([[1.,2.,3.],[-1.,-2.,0]])
scaler.transform(y)
print(scaler.transform(y))

output:
[[ 0. 2.44948974 2.13808994]
[-2.44948974 -2.44948974 -0.26726124]]

2. 最小-最大规范化

另一种标准化可以使用scal将特征标准化到指定的最大值和最小值之间，有两个函数：MinMaxScaler or MaxAbsScaler

MinMaxScaler ：最小-最大规范化对原始数据进行线性变换，变换到[0,1]区间（也可以是其他固定最小最大值的区间）
例如转化到[0,1]之间：

最低0.47元/天解锁文章