sklearn数据预处理类库学习

本文介绍了几种常用的数据预处理方法,包括使用sklearn.preprocessing模块下的scale、StandardScaler、MinMaxScaler和MaxAbsScaler进行数据标准化、特征缩放等操作。通过具体实例展示了不同方法的应用场景及其对数据分布的影响。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1、标准差标准化(z-score标准化):(X-X.mean(axis=0))/X.std(axis=0)

①利用scale标准化:sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)

In [2]: from sklearn import preprocessing
   ...: import numpy as np
   ...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
   ...: X_scaled = preprocessing.scale(X)
   ...: X_scaled
   ...:
Out[2]:
array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])
平均数、方差、标准差

In [3]: X.mean(axis=0)
Out[3]: array([ 1.        ,  0.        ,  0.33333333])

In [4]: X.var(axis=0)
Out[4]: array([ 0.66666667,  0.66666667,  1.55555556])

In [5]: import numpy as np

In [6]: np.sqrt(X.var(axis=0))
Out[6]: array([ 0.81649658,  0.81649658,  1.24721913])

In [9]: X.std(axis=0)
Out[9]: array([ 0.81649658,  0.81649658,  1.24721913])
设置参数with_mean=False、with_std=False,取消缩放操作

In [11]: from sklearn import preprocessing
    ...: import numpy as np
    ...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
    ...: X_scaled = preprocessing.scale(X,with_mean=False,with_std=False)
    ...: X_scaled
    ...:
Out[11]:
array([[ 1., -1.,  2.],
       [ 2.,  0.,  0.],
       [ 0.,  1., -1.]])
设置with_mean=False,标准化公式:(X)/std

In [12]: from sklearn import preprocessing
    ...: import numpy as np
    ...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
    ...: X_scaled = preprocessing.scale(X,with_mean=False)
    ...: X_scaled
    ...:
Out[12]:
array([[ 1.22474487, -1.22474487,  1.60356745],
       [ 2.44948974,  0.        ,  0.        ],
       [ 0.        ,  1.22474487, -0.80178373]])
②利用StandardScaler标准化,StandardScaler类提供transform方法可以训练集上计算的均值与标准差作用到测试集上进行相同的缩放

sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)

In [13]: from sklearn import preprocessing
    ...: import numpy as np
    ...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
    ...: scaler = preprocessing.StandardScaler().fit(X)
    ...:

In [14]: scaler
Out[14]: StandardScaler(copy=True, with_mean=True, with_std=True)

In [16]: scaler.mean_
Out[16]: array([ 1.        ,  0.        ,  0.33333333])

In [17]: scaler.scale_#标准差
Out[17]: array([ 0.81649658,  0.81649658,  1.24721913])

In [18]: scaler.var_
Out[18]: array([ 0.66666667,  0.66666667,  1.55555556])
作用测试集,引用训练集上的均值和标准差

In [20]: scaler.transform([[-1., 1., 0.]])
Out[20]: array([[-2.44948974,  1.22474487, -0.26726124]])

In [21]: print((-1-1)/0.81649658,1/0.81649658 ,- 0.33333333/1.24721913)
-2.449489745566356 1.224744872783178 -0.2672612390093792
设置参数with_mean=False、with_std=False,取消缩放操作

In [22]: from sklearn import preprocessing
    ...: import numpy as np
    ...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
    ...: scaler = preprocessing.StandardScaler(with_mean=False,with_std=False).
    ...: fit(X)
    ...: scaler.transform([[-1., 1., 0.]])
    ...:
Out[22]: array([[-1.,  1.,  0.]])

经过缩放数据集符合标准正态分布

In [24]: from sklearn import preprocessing
    ...: import numpy as np
    ...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
    ...: X_scaled = preprocessing.scale(X)
    ...:

In [25]: X_scaled.mean(axis=0)
Out[25]: array([ 0.,  0.,  0.])

In [26]: X_scaled.std(axis=0)
Out[26]: array([ 1.,  1.,  1.])
2、特征缩放到指定的范围

①利用MinMaxScaler标准化:sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)

参数feature_range=tuple (min, max),默认(0,1),极差标准化:(X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

In [27]: from sklearn import preprocessing
    ...: import numpy as np
    ...: X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.],[ 0., 1., -1.]])
    ...: min_max_scaler =  preprocessing.MinMaxScaler()
    ...: X_train_minmax = min_max_scaler.fit_transform(X_train)
    ...:

In [28]: X_train_minmax
Out[28]:
array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])
指定参数feature_range后,转换后数据范围为X_scaled,X_scaled取值公式如下:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

In [29]: from sklearn import preprocessing
    ...: import numpy as np
    ...: X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.],[ 0., 1., -1.]])
    ...: min_max_scaler =  preprocessing.MinMaxScaler(feature_range=(1, 3))
    ...: X_train_minmax = min_max_scaler.fit_transform(X_train)
    ...: X_train_minmax
    ...:
Out[29]:
array([[ 2.        ,  1.        ,  3.        ],
       [ 3.        ,  2.        ,  1.66666667],
       [ 1.        ,  3.        ,  1.        ]])

②利用MaxAbsScaler标准化: sklearn.preprocessing.MaxAbsScaler(copy=True)

标准化公式:X / np.abs(X).max(axis=0) 即将特征的每个元素除以该特征中元素绝对值最大的值

参数说明:copy参数为False表示就地修改原数组,否则复制数组

In [1]: from sklearn import preprocessing
   ...: import numpy as np
   ...: X = np.array([[ 1., -3.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
   ...: max_abs_scaler = preprocessing.MaxAbsScaler()
   ...: X_train_maxabs = max_abs_scaler.fit_transform(X)
   ...: print(X_train_maxabs)
   ...: print(X/np.abs(X).max(axis=0))#最简单广播
   ...:
[[ 0.5        -1.          1.        ]
 [ 1.          0.          0.        ]
 [ 0.          0.33333333 -0.5       ]]
[[ 0.5        -1.          1.        ]
 [ 1.          0.          0.        ]
 [ 0.          0.33333333 -0.5       ]]
将copy参数设置为False,原数组X的元素变为标准化后的数据

In [2]: from sklearn import preprocessing
   ...: import numpy as np
   ...: X = np.array([[ 1., -3.,  2.],[ 2.,  0.,  0.],[ 0.,  1., -1.]])
   ...: max_abs_scaler = preprocessing.MaxAbsScaler(False)
   ...: X_train_maxabs = max_abs_scaler.fit_transform(X)
   ...: print(X_train_maxabs)
   ...: print(X)
   ...:
[[ 0.5        -1.          1.        ]
 [ 1.          0.          0.        ]
 [ 0.          0.33333333 -0.5       ]]
[[ 0.5        -1.          1.        ]
 [ 1.          0.          0.        ]
 [ 0.          0.33333333 -0.5       ]]
属性值:

In [3]: max_abs_scaler.scale_#获取被训练的样本中各特征的元素最大绝对值
Out[3]: array([ 2.,  3.,  2.])

In [4]: max_abs_scaler.max_abs_#获取被训练的样本中各特征的元素最大绝对值

Out[4]: array([ 2.,  3.,  2.])

In [5]: max_abs_scaler.n_samples_seen_#获取已被训练的样本个数
Out[5]: 3

In [6]: X2 = np.array([[ 1., -4.,  2.],[ 2.,  1.,  0.],[ 0.,  1., -1.]])
   ...: max_abs_scaler.partial_fit(X2)
   ...:
Out[6]: MaxAbsScaler(copy=False)

In [7]: max_abs_scaler.scale_
Out[7]: array([ 2.,  4.,  2.])

In [8]: max_abs_scaler.max_abs_
Out[8]: array([ 2.,  4.,  2.])

In [9]: max_abs_scaler.n_samples_seen_
Out[9]: 6
作用测试集,应用训练集上得到的参数scale_

In [10]: X_test = np.array([[ -3., -1.,  4.]])
    ...: X_test_maxabs = max_abs_scaler.transform(X_test)
    ...: print(X_test_maxabs)
    ...: max_abs_scaler.scale_
    ...:
[[-1.5  -0.25  2.  ]]
Out[10]: array([ 2.,  4.,  2.])

















评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值