1、标准差标准化(z-score标准化):(X-X.mean(axis=0))/X.std(axis=0)
①利用scale标准化:sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
In [2]: from sklearn import preprocessing
...: import numpy as np
...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
...: X_scaled = preprocessing.scale(X)
...: X_scaled
...:
Out[2]:
array([[ 0. , -1.22474487, 1.33630621],
[ 1.22474487, 0. , -0.26726124],
[-1.22474487, 1.22474487, -1.06904497]])
平均数、方差、标准差
In [3]: X.mean(axis=0)
Out[3]: array([ 1. , 0. , 0.33333333])
In [4]: X.var(axis=0)
Out[4]: array([ 0.66666667, 0.66666667, 1.55555556])
In [5]: import numpy as np
In [6]: np.sqrt(X.var(axis=0))
Out[6]: array([ 0.81649658, 0.81649658, 1.24721913])
In [9]: X.std(axis=0)
Out[9]: array([ 0.81649658, 0.81649658, 1.24721913])
设置参数with_mean=False、with_std=False,取消缩放操作
In [11]: from sklearn import preprocessing
...: import numpy as np
...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
...: X_scaled = preprocessing.scale(X,with_mean=False,with_std=False)
...: X_scaled
...:
Out[11]:
array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
设置with_mean=False,标准化公式:(X)/std
In [12]: from sklearn import preprocessing
...: import numpy as np
...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
...: X_scaled = preprocessing.scale(X,with_mean=False)
...: X_scaled
...:
Out[12]:
array([[ 1.22474487, -1.22474487, 1.60356745],
[ 2.44948974, 0. , 0. ],
[ 0. , 1.22474487, -0.80178373]])
②利用StandardScaler标准化,StandardScaler类提供transform方法可以训练集上计算的均值与标准差作用到测试集上进行相同的缩放
sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
In [13]: from sklearn import preprocessing
...: import numpy as np
...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
...: scaler = preprocessing.StandardScaler().fit(X)
...:
In [14]: scaler
Out[14]: StandardScaler(copy=True, with_mean=True, with_std=True)
In [16]: scaler.mean_
Out[16]: array([ 1. , 0. , 0.33333333])
In [17]: scaler.scale_#标准差
Out[17]: array([ 0.81649658, 0.81649658, 1.24721913])
In [18]: scaler.var_
Out[18]: array([ 0.66666667, 0.66666667, 1.55555556])
作用测试集,引用训练集上的均值和标准差
In [20]: scaler.transform([[-1., 1., 0.]])
Out[20]: array([[-2.44948974, 1.22474487, -0.26726124]])
In [21]: print((-1-1)/0.81649658,1/0.81649658 ,- 0.33333333/1.24721913)
-2.449489745566356 1.224744872783178 -0.2672612390093792
设置参数with_mean=False、with_std=False,取消缩放操作
In [22]: from sklearn import preprocessing
...: import numpy as np
...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
...: scaler = preprocessing.StandardScaler(with_mean=False,with_std=False).
...: fit(X)
...: scaler.transform([[-1., 1., 0.]])
...:
Out[22]: array([[-1., 1., 0.]])
经过缩放数据集符合标准正态分布
In [24]: from sklearn import preprocessing
...: import numpy as np
...: X = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
...: X_scaled = preprocessing.scale(X)
...:
In [25]: X_scaled.mean(axis=0)
Out[25]: array([ 0., 0., 0.])
In [26]: X_scaled.std(axis=0)
Out[26]: array([ 1., 1., 1.])
2、特征缩放到指定的范围
①利用MinMaxScaler标准化:sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
参数feature_range=tuple (min, max),默认(0,1),极差标准化:(X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
In [27]: from sklearn import preprocessing
...: import numpy as np
...: X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.],[ 0., 1., -1.]])
...: min_max_scaler = preprocessing.MinMaxScaler()
...: X_train_minmax = min_max_scaler.fit_transform(X_train)
...:
In [28]: X_train_minmax
Out[28]:
array([[ 0.5 , 0. , 1. ],
[ 1. , 0.5 , 0.33333333],
[ 0. , 1. , 0. ]])
指定参数feature_range后,转换后数据范围为X_scaled,X_scaled取值公式如下:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
In [29]: from sklearn import preprocessing
...: import numpy as np
...: X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.],[ 0., 1., -1.]])
...: min_max_scaler = preprocessing.MinMaxScaler(feature_range=(1, 3))
...: X_train_minmax = min_max_scaler.fit_transform(X_train)
...: X_train_minmax
...:
Out[29]:
array([[ 2. , 1. , 3. ],
[ 3. , 2. , 1.66666667],
[ 1. , 3. , 1. ]])
②利用MaxAbsScaler标准化: sklearn.preprocessing.MaxAbsScaler(copy=True)
标准化公式:X / np.abs(X).max(axis=0) 即将特征的每个元素除以该特征中元素绝对值最大的值
参数说明:copy参数为False表示就地修改原数组,否则复制数组
In [1]: from sklearn import preprocessing
...: import numpy as np
...: X = np.array([[ 1., -3., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])
...: max_abs_scaler = preprocessing.MaxAbsScaler()
...: X_train_maxabs = max_abs_scaler.fit_transform(X)
...: print(X_train_maxabs)
...: print(X/np.abs(X).max(axis=0))#最简单广播
...:
[[ 0.5 -1. 1. ]
[ 1. 0. 0. ]
[ 0. 0.33333333 -0.5 ]]
[[ 0.5 -1. 1. ]
[ 1. 0. 0. ]
[ 0. 0.33333333 -0.5 ]]
将copy参数设置为False,原数组X的元素变为标准化后的数据
In [2]: from sklearn import preprocessing
...: import numpy as np
...: X = np.array([[ 1., -3., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])
...: max_abs_scaler = preprocessing.MaxAbsScaler(False)
...: X_train_maxabs = max_abs_scaler.fit_transform(X)
...: print(X_train_maxabs)
...: print(X)
...:
[[ 0.5 -1. 1. ]
[ 1. 0. 0. ]
[ 0. 0.33333333 -0.5 ]]
[[ 0.5 -1. 1. ]
[ 1. 0. 0. ]
[ 0. 0.33333333 -0.5 ]]
属性值:
In [3]: max_abs_scaler.scale_#获取被训练的样本中各特征的元素最大绝对值
Out[3]: array([ 2., 3., 2.])
In [4]: max_abs_scaler.max_abs_#获取被训练的样本中各特征的元素最大绝对值
Out[4]: array([ 2., 3., 2.])
In [5]: max_abs_scaler.n_samples_seen_#获取已被训练的样本个数
Out[5]: 3
In [6]: X2 = np.array([[ 1., -4., 2.],[ 2., 1., 0.],[ 0., 1., -1.]])
...: max_abs_scaler.partial_fit(X2)
...:
Out[6]: MaxAbsScaler(copy=False)
In [7]: max_abs_scaler.scale_
Out[7]: array([ 2., 4., 2.])
In [8]: max_abs_scaler.max_abs_
Out[8]: array([ 2., 4., 2.])
In [9]: max_abs_scaler.n_samples_seen_
Out[9]: 6
作用测试集,应用训练集上得到的参数scale_
In [10]: X_test = np.array([[ -3., -1., 4.]])
...: X_test_maxabs = max_abs_scaler.transform(X_test)
...: print(X_test_maxabs)
...: max_abs_scaler.scale_
...:
[[-1.5 -0.25 2. ]]
Out[10]: array([ 2., 4., 2.])