数据预处理
数据集分割
参数 | ||
---|---|---|
*arrays | list/np.array/matrices/padas dataframes | 需被分割的样本集 |
**options | ||
test_size | 在0.0和1.0之间,表示要从样本集拆分到测试集的比例,默认为0.25 | |
train_size | 在0.0和1.0之间,表示要从样本集拆分到训练集的比例,默认为0 | |
random_state | 随机数生成器 | |
shuffle | 是否对样本集进行洗牌,默认为True |
Examples
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
import numpy as np
def train_test_split(X,Y,test_ratio = 0.2,seed = None):
assert X.shape[0] == Y.shape[0],"the size of x must equal y"
assert 0.0 < test_ratio < 1.0,"test_ratio must be valid"
if seed:
np.random.seed(seed)
# permutation(x)返回一个新的打乱x,但x本身不变
shuffle_index = np.random.permutation(len(X))
test_size = int(len(X)*test_ratio)
train_index = shuffle_index[test_size:]
test_index = shuffle_index[:test_size]
x_train = X[train_index]
y_train = Y[train_index]
x_test = X[test_index]
y_test = Y[test_index]
return x_train,x_test,y_train,y_test
数据归一化
1.当样本的特征数值浮动大、差距大时,训练模型容易被某几种特征主导。
2.解决方案:
-
最值归一化(Normalization):把所有数据映射到0-1之间。
X s c a l e = x − x m i n x m a x − x m i n X_{scale} = {{x - x_{min}} \over {x_{max} - x_{min}}} Xscale=xmax−xminx−xmin
评价:适合于分布有明显分界的情况(学生成绩),受outlier(极端值)影响比较大。
-
均值方差归一化(Standardization):把所有数据归一到均值为0方差为1的分布中。
X s c a l e = x − x m e a n s X_{scale} = {{x - x_{mean}} \over{s}} Xscale=sx−xmean
评价:适用于所有数据的情况。
3. 一个问题:
- 归一化要使用到样本集的均值、方差、最大最小值,训练时,训练样本使用训练集进行归一化,那有个问题即训练好的模型投入使用后,测试样本要如何归一化?答案是使用训练集的均值和方差或最大最小值进行归一化。因此会现使用训练集构建一个归一化模型(Scaler)。
'MinMaxScaler '
from sklearn.preprocessing import MinMaxScaler
Scaler = MinMaxScaler ()
# 使用训练集构建归一化模型(Scaler)
Scaler.fit(X_train)
# 训练集的最大值
Scaler .data_max_
# 训练集的最小值
Scaler .data_min_
# 将训练集归一化
X_train = Scaler .transform(X_train)
# 测试集用同scaler归一化
X_test = Scaler .transform(X_test)
'StandardScaler'
from sklearn.preprocessing import StandardScaler
Scaler = StandardScaler()
Scaler .fit(X_train)
# 训练集的均值
Scaler .mean_
# 训练集的方差
Scaler .scale_
X_train = Scaler .transform(X_train)
X_test = Scaler .transform(X_test)
import numpy as np
class StandardScaler:
def __init__(self):
self.mean_ = None
self.scale_ = None
"""根据训练样本集 X 获得数据的均值和方差"""
def fit(self, X):
assert X.ndim == 2, "The dimension of X must be 2"
# 针对样本的每一个维度(即特征),计算均值和方差
self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])])
return self
"""将样本集 Y 根据此StandardScaler进行均值方差归一化处理"""
def transform(self, Y):
assert Y.ndim == 2, "The dimension of X must be 2"
assert self.mean_ is not None and self.scale_ is not None, "must fit before transform!"
assert Y.shape[1] == len(self.mean_), "the feature number of X must be equal to mean_ and std_"
sacleY = np.empty(shape=X.shape, dtype=float)
for col in range(X.shape[1]):
sacleY[:,col] = (Y[:,col] - self.mean_[col]) / self.scale_[col]
return sacleY