好菜好菜好菜好菜好菜
今天是5月25日天气不错心情不错爽约了牙医来学python鹿小葵加油
数据预处理:
import numpy as np
np.random.seed(0)
X_train_fpath = './data/X_train'
Y_train_fpath = './data/Y_train'
X_test_fpath = './data/X_test'
output_fpath = './output_{}.csv'
# 把csv文件转换成numpy的数组
with open(X_train_fpath) as f:
next(f)
X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
next(f)
Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
next(f)
X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
- next() 用来读取文件是每次读取一行
- with 语法自动包含close() ,听说用with as语法打开文件优雅
- array中的split函数将数据按‘,’分割
- split()后面的[1:]意为保留第一列之后的所有列,即去掉第一列的id信息
- strip() 用来删除,只要被删除的序列在()中不论顺序如何都会被删除掉:
a = 'u r stupid'
b = a.strip('r u')
c = 'sorry ' + b
print(b)
strip的是r u还是u r结果一样,b都返回stupid
标准化:
def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
# This function normalizes specific columns of X.
# The mean and standard variance of training data will be reused when processing testing data.
#
# Arguments:
# X: data to be processed
# train: 'True' when processing training data, 'False' for testing data
# specific_column: indexes of the columns that will be normalized. If 'None', all columns
# will be normalized.
# X_mean: mean value of training data, used when train = 'False'
# X_std: standard deviation of training data, used when train = 'False'
# Outputs:
# X: normalized data
# X_mean: computed mean value of training data
# X_std: computed standard deviation of training data
if specified_column == None:
specified_column = np.arange(X.shape[1])
# print(specified_column)
# 输出列的ID:
#[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
# ...
# 504 505 506 507 508 509]
if train:
X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)
X_std = np.std(X[:, specified_column], 0).reshape(1, -1)
X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8) #1e-8防止除零
return X, X_mean, X_std
# 标准化训练数据和测试数据
X_train, X_mean, X_std = _normalize(X_train, train = True)
X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)
# 用 _ 这个变量来存储函数返回的无用值
- arange 用于产生等差数组
- shape[[1]]返回二维数组的列数
- reshape(1,-1)将数组转化为一行,-1表示缺省值,需要pandas自行运算
- mean的第一个参数必须是数组,第二个参数axis=0表示压缩行对各列求均值,返回1*n的数组
- true,false,none首字母要大写
- 注意测试集没有求均值和方差而是沿用了测试集的
- 注意对行列求均值mean()要加维度参数!!
- 除数要加0.00001防止除零
其中X[:, specified_column]的表达我一直没看懂,specified_column是一个数组,取多列在我之前的印象中只有X[:,m:n] 这样的表达,见的太少了,原来还能用数组表达要取的列数和行数哇。BTW经两位善良的男士提醒,输出结果中没有逗号隔开不代表输入的时候也可以不用逗号,比如我在想做一个例子类比测试的时候,因为specified_column的输出是没有逗号的,在类比例子的specified_column中我就没加逗号,有了以下错误代码:
`X = np.array([[1,2],[3,4]])
Y = X[:,[0 1]]
print(Y)`
而正确的类比应该是
X = np.array([[1,2],[3,4]])
Y = X[:,[0,1]]
print(Y)
BTW,对数组取行列参数为None即代表取全部行列