李宏毅机器学习作业二（从参考答案中学到了什么）

最新推荐文章于 2025-01-08 23:26:40 发布

「已注销」

最新推荐文章于 2025-01-08 23:26:40 发布

阅读量392

点赞数

CC 4.0 BY-SA版权

分类专栏：李宏毅作业文章标签：机器学习

本文链接：https://blog.youkuaiyun.com/weixin_41595257/article/details/117254402

李宏毅作业专栏收录该内容

1 篇文章

订阅专栏

本文介绍了使用Python进行数据预处理的方法，包括从CSV文件读取数据到NumPy数组的转换，以及对数据进行标准化处理的过程。通过具体示例展示了如何使用Python实现数据标准化。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

好菜好菜好菜好菜好菜

今天是5月25日天气不错心情不错爽约了牙医来学python鹿小葵加油

数据预处理：

import numpy as np

np.random.seed(0)
X_train_fpath = './data/X_train'
Y_train_fpath = './data/Y_train'
X_test_fpath = './data/X_test'
output_fpath = './output_{}.csv'

# 把csv文件转换成numpy的数组
with open(X_train_fpath) as f:
    next(f)
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)

next() 用来读取文件是每次读取一行
with 语法自动包含close() ，听说用with as语法打开文件优雅
array中的split函数将数据按‘,’分割
split()后面的[1:]意为保留第一列之后的所有列，即去掉第一列的id信息
strip() 用来删除，只要被删除的序列在()中不论顺序如何都会被删除掉：

a = 'u r stupid'
b = a.strip('r u')
c = 'sorry ' + b
print(b)

strip的是r u还是u r结果一样，b都返回stupid

标准化：

def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
    # This function normalizes specific columns of X.
    # The mean and standard variance of training data will be reused when processing testing data.
    #
    # Arguments:
    #     X: data to be processed
    #     train: 'True' when processing training data, 'False' for testing data
    #     specific_column: indexes of the columns that will be normalized. If 'None', all columns
    #         will be normalized.
    #     X_mean: mean value of training data, used when train = 'False'
    #     X_std: standard deviation of training data, used when train = 'False'
    # Outputs:
    #     X: normalized data
    #     X_mean: computed mean value of training data
    #     X_std: computed standard deviation of training data

    if specified_column == None:
        specified_column = np.arange(X.shape[1])
        # print(specified_column)
        # 输出列的ID：
        #[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
        #  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
        # ...
        # 504 505 506 507 508 509]
    if train:
        X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)
        X_std  = np.std(X[:, specified_column], 0).reshape(1, -1)

    X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8) #1e-8防止除零
     
    return X, X_mean, X_std


# 标准化训练数据和测试数据
X_train, X_mean, X_std = _normalize(X_train, train = True)
X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)
# 用 _ 这个变量来存储函数返回的无用值

arange 用于产生等差数组
shape[[1]]返回二维数组的列数
reshape(1,-1)将数组转化为一行，-1表示缺省值，需要pandas自行运算
mean的第一个参数必须是数组，第二个参数axis=0表示压缩行对各列求均值，返回1*n的数组
true，false，none首字母要大写
注意测试集没有求均值和方差而是沿用了测试集的
注意对行列求均值mean（）要加维度参数！！
除数要加0.00001防止除零

其中X[:, specified_column]的表达我一直没看懂，specified_column是一个数组，取多列在我之前的印象中只有X[:,m:n] 这样的表达，见的太少了，原来还能用数组表达要取的列数和行数哇。BTW经两位善良的男士提醒，输出结果中没有逗号隔开不代表输入的时候也可以不用逗号，比如我在想做一个例子类比测试的时候，因为specified_column的输出是没有逗号的，在类比例子的specified_column中我就没加逗号，有了以下错误代码：

`X = np.array([[1,2],[3,4]])
Y = X[:,[0 1]]
print(Y)`

而正确的类比应该是

X = np.array([[1,2],[3,4]])
Y = X[:,[0,1]]
print(Y)

BTW，对数组取行列参数为None即代表取全部行列