机器学习MachineLearning概述(简单预处理)

最新推荐文章于 2024-10-16 13:26:03 发布

原创

最新推荐文章于 2024-10-16 13:26:03 发布 · 832 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #数据预处理

本文介绍了机器学习的基础概念，包括自我完善的特性、应用需求以及主要类型。详细阐述了数据预处理的重要性，如均值移除、范围缩放、归一化、二值化、独热编码和标签编码等方法，以优化模型性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

机器学习

一、概述

1. 什么是机器学习？

人工智能：通过人工的方法，实现或者近似实现某些需要人类智能处理的问题，都可以称为人工智能。
机器学习：一个计算机程序在完成任务T之后，获得经验E，而该经验的效果可以通过P得以表现，如果随着T的增加，借助P来表现的E也可以同步增进，则称这样的程序为机器学习系统。
特点：自我完善、自我修正、自我增强。

2. 为什么需要机器学习？

简化或者替代人工方式的模式识别，易于系统的开发维护和升级换代。
对于那些算法过于复杂，或者没有明确解法的问题，机器学习系统具有得天独厚的优势。
借鉴机器学习的过程，反向推理出隐藏在业务数据背后的规则——数据挖掘。

3. 机器学习的类型

有监督学习、无监督学习、半监督学习和强化学习
批量学习和增量学习
基于实例的学习和基于模型的学习

4. 机器学习的流程

数据
- 数据采集
- 数据清洗
机器学习
- 数据预处理
- 选择模型
- 训练模型
- 验证模型
业务
- 使用模型
- 维护和升级

二、数据预处理

import sklearn.preprocessing as sp

1. 均值移除(标准化) Standardization (or Z-score normalization)

通过算法调整令样本矩阵中每一列(特征)的平均值为0，标准差为1。这样一来，所有特征对最终模型的预测结果都有接近一致的贡献，模型对每个特征的倾向性更加均衡。

$\frac{ {x - \mu }}{\sigma }$

Standardization (or Z-score normalization) is the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1 where μ is the mean and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:

sp.scale(原始样本矩阵) -> 经过均值移除后的样本矩阵

# std.py
import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)
print('Mean:', raw_samples.mean(axis=0))  # 表示沿着行取平均值，得到每一列的均值
print('S:', raw_samples.std(axis=0))

std_samples = raw_samples.copy()
for col in std_samples.T:
    col_mean = col.mean()
    col_std = col.std()
    col -= col_mean
    col /= col_std
print('手动标准化处理后：', std_samples)
print('处理后的每一列均值：', std_samples.mean(axis=0))
print('处理后的每一列方差：', std_samples.std(axis=0))

std_samples = sp.scale(raw_samples)  # 经过均值移除后的样本矩阵
print('自动标准化处理后：', std_samples)
print('处理后的每一列均值：', std_samples.mean(axis=0))
print('处理后的每一列方差：', std_samples.std(axis=0))

[[ 3.  -1.5  2.  -5.4]
 [ 0.   4.  -0.3  2.1]
 [ 1.   3.3 -1.9 -4.3]]
Mean: [ 1.33333333  1.93333333 -0.06666667 -2.53333333]
S: [1.24721913 2.44449495 1.60069429 3.30689515]
手动标准化处理后： [[ 1.33630621 -1.40451644  1.29110641 -0.86687558]
 [-1.06904497  0.84543708 -0.14577008  1.40111286]
 [-0.26726124  0.55907936 -1.14533633 -0.53423728]]
处理后的每一列均值： [ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]
处理后的每一列方差： [1. 1. 1. 1.]
自动标准化处理后： [[ 1.33630621 -1.40451644  1.29110641 -0.86687558]
 [-1.06904497  0.84543708 -0.14577008  1.40111286]
 [-0.26726124  0.55907936 -1.14533633 -0.53423728]]
处理后的每一列均值： [ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]
处理后的每一列方差： [1. 1. 1. 1.]