100天学习机器学习python代码计划-Day1-2

最新推荐文章于 2025-07-22 17:00:55 发布

上进的小菜鸟

最新推荐文章于 2025-07-22 17:00:55 发布

阅读量428

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习文章标签： pandas sklearn 机器学习python代码 100-Days-Of-ML-Code

本文链接：https://blog.youkuaiyun.com/qq_35153620/article/details/94407249

机器学习专栏收录该内容

5 篇文章

订阅专栏

跟着github上的Avik-Jain学习机器学习：

https://github.com/Avik-Jain/100-Days-Of-ML-Code

Day1：初步认识机器学习程序实现方法

学习任务：

加载机器学习常用的库numpy和pandas还有sklearn，学会读取csv文件，csv类的基本操作

数据预处理的基本操作，数据集分割

1.加载机器学习库，读取csv文件及其基本操作

pandas库、sklearn库和numpy库可以用conda安装或者pip安装

conda install pandas/pip install pandas

conda install numpy/pip install numpy

conda install sklearn/pip install sklearn

import numpy as np
import pandas as pd
'''Step1 and Step2. Import the library(numpy, pandas) and read the data in csv'''
#读取csv数据文件
data = pd.read_csv('../100-Days-Of-ML-Code-master/datasets/Data.csv')
#获取data的numpy数值(即不包括标签和行数的数据值)
val = data.values
#获取data的维度信息
size = data.shape
#获取data的前n行数据
head = data.head(n=3)
#获取data的列信息
column = data.columns
#获取data指定行的数据
_loc4 = data.loc[:4]#获取前4行
_loc1 = data.loc[1]#获取1行信息
#获取指定行指定列的信息，[行，列]
x = data.iloc[:, :-1].values
y = data.iloc[:, 3].values

2.数据预处理：修复无效数据

使用sklearn库processing中的Imputer

"""Step3. Uses Imputer in sklearn.processing to replace the missing data"""
from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder
#修补的nan用均值代替
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(x[ : , 1:3])
x[ : , 1:3] = imputer.transform(x[ : , 1:3])

3.数据预处理：对标签信息编码

使用sklearn库processing中的LabelEncoder和OneHotEncoder

"""Step4. Uses LabelEncoder and OneHotEncoder,encoder the label"""
labelencoder_X = LabelEncoder()
#先将类别标签编码为0,1,2,...(例如国家类别，France->0，Spain->1， Germany->2)
x[:, 0] = labelencoder_X.fit_transform(x[:, 0])
#再将labelencoder编码成onehot编码，即001,010,100...
onehotencoder = OneHotEncoder(categorical_features=[0])
x = onehotencoder.fit_transform(x).toarray()
labelencoder_Y = LabelEncoder()

Y = labelencoder_Y.fit_transform(y)

4.数据集分割：按比例分成训练集和测试集

"""Step5. Splitting the datasets into training sets and Test sets"""
#train_test_split在有的版本里由sklearn.sklearn.model_selection提供
#在某些版本里由sklearn.cross_validation提供
#test_size是测试集占比
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, Y, test_size=0.2, random_state=0)

5.数据预处理：去均值化

"""Step6. Feature Scaling, (x-x_mean)/x_std, in column(attributes)"""
#将每个数据的值减去样本的均值，在除以样本的标准差，降低样本值，并归一化
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.fit_transform(x_test)

Day2：单值线性回归预测

学习任务：

实现单个输入值的线性回归预测，并将结果可视化为散点图和折线图

线性回归预测即y=f(x;w,b)=w*x+b

模型的优化：

w,b是模型参数，x是输入，yp是预测值，yi是标签真值(一般叫label，也叫groundtruth)

1.读取数据，对数据集分割，对数据预处理

import numpy as np 
import pandas as pd 
#加载可视化工具matplotlib.pyplot
import matplotlib.pyplot as plt

'''get the dataset, make preprocessing.'''
#获取数据集
dataset = pd.read_csv('../100-Days-Of-ML-Code-master/datasets/studentscores.csv')
#获取样本值(输入)
x = dataset.iloc[:, :1].values
#获取预测结果的真值(标签)
y = dataset.iloc[:, 1].values

#分割数据集
"""divide the dataset into trainset and testset"""
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=1/4, random_state=0)

2.载入线性回归模型

#加载sklearn线性回归模块
"""Makes the linearregression model"""
from sklearn.linear_model import LinearRegression
#获取线性回归类，将训练集的样本值和标签值输入进去
regrosser = LinearRegression()
regrosser = regrosser.fit(x_train, y_train)
#根据测试集样本值预测结果
"""predict"""
pred = regrosser.predict(x_test)

3.可视化结果

"""display:plt.scatter()散点图, plt.plot()折线图"""
plt.scatter(x_train, y_train, color='red')
plt.plot(x_train, regrosser.predict(x_train), color='blue')
plt.show()
plt.close()

plt.scatter(x_test , y_test, color = 'red')
plt.plot(x_test , regrosser.predict(x_test), color ='blue')
plt.show()
plt.close()