跟着github上的Avik-Jain学习机器学习:
https://github.com/Avik-Jain/100-Days-Of-ML-Code
Day1:初步认识机器学习程序实现方法
学习任务:
加载机器学习常用的库numpy和pandas还有sklearn,学会读取csv文件,csv类的基本操作
数据预处理的基本操作,数据集分割
1.加载机器学习库,读取csv文件及其基本操作
pandas库、sklearn库和numpy库可以用conda安装或者pip安装
conda install pandas/pip install pandas
conda install numpy/pip install numpy
conda install sklearn/pip install sklearn
import numpy as np
import pandas as pd
'''Step1 and Step2. Import the library(numpy, pandas) and read the data in csv'''
#读取csv数据文件
data = pd.read_csv('../100-Days-Of-ML-Code-master/datasets/Data.csv')
#获取data的numpy数值(即不包括标签和行数的数据值)
val = data.values
#获取data的维度信息
size = data.shape
#获取data的前n行数据
head = data.head(n=3)
#获取data的列信息
column = data.columns
#获取data指定行的数据
_loc4 = data.loc[:4]#获取前4行
_loc1 = data.loc[1]#获取1行信息
#获取指定行指定列的信息,[行,列]
x = data.iloc[:, :-1].values
y = data.iloc[:, 3].values
2.数据预处理:修复无效数据
使用sklearn库processing中的Imputer
"""Step3. Uses Imputer in sklearn.processing to replace the missing data"""
from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder
#修补的nan用均值代替
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(x[ : , 1:3])
x[ : , 1:3] = imputer.transform(x[ : , 1:3])
3.数据预处理:对标签信息编码
使用sklearn库processing中的LabelEncoder和OneHotEncoder
"""Step4. Uses LabelEncoder and OneHotEncoder,encoder the label"""
labelencoder_X = LabelEncoder()
#先将类别标签编码为0,1,2,...(例如国家类别,France->0,Spain->1, Germany->2)
x[:, 0] = labelencoder_X.fit_transform(x[:, 0])
#再将labelencoder编码成onehot编码,即001,010,100...
onehotencoder = OneHotEncoder(categorical_features=[0])
x = onehotencoder.fit_transform(x).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(y)
4.数据集分割:按比例分成训练集和测试集
"""Step5. Splitting the datasets into training sets and Test sets"""
#train_test_split在有的版本里由sklearn.sklearn.model_selection提供
#在某些版本里由sklearn.cross_validation提供
#test_size是测试集占比
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, Y, test_size=0.2, random_state=0)
5.数据预处理:去均值化
"""Step6. Feature Scaling, (x-x_mean)/x_std, in column(attributes)"""
#将每个数据的值减去样本的均值,在除以样本的标准差,降低样本值,并归一化
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.fit_transform(x_test)
Day2:单值线性回归预测
学习任务:
实现单个输入值的线性回归预测,并将结果可视化为散点图和折线图
线性回归预测即y=f(x;w,b)=w*x+b
模型的优化:
w,b是模型参数,x是输入,yp是预测值,yi是标签真值(一般叫label,也叫groundtruth)
1.读取数据,对数据集分割,对数据预处理
import numpy as np
import pandas as pd
#加载可视化工具matplotlib.pyplot
import matplotlib.pyplot as plt
'''get the dataset, make preprocessing.'''
#获取数据集
dataset = pd.read_csv('../100-Days-Of-ML-Code-master/datasets/studentscores.csv')
#获取样本值(输入)
x = dataset.iloc[:, :1].values
#获取预测结果的真值(标签)
y = dataset.iloc[:, 1].values
#分割数据集
"""divide the dataset into trainset and testset"""
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=1/4, random_state=0)
2.载入线性回归模型
#加载sklearn线性回归模块
"""Makes the linearregression model"""
from sklearn.linear_model import LinearRegression
#获取线性回归类,将训练集的样本值和标签值输入进去
regrosser = LinearRegression()
regrosser = regrosser.fit(x_train, y_train)
#根据测试集样本值预测结果
"""predict"""
pred = regrosser.predict(x_test)
3.可视化结果
"""display:plt.scatter()散点图, plt.plot()折线图"""
plt.scatter(x_train, y_train, color='red')
plt.plot(x_train, regrosser.predict(x_train), color='blue')
plt.show()
plt.close()
plt.scatter(x_test , y_test, color = 'red')
plt.plot(x_test , regrosser.predict(x_test), color ='blue')
plt.show()
plt.close()