本节练习是为了应用SVM进行非线性拟合,使用Gaussian Kernel
首先load数据集并plot
#----------------------------part4---------------------------#
#读取数据1,并将数据整理成可识别的格式
path = 'C:\\Users\Huanuo\PycharmProjects\ml\ex6_svm\ex6\ex6data2.mat'
m = loadmat(path)
df1 = pd.DataFrame(m['X'])
df2 = pd.DataFrame(m['y'])
df3 = pd.concat([df1,df2],axis=1)
df3.columns = [1,2,3]
#将数据可视化
n = 1024
X1 = df3.loc[df3[3]==1,1]
Y1 = df3.loc[df3[3]==1,2]
scatter(X1,Y1,marker = '*',color = 'r')
X2 = df3.loc[df3[3]==0,1]
Y2 = df3.loc[df3[3]==0,2]
scatter(X2,Y2,marker = '+',color = 'y')
show()
然后应用SVM进行建模,这中间踩了很多坑,大坑,首先贴一下成功的代码:
import numpy as np
from sklearn.svm import SVC
from scipy.io import loadmat
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
def load_data():
path = 'C:\\Users\Huanuo\PycharmProjects\ml\ex6_svm\ex6\ex6data2.mat'
m = loadmat(path)
x, y = m['X'], m['y']
y=np.ravel(y)
scaler = StandardScaler()
x_std = scaler.fit_transform(x) # 标准化
x_train, y_train=x_std,y
return x_train, y_train
def svm_c(x_train ,y_train):
# rbf核函数,设置数据权重
svc = SVC(kernel='rbf', class_weight='balanced',)
c_range = np.logspace(-5, 15, 11, base=2)
gamma_range = np.logspace(-9, 3, 13, base=2)
# 网格搜索交叉验证的参数范围,cv=3,3折交叉
param_grid = [{'kernel': ['rbf'], 'C': c_range, 'gamma': gamma_range}]
grid = GridSearchCV(svc, param_grid, cv=3, n_jobs=-1)
print()
# 训练模型
clf = grid.fit(x_train, y_train)
# 计算测试集精度
# score = grid.score(x_test, y_test)
plotsvm(x_train,clf,y_train)
# print('精度为%s' % score)
def plotsvm(x_train,clf,y_train):
# 可视化处理
# step size in the mesh
h = .02
# create a mesh to plot in
x_min, x_max = x_train[:, 0].min()-0.02, x_train[:, 0].max()+0.02
y_min, y_max = x_train[:, 1].min()-0.02, x_train[:, 1].max()+0.02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis('off')
# Plot also the training points画点
color_map = {0: (0, 0, .9), 1: (1, 0, 0)}
colors = [color_map[y] for y in y_train]
plt.scatter(x_train[:, 0], x_train[:, 1], c=colors, edgecolors='black')
plt.show()
if __name__ == '__main__':
svm_c(*load_data())
这段代码毫无疑问是成功的,最重要的是它引入了建模的一个重要思维流程:
1.将原始数据转化为SVM算法软件或包所能识别的数据格式;
2.将数据标准化;(防止样本中不同特征数值大小相差较大影响分类器性能)
3.不知使用什么核函数,考虑使用RBF;
4.利用交叉验证网格搜索寻找最优参数(C, γ);(交叉验证防止过拟合,网格搜索在指定范围内寻找最优参数)
5.使用最优参数来训练模型;
6.测试
也就是说,利用网格搜索的形式进行调参,这类似于一个贪心算法。
其次,还有一点需要记录,就是再plot决策边界的时候,目前还不知道如何直接画出边界,所以只能通过全量网格预测覆盖的形式将预测结果画出来。
另外,现在依旧没有解决,为何使用基础SVM算法(非网格搜索调参)的方式,只能得到一个非常恶劣的拟合结果。如下图所示。