1.概述
- KNN 也叫K-近邻算法。大白话其实就是根据你的 " 邻居 " 来推断你的类别。
- KNN算法解决问题:分类、回归。
- 算法思想:若一个样本在特征空间中的k个最相似的样本大多数都属于某一个类别,则该样本也属于这个类别。
- 分类流程:
- 计算未知样本到每一个训练样本的距离
- 将训练样本根据距离大小升序排序
- 取出距离最近的K个训练样本
- 进行多数表决,统计K个样本中哪个类别的样本个数组多
- 将未知的样本归属到出现次数最多的类别
- 回归流程:
- 计算未知样本到每一个训练样本的距离
- 将训练样本根据距离大小升序排列
- 取出距离最近的K个训练样本
- 把这个K个样本的目标值计算其平均值作为将未知的样本预测的值
- k值的选择:
2.KNN算法实现鸢尾花分类
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import joblib
from sklearn.metrics import accuracy_score
data = load_iris()
print(data.feature_names)
data_df = pd.DataFrame(data.data, columns=data.feature_names)
data_df['target'] = data.target
print(data_df)
x_train,x_test,y_train,y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=22)
transform = StandardScaler()
x_train = transform.fit_transform(x_train)
x_test = transform.transform(x_test)
estimator = KNeighborsClassifier(n_neighbors=1)
param_grid = {'n_neighbors':[2,4,5,7,8]}
estimator = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=4)
estimator.fit(x_train,y_train)
print(estimator.best_params_)
estimator = KNeighborsClassifier(n_neighbors=7)
estimator.fit(x_train,y_train)
joblib.dump(estimator,'iris.pth')
estimator = joblib.load('iris.pth')
print(estimator.score(x_test, y_test))
y_pred = estimator.predict(x_test)
print(accuracy_score(y_test, y_pred))