KNN算法实现:
KNN算法本质上就是通过测试点周围最近的k个点的标签来判断自己的类别,下面是算法的基本信息:
本文的数据集采用的是scokit-learn库当中的鸢尾花数据集,加载数据集后打印出的数据如下:
4列特征值分别表示鸢尾花的花萼长度、花萼宽度、花瓣长度、花瓣宽度
标签代表着每个鸢尾花样本所属的类别(0/1/2)
实验1:算法使用了欧几里得距离作为距离指标,设置k=3,将数据集按照test_size分为训练集和测试集,并使用random_state控制数据分割时的随机性,便于实验的复现,并与最终结果进行比较,得出算法的准确率。这里test_size表示从整个数据集当中划分训练集的比例
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
def euclidean_distance(x1, x2):
"""计算欧几里得距离"""
return np.sqrt(np.sum((x1 - x2) ** 2))
def knn(X_train, y_train, X_test, k):
"""KNN算法实现
参数:
X_train : 训练集特征数据
y_train : 训练集标签数据
X_test : 测试集特征数据
k : KNN中的K值
返回:
y_pred : 预测的标签列表
"""
y_pred = []
for test_sample in X_test:
# 计算测试样本与训练集中每个样本的欧几里得距离
distances = [euclidean_distance(test_sample, train_sample) for train_sample in X_train]
# 将距离和对应的标签、索引组合在一起
distances_with_labels = list(zip(distances, y_train, range(len(X_train))))
# 根据距离排序,获取距离最小的K个邻居
nearest_neighbors = sorted(distances_with_labels, key=lambda x: x[0])[:k]
# 通过索引获取邻居的标签
neighbor_labels = [neighbor[1] for neighbor in nearest_neighbors]
# 通过投票确定测试样本的预测类别
predicted_label = Counter(neighbor_labels).most_common(1)[0][0]
y_pred.append(predicted_label)
return y_pred
if __name__ == '__main__':
# 加载数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target
accuracys = []
# 划分数据集
for test_size in range(1,9,1):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size/10, random_state=42)
# 选择K值,这里以3为例
k = 3
y_pred = knn(X_train, y_train, X_test, k)
accuracy = accuracy_score(y_test, y_pred)
accuracys.append(accuracy)
print(f"Accuracy: {accuracy * 100:.2f}%")
plt.plot(accuracys,'gx--')
plt.title('Accuracy versus dataset split')
plt.xlabel('test_size')
plt.ylabel('accuracy')
plt.show()
分别设置随机种子的值为11、36、42,有如下结果:
Random_state = 11
Random_state = 36
Random_state = 42
实验之前,我认为模型的准确度应该随着训练集的增加而增加。但从实验结果当中我们可以观察到,设置不同的随机种子,accuracy与test_size并没有明显的关系。这是由于scokit-learn库当中的鸢尾花数据集数据量太小,单个数据点的变化可能对模型训练有较大影响,导致不同random_state值产生的分割结果对准确率有显著差异。
既然单个数据点对模型影响较大,提升k值可以减少对单个异常点或者噪声数据的敏感性,通过更多周边邻居的投票来减少预测当中的敏感性,提高了模型的泛化能力,减少对局部数据的依赖性。同时,增大k值相当于降低了模型的复杂度,能够使模型更加的平滑,减少过拟合的风险。
接下来进行实验,设置random_state=11,让k值从3逐渐上升到10,观察准确率的变化,代码如下:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
def euclidean_distance(x1, x2):
"""计算欧几里得距离"""
return np.sqrt(np.sum((x1 - x2) ** 2))
def knn(X_train, y_train, X_test, k):
"""KNN算法实现
参数:
X_train : 训练集特征数据
y_train : 训练集标签数据
X_test : 测试集特征数据
k : KNN中的K值
返回:
y_pred : 预测的标签列表
"""
y_pred = []
for test_sample in X_test:
# 计算测试样本与训练集中每个样本的欧几里得距离
distances = [euclidean_distance(test_sample, train_sample) for train_sample in X_train]
# 将距离和对应的标签、索引组合在一起
distances_with_labels = list(zip(distances, y_train, range(len(X_train))))
# 根据距离排序,获取距离最小的K个邻居
nearest_neighbors = sorted(distances_with_labels, key=lambda x: x[0])[:k]
# 通过索引获取邻居的标签
neighbor_labels = [neighbor[1] for neighbor in nearest_neighbors]
# 通过投票确定测试样本的预测类别
predicted_label = Counter(neighbor_labels).most_common(1)[0][0]
y_pred.append(predicted_label)
return y_pred
if __name__ == '__main__':
# 加载数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target
accuracys = []
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=11)
# 选择K值,这里以3为例
for k in range(3,11,1):
y_pred = knn(X_train, y_train, X_test, k)
accuracy = accuracy_score(y_test, y_pred)
accuracys.append(accuracy)
print(f"Accuracy: {accuracy * 100:.2f}%")
k_position = [3,4,5,6,7,8,9,10]
plt.plot(k_position,accuracys,'gx--')
plt.title('random_state=11')
plt.xlabel('k')
plt.ylabel('accuracy')
plt.show()
可以发现k在取8的时候模型有最优的准确率。