机器学习之knn

本文介绍了一个基于KNN算法的鸢尾花数据集分类验证过程。通过加载数据、距离计算、投票选择等步骤,实现了对鸢尾花种类的有效识别,并评估了算法准确性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

knn算法在鸢尾花数据集的验证。

#coding:utf-8

import csv
import random
import numpy as np
def loadData(filename):
    key_value = {'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}
    all_data_label = []
    with open(filename,'rb') as csvfile:
        lines = csv.reader(csvfile)
        for i in lines:
            temp = [float(i[0]), float(i[1]), float(i[2]), float(i[3])]
            temp1 = [key_value.get(i[4])]
            temp.extend(temp1)
            all_data_label.append(temp)
    return all_data_label
def cal_distances(train_vec,test_vec):
    train_vec = np.array(train_vec)
    test_vec = np.array(test_vec)
    return np.sqrt(sum(train_vec-test_vec)**2)
def vote(result):
    result = list(result)
    label = set(result)
    index = 0
    count = 0
    for i in label:
        temp = result.count(i)
        if temp > count:
            index = i
            count = temp
    return index

def find_label(train_set,test_point,k):
    get_label_list = []
    for each in train_set:
        each_distance = []
        train_vec,train_label = each[:4],each[-1]
        #print train_vec,train_label
        vec_distance = cal_distances(train_vec,test_point)
        each_distance.append(train_label)
        each_distance.append(vec_distance)
        get_label_list.append(each_distance)
    result_k = np.array(get_label_list)
    order_distance = (result_k.T)[1].argsort()
    order = np.array((result_k[order_distance].T)[0])
    top_k = np.array(order[:k],dtype=int)
    if k == 1 :
        return top_k[0]
    else:
        return vote(top_k)

def get_accuracy(test_set,train_set,k):
    count = 0
    for each in test_set:
        test_vec,test_label = each[:4],each[-1]
        result = find_label(train_set=train_set,test_point=test_vec,k=k)
        if result == test_label:
            count += 1
    print "accuracy",count/float(len(test_set))*100
if __name__ == '__main__':
    data = loadData('./Iris.data')
    split_point = int(len(data)*0.8)
    random.shuffle(data)
    train_data = data[:split_point]
    test_data = data[split_point:]

    get_accuracy(test_set=test_data,train_set=train_data,k=1


鸢尾花数据:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica

### kNN算法概述 kNN算法(K-Nearest Neighbors),也被称为K-最邻近算法,是一种基本的机器学习算法,适用于分类和回归任务。此算法的核心在于利用特征空间中的相似度来决定未知样本所属类别。具体来说,在给定含有已知类别的训练集情况下,对于每一个待预测的新样本,计算它与所有训练样本之间的距离,并选取距离最小的前K个邻居作为参考对象;最终依据这些邻居们的类别标签,采用投票机制或者加权平均的方式得出目标样本应归属于哪一类[^2]。 ### 数据预处理的重要性 在实际操作过程中,准备高质量的数据至关重要。特别是在执行像手写数字识别这样的复杂任务时,不仅需要精心挑选合适的数据源,还需要设计有效的函数把原始图片转化为适合输入模型的形式,从而提高后续运用K近邻算法进行模式匹配的成功概率[^4]。 ### Python实现示例 下面给出一段简单的Python代码片段用来演示如何构建并评估一个基础版本的kNN分类器: ```python from collections import Counter import numpy as np def create_dataset(): group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]]) labels = ['A', 'A', 'B', 'B'] return group, labels def classify_knn(inX, dataset, labels, k): dataSetSize = dataset.shape[0] diffMat = np.tile(inX, (dataSetSize, 1)) - dataset sqDiffMat = diffMat ** 2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances ** 0.5 sortedDistIndicies = distances.argsort() classCount = {} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 sortedClassCount = sorted(classCount.items(), key=lambda item:item[1], reverse=True) return sortedClassCount[0][0] group, labels = create_dataset() test_data_point = [0, 0] predicted_label = classify_knn(test_data_point, group, labels, 3) print(f"The predicted label of {test_data_point} is: {predicted_label}") ``` 上述程序定义了一个小型二维点集合及其对应的类别标记,并实现了`classify_knn()` 函数用于接收新的观测值 `inX`, 训练数据集 `dataset` 及其关联的真实标签 `labels` 和参数 `k`. 此外还展示了怎样调用这个自定义的方法完成一次具体的预测过程[^3]. ### 结果分析 为了验证模型性能的好坏程度,通常会在独立于训练阶段之外的一组测试样例上运行该算法,并统计误判次数占总检验数量的比例即误差率(error rate),以此衡量系统的准确性水平.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值