KNN002

最新推荐文章于 2024-09-25 13:31:04 发布

原创最新推荐文章于 2024-09-25 13:31:04 发布 · 553 阅读

0 ·

CC 4.0 BY-SA版权

python 同时被 2 个专栏收录

3 篇文章

订阅专栏

机器学习

2 篇文章

订阅专栏

博客内容涉及KNN算法的实现过程中遇到的问题及解决办法，包括numpy模块的使用、Matplotlib散点图绘制以及调整k值对错误率的影响。在代码实现中，由于模块导入、缩进和函数定义等问题导致的AttributeError和NameError被逐一解决。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

numpy数组和普通的python数组不一样
AttributeError: module ‘kNN’ has no attribute ‘file2matrix’
可能代码缩进有问题，函数没读出来
使用Matplotlib创建散点图　　表示不同属性的点，使用不同颜色进行表示
NameError: name ‘array’ is not defined
　　加上这两句array不会报错：
　　 import os
　　 from numpy import *
plot.show()后出现图就不再执行下面代码了，手动关闭图后再运行show也不会运行了，查了一下原因，wxPython关闭后不会再起来了，和python命令行冲突
出现numpy里的那么找不到，很可能是numpy的头文件没有import正确
NameError: name ‘zeros’ is not defined
NameError: name ‘tile’ is not defined
6.尝试改变k值：
测试集10% k=2 错误率5%
测试集10% k=3 错误率5%
测试集10% k=4 错误率3%
测试集10% k=5 错误率4%
测试集10% k=6 错误率6%
测试集20% k=4 错误率6.7%
测试集20% k=3 错误率8%
测试集20% k=5 错误率7%

#!/usr/bin/python

import numpy as np
import operator
from numpy import *

def createDataSet():
    group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) 
    labels = np.array(['A','A','B','B'])
    return group,labels
    

def classify0(inX, dataSet, labels, k):
    #inX 未知的点
    #dataset输入数据集，训练集，所有的点
    #dataset对应的labels

    #第一步：计算距离：
    #shape返回元组（行，列）
    dataSetSize = dataSet.shape[0]
    #tile 创建一个数组，dataSetSize行，每行数据是inX，inX和dataset相同列数
    #两个矩阵相减,inX和每两个点之间的距离
    diffMat = tile(inX, (dataSetSize,1)) - dataSet

    #矩阵的每个值算平方
    sqDiffMat=diffMat**2
    #矩阵的每行求和，得到一个一位数组，平方和
    sqDistances=sqDiffMat.sum(axis=1)
    #开根号
    distances = sqDistances**0.5
    #inX和每个点的距离，按照距离值从小到大排序，返回排序后的索引序号
    sortedDisIndicies = distances.argsort()
    #print(sortedDisIndicies)
    #第二步：选择距离最小的K个点
    #classCount是个字典，key想放label，值想放个数，初始值是0
    classCount={}
    #距离值从小到大排序，统计这些距离对应的标签都是哪些
    for i in range(k):
        #因为labels和dataset是对应的，利用上一步的索引序号找到对应的标签
        voteIlabel = labels[sortedDisIndicies[i]]
        #统计出前K个都有哪些标签
        #python字典get，dict.get(key, default=None)，没找到key的话赋值为0
        #找到已有label的话，值＋1
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    #python sorted排序，括号里第一个参数是迭代，key=限定了排序内容，reverse=True是大到小
    #items() 返回可遍历的（键，值）元组数组
    sortedClassCount = sorted(classCount.items(),
        key=operator.itemgetter(1),reverse=True)
    #print(sortedClassCount)
    return sortedClassCount[0][0]

    def file2matrix(filename):
        fr = open(filename) 
        arrayOlines = fr.readlines()
        numberOfLines = len(arrayOlines)
        #形成0填充的矩阵，我们选取numberOflines行前3列
        returnMat = zeros((numberOfLines,3))
        classLabelVector = []
        index = 0
        for line in arrayOlines:
            line = line.strip()
            #strip() 默认截取掉回车
            listFromLine = line.strip('\t')
            #将读入数据的前三个字段0，1，2 填充到returnMat的对应位置
            #[index,:]的形式是numpy中二维数组取某维度所有值,第一维即每行
            #[:,0]第二维的所有数，相当于二维数组每列
            returnMat[index,:] = listFromLine[0:3]
            #label在每行的最后一位
            classLabelVector.append(int(listFromLine[-1]))
            index += 1
        return returnMat,classLabelVector


#归一化
def autoNorm(dataSet):
    #返回一个列表，参数0为每列的最小值最大值
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals,(m,1))
    #每个对应的特征值相除，不是矩阵除法
    normDataSet = normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

#评估
def datingClassTest():
    #选取10%的数据分类器
    hoRatio = 0.20
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m * hoRatio)
    print ("矩阵的第一维度：",m)
    print ("测试个数", numTestVecs)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],\
            datingLabels[numTestVecs:m],5)
        print ("the classifier came back with: %d, the real answer is: %d"\
            % (classifierResult, datingLabels[i]))
        if(classifierResult != datingLabels[i]):
            errorCount += 1.0
    print ("the total error rate is: %f" % (errorCount/float(numTestVecs)))
    #print ("error count is: %d" % (errorCount))

# import kNN
# from imp import reload
# group,labels=kNN.createDataSet()
# datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
# import matplotlib
# import os
# from numpy import *
# import matplotlib
# import matplotlib.pyplot as plt
# fig = plt.figure()
##整个图片显示1行1列，在第一个位置显示ax
# ax = fig.add_subplot(111)
# ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
# plt.show()