关于机器学习k邻近算法的学习笔记

最新推荐文章于 2024-04-20 19:30:30 发布

原创最新推荐文章于 2024-04-20 19:30:30 发布 · 259 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #算法 #数据

本文介绍了一个基于K近邻算法的机器学习项目，包括如何从文本文件中读取数据并将其转换为可用的数据集，使用自动归一化处理数据，并通过分类测试验证算法的有效性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这是机器学习中的书中的一个项目，首先我们要有数据来源然后再对其进行操作。

数据来源是一个记事本文件

dataTestSet2.txt

数据来源请参考http://download.youkuaiyun.com/download/u012005313/9190017

然后我们需要编写的是记事本的读取程序以及对应的显示测试模块

from numpy import *
from kalgorithm import *
def file2matrix(filename): //读取txt文件函数
fr = open(filename)
arrayOLines=fr.readlines()
numberofLines = len(arrayOLines)
returnMat = zeros((numberofLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index +=1
return returnMat , classLabelVector
def displayfile(datingDataMat,datingLabels):
import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2], 15.0 * array(datingLabels), 15.0 * array(datingLabels))
plt.show()
return 0

def autoNorm(dataSet): //自动数据归一化函数
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals,(m,1))
normDataSet = normDataSet/tile(ranges,(m,1))
return normDataSet , ranges , minVals

def datingClassTest(): 测试函数
hoRatio = 0.10
datingDataMat , datingLabels = file2matrix('datingTestSet2.txt')
normMat , ranges , minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],\
datingLabels[numTestVecs:m],3)
print ("the classifier came back with: %d , the real answer is : %d"\
% (classifierResult, datingLabels[i]))
if (classifierResult != datingLabels[i]): errorCount+=1.0
print("the total error rate is %f" % (errorCount/float( numTestVecs)))

这是读取文件的程序，要注意的是需要先导入numpy模块，numpy函数库是一个机器学习常用的模块，具体请自行百度

之后就是数据分析，通过k-邻近算法来分类，判断类别。

本人自己亲测过代码发现几个因为版本产生的问题

from numpy import *
from operator import *

def createDataSet():
group = array([[1.0,2.2],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group , labels

def classify0(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX , (dataSetSize ,1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount = {}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) +1
sortedClassCount = sorted(classCount.items(), key = itemgetter(1),reverse = True) // 原书是用classCount.iteritems()但是python3已经取消这个函数了直接用items就行了