K-近邻原理
对于未知类别数据集中的点与当前点的距离
1. 计算已知类别数据集中的点与当前点的距离
2.按照距离依次排序
3.选取与当前点距离最小的k个点
4.确定前k个点所在类别的出现概率
5.返回前k个点出现频率最高的类别作为当前点预测分类
概述:
KNN算法本身简单有效,它是一种lazy-learning算法;
分类器不需要使用训练集进行训练,训练时间复杂度为0;
KNN分类的计算复杂度和训练集中的文档数目成正比,也就是说,如果训练集中文档总数为n,那么KNN的分类时间复杂度为O(n)。
该算法的三个基本要素:
1. k值的选择;
2. 距离度量;
3. 分类决策规则
问题:
该算法在分类时有个主要的不足是,当样本不平衡时,如一个类的样本容量很大,而其他类样本容量很小时,有可能导致当输入一个新样本时,该样本的k个邻居大容量的样本占多数。
解决:
不同的样本给予不同权重项。
代码如下:
import numpy as np
import operator
def createDataSet():
group = np.array([[1.0,1.1],[1.0,1.0],[0.0,0.0],[0,0.1]])
labels = ['A','A','B','B']
return group,labels
def classify0(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0]
diffMat = np.tile(inX, (dataSetSize,1)) - dataSet
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis = 1)
distance = sqDistances **0.5
sortedDistIndicies = distance.argsort()
classCount = {}
for i in range(k):
voteLabel = labels[sortedDistIndicies[i]]
classCount[voteLabel] = classCount.get(voteLabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
def file2matrix(filename):
fr = open(filename)
arrayOflines = fr.readlines()
numOfLines = len(arrayOflines)
returnMat = np.zeros((numOfLines,3))
classLabelVector = []
index = 0
for line in arrayOflines:
line = line.strip()
listFromline = line.split('\t')
returnMat[index,:] = listFromline[0:3]
classLabelVector.append(int(listFromline[-1]))
index += 1
return returnMat,classLabelVector
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
m = dataSet.shape[0]
normDataSet = np.zeros(np.shape(dataSet))
normDataSet = dataSet - np.tile(minVals,(m,1))
normDataSet = normDataSet / np.tile(ranges,(m,1))
return normDataSet,ranges,minVals
def datingClassTest():
hoRatio = 0.1
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:], datingLabels[numTestVecs:m],4)
print ('the classifier came back with: %d , the real answer is : %d' %(classifierResult, datingLabels[i]))
if (classifierResult != datingLabels[i]):
errorCount += 1.0
print ('total result is : %f' %(errorCount / float(numTestVecs))
def classifyperson():
resultList = ['not at all', 'in small does','in large does']
input_man= [20000, 10, 5]
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = autoNorm(datingDataMat)
result = classify0((input_man - minVals) / ranges,normMat,datingLabels,3)
print ('you will probably like this person:' , resultList[result-1])
if __name__ == '__main__':
# group,labels = createDataSet()
# test = classify0([3,3],group,labels,3)
# print test
classifyperson()
output: A