写了knn算法以后,再写一个二,
一的连接http://blog.youkuaiyun.com/liuqiao18434391822/article/details/78079018
本次测试的是irisdata.txt的结果,大家可以去直接下载该测试集,这是一个开放的机器学习测试集合
算法的大概步骤如下:
1、导入数据,生成训练集和测试集
2、计算测试实例和训练实例的距离,并返回,这里用的欧式距离算法,即sqrt((a1-b1)^2+(a2-b2)^2),可以推广到多维向量
3、调用2的函数,找出测试数据和训练数据最相近的k个训练集,并返回
4、根据3返回的k个实例,找到label出现次数最多的label,返回,出现最多的label即为我们预测出来的分类结果
#-*- coding:utf-8 -*-
#
import math
import random
import operator
import csv
import numpy as np
def loadDataset(filename,split,trainingSet = [],testSet = []):
with open(filename,'rb') as csvfile:
lines = csv.reader(csvfile) #csv读入所有的行
dataset = list(lines) #转换为列表
for x in range(len(dataset)-1): #转化为csv后多出来一个空[],所以要-1
for i in range(4):
dataset[x][i] = float(dataset[x][i])
if random.random() <split: #随机分割训练集和测试集
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
return trainingSet,testSet
# loadDataset('irisdata.txt',0.5,trainingSet = [],testSet = [])
# 返回两个实例之间的距离
def euclideanDistance(instance1,instance2):
n = len(instance1)-1
instance1 = np.array(instance1[:n])
instance2 = np.array(instance2[:n])
distance = (instance2-instance1)**2
distances = distance.sum(axis = 0)
return math.sqrt(distances)
#返回k个最近的实例集合
def getNeaborhod(testSet,trainset,k):
distances = []
for i in range(len(trainset)):
distance = euclideanDistance(testSet,trainset[i])
distances.append(distance)
distances = np.array(distances) #转化为np.array()类型,为了方便使用argsort,返回从小到大的元素的下标
distancesSort = distances.argsort()
neighbors = []
for x in range(k):
neighbors.append(trainset[distancesSort[x]])
return neighbors #返回最近的k个对象
#找到最近的label
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
label = neighbors[x][-1]
classVotes[label] = classVotes.get(label,0)+1
result = sorted(classVotes.iteritems(),key = operator.itemgetter(1),reverse=True)
return result[0][0]
def main():
trainSet,testSet = loadDataset('irisdata.txt',0.8,trainingSet = [],testSet = []) #80作为训练集
print 'trainSet1',len(trainSet)
print 'testSet1',len(testSet)
n = 0.0
for i in range(len(testSet)):
test = testSet[i]
neighbor = getNeaborhod(test,trainSet,3)
predict = getResponse(neighbor)
if predict == test[-1]:
n = n+1
allSet = float(len(testSet))
result = n/allSet
print result
trainSet,testSet = loadDataset('irisdata.txt',0.8,trainingSet = [],testSet = [])
print 'trainSet2',len(trainSet)
print 'testSet2',len(testSet)
a = [6.1,2.8,4.7,1.2]
def predict(new,train):
neighbor = getNeaborhod(new,train,3)
predict = getResponse(neighbor)
print predict
if __name__ == '__main__':
main()
predict(a,testSet)
测试结果还算是比较满意,达到了识别率95%
测试结果如下:
trainSet2 112
testSet2 38
trainSet1 109
testSet1 41
0.951219512195
Iris-versicolor
[Finished in 1.5s]