实例:在约会网站上使用K-近邻算法
流程:
(1)收集数据
(2)准备数据
(3)分析数据
(4)测试算法
(5)使用算法
1、准备数据:从文本文件中解析数据
海伦的样本主要包含以下3个特征:
每年获得的飞行常客里程数
完视频游戏所耗时间百分比
每周消费的冰淇淋公升数
将文本记录到转换Numpy的解析程序:
def file2matrix(filename):
#获得文件行数
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines)
#创建返回的NumPy矩阵
#创建以零填充的矩阵numpy,(numpy是一个二维数组)将矩阵的另一维度设置为固定值3
returnMat = np.zeros((numberOfLines,3))
classLabelVector = []
index = 0
# 解析文件数据到列表
# 循环处理文件中的每行数据
for line in arrayOLines:
# 使用函数line.strip()截取掉所有的回车字符
line = line.strip()
# 使用tab字符\t将上一步得到的整行数据分割成一个元素列表
listFromLine = line.split('\t')
# 选取前3个元素,将他们存储到特征矩阵中
returnMat[index,:] = listFromLine[0:3]
# python语句可以使用索引值-1表示列表中的最后一列元素,
# 利用这种负索引,可以将列表的最后一列存储到向量classLabelVector中
# 必须明确表示列表中存储的元素值为整型
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
显示数据:
datingDataMat,datingLabels = KNN.file2matrix('datingTestSet2.txt')
print(datingDataMat)
print(datingLabels[0:20])
结果:
D:\exercise\机器学习实战\Scripts\python.exe D:\exercise\pythonProject\mian.py
[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
[1.4488000e+04 7.1534690e+00 1.6739040e+00]
[2.6052000e+04 1.4418710e+00 8.0512400e-01]
...
[2.6575000e+04 1.0650102e+01 8.6662700e-01]
[4.8111000e+04 9.1345280e+00 7.2804500e-01]
[4.3757000e+04 7.8826010e+00 1.3324460e+00]]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]
2、分析数据:使用Matplotlib创建散点图
import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
#第二列,第三列数据
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*np.array(datingLabels),15.0*np.array(datingLabels))
plt.show()
结果:
3、准备数据:归一化数值:
数据处理:在处理这种不同取值范围的特征值时,我们通常采用的方法是将数值归一化,如将取值范围处理为0到1或者-1到1之间,
newValue = (oldValue-min)/(max-min)
其中min和max分别是数据集中的最小特征值和最大特征值。
def autoNorm(dataSet):
#dataSet.min(0)中的参数0使得函数可以从列中选取最小值,而不是选取当前行的最小值
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = np.zeros(np.shape(dataSet))
m = dataSet.shape[0]
#numpy中的tile()函数将变量内容复制成输入矩阵同样大小的矩阵
#numpy中矩阵除法需要使用函数linalg.solve(matA,matB)
normDataSet = dataSet - np.tile(minVals,(m,1))
normDataSet = normDataSet / np.tile(ranges,(m,1))
return normDataSet,ranges,minVals
显示结果:
normMat,ranges,minVals =autoNorm(datingDataMat)
print(normMat)
print(ranges)
print(minVals)
结果:
D:\exercise\机器学习实战\Scripts\python.exe D:\exercise\pythonProject\KNN.py
[[0.44832535 0.39805139 0.56233353]
[0.15873259 0.34195467 0.98724416]
[0.28542943 0.06892523 0.47449629]
...
[0.29115949 0.50910294 0.51079493]
[0.52711097 0.43665451 0.4290048 ]
[0.47940793 0.3768091 0.78571804]]
[9.1273000e+04 2.0919349e+01 1.6943610e+00]
[0. 0. 0.001156]
Process finished with exit code 0
4、测试算法:作为完整程序验证分类器
在数据集中随机取10%的数据作为测试集,并预测结果,计算错误率。
def classify0(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0]
diffMat = np.tile(inX,(dataSetSize,1))-dataSet
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances ** 0.5
sortedDistIndicies = distances.argsort()
classCount = {
}
for i in range(k):
vetoIlabel = labels[sortedDistIndicies[i]]
classCount[vetoIlabel] = classCount.get(vetoIlabel,0) + 1
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
def datingClassTest():
hoRatio = 0.10
#使用file2matrix和autoNorm()函数从文件中读取数据并将其转换为归一化特征值
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = autoNorm(datingDataMat)
#计算测试向量的数量,此步决定了normMat向量中哪些数据用于测试,哪些数据用于分类器的训练样本
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
#将数据输入到原始KNN分类器函数classify0中
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print("the classifier came back with: %s,the real answer is :%s"%(classifierResult,datingLabels[i]))
if(classifierResult != datingLabels[i]):errorCount += 1.0
print("the total error rate is: %f"%(errorCount/float(numTestVecs)))
显示结果:
KNN.datingClassTest()
结果:
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :1
the total error rate is: 0.050000
Process finished with exit code 0
错误率:5%,可以改变hoRatio和k的值来改变错误率。
5、使用算法:构建完整可以系统
input()函数允许用户输入文本行命令并返回用户所输入的命令。
def classifyPerson():
resultList = ['not at all','in small doses','in large doses']
percentTats = float(input("percentage of time spent playing video games?"))
ffMiles = float(input("frequent flier miles earned per year?"))
iceCream = float(input("liters of ice cream consumed per year?"))
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = autoNorm(datingDataMat)
inArr = np.array([ffMiles,percentTats,iceCream])
classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
print("You will probably like this person:%s"%resultList[classifierResult-1])
显示结果:
KNN.classifyPerson()
结果:
percentage of time spent playing video games?20
frequent flier miles earned per year?20000
liters of ice cream consumed per year?0.8
You will probably like this person:in large doses
Process finished with exit code 0
是一个极具魅力的人。
数据集:datingTestSet.txt
40920 8.326976 0.953952 largeDoses
14488 7.153469 1.673904 smallDoses
26052 1.441871 0.805124 didntLike
75136 13.147394 0.428964 didntLike
38344 1.669788 0.134296 didntLike
72993 10.141740 1.032955 didntLike
35948 6.830792 1.213192 largeDoses
42666 13.276369 0.543880 largeDoses
67497 8.631577 0.749278 didntLike
35483 12.273169 1.508053 largeDoses
50242 3.723498 0.831917 didntLike
63275 8.385879 1.669485 didntLike
5569 4.875435 0.728658 smallDoses
51052 4.680098 0.625224 didntLike
77372 15.299570 0.331351 didntLike
43673 1.889461 0.191283 didntLike
61364 7.516754 1.269164 didntLike
69673 14.239195 0.261333 didntLike
15669 0.000000 1.250185 smallDoses
28488 10.528555 1.304844 largeDoses
6487 3.540265 0.822483 smallDoses
37708 2.991551 0.833920 didntLike
22620 5.297865 0.638306 smallDoses
28782 6.593803 0.187108 largeDoses
19739 2.816760 1.686209 smallDoses
36788 12.458258 0.649617 largeDoses
5741 0.000000 1.656418 smallDoses
28567 9.968648 0.731232 largeDoses
6808 1.364838 0.640103 smallDoses
41611 0.230453 1.151996 didntLike
36661 11.865402 0.882810 largeDoses
43605 0.120460 1.352013 didntLike
15360 8.545204 1.340429 largeDoses
63796 5.856649 0.160006 didntLike
10743 9.665618 0.778626 smallDoses
70808 9.778763 1.084103 didntLike
72011 4.932976 0.632026 didntLike
5914 2.216246 0.587095 smallDoses
14851 14.305636 0.632317 largeDoses
33553 12.591889 0.686581 largeDoses
44952 3.424649 1.004504 didntLike
17934 0.000000 0.147573 smallDoses
27738 8.533823 0.205324 largeDoses
29290 9.829528 0.238620 largeDoses
42330 11.492186 0.263499 largeDoses
36429 3.570968 0.832254 didntLike
39623 1.771228 0.207612 didntLike
32404 3.513921 0.991854 didntLike
27268 4.398172 0.975024 didntLike
5477 4.276823 1.174874 smallDoses
14254 5.946014 1.614244 smallDoses
68613 13.798970 0.724375 didntLike
41539 10.393591 1.663724 largeDoses
7917 3.007577 0.297302 smallDoses
21331 1.031938 0.486174 smallDoses
8338 4.751212 0.064693 smallDoses
5176 3.692269 1.655113 smallDoses
18983 10.448091 0.267652 largeDoses
68837 10.585786 0.329557 didntLike
13438 1.604501 0.069064 smallDoses
48849 3.679497 0.961466 didntLike
12285 3.795146 0.696694 smallDoses
7826 2.531885 1.659173 smallDoses
5565 9.733340 0.977746 smallDoses
10346 6.093067 1.413798 smallDoses
1823 7.712960 1.054927 smallDoses
9744 11.470364 0.760461 largeDoses
16857 2.886529 0.934416 smallDoses
39336 10.054373 1.138351 largeDoses
65230 9.972470 0.881876 didntLike
2463 2.335785 1.366145 smallDoses
27353 11.375155 1.528626 largeDoses
16191 0.000000 0.605619 smallDoses
12258 4.126787 0.357501 smallDoses
42377 6.319522 1.058602 didntLike
25607 8.680527 0.086955 largeDoses
77450 14.856391 1.129823 didntLike
58732 2.454285 0.222380 didntLike
46426 7.292202 0.548607 largeDoses
32688 8.745137 0.857348 largeDoses
64890 8.579001 0.683048 didntLike
8554 2.507302 0.869177 smallDoses
28861 11.415476 1.505466 largeDoses
42050 4.838540 1.680892 didntLike
32193 10.339507 0.583646 largeDoses
64895 6.573742 1.151433 didntLike
2355 6.539397 0.462065 smallDoses
0 2.209159 0.723567 smallDoses
70406 11.196378 0.836326 didntLike
57399 4.229595 0.128253 didntLike
41732 9.505944 0.005273 largeDoses
11429 8.652725 1.348934 largeDoses
75270 17.101108 0.490712 didntLike
5459 7.871839 0.717662 smallDoses
73520 8.262131 1.361646 didntLike
40279 9.015635 1.658555 largeDoses
21540 9.215351 0.806762 largeDoses
17694 6.375007 0.033678 smallDoses
22329 2.262014 1.022169 didntLike
46570 5.677110 0.709469 didntLike
42403 11.293017 0.207976 largeDoses
33654 6.590043 1.353117 didntLike
9171 4.711960 0.194167 smallDoses
28122 8.768099 1.108041 largeDoses
34095 11.502519 0.545097 largeDoses
1774 4.682812 0.578112 smallDoses
40131 12.446578 0.300754 largeDoses
13994 12.908384 1.657722 largeDoses
77064 12.601108 0.974527 didntLike
11210 3.929456 0.025466 smallDoses
6122 9.751503 1.182050 largeDoses
15341 3.043767 0.888168 smallDoses
44373 4.391522 0.807100 didntLike
28454