机器学习实战----使用k-近邻算法改进约会网站的配对效果

实例:在约会网站上使用K-近邻算法
流程:

1)收集数据
(2)准备数据
(3)分析数据
(4)测试算法
(5)使用算法

1、准备数据:从文本文件中解析数据
海伦的样本主要包含以下3个特征:

每年获得的飞行常客里程数
完视频游戏所耗时间百分比
每周消费的冰淇淋公升数

将文本记录到转换Numpy的解析程序:

def file2matrix(filename):
    #获得文件行数
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)
    #创建返回的NumPy矩阵
    #创建以零填充的矩阵numpy,(numpy是一个二维数组)将矩阵的另一维度设置为固定值3
    returnMat = np.zeros((numberOfLines,3))
    classLabelVector = []
    index = 0
    # 解析文件数据到列表
    # 循环处理文件中的每行数据
    for line in arrayOLines:
        # 使用函数line.strip()截取掉所有的回车字符
        line = line.strip()
        # 使用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前3个元素,将他们存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # python语句可以使用索引值-1表示列表中的最后一列元素,
        # 利用这种负索引,可以将列表的最后一列存储到向量classLabelVector中
        # 必须明确表示列表中存储的元素值为整型
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

显示数据:

datingDataMat,datingLabels = KNN.file2matrix('datingTestSet2.txt')
print(datingDataMat)
print(datingLabels[0:20])

结果:

D:\exercise\机器学习实战\Scripts\python.exe D:\exercise\pythonProject\mian.py 
[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
 [1.4488000e+04 7.1534690e+00 1.6739040e+00]
 [2.6052000e+04 1.4418710e+00 8.0512400e-01]
 ...
 [2.6575000e+04 1.0650102e+01 8.6662700e-01]
 [4.8111000e+04 9.1345280e+00 7.2804500e-01]
 [4.3757000e+04 7.8826010e+00 1.3324460e+00]]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

2、分析数据:使用Matplotlib创建散点图

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
#第二列,第三列数据
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*np.array(datingLabels),15.0*np.array(datingLabels))
plt.show()

结果:
在这里插入图片描述

3、准备数据:归一化数值:
数据处理:在处理这种不同取值范围的特征值时,我们通常采用的方法是将数值归一化,如将取值范围处理为0到1或者-1到1之间,

newValue = (oldValue-min)/(max-min)

其中min和max分别是数据集中的最小特征值和最大特征值。

def autoNorm(dataSet):
    #dataSet.min(0)中的参数0使得函数可以从列中选取最小值,而不是选取当前行的最小值
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    #numpy中的tile()函数将变量内容复制成输入矩阵同样大小的矩阵
    #numpy中矩阵除法需要使用函数linalg.solve(matA,matB)
    normDataSet = dataSet - np.tile(minVals,(m,1))
    normDataSet = normDataSet / np.tile(ranges,(m,1))
    return  normDataSet,ranges,minVals

显示结果:

normMat,ranges,minVals =autoNorm(datingDataMat)
print(normMat)
print(ranges)
print(minVals)

结果:

D:\exercise\机器学习实战\Scripts\python.exe D:\exercise\pythonProject\KNN.py 
[[0.44832535 0.39805139 0.56233353]
 [0.15873259 0.34195467 0.98724416]
 [0.28542943 0.06892523 0.47449629]
 ...
 [0.29115949 0.50910294 0.51079493]
 [0.52711097 0.43665451 0.4290048 ]
 [0.47940793 0.3768091  0.78571804]]
[9.1273000e+04 2.0919349e+01 1.6943610e+00]
[0.       0.       0.001156]

Process finished with exit code 0

4、测试算法:作为完整程序验证分类器
在数据集中随机取10%的数据作为测试集,并预测结果,计算错误率。

def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX,(dataSetSize,1))-dataSet
    sqDiffMat = diffMat ** 2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()
    classCount = {
   }
    for i in range(k):
        vetoIlabel = labels[sortedDistIndicies[i]]
        classCount[vetoIlabel] = classCount.get(vetoIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return  sortedClassCount[0][0]

def datingClassTest():
    hoRatio = 0.10
    #使用file2matrix和autoNorm()函数从文件中读取数据并将其转换为归一化特征值
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    #计算测试向量的数量,此步决定了normMat向量中哪些数据用于测试,哪些数据用于分类器的训练样本
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        #将数据输入到原始KNN分类器函数classify0中
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print("the classifier came back with: %s,the real answer is :%s"%(classifierResult,datingLabels[i]))
        if(classifierResult != datingLabels[i]):errorCount += 1.0
    print("the total error rate is: %f"%(errorCount/float(numTestVecs)))

显示结果:

KNN.datingClassTest()

结果:

the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :1
the total error rate is: 0.050000

Process finished with exit code 0

错误率:5%,可以改变hoRatio和k的值来改变错误率。
5、使用算法:构建完整可以系统
input()函数允许用户输入文本行命令并返回用户所输入的命令。

def classifyPerson():
    resultList = ['not at all','in small doses','in large doses']
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    inArr = np.array([ffMiles,percentTats,iceCream])
    classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    print("You will probably like this person:%s"%resultList[classifierResult-1])

显示结果:

KNN.classifyPerson()

结果:

percentage of time spent playing video games?20
frequent flier miles earned per year?20000
liters of ice cream consumed per year?0.8
You will probably like this person:in large doses

Process finished with exit code 0

是一个极具魅力的人。
数据集:datingTestSet.txt

40920	8.326976	0.953952	largeDoses
14488	7.153469	1.673904	smallDoses
26052	1.441871	0.805124	didntLike
75136	13.147394	0.428964	didntLike
38344	1.669788	0.134296	didntLike
72993	10.141740	1.032955	didntLike
35948	6.830792	1.213192	largeDoses
42666	13.276369	0.543880	largeDoses
67497	8.631577	0.749278	didntLike
35483	12.273169	1.508053	largeDoses
50242	3.723498	0.831917	didntLike
63275	8.385879	1.669485	didntLike
5569	4.875435	0.728658	smallDoses
51052	4.680098	0.625224	didntLike
77372	15.299570	0.331351	didntLike
43673	1.889461	0.191283	didntLike
61364	7.516754	1.269164	didntLike
69673	14.239195	0.261333	didntLike
15669	0.000000	1.250185	smallDoses
28488	10.528555	1.304844	largeDoses
6487	3.540265	0.822483	smallDoses
37708	2.991551	0.833920	didntLike
22620	5.297865	0.638306	smallDoses
28782	6.593803	0.187108	largeDoses
19739	2.816760	1.686209	smallDoses
36788	12.458258	0.649617	largeDoses
5741	0.000000	1.656418	smallDoses
28567	9.968648	0.731232	largeDoses
6808	1.364838	0.640103	smallDoses
41611	0.230453	1.151996	didntLike
36661	11.865402	0.882810	largeDoses
43605	0.120460	1.352013	didntLike
15360	8.545204	1.340429	largeDoses
63796	5.856649	0.160006	didntLike
10743	9.665618	0.778626	smallDoses
70808	9.778763	1.084103	didntLike
72011	4.932976	0.632026	didntLike
5914	2.216246	0.587095	smallDoses
14851	14.305636	0.632317	largeDoses
33553	12.591889	0.686581	largeDoses
44952	3.424649	1.004504	didntLike
17934	0.000000	0.147573	smallDoses
27738	8.533823	0.205324	largeDoses
29290	9.829528	0.238620	largeDoses
42330	11.492186	0.263499	largeDoses
36429	3.570968	0.832254	didntLike
39623	1.771228	0.207612	didntLike
32404	3.513921	0.991854	didntLike
27268	4.398172	0.975024	didntLike
5477	4.276823	1.174874	smallDoses
14254	5.946014	1.614244	smallDoses
68613	13.798970	0.724375	didntLike
41539	10.393591	1.663724	largeDoses
7917	3.007577	0.297302	smallDoses
21331	1.031938	0.486174	smallDoses
8338	4.751212	0.064693	smallDoses
5176	3.692269	1.655113	smallDoses
18983	10.448091	0.267652	largeDoses
68837	10.585786	0.329557	didntLike
13438	1.604501	0.069064	smallDoses
48849	3.679497	0.961466	didntLike
12285	3.795146	0.696694	smallDoses
7826	2.531885	1.659173	smallDoses
5565	9.733340	0.977746	smallDoses
10346	6.093067	1.413798	smallDoses
1823	7.712960	1.054927	smallDoses
9744	11.470364	0.760461	largeDoses
16857	2.886529	0.934416	smallDoses
39336	10.054373	1.138351	largeDoses
65230	9.972470	0.881876	didntLike
2463	2.335785	1.366145	smallDoses
27353	11.375155	1.528626	largeDoses
16191	0.000000	0.605619	smallDoses
12258	4.126787	0.357501	smallDoses
42377	6.319522	1.058602	didntLike
25607	8.680527	0.086955	largeDoses
77450	14.856391	1.129823	didntLike
58732	2.454285	0.222380	didntLike
46426	7.292202	0.548607	largeDoses
32688	8.745137	0.857348	largeDoses
64890	8.579001	0.683048	didntLike
8554	2.507302	0.869177	smallDoses
28861	11.415476	1.505466	largeDoses
42050	4.838540	1.680892	didntLike
32193	10.339507	0.583646	largeDoses
64895	6.573742	1.151433	didntLike
2355	6.539397	0.462065	smallDoses
0	2.209159	0.723567	smallDoses
70406	11.196378	0.836326	didntLike
57399	4.229595	0.128253	didntLike
41732	9.505944	0.005273	largeDoses
11429	8.652725	1.348934	largeDoses
75270	17.101108	0.490712	didntLike
5459	7.871839	0.717662	smallDoses
73520	8.262131	1.361646	didntLike
40279	9.015635	1.658555	largeDoses
21540	9.215351	0.806762	largeDoses
17694	6.375007	0.033678	smallDoses
22329	2.262014	1.022169	didntLike
46570	5.677110	0.709469	didntLike
42403	11.293017	0.207976	largeDoses
33654	6.590043	1.353117	didntLike
9171	4.711960	0.194167	smallDoses
28122	8.768099	1.108041	largeDoses
34095	11.502519	0.545097	largeDoses
1774	4.682812	0.578112	smallDoses
40131	12.446578	0.300754	largeDoses
13994	12.908384	1.657722	largeDoses
77064	12.601108	0.974527	didntLike
11210	3.929456	0.025466	smallDoses
6122	9.751503	1.182050	largeDoses
15341	3.043767	0.888168	smallDoses
44373	4.391522	0.807100	didntLike
28454
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值