一.K-近邻算法(K Nearest Neighbors Classification)
1. KNN概述
1.1 KNN算法原理
基于近邻的的分类是一种基于实例的学习或者非泛化的学习,它不会训练一个通用的模型,只是存储训练数据的实例样本,每当有新样本时,找到该样本在训练集中一些最近邻居样本,新样本的类别从它的这些最近邻居的类别投票中得到(少数服从多数)。
算法原理:存在训练样本集,训练集中的每个样本都有自己标记/所属的分类。当输入没有标记的测试样本时,将测试样本的每个特征与训练集中每个样本对应特征进行比较,得到测试样本和训练集中每个样本的相似程度(距离值),我们只选择样本集中前k个最相似的样本数据,这就是k-近邻中k的出处,通常k不大于20。最后选择k个最相似数据中出现次数最多的分类,作为新样本的类别。
- 计算测试样本和训练集中每个样本的相似距离
- 按照距离将训练样本进行排序
- 选择前k个样本(一般情况 k<=20)
- 计算k个样本所属类别出现的频率
- 将出现频率最高的分类作为新样本的类别
1.2 相似距离计算
计算新样本和测试集中样本的相似距离的公式:
1.3 K值的选择
The K-neighbors classification in KNeighborsClassifier is the most commonly used technique. The optimal choice of the value K is highly data-dependent: in general a larger K suppresses the effects of noise, but makes the classification boundaries less distinct.
1.4 KNN特点
k-近邻算法是分类数据最简单最有效的算法,它是基于实例的学习,不需要训练模型。k-近邻算法必须保存全部数据集,如果训练数据集过大,不光使用大量的存储空间,而且计算量大耗时。
- 优点:精度高,对异常值不敏感,无数据输入假定
- 缺点:计算复杂度高,空间复杂度高
适用数据范围:数值型和标称型
二. Python实现KNN算法
'''
约会数据
数据包含三个特征:
1.每年获取的飞行常客里程数
2.玩视频游戏所耗时间百分比
3.每周消耗冰淇淋的公升数
人物分为三类:
1.不喜欢的人 didntLike
2.魅力一般的人 smallDoses
3.极具魅力的人 largeDoses
'''
import operator
import numpy as np
import matplotlib.pyplot as plt
def KNN():
#读取文本中的数据,将数据转化成分类器需要的格式,将标记由字符型转化为数值型
filename = "E:\courseware\machine leaning\code\MLiA_SourceCode\machinelearninginaction\Ch02\datingTestSet.txt"
data, labels = load_data(filename)
#绘制数据的散点图,可视化数据,分析数据
plt.figure()
plt.scatter(data[:,0],data[:,1],c = labels, s = 20)
plt.xlabel("distance")
plt.ylabel("time")
plt.show()
#数据归一化处理
data = normalization(data)
#划分测试集和训练集,测试集占样本集的0.1
m = data.shape[0]
ratio = 0.10
numTest = int(m*ratio)
errorCount = 0.0
for i in range(numTest):
result = predict(data[i,:],data[numTest:m,:],labels[numTest:m],5)
if(result == labels[i]):
errorCount += 1.0
print("The total error rate is:%f" % (errorCount/float(numTest)))
def predict(X_test, X_train, y_train, k):
size = X_train.shape[0]
difference = np.tile(X_test, (size, 1)) - X_train
sqDifference = difference**2
sumDifference = sqDifference.sum(axis=1)
distances = sumDifference**0.5
#argsort的作用是将数组中的元素从小到大排序,返回的数据是元素在原数组的索引
sortDistIndex = distances.argsort()
countlabel = {}
for i in range(k):
label = y_train[sortDistIndex[i]]
if label in countlabel:
countlabel[label] += 1
else:
countlabel[label] = 1
sortlabel = sorted(countlabel.items(), key=operator.itemgetter(1), reverse=True)
return sortlabel[0][0]
#数据归一化函数
def normalization(data):
minVals = data.min(0)
maxVals = data.max(0)
ranges = maxVals - minVals
normData = np.zeros(np.shape(data))
m = data.shape[0]
normData = normData - np.tile(minVals, (m,1))
normData = normData / np.tile(ranges, (m,1))
return normData
#定义load_data函数来读取文本中的数据,并进行形式转化
def load_data(filename):
with open(filename, 'r') as f:
lines = f.readlines() #按行读取文本中的数据,返回的是list形式
length = len(lines) #数据的个数
data = np.zeros((length,3))
labels = []
dict = {}
index = 0
labelindex = 0
for line in lines:
line = line.strip() #strip()移除字符串头和尾的字符,默认空格和换行
atrlist = line.split('\t') #根据空格分割字符,得到三个特征列表
data[index,:] = atrlist[0:3]
key = atrlist[-1]
if key in dict:
labels.append(dict.get(key))
else:
dict[key] = labelindex
labelindex += 1
labels.append(dict.get(key))
index += 1
return data, labels
if __name__ == '__main__':
KNN()
代码参考 《机器学习实战》
三. KNN在scikit-learn中的实现
在scikit-learn中 实现了 k-近邻算法:KNeighborsClassifier
Class:sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)
参数:
-
n_neighbors:邻居的个数,默认值为5
-
weights:权重函数
- uniform:相同的权重
- distance:邻近样本和测试样本距离的倒数,距离近的邻近点的权重会更大
-
algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional 计算K近邻的算法
- ‘ball_tree’ ball树
- ‘kd_tree’ K-D树
- ‘brute’ 暴力法
- ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method
-
p:计算距离的方法
- p=1 曼哈度距离
- p=2 欧氏距离,默认值
方法:
-
fit(X,y)
-
predict(X)
-
score(X,y,sample_weight=None)
from sklearn.neighbors import KNeighborsClassifier
model = KNeighboesClassifier()
model.fit(x_train, y_train)
predict = model.predict(x_test)
accuracy = model.score(x_test, y_test)```
>代码实例
```python
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
'''
def datatype(s):
it = {"didntLike": 0, "smallDoses": 1, "largeDoses": 2}
return it[s]
'''
def KNN():
#导入数据
filename = "E:\courseware\machine leaning\code\MLiA_SourceCode\machinelearninginaction\Ch02\datingTestSet.txt"
#data = np.loadtxt(filename,dtype=float, delimiter=' ', converters={3: datatype} )
data, labels = load_data(filename)
#划分训练集和测试集
x_train,x_test, y_train, y_test = train_test_split(data, labels,test_size=0.1)
#数据归一化
scalar = MinMaxScaler()
x_train = scalar.fit_transform(x_train)
x_test = scalar.fit_transform(x_test)
#构建模型
list =[]
for i in range(1,10):
model = KNeighborsClassifier(n_neighbors=i)
model.fit(x_train, y_train)
#accuracy = model.score(x_test,y_test)
predict = model.predict(x_test)
right = sum(predict == y_test)
accuracy = right*1.0/predict.shape[0]
list.append(accuracy)
#predict = np.hstack((np.reshape(predict,(-1,1)), np.reshape(y_test,(-1,1))))
#print(predict)
#print("测试集的精确度是:%f%%" % (right*100.0/predict.shape[0]))
min_accuracy = np.mean(list)
print("平均精确度为: %f" % min_accuracy)
plt.figure()
plt.plot(list)
plt.show()
def load_data(filename):
with open(filename, 'r') as f:
lines = f.readlines() #按行读取文本中的数据,返回的是list形式
length = len(lines) #数据的个数
data = np.zeros((length,3))
labels = []
dict = {}
index = 0
labelindex = 0
for line in lines:
line = line.strip() #strip()移除字符串头和尾的字符,默认空格和换行
atrlist = line.split('\t') #根据空格分割字符,得到三个特征列表
data[index,:] = atrlist[0:3]
key = atrlist[-1]
if key in dict:
labels.append(dict.get(key))
else:
dict[key] = labelindex
labelindex += 1
labels.append(dict.get(key))
index += 1
return data, labels
if __name__ == '__main__':
KNN()
参考: 周志华《机器学习》 李航《统计学习方法》