K近邻(KNN)的Python实现

本文介绍了一种简单且易于理解的分类算法——K近邻(KNN)算法。该算法几乎不需要训练过程,在开始时即可对测试样本进行分类。通过交叉验证确定合适的K值,并给出了一段实现KNN算法的Python代码示例。此外,还讨论了KNN算法的优点和缺点。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

KNN

KNN almost has no training process.
At the very beginning, we can classify the test samples.
K used to be lower than sqrt(#sample), determined by CV.

代码在Github上

代码:

# -*- coding -*-

import numpy as np 
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import random
import math

def get_multi_data():
    n = 2000
    centers = [[-5,3],[1,3],[-3,-3]]
    X,y = make_blobs(centers = centers,n_samples = n,random_state=42)
    trans = [[0.4,0.2],[-0.4,1.2]]
    X = np.dot(X,trans) + np.random.rand(n,2)*2.5
    return X,y

class KNN:
    def __init__(self,max_iter=10):
        self.max_iter = max_iter+0.0

    def fit(self,X,y,test_size=0.2):
        self.k = None
        self.vote_res = np.empty(np.unique(y).shape[0])
        upper = int(math.sqrt(X.shape[0]))
        step = upper/self.max_iter
        if abs(step-int(step)) > 0.01:
            step = int(upper/self.max_iter)+1
        else:
            step = int(step)
        max_corr = 0.0; max_k = 0;
        for k in range(1,upper,step):
            print '------------- k =',k,' ------------'
            corr = self._fit_one(X,y,k,test_size)
            if max_corr < corr:
                max_corr = corr; max_k = k 
        print 'best k is ',max_k
        self.k = max_k

    def _fit_one(self,X,y,k,test_size):
        self.kk = k
        p = 0.0
        for i in range(0,int(1.0/test_size)):
            X_train,X_test,y_train,y_test = \
                train_test_split(X,y,test_size=test_size,random_state = np.random.RandomState(42))
            pred = self.predict(X_test,X_train,y_train)
            p += (1-np.mean(y_test!=pred))
        print p/int(1.0/test_size)
        return p/int(1.0/test_size)

    # predict X's class with training data
    def predict(self,X,X_,y_):
        k = self.k if self.k != None else self.kk
        # when training: choose [1,k+1], because dis[0]=0
        offset = 1 if self.k !=None else 0 
        dis = np.empty(X_.shape[0])
        pred = np.empty(X.shape[0])
        for i in range(X.shape[0]):
            x = X[i,:]
            for j in range(X_.shape[0]):
                dis[j] = self.cal_dis(x,X_[j,:])
            index = np.argsort(dis)[offset:k+offset]
            pred[i] = self.vote(y_[index])
        return pred


    def vote(self,res):
        self.vote_res.fill(0)
        for i in res:
            self.vote_res[i] += 1
        max_cnt = 0; max_index = -1
        for i in range(self.vote_res.shape[0]):
            if max_cnt < self.vote_res[i]:
                max_cnt = self.vote_res[i]
                max_index = i 
        return max_index


    def cal_dis(self,x1,x2):
        x = x1-x2
        return np.dot(x,x)

############### plot
def plot_samples(X,y):
    x_start = 0
    if np.mean(X[:,0])==1:
        x_start += 1
    c_lst = ['r','g','b','y']
    y_set = np.unique(y)
    color = np.array([c_lst[0]]*y.shape[0])
    for i in range(1,y_set.shape[0]):
        color[y==y_set[i]] = c_lst[i]
    plt.figure(figsize=(8,6))
    plt.scatter(X[:,x_start],X[:,x_start+1],c = color)


if __name__ == '__main__':
    X,y = get_multi_data()
    X = np.column_stack([[1]*X.shape[0],X])
    X_train,X_test,y_train,y_test = \
                train_test_split(X,y,test_size=0.2,random_state = np.random.RandomState(42))
    clf = KNN(10)
    clf.fit(X_train,y_train,0.25)
    y_pred = clf.predict(X_train,X_train,y_train)
    correct_rate = 1-np.mean(y_train!=y_pred)
    print 'train correct_rate:',correct_rate

    y_pred1 = clf.predict(X_test,X_train,y_train)
    correct_rate = 1-np.mean(y_test!=y_pred1)
    print 'test correct_rate:',correct_rate

    plot_samples(X_train,y_train)
    plot_samples(X_train,y_pred)
    plot_samples(X_test,y_test)
    plot_samples(X_test,y_pred1)
    plt.show()

Disadvantages

  1. has no statistical estimates about error;
  2. lazy-learning, requires much memory at running time;
  3. computationally expensive

Therefore, KNN can just do some simple classification and handle low-dimension datasets.

Advantages

  1. simple and easy-understanding
  2. easy to realize
  3. no need to estimate parameters
  4. no training process
  5. suitable for sparse events
  6. suitable for multi-label classification

KNN can be used for recommendation.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值