A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm

本文详细介绍K-最近邻(KNN)算法的基本原理及其应用场景,包括KNN的非参数化特性、懒惰学习机制,并探讨了其在密度估计及分类任务中的应用。此外,还介绍了KNN在实际场景如手写识别、基因表达分析等方面的应用。

from: https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/

A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm

K Nearest Neighbor (KNN from now on) is one of those algorithms that are very simple to understand but works incredibly well in practice. Also it is surprisingly versatile and its applications range from vision to proteins to computational geometry to graphs and so on . Most people learn the algorithm and do not use it much which is a pity as a clever use of KNN can make things very simple. It also might surprise many to know that KNN is one of the top 10 data mining algorithms. Lets see why this is the case !

In this post, I will talk about KNN and how to apply it in various scenarios. I will focus primarily on classification even though it can also be used in regression). I also will not discuss much about Voronoi diagram or  tessellation.

KNN Introduction

KNN is an non parametric lazy learning algorithm. That is a pretty concise statement. When you say a technique is non parametric , it means that it does not make any assumptions on the underlying data distribution. This is pretty useful , as in the real world , most of the practical data does not obey the typical theoretical assumptions made (eg gaussian mixtures, linearly separable etc) . Non parametric algorithms like KNN come to the rescue here.

It is also a lazy algorithm. What this means is that it does not use the training data points to do any generalization. In other words, there is no explicit training phase or it is very minimal. This means the training phase is pretty fast . Lack of generalization means that KNN keeps all the training data. More exactly, all the training data is needed during the testing phase. (Well this is an exaggeration, but not far from truth). This is in contrast to other techniques like SVM where you can discard all non support vectors without any problem.  Most of the lazy algorithms – especially KNN – makes decision based on the entire training data set (in the best case a subset of them).

The dichotomy is pretty obvious here – There is a non existent or minimal training phase but a costly testing phase. The cost is in terms of both time and memory. More time might be needed as in the worst case, all data points might take point in decision. More memory is needed as we need to store all training data.

Assumptions in KNN

Before using KNN, let us revisit some of the assumptions in KNN.

KNN assumes that the data is in a feature space. More exactly, the data points are in a metric space. The data can be scalars or possibly even multidimensional vectors. Since the points are in feature space, they have a notion of distance – This need not necessarily be Euclidean distance although it is the one commonly used.

Each of the training data consists of a set of vectors and class label associated with each vector. In the simplest case , it will be either + or – (for positive or negative classes). But KNN , can work equally well with arbitrary number of classes.

We are also given a single number "k" . This number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. This is usually a odd number if the number of classes is 2. If k=1 , then the algorithm is simply called the nearest neighbor algorithm.

KNN for Density Estimation

Although classification remains the primary application of KNN, we can use it to do density estimation also. Since KNN is non parametric, it can do estimation for arbitrary distributions. The idea is very similar to use of Parzen window . Instead of using hypercube and kernel functions, here we do the estimation as follows – For estimating the density at a point x, place a hypercube centered at x and keep increasing its size till k neighbors are captured. Now estimate the density using the formula,

    p(x) = \frac{k/n}{V}

Where n is the total number of V is the volume of the hypercube. Notice that the numerator is essentially a constant and the density is influenced by the volume. The intuition is this : Lets say density at x is very high. Now, we can find k points near x very quickly . These points are also very close to x (by definition of high density). This means the volume of hypercube is small and the resultant density is high. Lets say the density around x is very low. Then the volume of the hypercube needed to encompass k nearest neighbors is large and consequently, the ratio is low.

The volume performs a job similar to the bandwidth parameter in kernel density estimation. In fact , KNN is one of common methods to estimate the bandwidth (eg adaptive mean shift) .

KNN for Classification

Lets see how to use KNN for classification. In this case, we are given some data points for training and also a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behavior based on k.

Case 1 : k = 1 or Nearest Neighbor Rule

This is the simplest scenario. Let x be the point to be labeled . Find the point closest to x . Let it be y. Now nearest neighbor rule asks to assign the label of y to x. This seems too simplistic and some times even counter intuitive. If you feel that this procedure will result a huge error , you are right – but there is a catch. This reasoning holds only when the number of data points is not very large.

If the number of data points is very large, then there is a very high chance that label of x and y are same. An example might help – Lets say you have a (potentially) biased coin. You toss it for 1 million time and you have got head 900,000 times. Then most likely your next call will be head. We can use a similar argument here.

Let me try an informal argument here -  Assume all points are in a D dimensional plane . The number of points is reasonably large. This means that the density of the plane at any point is fairly high. In other words , within any subspace there is adequate number of points. Consider a point x in the subspace which also has a lot of neighbors. Now let y be the nearest neighbor. If x and y are sufficiently close, then we can assume that probability that x and y belong to same class is fairly same – Then by decision theory, x and y have the same class.

The book "Pattern Classification" by Duda and Hart has an excellent discussion about this Nearest Neighbor rule. One of their striking results is to obtain a fairly tight error bound to the Nearest Neighbor rule. The bound is

P^* \leq P \leq P^* ( 2 - \frac{c}{c-1} P^*)

Where P^* is the Bayes error rate, c is the number of classes and P is the error rate of Nearest Neighbor. The result is indeed very striking (atleast to me) because it says that if the number of points is fairly large then the error rate of Nearest Neighbor is less that twice the Bayes error rate. Pretty cool for a simple algorithm like KNN. Do read the book for all the juicy details.

Case 2 : k = K or k-Nearest Neighbor Rule

This is a straightforward extension of 1NN. Basically what we do is that we try to find the k nearest neighbor and do a majority voting. Typically k is odd when the number of classes is 2. Lets say k = 5 and there are 3 instances of C1 and 2 instances of C2. In this case , KNN says that new point has to labeled as C1 as it forms the majority. We follow a similar argument when there are multiple classes.

One of the straight forward extension is not to give 1 vote to all the neighbors. A very common thing to do is weighted kNN where each point has a weight which is typically calculated using its distance. For eg under inverse distance weighting, each point has a weight equal to the inverse of its distance to the point to be classified. This means that neighboring points have a higher vote than the farther points.

It is quite obvious that the accuracy *might* increase when you increase k but the computation cost also increases.

Some Basic Observations

1. If we assume that the points are d-dimensional, then the straight forward implementation of finding k Nearest Neighbor takes O(dn) time.
2. We can think of KNN in two ways  – One way is that KNN tries to estimate the posterior probability of the point to be labeled (and apply bayesian decision theory based on the posterior probability). An alternate way is that KNN calculates the decision surface (either implicitly or explicitly) and then uses it to decide on the class of the new points.
3. There are many possible ways to apply weights for KNN – One popular example is the Shephard’s method.
4. Even though the naive method takes O(dn) time, it is very hard to do better unless we make some other assumptions. There are some efficient data structures like KD-Tree  which can reduce the time complexity but they do it at the cost of increased training time and complexity.
5. In KNN, k is usually chosen as an odd number if the number of classes is 2.
6. Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set  k=\sqrt{n}
7. There are some interesting data structures and algorithms when you apply KNN on graphs – See Euclidean minimum spanning tree and Nearest neighbor graph .

8. There are also some nice techniques like condensing, search tree and partial distance that try to reduce the time taken to find the k nearest neighbor. Duda et al has a discussion of all these techniques.

Applications of KNN

KNN is a versatile algorithm and is used in a huge number of fields. Let us take a look at few uncommon and non trivial applications.

1. Nearest Neighbor based Content Retrieval
This is one the fascinating applications of KNN – Basically we can use it in Computer Vision for many cases – You can consider handwriting detection as a rudimentary nearest neighbor problem. The problem becomes more fascinating if the content is a video – given a video find the video closest to the query from the database – Although this looks abstract, it has lot of practical applications – Eg : Consider ASL (American Sign Language)  . Here the communication is done using hand gestures.

So lets say if we want to prepare a dictionary for ASL so that user can query it doing a gesture. Now the problem reduces to find the (possibly k) closest gesture(s) stored in the database and show to user. In its heart it is nothing but a KNN problem. One of the professors from my dept , Vassilis Athitsos , does research in this interesting topic – See Nearest Neighbor Retrieval and Classification for more details.

2. Gene Expression
This is another cool area where many a time, KNN performs better than other state of the art techniques . In fact a combination of KNN-SVM is one of the most popular techniques there. This is a huge topic on its own and hence I will refrain from talking much more about it.

3. Protein-Protein interaction and 3D structure prediction
Graph based KNN is used in protein interaction prediction. Similarly KNN is used in structure prediction.

References

There are lot of excellent references for KNN. Probably, the finest is the book "Pattern Classification" by Duda and Hart. Most of the other references are application specific. Computational Geometry has some elegant algorithms for KNN range searching. Bioinformatics and Proteomics also has lot of topical references.

After this post, I hope you had a better appreciation of KNN !

 

Add to DeliciousAdd to DiggAdd to FaceBookAdd to Google BookmarkAdd to RedditAdd to StumbleUponAdd to TechnoratiAdd to Twitter

 

If you liked this post , please subscribe to the RSS feed.

Rate this:


标题基于Python的汽车之家网站舆情分析系统研究AI更换标题第1章引言阐述汽车之家网站舆情分析的研究背景、意义、国内外研究现状、论文方法及创新点。1.1研究背景与意义说明汽车之家网站舆情分析对汽车行业及消费者的重要性。1.2国内外研究现状概述国内外在汽车舆情分析领域的研究进展与成果。1.3论文方法及创新点介绍本文采用的研究方法及相较于前人的创新之处。第2章相关理论总结和评述舆情分析、Python编程及网络爬虫相关理论。2.1舆情分析理论阐述舆情分析的基本概念、流程及关键技术。2.2Python编程基础介绍Python语言特点及其在数据分析中的应用。2.3网络爬虫技术说明网络爬虫的原理及在舆情数据收集中的应用。第3章系统设计详细描述基于Python的汽车之家网站舆情分析系统的设计方案。3.1系统架构设计给出系统的整体架构,包括数据收集、处理、分析及展示模块。3.2数据收集模块设计介绍如何利用网络爬虫技术收集汽车之家网站的舆情数据。3.3数据处理与分析模块设计阐述数据处理流程及舆情分析算法的选择与实现。第4章系统实现与测试介绍系统的实现过程及测试方法,确保系统稳定可靠。4.1系统实现环境列出系统实现所需的软件、硬件环境及开发工具。4.2系统实现过程详细描述系统各模块的实现步骤及代码实现细节。4.3系统测试方法介绍系统测试的方法、测试用例及测试结果分析。第5章研究结果与分析呈现系统运行结果,分析舆情数据,提出见解。5.1舆情数据可视化展示通过图表等形式展示舆情数据的分布、趋势等特征。5.2舆情分析结果解读对舆情分析结果进行解读,提出对汽车行业的见解。5.3对比方法分析将本系统与其他舆情分析系统进行对比,分析优劣。第6章结论与展望总结研究成果,提出未来研究方向。6.1研究结论概括本文的主要研究成果及对汽车之家网站舆情分析的贡献。6.2展望指出系统存在的不足及未来改进方向,展望舆情
【磁场】扩展卡尔曼滤波器用于利用高斯过程回归进行磁场SLAM研究(Matlab代码实现)内容概要:本文介绍了利用扩展卡尔曼滤波器(EKF)结合高斯过程回归(GPR)进行磁场辅助的SLAM(同步定位与地图构建)研究,并提供了完整的Matlab代码实现。该方法通过高斯过程回归对磁场空间进行建模,有效捕捉磁场分布的非线性特征,同时利用扩展卡尔曼滤波器融合传感器数据,实现移动机器人在复杂环境中的精确定位与地图构建。研究重点在于提升室内等无GPS环境下定位系统的精度与鲁棒性,尤其适用于磁场特征明显的场景。文中详细阐述了算法原理、数学模型构建、状态估计流程及仿真实验设计。; 适合人群:具备一定Matlab编程基础,熟悉机器人感知、导航或状态估计相关理论的研究生、科研人员及从事SLAM算法开发的工程师。; 使用场景及目标:①应用于室内机器人、AGV等在缺乏GPS信号环境下的高精度定位与地图构建;②为磁场SLAM系统的设计与优化提供算法参考和技术验证平台;③帮助研究人员深入理解EKF与GPR在非线性系统中的融合机制及实际应用方法。; 阅读建议:建议读者结合Matlab代码逐模块分析算法实现细节,重点关注高斯过程回归的训练与预测过程以及EKF的状态更新逻辑,可通过替换实际磁场数据进行实验验证,进一步拓展至多源传感器融合场景。
### KNN算法概述 KNN(k-Nearest Neighbor Algorithm)是一种基于实例的学习方法,在模式识别领域具有广泛应用。其核心思想是通过寻找与目标样本最接近的 k 个邻居来进行分类或回归预测[^1]。 #### 工作机制 当输入一个新的未标记数据时,KNN 算法会在训练数据集中找到与此新数据最相似的 k 个实例。如果这是用于分类的任务,则根据这 k 个实例中占多数的类别来决定新数据所属的类别;如果是回归任务,则通常取这些邻居的目标变量平均值作为预测结果[^4]。 #### 特点分析 KNN 属于懒惰学习(Lazy Learning),因为它并不像其他许多监督学习模型那样存在显式的训练阶段。相反,它只是存储所有的训练样例直到需要做预测的时候才进行计算操作。这意味着每次新增加一个查询请求都需要重新遍历整个数据库并执行相应的距离度量运算[^2]。 另外值得注意的是,KNN 对于处理边界模糊或者有重叠区域的数据集特别有用,因为它是依据局部密度而非全局结构来做决策划分界限[^2]. ### Python 实现示例 下面提供了一个使用Python编程语言实现KNN算法的例子: ```python from collections import Counter import numpy as np def euclidean_distance(x1, x2): return np.sqrt(np.sum((x1 - x2)**2)) class KNN: def __init__(self, k=3): self.k = k def fit(self, X_train, y_train): self.X_train = X_train self.y_train = y_train def predict(self, X): predicted_labels = [self._predict(x) for x in X] return np.array(predicted_labels) def _predict(self, x): distances = [euclidean_distance(x, x_train) for x_train in self.X_train] k_indices = np.argsort(distances)[:self.k] k_nearest_labels = [self.y_train[i] for i in k_indices] most_common = Counter(k_nearest_labels).most_common(1) return most_common[0][0] if __name__ == "__main__": from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler iris = datasets.load_iris() X, y = iris.data, iris.target scaler = StandardScaler() X_scaled = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.2, random_state=1234 ) clf = KNN(k=5) clf.fit(X_train, y_train) predictions = clf.predict(X_test) acc = np.mean(predictions == y_test) print(f"Accuracy: {acc}") ``` 此代码片段展示了如何构建自定义版本的KNN分类器以及评估它的性能表现. ### 距离度量方式 为了衡量两个对象之间的差异程度,我们需要定义适当的距离函数.KNN常用的几种距离测量标准包括欧几里得距离(Euclidean Distance),曼哈顿距离(Manhattan Distance)等等.[^4] ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值