[更新ing]sklearn(十六)：Nearest Neighbors *

Sarah ฅʕ•̫͡•ʔฅ

已于 2022-05-24 18:06:54 修改

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

分类专栏： Sklearn 文章标签： sklearn 机器学习 python

于 2018-10-04 21:36:53 首次发布

本文链接：https://blog.youkuaiyun.com/u014765410/article/details/82936200

Sklearn 专栏收录该内容

27 篇文章

订阅专栏

本文深入解析了近邻算法，包括NearestNeighbors、KNeighborsClassifier、RadiusNeighborsClassifier等，探讨了不同算法的特点、适用场景及参数调整策略。同时，对比了bruteforce、KDTree、BallTree三种搜索算法的性能，提供了丰富的实践指导。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Finding the Nearest Neighbors

1、NearestNeighbors

#Unsupervised learner for implementing neighbor searches.
sklearn.neighbors.NearestNeighbors(n_neighbors=5, radius=1.0, algorithm=’auto’, leaf_size=30, metric=’minkowski’, p=2, metric_params=None, n_jobs=None, **kwargs)
#n_neighbors：设置要寻找的一个point的近邻点的数量
#radius：寻找一个point  radius半径以内的点，radius为半径值
#algorithm：{auto，ball_tree，kd_tree，brute}
#leaf_size：ball_tree，kd_tree的叶子结点数，决定了k近邻搜索的效率
#metric：距离公式
#p：minkowski距离公式的参数
#metric_params：其它距离公式的参数
#n_jobs：并行工作的数量

#method
fit(X[,y]) #说明：(X[,y])中y表示“位置参数”，用[]括起来，表示可有可无
#X：training data{array-like,sparse matrix,BallTree,KDTree}。如果function中meric='precomputed’,则training data shape为[n_samples,n_samples]，元素代表“距离”

get_params(deep=True)
#deep=True：返回更多的信息
#return：{param：value}

kneighbors(X=None, n_neighbors=None, return_distance=True)
#X：the query point or points
#n_neighbors：要找的邻近点数
#return_distance=True：返回point与邻近点的距离
#return：[array of distances，array of indices of  the nearest points]

kneighbors_graph(X=None, n_neighbors=None, mode=’connectivity’)
#X：the query point or points
#n_neighbors：要找的邻近点数
#mode：返回什么样的matrix。{connectivity matrix：各个近邻点之间是否相连,返回值为1或0，distance matrix：各个元素为点间距离}
#return : matrix ：[n_samples,n_samples]

radius_neighbors(X=None, radius=None, return_distance=True)
#X：the query points
#radius：搜索半径，搜索某点radius范围内所有点
#return_distance=True：返回距离
#return [array of distance, array of indices of the nearest points]

radius_neighbors_graph(X=None, radius=None, mode=’connectivity’)
#return：matrix [n_samples，n_samples]

set_params(**params) #关键字参数{key：value}
#Set the parameters of this estimator.

2、KDTree

sklearn.neighbors.KDTree(X, leaf_size=40, metric=’minkowski’, **kwargs)
#X：training data
#leaf_size：KDTree的叶子节点数
#metric：距离公式

#attribute
.data    # memory view：the training data

#method
kernel_density(self，X，h[, kernel, atol, ...]) #计算给定点X的kernel density estimate
#X：the query point
#kernel：选择核函数
#atol,rtol：设定stopping tolerance：abs(K_true - K_ret) < atol + rtol * K_ret ；其中k_true为real result，k_ret为predict result。
#breadth_first=True：if true use breadth search, or else use depth search
#return_log：return the logarithm of the result
#return：The array of (log)-density evaluations, shape = X.shape[:-1]

query(X, k=1, return_distance=True, dualtree=False, breadth_first=False)
#X：the query points
#dualtree=True：use the dual tree formalism for the query
#k：近邻点的数
#return：[array of distance，array of indices of nearest points]

query_radius(self, X, r, count_only = False)
#r：radius
#count_only：是否只返回point个数
#return：[array of distance，array of indices of nearest points]

two_point_correlation(X,r,dualtree)
#compute  the two-point correlation function
#r：radius
#return：各个query point与近邻点的correlation function

3、BallTree

BallTree(X, leaf_size=40, metric=’minkowski’, **kwargs)
#attributes及method与KDTree一致

Nearest Neighbors Classification

1、KNeighborsClassifier

算法目的是要找到new point的k近邻，然后选出最具代表性的label作为new point的label。

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)
#n_neighbors：近邻点个数
#weights：在prediction时，给每个近邻点分配的权重。{uniform, distance, callable}。weights=distance，是指每个近邻点分配权重与其距离成反比。
#algorithm：用于近邻点搜索的算法{auto，KDTree，BallTree，brute}
#leaf_size：针对KDTree，BallTree来讲
#p：minkowski距离公式的参数
#metric：距离公式

2、RadiusNeighborsClassifier
2. 对于“不是均匀采样的数据”，比起k近邻，RadiusNeighborsClassifier是一个更好的选择。因为，对于给定的radius，当new
point的领域较稀疏时，则采用较少的近邻点predict class，当new
point的领域较稠密时，则采用更多的近邻点predict class.
3. 该算法不适用于高维数据。因为在高维空间，该方法可能会由于curse of
dimentionality而降低效果。

sklearn.neighbors.RadiusNeighborsClassifier(radius=1.0, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, outlier_label=None, metric_params=None, n_jobs=None, **kwargs)
#outlier_label：当new point在radius之内没有近邻点时，被定为Outlier，给其一个label。如果outlier_label=None，则发现这样的new point会返回error。

Nearest Neighbors Regression

利用NNR获得的y值为k近邻点的平均值。
1、KNeighborsRegressor

sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)

2、RadiusNeighborsRegressor

sklearn.neighbors.RadiusNeighborsRegressor(radius=1.0, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)

Nearest Neighbor Algorithms

brute force search

brute force search是一种简单粗暴的方法，他枚举了所有可能的pairs of points的距离，并从中，找到new point的k近邻。复杂度达O(DN2)，D为维数，N为样本数。
brute force search

KDTree

KDTree是二叉树
KDTree构建的核心思想：依次以training data的某个feature作为切分依据，将小于feature.mean的数据放入左结点，大于feature.mean的数据放入右结点。重复上述过程，直到左右子节点中没有数据。
KDTree适用于低维数据（D<20），如果维数过高，则KDTree的效率会大大降低 because of curse of dimentionality。
KDTree的时间复杂度为O[DNlog(N)]。对于某一new point，搜索其k近邻的时间复杂度为O[Dlog(N)]。
k-近邻(KNN)

Ball tree

Ball tree是二叉树
尽管Ball tree的construction需要花费很多的时间，但是，在应用Ball tree寻找近邻点时，其效率很高。Ball tree弥补了KDTree无法用于高维数据的不足，他在高维数据中也可以达到很好的效果。
KDTree是基于每个特征划分子区域（超巨型区域），而Ball tree是基于所有特征划分超球面区域。
Ball tree通过以下策略，减少了需要搜索的candidate points的数量：|x+y| <= |x| + |y|。（三角形原理：a+b < c，如果c为target point与node兄弟结点圆心的距离，b为‘最近邻点’与兄弟结点圆心的距离，a为“最近邻点”与target point的距离。如果a+b<c，则说明在兄弟结点中不存在比现有最近邻点distance还小的点，如果a+b > c,则存在）
Scikit-learn：最近邻搜索sklearn.neighbors

以上3中算法的比较

当n_sample < 30时，brute algorithm效率最高。
data structure能够显著影响到Ball tree和KDTree的query time。一般来说，sparser data with small intrinsic dimentionality(降维后特征数)其query 更fast。
k近邻的数量对3中算法的影响：brute不会受太大影响；ball tree和KDTree随着k值的增大，其query time增加。
如果query points的数目较小的话，建议用brute，因为构建BallTree和KDTree需要花费很大时间，只寻找很少的几个new point的k近邻不值得专门构建tree。如果query points很多的话，则用BallTree和KDTree。
KDTree适用于维度较低的数据（d<10）,BallTree适用于维度较高的数据。

不太理解，为什么k>N/2，反而用brute？？？，有路过的高手，请多多指教

Nearest Centroid Classifier（k-均值聚类）

linear discriminant analysis 假设各个class都有Identical variance；
Qudratic discriminant analysis 中各个class无需有相同方差；
Nearest Centroid Classifier中，当各个class variance不同时，可能会得到non-convex class。为了避免这一情况的发生，要假设：在所有dimention上，拥有相同的variance（通过samplevalue/varianceoffeature，使得各个feature上的variance相同）。

sklearn.neighbors.NearestCentroid(metric=’euclidean’, shrink_threshold=None)
#shrink_threshold：将各个feature的数据同时除以该feature的variance，使得各个feature在同一scale下进行classification，避免的feature对于classification的影响，同时也可以消除一些Noise的影响。

广度优先搜索，深度优先搜索

深度优先搜索

深度优先搜索类似于树的先序遍历，是先序遍历的推广。
算法步骤:

从图中某顶点v出发，访问v；
找出刚访问过的顶点的第一个未被访问的邻接点，访问该顶点。以该顶点为新顶点，重复此步骤，直到刚访问过的顶点没有未被访问的邻接点为止。
返回前一个访问过的且仍有未被访问的邻接点的顶点，找出该顶点的下一个未被访问的邻接点，访问该顶点。
重复步骤2和3，直到图中所有顶点都被访问过，搜索结束。

#伪代码
bool visited[MVNum]      //访问标志数组，其初值为‘false’
void DFS(Graph G,int v){
		cout<<v; visited[v]=True; //访问顶点v，将其标志设为true
		for(w=FirstAdjVex(G,v),w>=0,w=NextAdjVex(G,v,w))  //FirstAdjVex：v的第一个邻接点; NextAdjVex：v相对于w的下一个邻接点
				if(!visited[w]) DFS(G,w); //递归访问
}

广度优先搜索

广度优先搜索类似于树的层次遍历。
算法步骤：

从图中某个顶点v出发，访问v。
一次访问v的各个未曾访问的邻接点。
分别从这些邻接点出发，一次访问他们的邻接点，并使先被访问到的顶点的邻接点先于后被访问到的顶点的邻接点被访问。重复步骤3。直到图中所有已被访问的顶点的邻接点都被访问到。

void BFS(Graph G, int v){
		cout<<v; visited[v]=True; //访问顶点v，并将顶点v的标志设为true
		InitQuene(Q)  //创建队列Q
		EnQuene(Q,v)  //将顶点v装入队列
		while(!QueneEmpty(Q)){
				DeQuene(Q,v)  //队列头元素出队，并将该出队元素命名为v
				for(w=FirstAdjVex(G,v),w>=0,w=NextAdjVex(G,v,w)) //依次访问顶点v的邻接点
						if(!visited[w]){
								cout<<w //访问邻接点
								EnQuene(Q,w)  //将该邻接点装入队列
								visited[w]=True  //将该邻接点的标志设为true
						}