Curse of Dimensionality

本文深入探讨了维度灾难这一现象,它在高维空间数据分析中引起的直觉悖论,尤其关注距离和体积测量的变化。通过简单示例和直观解释,阐述了在高维空间中采样和距离计算的复杂性,揭示了如何在实际应用中应对这一挑战。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Curse of Dimensionality

Curse of Dimensionality refers to non-intuitive properties of data observed when working in high-dimensional space *, specifically related to usability and interpretation of distances and volumes. This is one of my favourite topics in Machine Learning and Statistics since it has broad applications (not specific to any machine learning method), it is very counter-intuitive and hence awe inspiring, it has profound application for any of analytics techniques, and it has ‘cool’ scary name like some Egyptian curse!
For quick grasp, consider this example: Say, you dropped a coin on a 100 meter line. How do you find it? Simple, just walk on the line and search. But what if it’s 100 x 100 sq. m. field? It’s already getting tough, trying to search a (roughly) football ground for a single coin. But what if it’s 100 x 100 x 100 cu.m space?! You know, football ground now has thirty-story height. Good luck finding a coin there! That, in essence is “curse of dimensionality”.

Many ML methods use Distance Measure

Most segmentation and clustering methods rely on computing distances between observations. Well known k-Means segmentation assigns points to nearest center. DBSCAN and Hierarchical clustering also required distance metrics. Distribution and density based outlier detectionalgorithms also make use of distance relative to other distances to mark outliers.

Supervised classification solutions like k-Nearest Neighbours method also use distance between observations to assign class to unknown observation. Support Vector Machine method involves transforming observations around select Kernels based on distance between observation and the kernel.

Common form of recommendation systems involve distance based similarity among user and item attribute vectors. Even when other forms of distances are used, number of dimensions plays a role in analytic design.

One of the most common distance metrics is Euclidian Distance metric, which is simply linear distance between two points in multi-dimensional hyper-space. Euclidian Distance for point i and point j in n dimensional space can be computed as:

Distance plays havoc in high-dimension

Consider simple process of data sampling. Suppose the black outside box in Fig. 1 is data universe with uniform distribution of data points across whole volume, and that we want to sample 1% of observations as enclosed by red inside box. Black box is hyper-cube in multi-dimensional space with each side representing range of value in that dimension. For simple 3-dimensional example in Fig. 1, we may have following range:

 

 

Figure 1 : Sampling

 

example of data sampling

 

What is proportion of each range should we sample to obtain that 1% sample? For 2-dimensions, 10% of range will achieve overall 1% sampling, so we may select x∈(0,10) and y∈(0,50) and expect to capture 1% of all observations. This is because 10%2=1%. Do you expect this proportion to be higher or lower for 3-dimension?

Even though our search is now in additional direction, proportional actually increases to 21.5%. And not only increases, for just one additional dimension, it doubles! And you can see that we have to cover almost one-fifth of each dimension just to get one-hundredth of overall! In 10-dimensions, this proportion is 63% and in 100-dimensions – which is not uncommon number of dimensions in any real-life machine learning – one has to sample 95% of range along each dimension to sample 1% of observations! This mind-bending result happens because in high dimensions spread of data points becomes larger even if they are uniformly spread.

This has consequence in terms of design of experiment and sampling. Process becomes very computationally expensive, even to the extent that sampling asymptotically approaches population despite sample size remaining much smaller than population.

Consider another huge consequence of high dimensionality. Many algorithms measure distance between two data points to define some sort of near-ness (DBSCAN, Kernels, k-Nearest Neighbour) in reference to some pre-defined distance threshold. In 2-dimensions, we can imagine that two points are near if one falls within certain radius of another. Consider left image in Fig. 2. What’s share of uniformly spaced points within black square fall inside the red circle? That is about

 

 

Figure 2 : Near-ness

 

example of data sampling

 

So if you fit biggest circle possible inside the square, you cover 78% of square. Yet, biggest sphere possible inside the cube covers only

of the volume. This volume reduces exponentially to 0.24% for just 10-dimension! What it essentially means that in high-dimensional world every single data point is at corners and nothing really is center of volume, or in other words, center volume reduces to nothing because there is (almost) no center! This has huge consequences of distance based clustering algorithms. All the distances start looking like same and any distance more or less than other is more random fluctuation in data rather than any measure of dissimilarity!

Fig. 3 shows randomly generated 2-D data and corresponding all-to-all distances. Coefficient of Variation in distance, computed as Standard Deviation divided by Mean, is 45.9%. Corresponding number of similarly generated 5-D data is 26.5% and for 10-D is 19.1%. Admittedly this is one sample, but trend supports the conclusion that in high-dimensions every distance is about same, and none is near or far!

 

Figure 3 : Distance Clustering

 

distance clustering

 

High-dimension affects other things too

Apart from distances and volumes, number of dimensions creates other practical problems. Solution run-time and system-memory requirements often non-linearly escalate with increase in number of dimensions. Due to exponential increase in feasible solutions, many optimization methods cannot reach global optima and have to make do with local optima. Further, instead of closed-form solution, optimization must use search based algorithms like gradient descent, genetic algorithm and simulated annealing. More dimensions introduce possibility of correlation and parameter estimation can become difficult in regression approaches.

Dealing with High-dimension

This will be separate blog post in itself, but correlation analysis, clustering, information value, variance inflation factor, principal component analysis are some of the ways in which number of dimensions can be reduced.

* Number of variables, observations or features a data point is made up of is called dimension of data. For instance, any point in space can be represented using 3 co-ordinates of length, breadth, and height, and has 3 dimensions

 

1. 算法背景 问题场景:在高维数据(如图像特征、文本向量)中进行**最近邻搜索(NN Search)**时,传统方法(如暴力搜索、KD树)可能效率低下,尤其是在维度灾难(Curse of Dimensionality)下,KD树的搜索时间可能退化为线性复杂度。 核心目标:在KD树的基础上,通过近似最近邻搜索(Approximate Nearest Neighbor, ANN),在保证一定精度的前提下,显著提升搜索效率。 2. 核心思想 BBF算法由David Lowe在SIFT特征匹配中提出,主要优化KD树的搜索过程: 优先级队列(Priority Queue): 将待搜索的树节点按与查询点的距离优先级排序,优先探索更可能包含最近邻的子树。 剪枝策略(Pruning): 设置最大搜索次数或误差容忍阈值,提前终止搜索,避免遍历所有节点。 近似解替代精确解: 通过限制搜索路径数量,牺牲少量精度以换取速度提升。 3. 算法步骤 构建KD树:与传统KD树相同,递归划分数据空间。 初始化优先级队列: 将根节点加入队列,队列中节点按与查询点的最小可能距离排序。 迭代搜索: 从队列中取出优先级最高的节点: 若为叶子节点,计算其数据点与查询点的距离,更新最近邻候选。 若为非叶子节点,将其子节点加入队列。 终止条件: 达到预设的最大搜索次数,或队列为空时停止。 返回结果:当前找到的最近邻(可能是近似解)。 4. 优缺点总结 优点 缺点 1. 高效:显著减少搜索时间,复杂度接近对数级别(O(log n))。 2. 适用高维数据:优于传统KD树暴力搜索。 3. 灵活控制精度:通过调整搜索次数平衡效率与精度。 1. 近似解:可能无法保证严格最近邻。 2. 参数敏感:需手动设置最大搜索次数或误差阈值。 3. 预处理开销:需预先构建KD树。 5. 应用场景 图像特征匹配(如SIFT、SURF)。 大规模高维数据检索(如推荐系统、生物信息学)。 实时系统:需快速响应的场景(如机器人导航、AR/VR)。 6. 教学要点总结 核心机制:优先级队列 + 剪枝策略 → 近似最近邻。 关键公式:节点优先级 = 查询点到节点区域的最小可能距离(如超平面距离)。 对比KD树: 传统KD树 BBF优化 必须搜索到叶子节点 提前终止,近似解 精确解 可能牺牲精度换速度 高维效率低 高维下更稳定 代码实现提示:需结合KD树结构与优先队列(如Python的heapq库)。 生成思维导图
03-15
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值