K-nearest neighbors and Euclidean Distance

本文探讨了K-最近邻(KNN)算法在机器学习分类中的应用,解释了K值选择的重要性及其对预测准确性和置信度的影响,并介绍了作为算法核心的欧氏距离计算方法。

K-nearest neighbors and Euclidean Distance

This is my study notes in machine learning, writing articles in English because I want to improve my writing skills. Anyway, thanks for watching and if I made some mistakes, let me know please.

What is K-nearest neighbors algorithm?

It is a supervised learning algorithm in classification. We have prior-labeled data for training, telling the machine which data belongs to which group. Clustering is the other algorithm in classification but it is unsupervised learning method.
The algorithm bases on distances between predicted data and trained data which are knew before. Distance is also understood as proximity intuitively.

What the K and nearest means?

K is a number we can choose, which symbols how many data points we choose that nearest to the new data. Usually we want to use an odd number as K because this algorithm is going to basically go to a majority vote based on the neighbors. If we use even number, we may get into trouble of 50/50 split situation. There many ways to apply weights to distance to penalize greater distances data, we may use even number for K.

Accuracy or Confidence

During prediction process, the algorithm will select K points which closest to the new data point, and then, find out the largest categories(classification), it means probably this data belongs to this group. The rate of ’positive number / K’ stands for confidence, which means how much we can trust this data belongs to this group. Accuracy is used in testing model after training. They are completely different.

Euclidean Distance

coming soon

<think> 我们正在讨论k-最近邻算法(k-NN)。根据用户需求,需要详细描述该算法,同时遵循系统级指令(如数学公式格式、引用标记等)。同时,用户引用的内容中提到了k-NN的基本概念、特点、优缺点以及应用,我们将整合这些信息并扩展。 参考引用内容: [1]:k-NN算法简单直观,无参数,懒惰学习,应用于分类和回归。 [2]:k-NN分类器的理论性质(一致性)以及回归的应用,同时给出了一些术语解释。 [3]:k-NN的计算成本问题(随维度或样本量指数增长),以及优点(实现快、训练时间少、预测准确性高)和一个应用示例(约会对象匹配)。 接下来,我们将按照以下结构组织回答: 1. 算法概述 2. 工作原理(分类和回归) 3. 关键参数与距离度量 4. 优缺点分析 5. 优化方法 6. 应用场景 7. 示例(可选,但引用[3]提到约会对象匹配,可以简述) 8. 相关问题 注意:在适当的地方添加引用标记(如[^1],[^2],[^3]),并使用正确的LaTeX格式。 </think> ### k-最近邻 (k-Nearest Neighbors, k-NN) 算法详解 k-NN 是一种**基于实例的监督学习算法**,通过测量待预测样本与训练集样本的相似度进行分类或回归。其核心思想是:相似样本的特征空间距离相近,因此可通过最近邻的标签预测目标样本的类别或数值[^1][^3]。 --- #### 一、核心原理与数学表示 1. **基本假设** 假设特征空间中相邻样本具有相似属性: $$ P(y=c \mid x) \approx P(y=c \mid x') \quad \text{当} \quad d(x,x') \to 0 $$ 其中 $d(\cdot)$ 为距离度量函数,$c$ 为类别标签[^2]。 2. **决策规则** - **分类任务**:投票机制 $$ \hat{y} = \arg\max_{c} \sum_{x_i \in N_k(x)} \mathbb{I}(y_i = c) $$ 其中 $N_k(x)$ 是 $x$ 的 $k$ 个最近邻集合,$\mathbb{I}$ 为指示函数[^1]。 - **回归任务**:加权平均 $$ \hat{y} = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i \quad \text{或} \quad \hat{y} = \sum_{x_i \in N_k(x)} w_i y_i $$ 权重 $w_i$ 通常与距离成反比(如 $w_i = 1/d(x,x_i)$)[^2]。 --- #### 二、关键组件与参数 1. **距离度量** 常用距离函数($x, z$ 为 $d$ 维向量): | 距离类型 | 公式 | |----------|------| | 欧氏距离 | $d(x,z) = \sqrt{\sum_{j=1}^d (x_j - z_j)^2}$ | | 曼哈顿距离 | $d(x,z) = \sum_{j=1}^d |x_j - z_j|$ | | 余弦相似度 | $d(x,z) = 1 - \frac{x \cdot z}{\|x\| \|z\|}$ | 2. **关键参数** - $k$(邻居数):控制模型复杂度 - 小 $k$:高方差(对噪声敏感) - 大 $k$:高偏差(决策边界平滑) - 距离权重:缓解样本不平衡问题[^3] --- #### 三、算法流程 ```python def kNN_predict(X_train, y_train, x_test, k=5, metric='euclidean'): distances = [] for i, x_train in enumerate(X_train): dist = calculate_distance(x_test, x_train, metric) # 计算距离 distances.append((dist, y_train[i])) sorted_dist = sorted(distances, key=lambda x: x[0]) # 按距离排序 neighbors = sorted_dist[:k] # 选择k个最近邻 if task == 'classification': votes = Counter([label for _, label in neighbors]) return votes.most_common(1)[0][0] # 多数投票 else: return np.mean([label for _, label in neighbors]) # 平均值 ``` --- #### 四、优缺点分析 | **优势** | **局限** | |----------|----------| | ▶ **无显式训练**:惰性学习(lazy learning),无需训练阶段[^1] | ▶ **计算成本高**:预测需计算所有样本距离,复杂度 $O(nd)$($n$ 样本数,$d$ 维度)[^3] | | ▶ **模型简单直观**:决策过程可解释性强 | ▶ **维度灾难**:高维数据中距离度量失效 | | ▶ **非线性适应**:可拟合复杂决策边界 | ▶ **噪声敏感**:需特征缩放和 $k$ 值调优 | | ▶ **理论保证**:当 $k \to \infty$ 且 $k/n \to 0$ 时具有一致性(consistent)[^2] | ▶ **样本不平衡**:需加权投票或采样 | --- #### 五、优化策略 1. **计算效率优化** - KD树或球树:将搜索复杂度降至 $O(d \log n)$ - 近似最近邻(ANN):如 Locality-Sensitive Hashing (LSH) 2. **噪声处理** - 距离加权:$w_i = 1/(d(x,x_i)+\epsilon)$ - 编辑训练集:移除边界噪声样本 3. **维度缩减** - PCA 或 t-SNE 降维后再应用 k-NN --- #### 六、应用场景 1. **分类任务** - 手写数字识别(MNIST) - 推荐系统(用户相似度计算) 2. **回归任务** - 房价预测(相似户型均价) - 用户评分预测(协同过滤)[^3] 3. **其他领域** - 生物信息学(基因序列分类) - 异常检测(远离邻居的样本为异常点) > **示例**:约会对象匹配中,用 k-NN 根据特征(年龄、收入、兴趣匹配度)预测用户接受约会的概率[^3]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值