[#1]Least square and Nearest neighbors

最新推荐文章于 2025-02-14 18:54:52 发布

原创最新推荐文章于 2025-02-14 18:54:52 发布 · 639 阅读

0 ·

CC 4.0 BY-SA版权

本文探讨了最小二乘法和近邻法在回归与分类任务中的应用，对比了两者之间的合理性与差异，并介绍了这些简单过程的一些扩展，如核方法、局部回归等。

Least square and nearest neighbors
- 1 Least square in learn regression
- 2 Nearest neighbors
Rationality and difference of least square and nearest neighbors
- rationality least square and nearest neighbors
- extension of these simple procedures

1. Least square and nearest neighbors

1.1 Least square in learn regression

Assume we have a data set $\{(X^{(i)},y^{(i)})\}_{i=1}^N$ , and we will fit a linear regression $y = X^T\beta+b$ on this data set. Notations to be used:

$Y = (y^{(1)},y^{(2)},\cdots,y^{(N)})$
$\mathrm{X} = ({X^{(1)}}^T,{X^{(2)}}^T,\cdots,{X^{(N)}}^T)^T$

Firstly, we need to choose a loss function. Here we choose the Least squares, which aims to minimize the quadratic function of parameter $\beta$ ,

R S S      residual sum of square (β) = ∥ Y - X β ∥ 2 F

$\underbrace{RSS}_{\text{residual sum of square}}(\beta) = \|Y - \mathrm{X}\beta\|_F^2$
which leads to the solution

β^=(XTX)−1XTY $\hat\beta = (\mathrm{X}^T\mathrm{X})^{-1}\mathrm{X}^TY$ if

X $\mathrm{X}$ is column full rank. Then prediction function at any

X $X$ is given by

y^=XTβ^ $\hat y = X^T\hat \beta$ .

1.2 Nearest neighbors

In regression, nearest neighbors method averages the outputs of the k-nearest points of $X$ as prediction value at $X$ , which can be formulated as

y^(X) = 1 k \sum X (i) \in N k (X) y (i)

$\hat y(X) = \frac{1}{k}\sum_{X^{(i)} \in N_k(X)}y^{(i)}$
In classification, nearest neighbors method uses the maximum votes of labels of the k-nearest points of

X $X$ as the class for

X $X$ , which can be formulated as

g^(X)=argmaxg∑X(i)∈Nk(X)1{y(i)=g}

$\hat g(X) = \arg\max_g\sum_{X^{(i)} \in N_k(X)}1_{\{y^{(i)}=g\}}$

2. Rationality and difference of least square and nearest neighbors

least square makes huge assumptions about structure but nearest neighbors not
least square yields stable but possible inaccurate predictions, while predictions of nearest neighbors are often accurate but can be unstable

Note that:

stable means low variance
accurate means low bias

rationality least square and nearest neighbors

Suppose that we have random variables $(X,Y)$ with jointly distribution $\mathrm{Pr}(X,Y)$ . Then we want to find a function $f(X)$ to approximate $Y$ . If we use the suqare loss function as a criteria for choosing $f(X)$ ,

E P E      epxect prediction error (f) = E ∥ Y - f (X) ∥ 2 F = E X E Y | X [∥ Y - f (X) ∥ 2 F | X]

$\begin{aligned} \underbrace{\mathrm{EPE}}_{\text{epxect prediction error}}(f) &= \mathrm{E}\|Y - f(X)\|_F^2\\ &=\mathrm{E}_X\mathrm{E}_{Y|X}[\|Y - f(X)\|_F^2|X]\\ \end{aligned}$
It suffices to minimize

EPE $\mathrm{EPE}$ pointwise

f (x) = arg min E Y | X [∥ Y - f (X) ∥ 2 F | X = x]

$f(x) = \arg\min\mathrm{E}_{Y|X}[\|Y - f(X)\|_F^2|X=x]$
The solution is

f (x) = E [Y | X = x]

$f(x) = \mathrm{E}[Y|X=x]$

And least square and nearest neighbors both aim to approximate the expectation by averaging.

Least square assumes the linear structure and approximate the expectation in square loss function by averaging all training datas.

\mathrm{\hat{EPE}}(\beta) = \frac{1}{N}\sum_{i=1}^N\|y_i - X^{(i)}^t\beta\|_F^2

$\mathrm{\hat{EPE}}(\beta) = \frac{1}{N}\sum_{i=1}^N\|y_i - X^{(i)}^t\beta\|_F^2$
Nearest neighbors approximates the conditional expectation in solution by averaging the outputs near target

x $x$

Y^= a v e (y i | X (i) \in N k (x))

$\hat Y = \mathrm{ave}(y_i|X^{(i)} \in N_k(x))$

So, two things are happening in approximating of both least square and nearest neighbor.

Least square

model structure assumption
averaging over all training data in $\mathrm{EPE}$

Nearest neighbors

1.condition on a small region of target point $x$ instead of conditioning on $it$
2.averaging the outputs which are near to

extension of these simple procedures

There are many complex algorithms are from these two,

Kernel Methods use weights that decrease smoothly to zero with distance from the target point, rather that the effective 0/1 weights used by k-nearest neighbors.
In high dimensional spaces the distance kernels are modified to emphasize some variable more than others.
Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
Linear models fit to a basis expansion of the original inputs allow arbitrary complex models.
Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.