[#1]Least square and Nearest neighbors

本文探讨了最小二乘法和近邻法在回归与分类任务中的应用,对比了两者之间的合理性与差异,并介绍了这些简单过程的一些扩展,如核方法、局部回归等。

1. Least square and nearest neighbors

1.1 Least square in learn regression

Assume we have a data set {(X(i),y(i))}Ni=1 , and we will fit a linear regression y=XTβ+b on this data set. Notations to be used:

  • Y=(y(1),y(2),,y(N))
  • X=(X(1)T,X(2)T,,X(N)T)T

Firstly, we need to choose a loss function. Here we choose the Least squares, which aims to minimize the quadratic function of parameter β ,

RSSresidual sum of square(β)=YXβ2F

which leads to the solution β^=(XTX)1XTY if X is column full rank. Then prediction function at any X is given by y^=XTβ^.

1.2 Nearest neighbors

In regression, nearest neighbors method averages the outputs of the k-nearest points of X as prediction value at X, which can be formulated as

y^(X)=1kX(i)Nk(X)y(i)

In classification, nearest neighbors method uses the maximum votes of labels of the k-nearest points of X as the class for X, which can be formulated as
g^(X)=argmaxgX(i)Nk(X)1{y(i)=g}

2. Rationality and difference of least square and nearest neighbors

  • least square makes huge assumptions about structure but nearest neighbors not
  • least square yields stable but possible inaccurate predictions, while predictions of nearest neighbors are often accurate but can be unstable

Note that:

  • stable means low variance
  • accurate means low bias

rationality least square and nearest neighbors

Suppose that we have random variables (X,Y) with jointly distribution Pr(X,Y) . Then we want to find a function f(X) to approximate Y . If we use the suqare loss function as a criteria for choosing f(X),

EPEepxect prediction error(f)=EYf(X)2F=EXEY|X[Yf(X)2F|X]

It suffices to minimize EPE pointwise
f(x)=argminEY|X[Yf(X)2F|X=x]

The solution is
f(x)=E[Y|X=x]

And least square and nearest neighbors both aim to approximate the expectation by averaging.

Least square assumes the linear structure and approximate the expectation in square loss function by averaging all training datas.

\mathrm{\hat{EPE}}(\beta) = \frac{1}{N}\sum_{i=1}^N\|y_i - X^{(i)}^t\beta\|_F^2

Nearest neighbors approximates the conditional expectation in solution by averaging the outputs near target x
Y^=ave(yi|X(i)Nk(x))

So, two things are happening in approximating of both least square and nearest neighbor.

Least square

  1. model structure assumption
  2. averaging over all training data in EPE

Nearest neighbors

1.condition on a small region of target point x instead of conditioning on it
2.averaging the outputs which are near to x

extension of these simple procedures

There are many complex algorithms are from these two,

  • Kernel Methods use weights that decrease smoothly to zero with distance from the target point, rather that the effective 0/1 weights used by k-nearest neighbors.
  • In high dimensional spaces the distance kernels are modified to emphasize some variable more than others.
  • Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
  • Linear models fit to a basis expansion of the original inputs allow arbitrary complex models.
  • Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值