1. Least square and nearest neighbors
1.1 Least square in learn regression
Assume we have a data set {(X(i),y(i))}Ni=1 , and we will fit a linear regression y=XTβ+b on this data set. Notations to be used:
- Y=(y(1),y(2),⋯,y(N))
- X=(X(1)T,X(2)T,⋯,X(N)T)T
Firstly, we need to choose a loss function. Here we choose the Least squares, which aims to minimize the quadratic function of parameter
β
,
which leads to the solution β^=(XTX)−1XTY if X is column full rank. Then prediction function at any X is given by
1.2 Nearest neighbors
In regression, nearest neighbors method averages the outputs of the k-nearest points of
X
as prediction value at
In classification, nearest neighbors method uses the maximum votes of labels of the k-nearest points of X as the class for
2. Rationality and difference of least square and nearest neighbors
- least square makes huge assumptions about structure but nearest neighbors not
- least square yields stable but possible inaccurate predictions, while predictions of nearest neighbors are often accurate but can be unstable
Note that:
- stable means low variance
- accurate means low bias
rationality least square and nearest neighbors
Suppose that we have random variables
(X,Y)
with jointly distribution
Pr(X,Y)
. Then we want to find a function
f(X)
to approximate
Y
. If we use the suqare loss function as a criteria for choosing
It suffices to minimize EPE pointwise
The solution is
And least square and nearest neighbors both aim to approximate the expectation by averaging.
Least square assumes the linear structure and approximate the expectation in square loss function by averaging all training datas.
Nearest neighbors approximates the conditional expectation in solution by averaging the outputs near target x
So, two things are happening in approximating of both least square and nearest neighbor.
Least square
- model structure assumption
- averaging over all training data in EPE
Nearest neighbors
1.condition on a small region of target point x instead of conditioning on
it
2.averaging the outputs which are near to x
extension of these simple procedures
There are many complex algorithms are from these two,
- Kernel Methods use weights that decrease smoothly to zero with distance from the target point, rather that the effective 0/1 weights used by k-nearest neighbors.
- In high dimensional spaces the distance kernels are modified to emphasize some variable more than others.
- Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
- Linear models fit to a basis expansion of the original inputs allow arbitrary complex models.
- Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.
本文探讨了最小二乘法和近邻法在回归与分类任务中的应用,对比了两者之间的合理性与差异,并介绍了这些简单过程的一些扩展,如核方法、局部回归等。
9801

被折叠的 条评论
为什么被折叠?



