Part Ⅲ

最新推荐文章于 2023-08-14 22:06:34 发布

sdnuzsj

最新推荐文章于 2023-08-14 22:06:34 发布

阅读量351

点赞数

CC 4.0 BY-SA版权

分类专栏： Mechine Learning

本文链接：https://blog.youkuaiyun.com/sdnuzsj/article/details/89082677

Mechine Learning 专栏收录该内容

3 篇文章

订阅专栏

Part Ⅲ

3.1 Words list

over fitting 过拟合
under fitting 欠拟合
locally weighted linear regression 局部加权线性回归
parametric learning algorithm 参数学习算法
exponential decay function 指数衰减函数
maximum likelihood 似然最大化
digression 插话
perception learning algorithm 感知器学习算法

3.2 over fitting and under fitting

The over fitting is which make the hypothesis become over strict for getting the consistent hypothesis. And the under fitting is which has the same bad performance in training sets and test sets,it’s also be known as a bad generalization.

3.3 locally weighted regression

In the original linear regression algorithm , to make a prediction at query point $x$ $(i . e ., t o e v a l u a t e h (x))$ , we would:

1.Fit $θ\theta$ to minimize $∑i(y(i)−θTx(i))2\sum_i(y^{(i)} - \theta^Tx^{(i)})^2$ .

2.Output $θTx\theta^Tx$ .

In contrast, the locally weighted linear regression algorithm does the following:

1.Fit $θ\theta$ to minimize $∑iw(i)(y(i)−θTx(i))2\sum_iw(i)(y^{(i)} - \theta^Tx^{(i)})^2$ .

2.Output $θT\theta^T$ .

Here,the $w (i)$ 's are non_negative valued weights.Note that the weights depend on the particular point $x$ at which we’re trying to evaluate $x$ .In other word, the $x$ is decided of you.The closer the observed points $x^{(i)}$ is to x,the $w (i)$ is bigger.

The fairly standard choice for the weights is:

$\frac{-exp(x^{(i)} - x)^2}{2\tau^2}$ .

$τ\tau$ is called the bandwidth parameter, the larger $τ\tau$ is,the faster the point farther from the distance x falls.Locally weighted linear regression is the first example we’re seeing of a non-parametric algorithm.

The difference between parametric algorithm and non-parametric algorithm:

The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm, because it has a fixed, finite number of parameters (the θi’s), which are fit to the data. Once we’ve fit the θi’s and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term “non-parametric” (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis h grows linearly with the size of the training set.

3.4 Classification logistic regression

$hθ(x)=g(θTx)=11+e−θTxh_\theta(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}}$ ,where $\frac{1}{1+e^{-z}}$

is called the logistic function or sigmoid function. Here is a plot showing $g (z)$ .

在这里插入图片描述

The derivative of the sigmoid function $g′(z)=g(z)(1−g(z))g^{'}(z) = g(z)(1 - g(z))$ .

Let us assume that

$P(y=1∣x;θ)=hθ(x)P(y=1|x;\theta) = h_\theta(x)$ .

$P(y=0∣x;θ)=1−hθ(x)P(y=0|x;\theta) = 1 - h_\theta(x)$ .

Note that this can be written more compactly as

$P(hθ(x))y(1−hθ(x))1−yP(h_\theta(x))^y(1 - h_\theta(x))^{1-y}$ .

We can solve parameters of this model by maximize the log likelihood,so we get

$θj:=θj+α(y(i)−hθ(x(i)))xj(i)\theta_j:=\theta_j+\alpha(y^{(i)} - h_\theta(x^{(i)}))x^{(i)}_j$ .

As $A n d r e w N g$ side,“If we compare this to the $L M S$ update rule, we see that it looks identical; but this is not the same algorithm, because $hθ(x(i))h_\theta(x^{(i)})$ is now defined as a non-linear function of $θ^Tx(i)$ . Nonetheless, it’s a little surprising that we end up with the same update rule for a rather different algorithm and learning problem. Is this coincidence, or is there a deeper reason behind this? We’ll answer this when get get to $G L M$ models.”