Part Ⅱ

最新推荐文章于 2025-04-29 15:03:11 发布

sdnuzsj

最新推荐文章于 2025-04-29 15:03:11 发布

阅读量292

点赞数

CC 4.0 BY-SA版权

分类专栏： Mechine Learning

本文链接：https://blog.youkuaiyun.com/sdnuzsj/article/details/89082607

Mechine Learning 专栏收录该内容

3 篇文章

订阅专栏

Part Ⅱ

2.1 The words list

linear regression 线性回归
gradient descent 梯度下降
learning rate 学习率
stochastic gradient descent 随机梯度下降法
Matrix derivatives 矩阵求导

2.2 Gradient descent concept

Gradient descent is kind of the iterative method, which can solve the least squares problems(linear or nonlinear).Gradient descent is one of the most commonly used method for solving the model parameters in machine learning algorithms, also named unconstrained optimization problems. Another method is the least squares method. We can get the minimize of loss functions and model parameters by gradient descent to solve step by step when we solve the minimize of the loss functions. In contrast, we use the gradient rise method to iterate if we need the maximum of the loss functions. In the machine learning, developing two methods based on the most essential gradient descent, stochastic gradient descent method and batch gradient descent method respectively.

The fundamental formula:

$θj:=θj−α∂∂θjJ(θ)\theta_{j}:= \theta_{j} - \alpha\frac{\partial}{\partial\theta_{j}}J(\theta)$ .

The $J(θ)J(\theta)$ is cost function defined:

$J(θ)=12(hθ(x)−y)2J(\theta) = \frac{1}{2}{(h_{\theta}(x) - y)^2}$ .

because $h(θ)=Σi=1nθixih(\theta) = \Sigma^n_{i=1}\theta_{i}x_{i}$

so after simplification we get $θj:=θj+α(y(i)−hθ(x(i)))xj(i)\theta_{j}:= \theta_{j} + \alpha(y^{(i)} - h_{\theta}(x^{(i)}))x^{(i)}_j$ .

Repeat until convergence{

$θj:=θj+αΣi=1m(y(i)−hθ(x(i))xj(i)\theta_{j}:= \theta_{j} + \alpha\Sigma^m_{i=1}(y^{(i)} - h_{\theta(x^{(i)})}x^{(i)}_j$ . >for every j

}

This algorithm is called batch gradient descent .

Loop{

for i = 1 to m{
$θj:=θj+α(y(i)−hθ(x(i)))xj(i)\theta_{j}:= \theta_{j} + \alpha(y^{(i)} - h_{\theta}(x^{(i)}))x^{(i)}_j$ .

}
}

This algorithm is called stochastic gradient descent method.

2.3 The normal equation of the least square method

the training example’s input values in its rows:

$\left[ \begin{matrix} —(x^{(1)})^{T}— \\ —(x^{(2)})^{T}— \\ \vdots \\ —(x^{(m)})^{T}— \end{matrix}\right]$ .

Also,let $y⃗\vec y$ be the m-dimensional vector containing all the target values from the training set:

$y⃗=[y(1)y(2)⋮y(m)]\vec y = \left[ \begin{matrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)}\end{matrix}\right]$ .

Now,since $hθ(x(i))=(x(i))Tθh_{\theta}(x^{(i)}) = (x^{(i)})^{T}\theta$ ,we can easily verify that

$Xθ−y⃗=[(x(1))Tθ⋮(x(m))Tθ]−[y(1)⋮y(m)]X\theta - \vec y = \left[ \begin{matrix} (x^{(1)})^{T}\theta \\ \vdots\\ (x^{(m)})^{T}\theta\end{matrix}\right] - \left[ \begin{matrix} y^{(1)} \\ \vdots \\ y^{(m)}\end{matrix}\right]$ .

$=[hθ(x(1))−y(1)⋮hθ(x(m))−y(m)]=\left[ \begin{matrix} h_\theta(x^{(1)}) - y^{(1)} \\ \vdots \\ h_\theta(x^{(m)}) - y^{(m)}\end{matrix}\right]$ .

Thus, using the fact that for a vector $z$ , we have that $zTz=Σizi2:z^Tz = \Sigma_iz^2_{i}:$

$12(Xθ−y⃗)T(Xθ−y⃗)=12∑i=1m(hθ(x(i))−y(i))2\frac{1}{2}{(X\theta - \vec y)^{T}(X\theta - \vec y)} = \frac{1}{2} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$ .

$=J(θ)=J(\theta)$ .

After simplification,we can the normal equation:

$XTXθ=XTy⃗X^TX\theta = X^T\vec y$ .

$θ=(XTX)(−1)XTy⃗\theta=(X^TX)^{(-1)}X^T\vec y$ .

PS:The least square method have essential difference with the gradient descent method. The least square method is to solve the global optimal solution and the gradient descent is to solve the local optimal solution.

2.4 summarize

The main content of this section is the difference between the gradient descent method and the least square method.For the gradient descent method,main methods are batch gradient descent and stochastic gradient descent. In difference places, these two methods have different performances.Meanwhile,we need notice the essence between the least square and the gradient descent.