Part Ⅱ
2.1 The words list
- linear regression 线性回归
- gradient descent 梯度下降
- learning rate 学习率
- stochastic gradient descent 随机梯度下降法
- Matrix derivatives 矩阵求导
2.2 Gradient descent concept
Gradient descent is kind of the iterative method, which can solve the least squares problems(linear or nonlinear).Gradient descent is one of the most commonly used method for solving the model parameters in machine learning algorithms, also named unconstrained optimization problems. Another method is the least squares method. We can get the minimize of loss functions and model parameters by gradient descent to solve step by step when we solve the minimize of the loss functions. In contrast, we use the gradient rise method to iterate if we need the maximum of the loss functions. In the machine learning, developing two methods based on the most essential gradient descent, stochastic gradient descent method and batch gradient descent method respectively.
The fundamental formula:
θj:=θj−α∂∂θjJ(θ)\theta_{j}:= \theta_{j} - \alpha\frac{\partial}{\partial\theta_{j}}J(\theta)θj:=θj−α∂θj∂J(θ).
The J(θ)J(\theta)J(θ) is cost function defined:
J(θ)=12(hθ(x)−y)2J(\theta) = \frac{1}{2}{(h_{\theta}(x) - y)^2}J(θ)=21(hθ(x)−y)2.
because h(θ)=Σi=1nθixih(\theta) = \Sigma^n_{i=1}\theta_{i}x_{i}h(θ)=Σi=1nθixi
so after simplification we get θj:=θj+α(y(i)−hθ(x(i)))xj(i)\theta_{j}:= \theta_{j} + \alpha(y^{(i)} - h_{\theta}(x^{(i)}))x^{(i)}_jθj:=θj+α(y(i)−hθ(x(i)))xj(i).
Repeat until convergence{
θj:=θj+αΣi=1m(y(i)−hθ(x(i))xj(i)\theta_{j}:= \theta_{j} + \alpha\Sigma^m_{i=1}(y^{(i)} - h_{\theta(x^{(i)})}x^{(i)}_jθj:=θj+αΣi=1m(y(i)−hθ(x(i))xj(i). >for every j
}
This algorithm is called batch gradient descent .
Loop{
for i = 1 to m{
θj:=θj+α(y(i)−hθ(x(i)))xj(i)\theta_{j}:= \theta_{j} + \alpha(y^{(i)} - h_{\theta}(x^{(i)}))x^{(i)}_jθj:=θj+α(y(i)−hθ(x(i)))xj(i).}
}
This algorithm is called stochastic gradient descent method.
2.3 The normal equation of the least square method
the training example’s input values in its rows:
X=[—(x(1))T——(x(2))T—⋮—(x(m))T—]X = \left[ \begin{matrix} —(x^{(1)})^{T}— \\ —(x^{(2)})^{T}— \\ \vdots \\ —(x^{(m)})^{T}— \end{matrix}\right]X=⎣⎢⎢⎢⎡—(x(1))T——(x(2))T—⋮—(x(m))T—⎦⎥⎥⎥⎤.
Also,let y⃗\vec yy be the m-dimensional vector containing all the target values from the training set:
y⃗=[y(1)y(2)⋮y(m)]\vec y = \left[ \begin{matrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)}\end{matrix}\right]y=⎣⎢⎢⎢⎡y(1)y(2)⋮y(m)⎦⎥⎥⎥⎤.
Now,since hθ(x(i))=(x(i))Tθh_{\theta}(x^{(i)}) = (x^{(i)})^{T}\thetahθ(x(i))=(x(i))Tθ,we can easily verify that
Xθ−y⃗=[(x(1))Tθ⋮(x(m))Tθ]−[y(1)⋮y(m)]X\theta - \vec y = \left[ \begin{matrix} (x^{(1)})^{T}\theta \\ \vdots\\ (x^{(m)})^{T}\theta\end{matrix}\right] - \left[ \begin{matrix} y^{(1)} \\ \vdots \\ y^{(m)}\end{matrix}\right]Xθ−y=⎣⎢⎡(x(1))Tθ⋮(x(m))Tθ⎦⎥⎤−⎣⎢⎡y(1)⋮y(m)⎦⎥⎤.
=[hθ(x(1))−y(1)⋮hθ(x(m))−y(m)]=\left[ \begin{matrix} h_\theta(x^{(1)}) - y^{(1)} \\ \vdots \\ h_\theta(x^{(m)}) - y^{(m)}\end{matrix}\right]=⎣⎢⎡hθ(x(1))−y(1)⋮hθ(x(m))−y(m)⎦⎥⎤.
Thus, using the fact that for a vector zzz, we have that zTz=Σizi2:z^Tz = \Sigma_iz^2_{i}:zTz=Σizi2:
12(Xθ−y⃗)T(Xθ−y⃗)=12∑i=1m(hθ(x(i))−y(i))2\frac{1}{2}{(X\theta - \vec y)^{T}(X\theta - \vec y)} = \frac{1}{2} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^221(Xθ−y)T(Xθ−y)=21∑i=1m(hθ(x(i))−y(i))2.
=J(θ)=J(\theta)=J(θ).
After simplification,we can the normal equation:
XTXθ=XTy⃗X^TX\theta = X^T\vec yXTXθ=XTy.
θ=(XTX)(−1)XTy⃗\theta=(X^TX)^{(-1)}X^T\vec yθ=(XTX)(−1)XTy.
PS:The least square method have essential difference with the gradient descent method. The least square method is to solve the global optimal solution and the gradient descent is to solve the local optimal solution.
2.4 summarize
The main content of this section is the difference between the gradient descent method and the least square method.For the gradient descent method,main methods are batch gradient descent and stochastic gradient descent. In difference places, these two methods have different performances.Meanwhile,we need notice the essence between the least square and the gradient descent.