Basics of Neural Network Programming - Gradient Descent-优快云博客

本文链接：https://blog.youkuaiyun.com/edward_wang1/article/details/117948950

Let's talk about how you can use gradient descent to train the parameters w and b on your training set.

Figure-1 shows the logistic regression algorithm and the cost function J. We want to find w, b that minimize J(w,b). Following is the illustration of gradient descent.

In Figure-2, the horizontal axes represent your spatial parameters w and b. In practice, w can be much higher dimensional, but for the purpose of plotting, let's illustrate w as a single real number. The cost function J(w,b) is then some surface above these horizontal axes w and b. So the height of the surface represents the value of J(w,b) at a certain point. We want to find the value of w and b that corresponds to the minimum of the cost function J. It turns out that this cost function is a convex function. So, to find a good value for w and b, we'll initialize w and b to some initial value, usually the values to 0. What gradient descent does is it starts at that initial point, and then takes steps in the steepest downhill direction till converge to the global optimum or get to something close to the global minimum.

More detail of Gradient descent. Let's say there is function J(w) that we want to minimize as figure-3. We're going to repeatedly carry out the following update till the algorithm converges:

Note that:

$\alpha$ is learning rate that controls how big a step we take on each iteration of gradient descent
$\frac{dJ(w)}{dw}$ is the derivative. It's the update you want to make to the parameter w. The convention in coding is to use $dw$ to represent this
The definition of derivative is the slope of a function at some point, which is the hight divided by the width. So,
- If the initial value of w is w1, the derivative is positive, so the update of w end up taking steps to the left.
- If the initial value of w is w2, the derivative is negative, then the update to w end up taking steps to the right.
- So hopefully, whether w is initialized on the left or right, gradient descent will move the points toward the global minimum eventually

For logistic regression, the gradient descent looks like this:

In calculus, $\frac{dJ(w,b)}{dw}$ is actually written as $\frac{\partial J(w,b)}{\partial w}$ , it's alled partial derivative. And similarly, $\frac{dJ(w,b)}{db}$ is written as $\frac{\partial (w,b)}{\partial b}$ . All what it means is that when J is function of two or more variable, this just means derivative with respect to one of the variables. In coding, $\frac{\partial J(w,b)}{\partial w}$ will be denoted as $dw$ and $\frac{\partial (w,b)}{\partial b}$ denoted as $db$ .

<end>