Basics of Neural Network Programming - Gradient Descent

Let's talk about how you can use gradient descent to train the parameters w and b on your training set.

Figure-1

Figure-1 shows the logistic regression algorithm and the cost function J. We want to find w, b that minimize J(w,b). Following is the illustration of gradient descent.

Figure-2

In Figure-2, the horizontal axes represent your spatial parameters w and b. In practice, w can be much higher dimensional, but for the purpose of plotting, let's illustrate w as a single real number. The cost function J(w,b) is then some surface above these horizontal axes w and b. So the height of the surface represents the value of J(w,b) at a certain point. We want to find the value of w and b that corresponds to the minimum of the cost function J. It turns out that this cost function is a convex function. So, to find a good value for w and b, we'll initialize w and b to some initial value, usually the values to 0. What gradient descent does is it starts at that initial point, and then takes steps in the steepest downhill direction till converge to the global optimum or get to something close to the global minimum.

Figure-3

More detail of Gradient descent. Let's say there is function J(w) that we want to minimize as figure-3. We're going to repeatedly carry out the following update till the algorithm converges:

Figure-4

Note that:

  • \alpha is learning rate that controls how big a step we take on each iteration of gradient descent
  • \frac{dJ(w)}{dw} is the derivative. It's the update you want to make to the parameter w. The convention in coding is to use dw to represent this
  • The definition of derivative is the slope of a function at some point, which is the hight divided by the width. So,
    • If the initial value of w is w1, the derivative is positive, so the update of w end up taking steps to the left.
    • If the initial value of w is w2, the derivative is negative, then the update to w end up taking steps to the right.
    • So hopefully, whether w is initialized on the left or right, gradient descent will move the points toward the global minimum eventually

For logistic regression, the gradient descent looks like this:

Figure-5

In calculus, \frac{dJ(w,b)}{dw} is  actually written as \frac{\partial J(w,b)}{\partial w}, it's alled partial derivative. And similarly, \frac{dJ(w,b)}{db} is written as \frac{\partial (w,b)}{\partial b}. All what it means is that when J is function of two or more variable, this just means derivative with respect to one of the variables. In coding, \frac{\partial J(w,b)}{\partial w} will be denoted as dw and \frac{\partial (w,b)}{\partial b} denoted as db.

<end>

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值