Neural Networks and Deep Learning 第二周 Logistic Regression

最新推荐文章于 2020-11-24 15:42:13 发布

gaoyishu91

最新推荐文章于 2020-11-24 15:42:13 发布

阅读量353

点赞数

分类专栏： deeplearning

本文链接：https://blog.youkuaiyun.com/gaoyishu91/article/details/80358561

版权

deeplearning 专栏收录该内容

5 篇文章

订阅专栏

1. 二分分类 Binary Classification

1.1 Logistic Regression

There is a cat picture. You want to know whether it is a cat picture, if yes, labeled () = 1;otherwise = 0.

So if your input image is 64 pixels by 64 pixels,

then you would have 3 64 by 64 matrices

corresponding to the red, green and blue pixel intensity values for your images. So to turn these pixel intensity values- Into a feature vector, what we're going to do is unroll all of these pixel values into an input feature vector x. And so we're going to use nx=12288 to represent the dimension of the input features x. So, the dimension of X depends on its feature numbers.

A single training examples

An entire training set consists of :

1：a set of $(X^{(m)},y^{(m)})$ .

$m$ : The number of samples. $m_{(train)}$ : the number of train samples; $m_{(test)}$ : the number of test samples.

y belongs to a 1*m space. (这个和我学的不同，一般不是m*n的矩阵吗?)

So, let's simply discuss the logistic regression below:

When given a x^(m) (X in a R^(nx) dimensions) (the input data), we need a y^(m) to label X^(m) like {(X, y)}.

But if you want your y is a possibility label which shows the possibility of result is one. For example, you want to know the possibility of to be sunny (not rainy) tomorrow. How to do that?

(首先，y标记的是1的概率，不是0的概率；其次，我想输出的是一个概率，而不是数，概率的范围是（0,1），数的范围说不清。那么，如果将无穷范围的数缩小成（0,1）的概率呢？)

We know that the range of w.T.dotX is +∞ to -∞. The problem is transfer to how to narrow down the results in a (0,1) region. Notice the sigmoid function from above picture. The sigmoid function sig = 1/1+e^(-z) has the ability. Imagining the z is w.T.dotX, when z close to +∞,(larger positive number) e^(-z) -> 0, sig -> 1 comparing that sig -> 0 when z close to -∞(small or larger negative number); when w.t.dotX is 0, sig = 0.5. So, we can set that when p(sig) > 0.5, y =1; p(sig) < 0.5, y = 0.

Cost function

Loss function definition:

One thing you could do is define the loss when your algorithm outputs y-hat and the true label as Y to be. (也就是看自己预测的值和真实值之间的差距，那肯定是 as small as possible).

Some important info above:

Given you a set of $(X^{(m)},y^{(m)})$ , you want to your prediction y-hat close to y. (这样才算预测准备)
Finding a Loss functionj which measures the diference between y-hat and real y.
When $sig(W^Tx+b)\sqsubseteq (0.5,1)$ , y-hat should be predicted as 1; So, $l(y_{(pred)},y) = -log(y_{(pred)})$ should be as small as possible. $log(y_{(pred)})$ should be as large as possible, $y_{(pred)}$ should as large as possible. $y_{(pred)}$ should close to 1.
Loss function is dealing with single neural network; Cost finction to large samples.

Gradient Descent

When w,b change, the cost function will change. So, the queation transfer to find a good w,b to let cost function as small as possible.
α is the learning rate, and controls how big a step we take on each iteration or gradient descent.
dw can be represented the deviation of dJ(w)/dw in python.

So, w and b gradient descent can be this:

其实就是分别对w和b求偏导，然后一步一步获得减去偏导。

Computation Graph （计算图）

When you want to compute a function J(a,b,c) = 3(a+bc)时，可以预设 $u=bc$ , $v=a+u$ 和 $J=3v$ 。那么，就可以做出一个计算图出来，蓝色的线条表示 forward propagation 前向传播；红色的线表示backward propagation反向传播。那么，就能通过反向传播求出每个input和hidden layer相对于 $J$ 的导数。为什么要求这个导数呢？求导数就是求当因子改变时，对函数J的影响多少。下面，通过一个计算的例子来介绍这个计算方法。

好了，先从右向左求。当 $v$ 改变1时， $J$ 改变3，那么 $dJ/dv$ =3；当 $u$ 改变1时， $J$ 改变3,那么 $dJ/du$ 也是3；因为 $v =a + u$ ，所以 $dJ/du = dJ/dv * dv/du=3$ 。这就是一个链式法则。那么，求 $dJ/db = du/db * dJ/du$ 。这样，从右向左乘每个因子的导数就能求得input和函数的改变的关联情况了。

那么，在python命名中，因为所有的导数的分子都是 $dJ$ ，省去它, $dJ/du = du$ ，其余以此类推即可。

用Logistic Regression做例子来看。

So, first, we list all the formulas needed to caculate loss function of logistic regression. Then, we can get the computation graph. w1,w2 and b are the parameters we needed to refine which depends the model quality.

So, in order to caculate the deriate of w1,w2 and b of L(a,y): $dL(a,y)/dw1$ , $dL(a,y)/dw2$ and $dL(a,y)/db$ , we can use backward propagation.

caculate $da = dL(a,y)/da$ = $-y/a+(1-y/1-a)$ .
caculate $dz=dL/dz=(dL/da) * (da/dz)$ , $da/dz$ is the derivate of sigmoid function. So, dz = $a-y$ 。
caculate $dw_1 =x_1 * dz$ and other.

So, we can get how to caculate or optimize the parameters use coumputation graph. (导数复合求导,讲的很不错，能够理解公式)

Gradient Descent in m samples?

When total samples = m = [1,2,3,4...m], we use $(a^{(i)},y^{(i)})$ represent the sample $i$ ， $dw_1^{(i)},dw_2^{(i)},db^{(i)}$ represent the deriate of sample $i$ . For total sample $dw_1^{(i)},dw_2^{(i)},db^{(i)}$ , they should be the average of every sample $dw_1^{(i)},dw_2^{(i)},db^{(i)}$ 。(对每个sample的parameter参数的导数求平均)

python计算思路：

有2个缺点:

写2个for loop；for loop会降低性能；怎么办呢? Vectorization。

Broadcasting in Python

import numpy as np
A = np.array([[56.0,0.0,4.4,68.0],
             [1.2,104.0,52.0,8.0],
             [1.8,135.0,99.0,0.9]])


print(A)

[[ 56.    0.    4.4  68. ]
 [  1.2 104.   52.    8. ]
 [  1.8 135.   99.    0.9]]

cal = A.sum(axis=0)  #sum vertically 纵向加

#这是个横向的 因为是一维数组，一维数组默认横向 维度是(1,n)
cal
array([ 59. , 239. , 155.4,  76.9])

#这是个横向的 二维数组，一个子数组代表一个row
cal = cal.reshape(1,4)
cal
array([[ 59. , 239. , 155.4,  76.9]])

#如果是二维数组中的纵向 
cal_verti = cal.reshape(4,1)
cal_verti
array([[ 59. ],
       [239. ],
       [155.4],
       [ 76.9]])

percentage = 100 * A/cal
percentage

array([[94.91525424,  0.        ,  2.83140283, 88.42652796],
       [ 2.03389831, 43.51464435, 33.46203346, 10.40312094],
       [ 3.05084746, 56.48535565, 63.70656371,  1.17035111]])

#如果加上常数
B = np.array([1,2,3,4])
B + 100

array([101, 102, 103, 104])

c = np.array([[1,2,3],[4,5,6]])
c1 = np.array([100,200,300])
c2 = np.array([[100],[200]])
print(c + c1)
print(c+c2)

[[101 202 303]
 [104 205 306]]
[[101 102 103]
 [204 205 206]]

#也就是numpy的加减乘数能自动给你转成匹配的矩阵，当两个矩阵的维度不同时。

a = np.random.randn(5)
print(a)
#这是个(5,)的vector，rank1没有纵坐标 it's neither a row vector nor a column vector.
a.shape
(5,)

print(a.T)  #一样的
[ 0.61129989 -0.48008827  1.39754925 -0.90183129  0.13849732]

#所以最好不要用这种,可以把column加上去，这样就是二维矩阵
b = np.random.randn(5,1)
print(b)

[[ 1.46909732]
 [-0.70696235]
 [ 0.50947828]
 [-0.48711335]
 [-1.61225188]]

#这样就可以做转置了，因为是二维数组
print(b.T)

[[ 1.46909732 -0.70696235  0.50947828 -0.48711335 -1.61225188]]

#所以不要用rank1 vector。
assert(a.shape==(5,1))