Class1-Week2-Neural Networks Basics

最新推荐文章于 2021-11-20 17:11:59 发布

原创最新推荐文章于 2021-11-20 17:11:59 发布 · 487 阅读

0 ·

CC 4.0 BY-SA版权

Deep Learning 专栏收录该内容

18 篇文章

订阅专栏

本文深入解析逻辑回归算法，探讨其在二分类问题中的应用，通过猫与非猫图像识别实例，介绍逻辑函数、成本函数及梯度下降等核心概念。

文章目录

@[toc]

Logistic Regression
Description
Example: Cat vs No-cat
Logistic Function

Logistic Cost Function
Loss(error) Function
Cost Function

Gradient Descent
One Example:
M Training Examples:

Vectorization
Look at the Power of Vectorization:
Vectorizing Logitic Regression
Broadcasting in Python
A Note on Python/Numpy vectors

Explanation of Logistic Regression

Logistic Regression

Description

Logistic regression is a learning algorithm used in a supervised learning problem when the output ? are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data

Example: Cat vs No-cat

Given an image represented by a feature vector ?, the algorithm will evaluate the probability of a cat being in that image.

$\widehat{y} = P(y=1|x), where \widehat{y} \in [0,1]$

The parameters used in Logistic regression are:

The input features vector: $\in \mathbb{R}^{n_{x}}$ , where $n_{x}$ is the number of features
The training label: $\in \{0,1\}$
The weight: $\in \mathbb{R}^{n_{x}}$ , where $n_{x}$ is the number of features
The threshold: $\in \mathbb{R}$
The output: $\widehat{y} = \sigma{w^{T}x + b}$
Sigmoid function: $\sigma(w^{T}x + b)$ , $\sigma{z} = \frac{1}{1 + e^{-z}}$

Logistic Function

import numpy as np 
import time
import matplotlib.pyplot as plt 

%matplotlib inline

x = np.arange(-10, 10, 0.001)
y = 1 / (1 + np.exp(-x))
plt.plot(x,y)
plt.suptitle(r'$y=\frac{1}{1+e^{-x}}$', fontsize=20)
plt.grid(color='gray')
plt.grid(linewidth='1')
plt.grid(linestyle='--')

plt.show()

在这里插入图片描述

$w^{T}x + b)$ is a linear function (?? + ?), but since we are looking for a probability constraint between [0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.

Some observations from the graph:

If z is large positive number, then $\sigma(z) = 1$
If z is large negative number, then $\sigma(z) = 0$
if $z = 0$ , then $\sigma(z) = 0.5$

Logistic Cost Function

To train the parameters ? and ?, we need to define a cost function.

Recap:

$\widehat{y} = \sigma(w^{T}x^{(i)}+b), where\ \sigma(z^{i}) = \frac{1}{1 + e^{-z}}$

$Given\ \{(x^{(1)},y^{(1)}), \cdots,((x^{(m)},y^{(m)}))\}, we\ want\ \widehat{y}^{(i)}\approx y^{(i)}$

$x^{(i)}$ the i-th traning example

Loss(error) Function

The loss function measures the discrepancy between the prediction $\widehat{y}^{(i)}$ and the desired output $y^{(i)}$ .In other words, the loss function computes the error for a single training example.

$L(\widehat{y}^{(i)}, y^{(i)}) = \frac{1}{2} (\widehat{y}^{(i)} - y^{(i)})^{2}$

$L(\widehat{y}^{(i)}, y^{(i)}) = -(y^{(i)}log(\widehat{y}^{(i)}) + (1 - y^{(i)})log(1-\widehat{y}^{(i)}))$

If $y^{(i)} = 1$ : $(\widehat{y}^{(i)}, y^{(i)}) = -log(\widehat{y}^{(i)})$ where $\widehat{y}^{(i)}$ should be close to 1.
If $y^{(i)} = 0$ : $(\widehat{y}^{(i)}, y^{(i)}) = -log(1 - \widehat{y}^{(i)})$ where $\widehat{y}^{(i)}$ should be close to 0

Cost Function

The cost function is the average of the loss function of the entire training set. We are going to find the parameters w and b that minimize the overall cost function.

$=\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})]$

Gradient Descent

One Example:

We have get that:

$z = w^{T}x + b$

$\widehat{y} = a = \sigma(z)$

$L (a, y) = - (y l o g (a) + (1 - y) l o g (1 - a))$

Forward:
在这里插入图片描述
Backward:

$\frac{\alpha_{L(a,y)}}{\alpha_{a}} = \frac{\alpha_{-(ylog(a)+(1-y)log(1-a))}}{\alpha_{a}} = -\frac{y}{a} + \frac{1-y}{1-a}$

$\frac{\alpha_{L(a,y)}}{\alpha_{z}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) = a - y$

$\frac{\alpha_{L(a,y)}}{\alpha_{w_{1}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{1}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{1} = x_{1}(a-y)$

$\frac{\alpha_{L(a,y)}}{\alpha_{w_{2}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{2}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{2} = x_{2}(a-y)$

$\frac{\alpha_{L(a,y)}}{\alpha_{b}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{b}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times 1 = (a-y)$

M Training Examples:

Recap:

$=\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})]$

$\frac{\alpha_{J(w,b)}}{\alpha_{w}} = \frac{1}{m} \sum_{i=1}^{m} \frac{\alpha_{L(\widehat{y}^{(i)}, y^{(i)})}}{\alpha_w}$

# Assume we have two features in this prediction
def logisitcRegressionGradientDescent(m, x, y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    x (matrix: 2 * m) -- the features
    y (vector: 1 * m) -- the label
    w (vector: 1 * 2) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """     
    J = 0; dw1 = 0; dw2 = 0; db = 0
    z = []; a = []
    for i in range(m):
        # FP
        z[i] = w * x[:, i] + b
        a[i] = sigmoid(z[i])
        J += -[y[i] * log(a[i]) + (1 - y[i]) * log(1 - a[i])]
        
        #BP
        dz[i] = a[i] - y[i]
        dw1 += x[i][0] * dz[i]
        dw2 += x[i][1] * dz[i]
        db += dz[i]
    
    dw1 /= m
    dw2 /= m
    db /= m
    
    w1 -= alpha * dw1
    w2 -= alpha * dw2
    b -= alpha * b
    
    return

Vectorization

Look at the Power of Vectorization:

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(c)
print("Vectorized version:", str((toc - tic) * 1000) + "ms")

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()

print(c)
print("For loop version:", str((toc - tic) * 1000) + "ms")

249706.30302638497
Vectorized version: 1.2466907501220703ms
249706.30302638162
For loop version: 395.0514793395996ms

Vectorizing Logitic Regression

def logisitcRegressionGradientDescent(m, X, Y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    X (matrix: n * m) -- the features
    Y (vector: m * 1) -- the label
    w (vector: 1 * n) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """    
    
    # FP
    Z = np.dot(W, X) + b
    A = sigmoid(Z)
    #J = -np.sum(Y * log(A) + (1 - Y) * log(1 - A))
    dZ = A - Y
    dw = 1 / m * np.dot(dZ, X.T)
    db = 1 / m * np.sum(dZ)
    return

Broadcasting in Python

a = np.array([[1,2], [3,4]])
print(a)

b = 1
# b = [1,1]
# b = [[1], [1]]

# boradcast b to [[1,1], [1,1]]
# b = np.array([[1,1], [1,1]])

print(a + b)
print(a - b)

[[1 2]
 [3 4]]
[[2 3]
 [4 5]]
[[0 1]
 [2 3]]

A Note on Python/Numpy vectors

a = np.random.randn(5) # Rank 1 array
print(a.shape, a)

print(a.T)
print(a * a.T)

(5,) [ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[2.19126245 0.42514382 0.0072945  1.17883282 2.96784882]

a = np.random.randn(5, 1)
print(a.shape)
print(a)
print(a.T)
print(a.T.shape)
print(a * a.T)

(5, 1)
[[ 1.80545889]
 [ 2.31719407]
 [ 0.89081914]
 [-1.08760266]
 [ 0.24755189]]
[[ 1.80545889  2.31719407  0.89081914 -1.08760266  0.24755189]]
(1, 5)
[[ 3.25968179  4.18359863  1.60833733 -1.96362189  0.44694476]
 [ 4.18359863  5.36938835  2.06420082 -2.52018643  0.57362578]
 [ 1.60833733  2.06420082  0.79355874 -0.96885726  0.22052396]
 [-1.96362189 -2.52018643 -0.96885726  1.18287954 -0.2692381 ]
 [ 0.44694476  0.57362578  0.22052396 -0.2692381   0.06128194]]

Explanation of Logistic Regression

Recap:

We knew that $\widehat{y} = p(y=1|x)$ , so that we can get:

If $y = 1$ , $\widehat{y}$
If $y = 0$ , $\widehat{y}$

Above that, we can generate a function:
$\widehat{y}^{y}(1-\widehat{y})^{1-y}$

$ylog(\widehat{y}) + (1-y)log(1-\widehat{y})$

Then, we want to maximize the p(y|x), so we minimize the -p(y|x):
$\begin{aligned} -log(p(y|x)) &= -(ylog(\widehat{y}) + (1-y)log(1-\widehat{y})) \\ &= L(\widehat{y}, y) \end{aligned}$

we should remember that the loss function above is a convex function, so we can find the optimal of the function.

Cost on m examples:
$\prod_{i=1}^{m}p(y|x)$

$\begin{aligned} log(p(Y|X)) &= \sum_{i=1}^{m}log(p(y|x)) \\ &= -\sum_{i=1}^{m}L(\widehat{y}, x) \\ &= -J(W, b) \end{aligned}$

To summarize, by minimizing this cost function J(w,b) we’re really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed.