Class1-Week2-Neural Networks Basics

Logistic Regression

Description

Logistic regression is a learning algorithm used in a supervised learning problem when the output ? are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data

Example: Cat vs No-cat

Given an image represented by a feature vector ?, the algorithm will evaluate the probability of a cat being in that image.

G i v e n − x , y ^ = P ( y = 1 ∣ x ) , w h e r e y ^ ∈ [ 0 , 1 ] Given - x, \widehat{y} = P(y=1|x), where \widehat{y} \in [0,1] Givenx,y =P(y=1x),wherey [0,1]

The parameters used in Logistic regression are:

  • The input features vector: x ∈ R n x x \in \mathbb{R}^{n_{x}} xRnx, where n x n_{x} nx is the number of features
  • The training label: y ∈ { 0 , 1 } y \in \{0,1\} y{0,1}
  • The weight: w ∈ R n x w \in \mathbb{R}^{n_{x}} wRnx, where n x n_{x} nx is the number of features
  • The threshold: b ∈ R b \in \mathbb{R} bR
  • The output: y ^ = σ w T x + b \widehat{y} = \sigma{w^{T}x + b} y =σwTx+b
  • Sigmoid function: s = σ ( w T x + b ) s = \sigma(w^{T}x + b) s=σ(wTx+b), σ z = 1 1 + e − z \sigma{z} = \frac{1}{1 + e^{-z}} σz=1+ez1

Logistic Function

import numpy as np 
import time
import matplotlib.pyplot as plt 

%matplotlib inline
x = np.arange(-10, 10, 0.001)
y = 1 / (1 + np.exp(-x))
plt.plot(x,y)
plt.suptitle(r'$y=\frac{1}{1+e^{-x}}$', fontsize=20)
plt.grid(color='gray')
plt.grid(linewidth='1')
plt.grid(linestyle='--')

plt.show()

在这里插入图片描述

( w T x + b ) (w^{T}x + b) (wTx+b) is a linear function (?? + ?), but since we are looking for a probability constraint between [0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.

Some observations from the graph:

  • If z is large positive number, then σ ( z ) = 1 \sigma(z) = 1 σ(z)=1
  • If z is large negative number, then σ ( z ) = 0 \sigma(z) = 0 σ(z)=0
  • if z = 0 z = 0 z=0, then σ ( z ) = 0.5 \sigma(z) = 0.5 σ(z)=0.5

Logistic Cost Function

To train the parameters ? and ?, we need to define a cost function.

Recap:

y ^ = σ ( w T x ( i ) + b ) , w h e r e   σ ( z i ) = 1 1 + e − z \widehat{y} = \sigma(w^{T}x^{(i)}+b), where\ \sigma(z^{i}) = \frac{1}{1 + e^{-z}} y =σ(wTx(i)+b),where σ(zi)=1+ez1

G i v e n   { ( x ( 1 ) , y ( 1 ) ) , ⋯   , ( ( x ( m ) , y ( m ) ) ) } , w e   w a n t   y ^ ( i ) ≈ y ( i ) Given\ \{(x^{(1)},y^{(1)}), \cdots,((x^{(m)},y^{(m)}))\}, we\ want\ \widehat{y}^{(i)}\approx y^{(i)} Given {(x(1),y(1)),,((x(m),y(m)))},we want y (i)y(i)

  • x ( i ) x^{(i)} x(i) the i-th traning example

Loss(error) Function

The loss function measures the discrepancy between the prediction y ^ ( i ) \widehat{y}^{(i)} y (i) and the desired output y ( i ) y^{(i)} y(i).In other words, the loss function computes the error for a single training example.

L ( y ^ ( i ) , y ( i ) ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 L(\widehat{y}^{(i)}, y^{(i)}) = \frac{1}{2} (\widehat{y}^{(i)} - y^{(i)})^{2} L(y (i),y(i))=21(y (i)y(i))2

L ( y ^ ( i ) , y ( i ) ) = − ( y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ) L(\widehat{y}^{(i)}, y^{(i)}) = -(y^{(i)}log(\widehat{y}^{(i)}) + (1 - y^{(i)})log(1-\widehat{y}^{(i)})) L(y (i),y(i))=(y(i)log(y (i))+(1y(i))log(1y (i)))

  • If y ( i ) = 1 y^{(i)} = 1 y(i)=1: ( y ^ ( i ) , y ( i ) ) = − l o g ( y ^ ( i ) ) (\widehat{y}^{(i)}, y^{(i)}) = -log(\widehat{y}^{(i)}) (y (i),y(i))=log(y (i)) where y ^ ( i ) \widehat{y}^{(i)} y (i) should be close to 1.
  • If y ( i ) = 0 y^{(i)} = 0 y(i)=0: ( y ^ ( i ) , y ( i ) ) = − l o g ( 1 − y ^ ( i ) ) (\widehat{y}^{(i)}, y^{(i)}) = -log(1 - \widehat{y}^{(i)}) (y (i),y(i))=log(1y (i)) where y ^ ( i ) \widehat{y}^{(i)} y (i) should be close to 0

Cost Function

The cost function is the average of the loss function of the entire training set. We are going to find the parameters w and b that minimize the overall cost function.

J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ] J(w,b) =\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})] J(w,b)=m1i=1mL(y (i),y(i))=m1i=1m[y(i)log(y (i))+(1y(i))log(1y (i))]


Gradient Descent

One Example:

We have get that:

z = w T x + b z = w^{T}x + b z=wTx+b

y ^ = a = σ ( z ) \widehat{y} = a = \sigma(z) y =a=σ(z)

L ( a , y ) = − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) L(a,y) = -(ylog(a) + (1 - y)log(1 - a)) L(a,y)=(ylog(a)+(1y)log(1a))

Forward:
在这里插入图片描述
Backward:

α L ( a , y ) α a = α − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) α a = − y a + 1 − y 1 − a \frac{\alpha_{L(a,y)}}{\alpha_{a}} = \frac{\alpha_{-(ylog(a)+(1-y)log(1-a))}}{\alpha_{a}} = -\frac{y}{a} + \frac{1-y}{1-a} αaαL(a,y)=αaα(ylog(a)+(1y)log(1a))=ay+1a1y

α L ( a , y ) α z = α L ( a , y ) α a × α a α z = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) = a − y \frac{\alpha_{L(a,y)}}{\alpha_{z}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) = a - y αzαL(a,y)=αaαL(a,y)×αzαa=(ay+1a1y)×a(1a)=ay

α L ( a , y ) α w 1 = α L ( a , y ) α a × α a α z × α z α w 1 = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × x 1 = x 1 ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{w_{1}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{1}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{1} = x_{1}(a-y) αw1αL(a,y)=αaαL(a,y)×αzαa×αw1αz=(ay+1a1y)×a(1a)×x1=x1(ay)

α L ( a , y ) α w 2 = α L ( a , y ) α a × α a α z × α z α w 2 = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × x 2 = x 2 ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{w_{2}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{2}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{2} = x_{2}(a-y) αw2αL(a,y)=αaαL(a,y)×αzαa×αw2αz=(ay+1a1y)×a(1a)×x2=x2(ay)

α L ( a , y ) α b = α L ( a , y ) α a × α a α z × α z α b = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × 1 = ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{b}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{b}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times 1 = (a-y) αbαL(a,y)=αaαL(a,y)×αzαa×αbαz=(ay+1a1y)×a(1a)×1=(ay)

M Training Examples:

Recap:

J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ] J(w,b) =\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})] J(w,b)=m1i=1mL(y (i),y(i))=m1i=1m[y(i)log(y (i))+(1y(i))log(1y (i))]

α J ( w , b ) α w = 1 m ∑ i = 1 m α L ( y ^ ( i ) , y ( i ) ) α w \frac{\alpha_{J(w,b)}}{\alpha_{w}} = \frac{1}{m} \sum_{i=1}^{m} \frac{\alpha_{L(\widehat{y}^{(i)}, y^{(i)})}}{\alpha_w} αwαJ(w,b)=m1i=1mαwαL(y (i),y(i))

# Assume we have two features in this prediction
def logisitcRegressionGradientDescent(m, x, y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    x (matrix: 2 * m) -- the features
    y (vector: 1 * m) -- the label
    w (vector: 1 * 2) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """     
    J = 0; dw1 = 0; dw2 = 0; db = 0
    z = []; a = []
    for i in range(m):
        # FP
        z[i] = w * x[:, i] + b
        a[i] = sigmoid(z[i])
        J += -[y[i] * log(a[i]) + (1 - y[i]) * log(1 - a[i])]
        
        #BP
        dz[i] = a[i] - y[i]
        dw1 += x[i][0] * dz[i]
        dw2 += x[i][1] * dz[i]
        db += dz[i]
    
    dw1 /= m
    dw2 /= m
    db /= m
    
    w1 -= alpha * dw1
    w2 -= alpha * dw2
    b -= alpha * b
    
    return

Vectorization

Look at the Power of Vectorization:

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(c)
print("Vectorized version:", str((toc - tic) * 1000) + "ms")

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()

print(c)
print("For loop version:", str((toc - tic) * 1000) + "ms")
249706.30302638497
Vectorized version: 1.2466907501220703ms
249706.30302638162
For loop version: 395.0514793395996ms

Vectorizing Logitic Regression

def logisitcRegressionGradientDescent(m, X, Y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    X (matrix: n * m) -- the features
    Y (vector: m * 1) -- the label
    w (vector: 1 * n) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """    
    
    # FP
    Z = np.dot(W, X) + b
    A = sigmoid(Z)
    #J = -np.sum(Y * log(A) + (1 - Y) * log(1 - A))
    dZ = A - Y
    dw = 1 / m * np.dot(dZ, X.T)
    db = 1 / m * np.sum(dZ)
    return

Broadcasting in Python

a = np.array([[1,2], [3,4]])
print(a)

b = 1
# b = [1,1]
# b = [[1], [1]]

# boradcast b to [[1,1], [1,1]]
# b = np.array([[1,1], [1,1]])

print(a + b)
print(a - b)
[[1 2]
 [3 4]]
[[2 3]
 [4 5]]
[[0 1]
 [2 3]]

A Note on Python/Numpy vectors

a = np.random.randn(5) # Rank 1 array
print(a.shape, a)

print(a.T)
print(a * a.T)
(5,) [ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[2.19126245 0.42514382 0.0072945  1.17883282 2.96784882]
a = np.random.randn(5, 1)
print(a.shape)
print(a)
print(a.T)
print(a.T.shape)
print(a * a.T)
(5, 1)
[[ 1.80545889]
 [ 2.31719407]
 [ 0.89081914]
 [-1.08760266]
 [ 0.24755189]]
[[ 1.80545889  2.31719407  0.89081914 -1.08760266  0.24755189]]
(1, 5)
[[ 3.25968179  4.18359863  1.60833733 -1.96362189  0.44694476]
 [ 4.18359863  5.36938835  2.06420082 -2.52018643  0.57362578]
 [ 1.60833733  2.06420082  0.79355874 -0.96885726  0.22052396]
 [-1.96362189 -2.52018643 -0.96885726  1.18287954 -0.2692381 ]
 [ 0.44694476  0.57362578  0.22052396 -0.2692381   0.06128194]]

Explanation of Logistic Regression

Recap:

We knew that y ^ = p ( y = 1 ∣ x ) \widehat{y} = p(y=1|x) y =p(y=1x), so that we can get:

  • If y = 1 y = 1 y=1, p ( y ∣ x ) = y ^ p(y|x) = \widehat{y} p(yx)=y
  • If y = 0 y = 0 y=0, p ( y ∣ x ) = 1 − y ^ p(y|x) = 1 - \widehat{y} p(yx)=1y

Above that, we can generate a function:
p ( y ∣ x ) = y ^ y ( 1 − y ^ ) 1 − y p(y|x) = \widehat{y}^{y}(1-\widehat{y})^{1-y} p(yx)=y y(1y )1y

l o g ( p ( y ∣ x ) ) = y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) log(p(y|x)) = ylog(\widehat{y}) + (1-y)log(1-\widehat{y}) log(p(yx))=ylog(y )+(1y)log(1y )

Then, we want to maximize the p(y|x), so we minimize the -p(y|x):
− l o g ( p ( y ∣ x ) ) = − ( y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) ) = L ( y ^ , y ) \begin{aligned} -log(p(y|x)) &= -(ylog(\widehat{y}) + (1-y)log(1-\widehat{y})) \\ &= L(\widehat{y}, y) \end{aligned} log(p(yx))=(ylog(y )+(1y)log(1y ))=L(y ,y)

we should remember that the loss function above is a convex function, so we can find the optimal of the function.

Cost on m examples:
p ( Y ∣ X ) = ∏ i = 1 m p ( y ∣ x ) p(Y|X) = \prod_{i=1}^{m}p(y|x) p(YX)=i=1mp(yx)

l o g ( p ( Y ∣ X ) ) = ∑ i = 1 m l o g ( p ( y ∣ x ) ) = − ∑ i = 1 m L ( y ^ , x ) = − J ( W , b ) \begin{aligned} log(p(Y|X)) &= \sum_{i=1}^{m}log(p(y|x)) \\ &= -\sum_{i=1}^{m}L(\widehat{y}, x) \\ &= -J(W, b) \end{aligned} log(p(YX))=i=1mlog(p(yx))=i=1mL(y ,x)=J(W,b)

To summarize, by minimizing this cost function J(w,b) we’re really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值