Class1-Week2-Neural Networks Basics

本文深入解析逻辑回归算法,探讨其在二分类问题中的应用,通过猫与非猫图像识别实例,介绍逻辑函数、成本函数及梯度下降等核心概念。

Logistic Regression

Description

Logistic regression is a learning algorithm used in a supervised learning problem when the output ? are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data

Example: Cat vs No-cat

Given an image represented by a feature vector ?, the algorithm will evaluate the probability of a cat being in that image.

G i v e n − x , y ^ = P ( y = 1 ∣ x ) , w h e r e y ^ ∈ [ 0 , 1 ] Given - x, \widehat{y} = P(y=1|x), where \widehat{y} \in [0,1] Givenx,y =P(y=1x),wherey [0,1]

The parameters used in Logistic regression are:

  • The input features vector: x ∈ R n x x \in \mathbb{R}^{n_{x}} xRnx, where n x n_{x} nx is the number of features
  • The training label: y ∈ { 0 , 1 } y \in \{0,1\} y{0,1}
  • The weight: w ∈ R n x w \in \mathbb{R}^{n_{x}} wRnx, where n x n_{x} nx is the number of features
  • The threshold: b ∈ R b \in \mathbb{R} bR
  • The output: y ^ = σ w T x + b \widehat{y} = \sigma{w^{T}x + b} y =σwTx+b
  • Sigmoid function: s = σ ( w T x + b ) s = \sigma(w^{T}x + b) s=σ(wTx+b), σ z = 1 1 + e − z \sigma{z} = \frac{1}{1 + e^{-z}} σz=1+ez1

Logistic Function

import numpy as np 
import time
import matplotlib.pyplot as plt 

%matplotlib inline
x = np.arange(-10, 10, 0.001)
y = 1 / (1 + np.exp(-x))
plt.plot(x,y)
plt.suptitle(r'$y=\frac{1}{1+e^{-x}}$', fontsize=20)
plt.grid(color='gray')
plt.grid(linewidth='1')
plt.grid(linestyle='--')

plt.show()

在这里插入图片描述

( w T x + b ) (w^{T}x + b) (wTx+b) is a linear function (?? + ?), but since we are looking for a probability constraint between [0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.

Some observations from the graph:

  • If z is large positive number, then σ ( z ) = 1 \sigma(z) = 1 σ(z)=1
  • If z is large negative number, then σ ( z ) = 0 \sigma(z) = 0 σ(z)=0
  • if z = 0 z = 0 z=0, then σ ( z ) = 0.5 \sigma(z) = 0.5 σ(z)=0.5

Logistic Cost Function

To train the parameters ? and ?, we need to define a cost function.

Recap:

y ^ = σ ( w T x ( i ) + b ) , w h e r e   σ ( z i ) = 1 1 + e − z \widehat{y} = \sigma(w^{T}x^{(i)}+b), where\ \sigma(z^{i}) = \frac{1}{1 + e^{-z}} y =σ(wTx(i)+b),where σ(zi)=1+ez1

G i v e n   { ( x ( 1 ) , y ( 1 ) ) , ⋯   , ( ( x ( m ) , y ( m ) ) ) } , w e   w a n t   y ^ ( i ) ≈ y ( i ) Given\ \{(x^{(1)},y^{(1)}), \cdots,((x^{(m)},y^{(m)}))\}, we\ want\ \widehat{y}^{(i)}\approx y^{(i)} Given {(x(1),y(1)),,((x(m),y(m)))},we want y (i)y(i)

  • x ( i ) x^{(i)} x(i) the i-th traning example

Loss(error) Function

The loss function measures the discrepancy between the prediction y ^ ( i ) \widehat{y}^{(i)} y (i) and the desired output y ( i ) y^{(i)} y(i).In other words, the loss function computes the error for a single training example.

L ( y ^ ( i ) , y ( i ) ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 L(\widehat{y}^{(i)}, y^{(i)}) = \frac{1}{2} (\widehat{y}^{(i)} - y^{(i)})^{2} L(y (i),y(i))=21(y (i)y(i))2

L ( y ^ ( i ) , y ( i ) ) = − ( y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ) L(\widehat{y}^{(i)}, y^{(i)}) = -(y^{(i)}log(\widehat{y}^{(i)}) + (1 - y^{(i)})log(1-\widehat{y}^{(i)})) L(y (i),y(i))=(y(i)log(y (i))+(1y(i))log(1y (i)))

  • If y ( i ) = 1 y^{(i)} = 1 y(i)=1: ( y ^ ( i ) , y ( i ) ) = − l o g ( y ^ ( i ) ) (\widehat{y}^{(i)}, y^{(i)}) = -log(\widehat{y}^{(i)}) (y (i),y(i))=log(y (i)) where y ^ ( i ) \widehat{y}^{(i)} y (i) should be close to 1.
  • If y ( i ) = 0 y^{(i)} = 0 y(i)=0: ( y ^ ( i ) , y ( i ) ) = − l o g ( 1 − y ^ ( i ) ) (\widehat{y}^{(i)}, y^{(i)}) = -log(1 - \widehat{y}^{(i)}) (y (i),y(i))=log(1y (i)) where y ^ ( i ) \widehat{y}^{(i)} y (i) should be close to 0

Cost Function

The cost function is the average of the loss function of the entire training set. We are going to find the parameters w and b that minimize the overall cost function.

J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ] J(w,b) =\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})] J(w,b)=m1i=1mL(y (i),y(i))=m1i=1m[y(i)log(y (i))+(1y(i))log(1y (i))]


Gradient Descent

One Example:

We have get that:

z = w T x + b z = w^{T}x + b z=wTx+b

y ^ = a = σ ( z ) \widehat{y} = a = \sigma(z) y =a=σ(z)

L ( a , y ) = − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) L(a,y) = -(ylog(a) + (1 - y)log(1 - a)) L(a,y)=(ylog(a)+(1y)log(1a))

Forward:
在这里插入图片描述
Backward:

α L ( a , y ) α a = α − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) α a = − y a + 1 − y 1 − a \frac{\alpha_{L(a,y)}}{\alpha_{a}} = \frac{\alpha_{-(ylog(a)+(1-y)log(1-a))}}{\alpha_{a}} = -\frac{y}{a} + \frac{1-y}{1-a} αaαL(a,y)=αaα(ylog(a)+(1y)log(1a))=ay+1a1y

α L ( a , y ) α z = α L ( a , y ) α a × α a α z = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) = a − y \frac{\alpha_{L(a,y)}}{\alpha_{z}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) = a - y αzαL(a,y)=αaαL(a,y)×αzαa=(ay+1a1y)×a(1a)=ay

α L ( a , y ) α w 1 = α L ( a , y ) α a × α a α z × α z α w 1 = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × x 1 = x 1 ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{w_{1}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{1}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{1} = x_{1}(a-y) αw1αL(a,y)=αaαL(a,y)×αzαa×αw1αz=(ay+1a1y)×a(1a)×x1=x1(ay)

α L ( a , y ) α w 2 = α L ( a , y ) α a × α a α z × α z α w 2 = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × x 2 = x 2 ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{w_{2}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{2}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{2} = x_{2}(a-y) αw2αL(a,y)=αaαL(a,y)×αzαa×αw2αz=(ay+1a1y)×a(1a)×x2=x2(ay)

α L ( a , y ) α b = α L ( a , y ) α a × α a α z × α z α b = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × 1 = ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{b}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{b}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times 1 = (a-y) αbαL(a,y)=αaαL(a,y)×αzαa×αbαz=(ay+1a1y)×a(1a)×1=(ay)

M Training Examples:

Recap:

J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ] J(w,b) =\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})] J(w,b)=m1i=1mL(y (i),y(i))=m1i=1m[y(i)log(y (i))+(1y(i))log(1y (i))]

α J ( w , b ) α w = 1 m ∑ i = 1 m α L ( y ^ ( i ) , y ( i ) ) α w \frac{\alpha_{J(w,b)}}{\alpha_{w}} = \frac{1}{m} \sum_{i=1}^{m} \frac{\alpha_{L(\widehat{y}^{(i)}, y^{(i)})}}{\alpha_w} αwαJ(w,b)=m1i=1mαwαL(y (i),y(i))

# Assume we have two features in this prediction
def logisitcRegressionGradientDescent(m, x, y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    x (matrix: 2 * m) -- the features
    y (vector: 1 * m) -- the label
    w (vector: 1 * 2) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """     
    J = 0; dw1 = 0; dw2 = 0; db = 0
    z = []; a = []
    for i in range(m):
        # FP
        z[i] = w * x[:, i] + b
        a[i] = sigmoid(z[i])
        J += -[y[i] * log(a[i]) + (1 - y[i]) * log(1 - a[i])]
        
        #BP
        dz[i] = a[i] - y[i]
        dw1 += x[i][0] * dz[i]
        dw2 += x[i][1] * dz[i]
        db += dz[i]
    
    dw1 /= m
    dw2 /= m
    db /= m
    
    w1 -= alpha * dw1
    w2 -= alpha * dw2
    b -= alpha * b
    
    return

Vectorization

Look at the Power of Vectorization:

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(c)
print("Vectorized version:", str((toc - tic) * 1000) + "ms")

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()

print(c)
print("For loop version:", str((toc - tic) * 1000) + "ms")
249706.30302638497
Vectorized version: 1.2466907501220703ms
249706.30302638162
For loop version: 395.0514793395996ms

Vectorizing Logitic Regression

def logisitcRegressionGradientDescent(m, X, Y, w, b, alpha):
    """FP & BP
    
    Parameters:
    m (int) -- the number of entire training set
    X (matrix: n * m) -- the features
    Y (vector: m * 1) -- the label
    w (vector: 1 * n) -- the weights of different features
    b (int) -- the bias
    alpha(float) -- the learning rate
    """    
    
    # FP
    Z = np.dot(W, X) + b
    A = sigmoid(Z)
    #J = -np.sum(Y * log(A) + (1 - Y) * log(1 - A))
    dZ = A - Y
    dw = 1 / m * np.dot(dZ, X.T)
    db = 1 / m * np.sum(dZ)
    return

Broadcasting in Python

a = np.array([[1,2], [3,4]])
print(a)

b = 1
# b = [1,1]
# b = [[1], [1]]

# boradcast b to [[1,1], [1,1]]
# b = np.array([[1,1], [1,1]])

print(a + b)
print(a - b)
[[1 2]
 [3 4]]
[[2 3]
 [4 5]]
[[0 1]
 [2 3]]

A Note on Python/Numpy vectors

a = np.random.randn(5) # Rank 1 array
print(a.shape, a)

print(a.T)
print(a * a.T)
(5,) [ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[ 1.48029134  0.65203054 -0.08540782  1.08574068 -1.72274456]
[2.19126245 0.42514382 0.0072945  1.17883282 2.96784882]
a = np.random.randn(5, 1)
print(a.shape)
print(a)
print(a.T)
print(a.T.shape)
print(a * a.T)
(5, 1)
[[ 1.80545889]
 [ 2.31719407]
 [ 0.89081914]
 [-1.08760266]
 [ 0.24755189]]
[[ 1.80545889  2.31719407  0.89081914 -1.08760266  0.24755189]]
(1, 5)
[[ 3.25968179  4.18359863  1.60833733 -1.96362189  0.44694476]
 [ 4.18359863  5.36938835  2.06420082 -2.52018643  0.57362578]
 [ 1.60833733  2.06420082  0.79355874 -0.96885726  0.22052396]
 [-1.96362189 -2.52018643 -0.96885726  1.18287954 -0.2692381 ]
 [ 0.44694476  0.57362578  0.22052396 -0.2692381   0.06128194]]

Explanation of Logistic Regression

Recap:

We knew that y ^ = p ( y = 1 ∣ x ) \widehat{y} = p(y=1|x) y =p(y=1x), so that we can get:

  • If y = 1 y = 1 y=1, p ( y ∣ x ) = y ^ p(y|x) = \widehat{y} p(yx)=y
  • If y = 0 y = 0 y=0, p ( y ∣ x ) = 1 − y ^ p(y|x) = 1 - \widehat{y} p(yx)=1y

Above that, we can generate a function:
p ( y ∣ x ) = y ^ y ( 1 − y ^ ) 1 − y p(y|x) = \widehat{y}^{y}(1-\widehat{y})^{1-y} p(yx)=y y(1y )1y

l o g ( p ( y ∣ x ) ) = y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) log(p(y|x)) = ylog(\widehat{y}) + (1-y)log(1-\widehat{y}) log(p(yx))=ylog(y )+(1y)log(1y )

Then, we want to maximize the p(y|x), so we minimize the -p(y|x):
− l o g ( p ( y ∣ x ) ) = − ( y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) ) = L ( y ^ , y ) \begin{aligned} -log(p(y|x)) &= -(ylog(\widehat{y}) + (1-y)log(1-\widehat{y})) \\ &= L(\widehat{y}, y) \end{aligned} log(p(yx))=(ylog(y )+(1y)log(1y ))=L(y ,y)

we should remember that the loss function above is a convex function, so we can find the optimal of the function.

Cost on m examples:
p ( Y ∣ X ) = ∏ i = 1 m p ( y ∣ x ) p(Y|X) = \prod_{i=1}^{m}p(y|x) p(YX)=i=1mp(yx)

l o g ( p ( Y ∣ X ) ) = ∑ i = 1 m l o g ( p ( y ∣ x ) ) = − ∑ i = 1 m L ( y ^ , x ) = − J ( W , b ) \begin{aligned} log(p(Y|X)) &= \sum_{i=1}^{m}log(p(y|x)) \\ &= -\sum_{i=1}^{m}L(\widehat{y}, x) \\ &= -J(W, b) \end{aligned} log(p(YX))=i=1mlog(p(yx))=i=1mL(y ,x)=J(W,b)

To summarize, by minimizing this cost function J(w,b) we’re really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed.

"Track-Before-Detect with Neural Networks"是一种利用神经网络进行目标跟踪前检测的方法。在传统的跟踪算法中,通常先进行目标检测,然后再进行跟踪。然而,在某些场景下,目标可能非常小、模糊或者被部分遮挡,传统的目标检测方法往往无法准确地检测到目标,从而导致跟踪失败。 "Track-Before-Detect with Neural Networks"的核心思想是在跟踪之前先对目标进行检测。而与传统的目标检测方法不同的是,它使用神经网络来实现目标检测,而不是基于传统的图像处理技术。神经网络通常可以更好地处理图像的特征提取和模式识别任务。 这种方法首先使用神经网络对图像进行处理,提取其中的特征。然后,基于提取的特征,在图像中进行目标检测。如果检测到了目标,就可以在该帧中进行跟踪,随着目标在不同帧之间的位置变化,通过更新模型来实现目标的连续跟踪。 相对于传统方法,"Track-Before-Detect with Neural Networks"有以下优势:首先,神经网络可以自动学习图像中的特征,无需手动设计特征提取算法。其次,神经网络具有较强的泛化能力,可以适应不同目标的形状、尺寸和外观变化。此外,神经网络还可以通过训练进行优化,提高准确性和鲁棒性。因此,这种方法可以在复杂的环境中更准确地检测和跟踪目标。 总之,"Track-Before-Detect with Neural Networks"是一种利用神经网络实现目标跟踪和检测的方法,具有较好的准确性和鲁棒性,在实际应用中具有广泛的应用前景。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值