Logistic Regression
Description
Logistic regression is a learning algorithm used in a supervised learning problem when the output ? are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data
Example: Cat vs No-cat
Given an image represented by a feature vector ?, the algorithm will evaluate the probability of a cat being in that image.
G i v e n − x , y ^ = P ( y = 1 ∣ x ) , w h e r e y ^ ∈ [ 0 , 1 ] Given - x, \widehat{y} = P(y=1|x), where \widehat{y} \in [0,1] Given−x,y =P(y=1∣x),wherey ∈[0,1]
The parameters used in Logistic regression are:
- The input features vector: x ∈ R n x x \in \mathbb{R}^{n_{x}} x∈Rnx, where n x n_{x} nx is the number of features
- The training label: y ∈ { 0 , 1 } y \in \{0,1\} y∈{0,1}
- The weight: w ∈ R n x w \in \mathbb{R}^{n_{x}} w∈Rnx, where n x n_{x} nx is the number of features
- The threshold: b ∈ R b \in \mathbb{R} b∈R
- The output: y ^ = σ w T x + b \widehat{y} = \sigma{w^{T}x + b} y =σwTx+b
- Sigmoid function: s = σ ( w T x + b ) s = \sigma(w^{T}x + b) s=σ(wTx+b), σ z = 1 1 + e − z \sigma{z} = \frac{1}{1 + e^{-z}} σz=1+e−z1
Logistic Function
import numpy as np
import time
import matplotlib.pyplot as plt
%matplotlib inline
x = np.arange(-10, 10, 0.001)
y = 1 / (1 + np.exp(-x))
plt.plot(x,y)
plt.suptitle(r'$y=\frac{1}{1+e^{-x}}$', fontsize=20)
plt.grid(color='gray')
plt.grid(linewidth='1')
plt.grid(linestyle='--')
plt.show()

( w T x + b ) (w^{T}x + b) (wTx+b) is a linear function (?? + ?), but since we are looking for a probability constraint between [0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.
Some observations from the graph:
- If z is large positive number, then σ ( z ) = 1 \sigma(z) = 1 σ(z)=1
- If z is large negative number, then σ ( z ) = 0 \sigma(z) = 0 σ(z)=0
- if z = 0 z = 0 z=0, then σ ( z ) = 0.5 \sigma(z) = 0.5 σ(z)=0.5
Logistic Cost Function
To train the parameters ? and ?, we need to define a cost function.
Recap:
y ^ = σ ( w T x ( i ) + b ) , w h e r e σ ( z i ) = 1 1 + e − z \widehat{y} = \sigma(w^{T}x^{(i)}+b), where\ \sigma(z^{i}) = \frac{1}{1 + e^{-z}} y =σ(wTx(i)+b),where σ(zi)=1+e−z1
G i v e n { ( x ( 1 ) , y ( 1 ) ) , ⋯   , ( ( x ( m ) , y ( m ) ) ) } , w e w a n t y ^ ( i ) ≈ y ( i ) Given\ \{(x^{(1)},y^{(1)}), \cdots,((x^{(m)},y^{(m)}))\}, we\ want\ \widehat{y}^{(i)}\approx y^{(i)} Given {(x(1),y(1)),⋯,((x(m),y(m)))},we want y (i)≈y(i)
- x ( i ) x^{(i)} x(i) the i-th traning example
Loss(error) Function
The loss function measures the discrepancy between the prediction y ^ ( i ) \widehat{y}^{(i)} y (i) and the desired output y ( i ) y^{(i)} y(i).In other words, the loss function computes the error for a single training example.
L ( y ^ ( i ) , y ( i ) ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 L(\widehat{y}^{(i)}, y^{(i)}) = \frac{1}{2} (\widehat{y}^{(i)} - y^{(i)})^{2} L(y (i),y(i))=21(y (i)−y(i))2
L ( y ^ ( i ) , y ( i ) ) = − ( y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ) L(\widehat{y}^{(i)}, y^{(i)}) = -(y^{(i)}log(\widehat{y}^{(i)}) + (1 - y^{(i)})log(1-\widehat{y}^{(i)})) L(y (i),y(i))=−(y(i)log(y (i))+(1−y(i))log(1−y (i)))
- If y ( i ) = 1 y^{(i)} = 1 y(i)=1: ( y ^ ( i ) , y ( i ) ) = − l o g ( y ^ ( i ) ) (\widehat{y}^{(i)}, y^{(i)}) = -log(\widehat{y}^{(i)}) (y (i),y(i))=−log(y (i)) where y ^ ( i ) \widehat{y}^{(i)} y (i) should be close to 1.
- If y ( i ) = 0 y^{(i)} = 0 y(i)=0: ( y ^ ( i ) , y ( i ) ) = − l o g ( 1 − y ^ ( i ) ) (\widehat{y}^{(i)}, y^{(i)}) = -log(1 - \widehat{y}^{(i)}) (y (i),y(i))=−log(1−y (i)) where y ^ ( i ) \widehat{y}^{(i)} y (i) should be close to 0
Cost Function
The cost function is the average of the loss function of the entire training set. We are going to find the parameters w and b that minimize the overall cost function.
J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ] J(w,b) =\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})] J(w,b)=m1i=1∑mL(y (i),y(i))=−m1i=1∑m[y(i)log(y (i))+(1−y(i))log(1−y (i))]
Gradient Descent
One Example:
We have get that:
z = w T x + b z = w^{T}x + b z=wTx+b
y ^ = a = σ ( z ) \widehat{y} = a = \sigma(z) y =a=σ(z)
L ( a , y ) = − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) L(a,y) = -(ylog(a) + (1 - y)log(1 - a)) L(a,y)=−(ylog(a)+(1−y)log(1−a))
Forward:

Backward:
α L ( a , y ) α a = α − ( y l o g ( a ) + ( 1 − y ) l o g ( 1 − a ) ) α a = − y a + 1 − y 1 − a \frac{\alpha_{L(a,y)}}{\alpha_{a}} = \frac{\alpha_{-(ylog(a)+(1-y)log(1-a))}}{\alpha_{a}} = -\frac{y}{a} + \frac{1-y}{1-a} αaαL(a,y)=αaα−(ylog(a)+(1−y)log(1−a))=−ay+1−a1−y
α L ( a , y ) α z = α L ( a , y ) α a × α a α z = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) = a − y \frac{\alpha_{L(a,y)}}{\alpha_{z}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) = a - y αzαL(a,y)=αaαL(a,y)×αzαa=(−ay+1−a1−y)×a(1−a)=a−y
α L ( a , y ) α w 1 = α L ( a , y ) α a × α a α z × α z α w 1 = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × x 1 = x 1 ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{w_{1}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{1}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{1} = x_{1}(a-y) αw1αL(a,y)=αaαL(a,y)×αzαa×αw1αz=(−ay+1−a1−y)×a(1−a)×x1=x1(a−y)
α L ( a , y ) α w 2 = α L ( a , y ) α a × α a α z × α z α w 2 = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × x 2 = x 2 ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{w_{2}}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{w_{2}}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times x_{2} = x_{2}(a-y) αw2αL(a,y)=αaαL(a,y)×αzαa×αw2αz=(−ay+1−a1−y)×a(1−a)×x2=x2(a−y)
α L ( a , y ) α b = α L ( a , y ) α a × α a α z × α z α b = ( − y a + 1 − y 1 − a ) × a ( 1 − a ) × 1 = ( a − y ) \frac{\alpha_{L(a,y)}}{\alpha_{b}} = \frac{\alpha_{L(a,y)}}{\alpha_{a}} \times \frac{\alpha_{a}}{\alpha_{z}} \times \frac{\alpha_{z}}{\alpha_{b}} = (-\frac{y}{a} + \frac{1-y}{1-a}) \times a(1-a) \times 1 = (a-y) αbαL(a,y)=αaαL(a,y)×αzαa×αbαz=(−ay+1−a1−y)×a(1−a)×1=(a−y)
M Training Examples:
Recap:
J ( w , b ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( y ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) ] J(w,b) =\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)}) =-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\widehat{y}^{(i)})+(1-y^{(i)})log(1-\widehat{y}^{(i)})] J(w,b)=m1i=1∑mL(y (i),y(i))=−m1i=1∑m[y(i)log(y (i))+(1−y(i))log(1−y (i))]
α J ( w , b ) α w = 1 m ∑ i = 1 m α L ( y ^ ( i ) , y ( i ) ) α w \frac{\alpha_{J(w,b)}}{\alpha_{w}} = \frac{1}{m} \sum_{i=1}^{m} \frac{\alpha_{L(\widehat{y}^{(i)}, y^{(i)})}}{\alpha_w} αwαJ(w,b)=m1i=1∑mαwαL(y (i),y(i))
# Assume we have two features in this prediction
def logisitcRegressionGradientDescent(m, x, y, w, b, alpha):
"""FP & BP
Parameters:
m (int) -- the number of entire training set
x (matrix: 2 * m) -- the features
y (vector: 1 * m) -- the label
w (vector: 1 * 2) -- the weights of different features
b (int) -- the bias
alpha(float) -- the learning rate
"""
J = 0; dw1 = 0; dw2 = 0; db = 0
z = []; a = []
for i in range(m):
# FP
z[i] = w * x[:, i] + b
a[i] = sigmoid(z[i])
J += -[y[i] * log(a[i]) + (1 - y[i]) * log(1 - a[i])]
#BP
dz[i] = a[i] - y[i]
dw1 += x[i][0] * dz[i]
dw2 += x[i][1] * dz[i]
db += dz[i]
dw1 /= m
dw2 /= m
db /= m
w1 -= alpha * dw1
w2 -= alpha * dw2
b -= alpha * b
return
Vectorization
Look at the Power of Vectorization:
a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
c = np.dot(a, b)
toc = time.time()
print(c)
print("Vectorized version:", str((toc - tic) * 1000) + "ms")
c = 0
tic = time.time()
for i in range(1000000):
c += a[i] * b[i]
toc = time.time()
print(c)
print("For loop version:", str((toc - tic) * 1000) + "ms")
249706.30302638497
Vectorized version: 1.2466907501220703ms
249706.30302638162
For loop version: 395.0514793395996ms
Vectorizing Logitic Regression
def logisitcRegressionGradientDescent(m, X, Y, w, b, alpha):
"""FP & BP
Parameters:
m (int) -- the number of entire training set
X (matrix: n * m) -- the features
Y (vector: m * 1) -- the label
w (vector: 1 * n) -- the weights of different features
b (int) -- the bias
alpha(float) -- the learning rate
"""
# FP
Z = np.dot(W, X) + b
A = sigmoid(Z)
#J = -np.sum(Y * log(A) + (1 - Y) * log(1 - A))
dZ = A - Y
dw = 1 / m * np.dot(dZ, X.T)
db = 1 / m * np.sum(dZ)
return
Broadcasting in Python
a = np.array([[1,2], [3,4]])
print(a)
b = 1
# b = [1,1]
# b = [[1], [1]]
# boradcast b to [[1,1], [1,1]]
# b = np.array([[1,1], [1,1]])
print(a + b)
print(a - b)
[[1 2]
[3 4]]
[[2 3]
[4 5]]
[[0 1]
[2 3]]
A Note on Python/Numpy vectors
a = np.random.randn(5) # Rank 1 array
print(a.shape, a)
print(a.T)
print(a * a.T)
(5,) [ 1.48029134 0.65203054 -0.08540782 1.08574068 -1.72274456]
[ 1.48029134 0.65203054 -0.08540782 1.08574068 -1.72274456]
[2.19126245 0.42514382 0.0072945 1.17883282 2.96784882]
a = np.random.randn(5, 1)
print(a.shape)
print(a)
print(a.T)
print(a.T.shape)
print(a * a.T)
(5, 1)
[[ 1.80545889]
[ 2.31719407]
[ 0.89081914]
[-1.08760266]
[ 0.24755189]]
[[ 1.80545889 2.31719407 0.89081914 -1.08760266 0.24755189]]
(1, 5)
[[ 3.25968179 4.18359863 1.60833733 -1.96362189 0.44694476]
[ 4.18359863 5.36938835 2.06420082 -2.52018643 0.57362578]
[ 1.60833733 2.06420082 0.79355874 -0.96885726 0.22052396]
[-1.96362189 -2.52018643 -0.96885726 1.18287954 -0.2692381 ]
[ 0.44694476 0.57362578 0.22052396 -0.2692381 0.06128194]]
Explanation of Logistic Regression
Recap:
We knew that y ^ = p ( y = 1 ∣ x ) \widehat{y} = p(y=1|x) y =p(y=1∣x), so that we can get:
- If y = 1 y = 1 y=1, p ( y ∣ x ) = y ^ p(y|x) = \widehat{y} p(y∣x)=y
- If y = 0 y = 0 y=0, p ( y ∣ x ) = 1 − y ^ p(y|x) = 1 - \widehat{y} p(y∣x)=1−y
Above that, we can generate a function:
p
(
y
∣
x
)
=
y
^
y
(
1
−
y
^
)
1
−
y
p(y|x) = \widehat{y}^{y}(1-\widehat{y})^{1-y}
p(y∣x)=y
y(1−y
)1−y
l o g ( p ( y ∣ x ) ) = y l o g ( y ^ ) + ( 1 − y ) l o g ( 1 − y ^ ) log(p(y|x)) = ylog(\widehat{y}) + (1-y)log(1-\widehat{y}) log(p(y∣x))=ylog(y )+(1−y)log(1−y )
Then, we want to maximize the p(y|x), so we minimize the -p(y|x):
−
l
o
g
(
p
(
y
∣
x
)
)
=
−
(
y
l
o
g
(
y
^
)
+
(
1
−
y
)
l
o
g
(
1
−
y
^
)
)
=
L
(
y
^
,
y
)
\begin{aligned} -log(p(y|x)) &= -(ylog(\widehat{y}) + (1-y)log(1-\widehat{y})) \\ &= L(\widehat{y}, y) \end{aligned}
−log(p(y∣x))=−(ylog(y
)+(1−y)log(1−y
))=L(y
,y)
we should remember that the loss function above is a convex function, so we can find the optimal of the function.
Cost on m examples:
p
(
Y
∣
X
)
=
∏
i
=
1
m
p
(
y
∣
x
)
p(Y|X) = \prod_{i=1}^{m}p(y|x)
p(Y∣X)=i=1∏mp(y∣x)
l o g ( p ( Y ∣ X ) ) = ∑ i = 1 m l o g ( p ( y ∣ x ) ) = − ∑ i = 1 m L ( y ^ , x ) = − J ( W , b ) \begin{aligned} log(p(Y|X)) &= \sum_{i=1}^{m}log(p(y|x)) \\ &= -\sum_{i=1}^{m}L(\widehat{y}, x) \\ &= -J(W, b) \end{aligned} log(p(Y∣X))=i=1∑mlog(p(y∣x))=−i=1∑mL(y ,x)=−J(W,b)
To summarize, by minimizing this cost function J(w,b) we’re really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed.
本文深入解析逻辑回归算法,探讨其在二分类问题中的应用,通过猫与非猫图像识别实例,介绍逻辑函数、成本函数及梯度下降等核心概念。

被折叠的 条评论
为什么被折叠?



