Supervised learning
Application:
- Standard NN
- real estate
- Online advertising
- CNN
- Photo tagging
- RNN
- Speech recognition
- Machine translation
- Custom/hybrid RNNs
- Autonomous driving
Notation
(x,y),x∈Rnx,y∈{0,1}(x,y), x∈R^{n_x} ,y∈\{0,1\}(x,y),x∈Rnx,y∈{0,1}
m training example {(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}
X=[.....−.....∣x(1)x(2)......x(m)nx.....∣.....−<−m−>]X=
\left[
\begin{matrix}
. & . & . & . & .& -\\
. & . & . & . & . &|\\
x^{(1)} & x^{(2)} & ... & ... & x^{(m)} & n_x\\
. & . & . & . & .& | \\
. & . & . & . & .& -\\
<- & & m & & ->& \\
\end{matrix}
\right]
X=⎣⎢⎢⎢⎢⎢⎢⎡..x(1)..<−..x(2).........m.........x(m)..−>−∣nx∣−⎦⎥⎥⎥⎥⎥⎥⎤
X∈Rnx∗m,Y=[y(1),y(2),..,y(m)],Y∈R1∗mX∈R^{n_x * m }, Y=[y^{(1)},y^{(2)},..,y^{(m)}], Y∈R^{1* m}X∈Rnx∗m,Y=[y(1),y(2),..,y(m)],Y∈R1∗m
x.shape(nx,m) y.shape=(1,m)
Logistic Regression
x∈Rnx,want:y^=P(y=1∣x),so0≤y^≤1x∈R^{n_x}, want: \hat{y}=P(y=1|x),so 0≤\hat{y}≤1x∈Rnx,want:y^=P(y=1∣x),so0≤y^≤1
parameters:w∈Rnx,b∈Rparameters: w∈R^{n_x},b∈Rparameters:w∈Rnx,b∈R
Output:y^=σ(wTx+b)Output: \hat{y}=\sigma (w^Tx+b)Output:y^=σ(wTx+b)
σ\sigmaσ is activation function σ(z)=11+e(−z)\sigma (z)= \frac{1}{1+e^{(-z)}}σ(z)=1+e(−z)1
Logistic Regression Cost Function
y^(i)=σ(wTx(i)+b)\hat{y}^{(i)}= \sigma(w^Tx^{(i)}+b)y^(i)=σ(wTx(i)+b)
given {(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))} want y^(i)≈y(i)\hat{y}^{(i)} \approx y^{(i)}y^(i)≈y(i)
Measure single training sample
usually can use
L(y^,y)=12(y^−y)2L(\hat{y},y)=\frac{1}{2}(\hat{y}-y)^2L(y^,y)=21(y^−y)2
to measure the gap but later gradient descent may not work well because it’s non-convex function
y^=σ(wTx+b),where:σ(z)=11+e−z,interpret:y^=P(y=1∣x)\hat{y}=\sigma (w^Tx+b),where: \sigma(z)=\frac{1}{1+e^{-z}},interpret :\hat{y}=P(y=1|x)y^=σ(wTx+b),where:σ(z)=1+e−z1,interpret:y^=P(y=1∣x)
IF y=1:P(y∣x)=y^y=1:P(y|x)=\hat{y}y=1:P(y∣x)=y^
IF y=0:P(y∣x)=1−y^y=0:P(y|x)=1-\hat{y}y=0:P(y∣x)=1−y^
combine the function above
P(y∣x)=y^y(1−y^)1−y P(y|x)=\hat{y}^y(1-\hat{y})^{1-y}P(y∣x)=y^y(1−y^)1−yand the log\loglog function is a strictly monotonically increasing function
logP(y∣x)=ylogy^+(1−y)log(1−y^)\log P(y|x)=y\log\hat{y}+(1-y)\log(1-\hat{y})logP(y∣x)=ylogy^+(1−y)log(1−y^)then add negative sign because we want the minimum cost, so
−logP(y∣x)=−[ylogy^+(1−y)log(1−y^)]L(y^,y)=−[ylogy^+(1−y)log(1−y^)] \begin{aligned} -\log P(y|x)=&-[y\log\hat{y}+(1-y)\log(1-\hat{y})]\\ L(\hat{y},y)=&-[y\log\hat{y}+(1-y)\log(1-\hat{y})] \end{aligned} −logP(y∣x)=L(y^,y)=−[ylogy^+(1−y)log(1−y^)]−[ylogy^+(1−y)log(1−y^)]
this is the cost function with single example
Cost function in m training set
under IID
P(labels−in−training−set)=∏i=1mP(y(i)∣x(i))P(labels-in-training-set) =\prod_{i=1}^mP(y^{(i)}|x^{(i)})P(labels−in−training−set)=i=1∏mP(y(i)∣x(i)) to maximizing the training set chance as same as maximizing the log\loglog fun
logP(labels−in−training−set)=log∏i=1mP(y(i)∣x(i))=∑i=1mlogP(y(i)∣x(i))=∑i=1m−L(y^(i),y(i))=1m∑i=1m−L(y^(i),y(i)) \begin{aligned} \log P(labels-in-training-set) =&\log \prod_{i=1}^mP(y^{(i)}|x^{(i)})\\ =&\sum_{i=1}^m \log P(y^{(i)}|x^{(i)})\\ =&\sum_{i=1}^m-L(\hat{y}^{(i)},y^{(i)})\\ =&\frac{1}{m}\sum_{i=1}^m-L(\hat{y}^{(i)},y^{(i)}) \end{aligned} logP(labels−in−training−set)====logi=1∏mP(y(i)∣x(i))i=1∑mlogP(y(i)∣x(i))i=1∑m−L(y^(i),y(i))m1i=1∑m−L(y^(i),y(i))
add 1m\frac{1}{m}m1 scaling factor to for better scale
so the overall cost function
J(w,b)=1m∑i=1mL(y^(i),y(i))J(w,b)=\frac{1}{m}\sum_{i=1}^m L(\hat{y}^{(i)},y^{(i)})J(w,b)=m1i=1∑mL(y^(i),y(i))
remove the nagative for minimize the cost function
J(w,b)J(w,b)J(w,b) is convex func, that is the particular reason to be chosen for cost function
Gradient Descent
Repeat:{
w:=w−αdJ(w,b)dww:= w-\alpha \frac{dJ(w,b)}{dw}w:=w−αdwdJ(w,b)
b:=b−αdJ(w,b)dbb:= b-\alpha \frac{dJ(w,b)}{db}b:=b−αdbdJ(w,b)
}
α\alphaα is the learning rate
Computation Graph
from right to left
dLda=−ya+1−y1−a\frac{dL}{da}=-\frac{y}{a}+\frac{1-y}{1-a}dadL=−ay+1−a1−y
dLdz=dLda⋅dadz=(−ya+1−y1−a)(a(1−a))=a−y=dz\frac {dL}{dz}=\frac{dL}{da}\cdot \frac{da}{dz}=(-\frac{y}{a}+\frac{1-y}{1-a})(a(1-a))=a-y=dzdzdL=dadL⋅dzda=(−ay+1−a1−y)(a(1−a))=a−y=dz
dLdw1=dLdz⋅dzdw1=dz⋅x1=dw1\frac {dL}{dw_{1}}=\frac{dL}{dz}\cdot \frac{dz}{dw_1}=dz \cdot x_1=dw_1dw1dL=dzdL⋅dw1dz=dz⋅x1=dw1
dLdw2=dLdz⋅dzdw2=dz⋅x2=dw2\frac {dL}{dw_{2}}=\frac{dL}{dz}\cdot \frac{dz}{dw_2}=dz \cdot x_2=dw_2dw2dL=dzdL⋅dw2dz=dz⋅x2=dw2
dLdb=dLdz⋅dzdb=dz=db\frac {dL}{db}=\frac{dL}{dz}\cdot \frac{dz}{db}=dz=dbdbdL=dzdL⋅dbdz=dz=db
so in single example :
ω1:=ω1−αdz⋅x1\omega_1:= \omega_1-\alpha dz\cdot x_1ω1:=ω1−αdz⋅x1
ω2:=ω2−αdz⋅x2\omega_2:= \omega_2-\alpha dz\cdot x_2ω2:=ω2−αdz⋅x2
b:=b−αdzb:= b-\alpha dzb:=b−αdz
Gradient descent in m example
J(w,b)=1m∑i=1mL(a(i),y(i)),a(i)=y^(i)=σ(z(i))=σ(ωTx(i)+b)J(w,b)=\frac{1}{m}\sum_{i=1}^mL(a^{(i)},y^{(i)}), a^{(i)}=\hat{y}^{(i)}=\sigma(z^{(i)})=\sigma(\omega^T x^{(i)}+b)J(w,b)=m1i=1∑mL(a(i),y(i)),a(i)=y^(i)=σ(z(i))=σ(ωTx(i)+b)
dJ(w1,b)dw1=1m∑i=1mdL(a(i),y(i))dw1=1m∑i=1mdw1(i)−−using(x1(i),y(i)) \begin{aligned} \frac{dJ(w_1,b)}{dw_1}=&\frac{1}{m}\sum_{i=1}^m \frac {dL(a^{(i)},y^{(i)})}{dw_1}\\ =&\frac{1}{m}\sum_{i=1}^m dw_1^{(i)} -- using (x_1^{(i)},y^{(i)}) \end{aligned} dw1dJ(w1,b)==m1i=1∑mdw1dL(a(i),y(i))m1i=1∑mdw1(i)−−using(x1(i),y(i))
J=0 dw1=0 dw2=0 b=0
for i=1 to m
z[i]=w_T*x[i]+b
a[i]=sigma(z[i])
J[i]+=-[y[i]*log(a[i])+(1-y[i])*log(1-a[i])]
dz[i]=a[i]-y[i]
for w in wm
dw[1] += dz[i]*x1[i]
dw[2] +=dz[i]*x2[i]
...
db +=dz[i]
end
end
dw1=dw1/m dw2=dw2/m db=db/m J=J/m
There are 2 loops ,less efficiency ⟶\longrightarrow⟶ Vectorization
Vectorization
Z=ωTx+bZ=\omega ^Tx+bZ=ωTx+b
- non-vectorization
z=0
for i in range(nx):
z += w[i]*x[i]
z = z+b
-
vectorization
ω=[..ω(i)..]X=[..x(i)..]ω∈Rnx,x∈Rnx\omega = \left[ \begin{matrix} . \\ . \\ \omega^{(i)} \\ . \\ . \\ \end{matrix} \right] X = \left[ \begin{matrix} . \\ . \\ x^{(i)} \\ . \\ . \\ \end{matrix} \right] \omega \in R^{n_x},x \in R^{n_x}ω=⎣⎢⎢⎢⎢⎡..ω(i)..⎦⎥⎥⎥⎥⎤X=⎣⎢⎢⎢⎢⎡..x(i)..⎦⎥⎥⎥⎥⎤ω∈Rnx,x∈Rnx
dw=np.zeros(n_x,1) x.shape(n_x,1)
so in code
J=0 b=0
dw=np.zeros(n_x,1)
for i=1 to m //one loop for x
z[i]=w_T*x[i]+b
a[i]=sigma(z[i])
J[i]+=-[y[i]* log(a[i])+(1-y[i])* log(1-a[i])]
dz[i]=a[i]-y[i]
dw += dw[i]* x[i]
db +=dz[i]
end
dw=dw/m db=db/m J=J/m
Vectoring Logistic Regression
z(1)=ωTx(1)+b,a(1)=σ(z(1))z(2)=ωTx(2)+b,a(2)=σ(z(2))...z^{(1)}=\omega ^Tx^{(1)}+b ,a^{(1)}=\sigma(z^{(1)})\\ z^{(2)}=\omega ^Tx^{(2)}+b ,a^{(2)}=\sigma(z^{(2)}) \\...z(1)=ωTx(1)+b,a(1)=σ(z(1))z(2)=ωTx(2)+b,a(2)=σ(z(2))...
X=[..........x(1)x(2)......x(m)..........]X= \left[ \begin{matrix} . & . & . & . & .\\ . & . & . & . & .\\ x^{(1)} & x^{(2)} & ... & ... & x^{(m)} \\ . & . & . & . & .\\ . & . & . & . & .\\ \end{matrix} \right]X=⎣⎢⎢⎢⎢⎡..x(1)....x(2)..................x(m)..⎦⎥⎥⎥⎥⎤
Z=[z(1),z(2),z(3),...,z(m)]=ωTX+[b,b,...,b]=[ωTx(1)+b,ωTx(2)+b,...,ωTx(m)+b] \begin{aligned} Z=&[z^{(1)},z^{(2)},z^{(3)},...,z^{(m)}]\\=&\omega ^{T}X+[b,b,...,b]\\=&[\omega ^{T}x^{(1)}+b,\omega ^{T}x^{(2)}+b,...,\omega ^{T}x^{(m)}+b] \end{aligned} Z===[z(1),z(2),z(3),...,z(m)]ωTX+[b,b,...,b][ωTx(1)+b,ωTx(2)+b,...,ωTx(m)+b]
Z =np.dot(w.T,x)+b . //python Broadcasting as a vector
Vecrotring Logistic Regression Gradient Descent
dz(i)=a(i)−y(i),dz=[dz(1),dz(2),...,dz(m)],A=[a(1),a(2),..,a(m)],Y=[y(1),y(2),...,y(m)],dz=A−Ydb=1m∑i=1mdz(i)dz^{(i)}=a^{(i)}-y^{(i)},\\dz=[dz^{(1)},dz^{(2)},...,dz^{(m)}],\\A=[a^{(1)},a^{(2)},..,a^{(m)}],\\Y=[y^{(1)},y^{(2)},...,y^{(m)}] ,\\dz=A-Y\\ db=\frac {1}{m}\sum^m_{i=1} dz^{(i)}dz(i)=a(i)−y(i),dz=[dz(1),dz(2),...,dz(m)],A=[a(1),a(2),..,a(m)],Y=[y(1),y(2),...,y(m)],dz=A−Ydb=m1∑i=1mdz(i)
db=np.sum(dz)/m
dω=1m⋅X⋅dzTd\omega=\frac{1}{m} \cdot X \cdot dz^Tdω=m1⋅X⋅dzT
dω=1m[..........x(1)x(2)......x(m)..........]⋅[d(1)d(2)..d(m)]=1m[x(1)dz(1)+x(2)dz(2)...+x(m)dz(m)]d\omega=\frac{1}{m} \left[
\begin{matrix}
. & . & . & . & .\\
. & . & . & . & .\\
x^{(1)} & x^{(2)} & ... & ... & x^{(m)} \\
. & . & . & . & .\\
. & . & . & . & .\\
\end{matrix}
\right] \cdot \left[
\begin{matrix}
d^{(1)} \\
d^{(2)} \\
.\\
. \\
d^{(m)} \\
\end{matrix}
\right]=\frac{1}{m}[x^{(1)}dz^{(1)}+x^{(2)}dz^{(2)}...+x^{(m)}dz^{(m)}]dω=m1⎣⎢⎢⎢⎢⎡..x(1)....x(2)..................x(m)..⎦⎥⎥⎥⎥⎤⋅⎣⎢⎢⎢⎢⎡d(1)d(2)..d(m)⎦⎥⎥⎥⎥⎤=m1[x(1)dz(1)+x(2)dz(2)...+x(m)dz(m)]
for iter in range(2000):
z=np.dot(W.T,X)+b
A=sigmoid(z)
dz=A-Y
dw=np.dot(X,dz.T)/m
db=np.sum(dz)/m
W=W-alpha * dw
b=b-alpha * db