A toy code “A Neural Network in 11 lines of Python” is famous with machine learning starters. I’m wondering how many people really look into it because the derivatives by author is a bit strange. Let’s see it in detail together, especially the gradient back-propagation part.
Of course of all, the code:
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1
for j in xrange(60000):
l1 = 1/(1+np.exp(-(np.dot(X,syn0))))
l2 = 1/(1+np.exp(-(np.dot(l1,syn1))))
l2_delta = (y - l2)*(l2*(1-l2))
l1_delta = l2_delta.dot(syn1.T) * (l1 * (1-l1))
syn1 += l1.T.dot(l2_delta)
syn0 += X.T.dot(l1_delta)
It’s a simple MLP with one hidden layer. syn0 and syn1 are weights of input layer and hidden layer.
X is input. l1 is the value of hidden layer. l2 is the value of output layer(one node)
First, the forward propagation:
input -> hidden:
l1=sigmoid(X∗syn0)
hidden -> output:
l2=sigmoid(l1∗syn1)
MSE loss
L=12(y−l2)2
(We will talk about what loss this code uses later)
Then, back propagation:
We want to know
∂L∂syn1
and
∂L∂syn0
.
For
∂L∂syn1
:
∂L∂syn1=∂L∂l2∗∂l2∂syn1
, in which,
∂L∂l2=l2−y
∂l2∂syn1=∂l2∂l1∗syn1∗∂l1∗syn1∂syn1=l2(1−l2)∗l1
times this two parts,
∂L∂syn1=(l2−y)∗l2(1−l2)∗l1
And for
∂L∂syn0
:
∂L∂syn0=∂L∂l2∗∂l2∂l1∗∂l1∂syn0
∂L∂l2=l2−y
as derived above
∂l2∂l1=∂l2∂l1∗syn1∗∂l1∗syn1∂l1=l2(1−l2)∗syn1
∂l1∂syn0=∂l1∂X∗syn0∗∂X∗syn0∂syn0=l1(1−l1)∗X
times this three parts together,
∂L∂syn0=(l2−y)∗l2(1−l2)∗syn1∗l1(1−l1)∗X
These derivatives are all match the code so I guess actually this piece of code is using MSE as the cost function while the author seems not making this part clear.