Look into "A Neural Network in 11 lines of Python"-优快云博客

本文链接：https://blog.youkuaiyun.com/travischan/article/details/64919076

A toy code “A Neural Network in 11 lines of Python” is famous with machine learning starters. I’m wondering how many people really look into it because the derivatives by author is a bit strange. Let’s see it in detail together, especially the gradient back-propagation part.

Of course of all, the code:

X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1
for j in xrange(60000):
    l1 = 1/(1+np.exp(-(np.dot(X,syn0))))
    l2 = 1/(1+np.exp(-(np.dot(l1,syn1))))
    l2_delta = (y - l2)*(l2*(1-l2))
    l1_delta = l2_delta.dot(syn1.T) * (l1 * (1-l1))
    syn1 += l1.T.dot(l2_delta)
    syn0 += X.T.dot(l1_delta)

It’s a simple MLP with one hidden layer. syn0 and syn1 are weights of input layer and hidden layer.
X is input. l1 is the value of hidden layer. l2 is the value of output layer(one node)

First, the forward propagation:
input -> hidden: $l1 = sigmoid(X*syn0)$
hidden -> output: $l2 = sigmoid(l1*syn1)$
MSE loss $L = \frac{1}{2}(y - l2)^2$ (We will talk about what loss this code uses later)

Then, back propagation:
We want to know $\frac{\partial L}{\partial syn1}$ and $\frac{\partial L}{\partial syn0}$ .
For $\frac{\partial L}{\partial syn1}$ :
$\frac{\partial L}{\partial syn1} = \frac{\partial L}{\partial l2} * \frac{\partial l2}{\partial syn1}$ , in which,
$\frac{\partial L}{\partial l2} = l2-y$
$\frac{\partial l2}{\partial syn1} = \frac{\partial l2}{\partial l1*syn1} * \frac{\partial l1*syn1}{\partial syn1} = l2(1-l2)*l1$
times this two parts, $\frac{\partial L}{\partial syn1} = (l2-y)*l2(1-l2)*l1$

And for $\frac{\partial L}{\partial syn0}$ :
$\frac{\partial L}{\partial syn0} = \frac{\partial L}{\partial l2} * \frac{\partial l2}{\partial l1} * \frac{\partial l1}{\partial syn0}$
$\frac{\partial L}{\partial l2} = l2 - y$ as derived above
$\frac{\partial l2}{\partial l1} = \frac{\partial l2}{\partial l1*syn1} * \frac{\partial l1*syn1}{\partial l1}= l2(1-l2)*syn1$
$\frac{\partial l1}{\partial syn0} = \frac{\partial l1}{\partial X*syn0} * \frac{\partial X*syn0}{\partial syn0} = l1(1-l1)*X$
times this three parts together, $\frac{\partial L}{\partial syn0} = (l2 - y)*l2(1-l2)*syn1*l1(1-l1)*X$

These derivatives are all match the code so I guess actually this piece of code is using MSE as the cost function while the author seems not making this part clear.

Reference:
A Neural Network in 11 lines of Python (Part 1)