机器学习--神经网络

原创于 2024-09-12 12:07:50 发布 · 1.5k 阅读

32 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #神经网络 #人工智能

------Machine Learning------ 专栏收录该内容

11 篇文章

订阅专栏

神经网络

计算

神经网络非常简单，举个例子就理解了（最后一层的那个写错了，应该是 $a1(3)a^{(3)}_1$ ）：

在这里插入图片描述

$n o t a t i o n$ ： $aj(i)a^{(i)}_j$ 表示第 $i$ 层的第 $j$ 个单元。 $w^{(j)}$ 表示权重矩阵，控制从 $j$ 层到 $j + 1$ 层的映射。

其中：

$\begin{aligned} a^{(2)}_1 = & g\bigg( w^{(1)}_{10} x_0 + w^{(1)}_{11} x_1 + w^{(1)}_{12} x_2 + w^{(1)}_{13} x_3 \bigg)\\ a^{(2)}_2 = & g\bigg( w^{(1)}_{20} x_0 + w^{(1)}_{21} x_1 + w^{(1)}_{22} x_2 + w^{(1)}_{23} x_3 \bigg)\\ a^{(2)}_3 = & g\bigg( w^{(1)}_{30} x_0 + w^{(1)}_{31} x_1 + w^{(1)}_{32} x_2 + w^{(1)}_{33} x_3 \bigg)\\ h(x) = a^{(3)}_1 = &g\bigg( w^{(2)}_{10}a^{(2)}_0 + w^{(2)}_{11}a^{(2)}_1 + w^{(2)}_{12}a^{(2)}_2 + w^{(2)}_{13}a^{(2)}_3 \bigg) \end{aligned}$

如果向量化一下，那就是：

$\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix}, \;\;\;\; w^{(1)} = \begin{bmatrix} w^{(1)}_{10} & w^{(1)}_{11} & w^{(1)}_{12} & w^{(1)}_{13} \\ w^{(1)}_{20} & w^{(1)}_{21} & w^{(1)}_{22} & w^{(1)}_{23} \\ w^{(1)}_{30} & w^{(1)}_{31} & w^{(1)}_{32} & w^{(1)}_{33} \end{bmatrix}$

然后有：

$z^{(2)} = w^{(1)}x = \begin{bmatrix} z^{(2)}_1 \\ z^{(2)}_2 \\ z^{(2)}_3 \end{bmatrix}, \;\;\;\;a^{(2)} = g(z^{(2)}) = \begin{bmatrix} a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \end{bmatrix}$

下一层是：

$a^{(2)} = \begin{bmatrix} a^{(2)}_{0} \\ a^{(2)}_{1} \\ a^{(2)}_{2} \\ a^{(2)}_{3} \end{bmatrix}, \;\;\;\;w^{(2)} = \begin{bmatrix} w^{(2)}_{20} & w^{(2)}_{21} & w^{(2)}_{22} & w^{(2)}_{23} \end{bmatrix}$

$z^{(3)} = w^{(2)}a^{(2)} = \begin{bmatrix} z^{(3)}_1 \end{bmatrix}, \;\;\;\; a^{(3)} = g(z^{(3)}) = \begin{bmatrix} a^{(3)}_1 \end{bmatrix}$

以上就是神经网络的计算方式，其实还是很好理解也很好实现的qwq

后向传播 $\; Propagation$

现在就是考虑如何计算出 $w^{(i)}$ 这么多矩阵了。（ $n o t a t i o n$ ： $L$ 表示神经网络的层数， $S_l$ 表示 $l$ 层的节点数， $k$ 表示输出层的节点数）

我们仍然考虑用类似 $G D$ 的方法，于是我们考虑 $min⁡wJ(w)\min\limits_wJ(w)$ ，其中：

$\frac 1m \sum_{i = 1}^m\sum_{k = 1}^{S_L}\frac 12 \bigg[ (h(x_i))_k - y_{ik} \bigg]^2$

然后我们就是要求解 $∂J(w)∂wij(l)\frac{\partial J(w)}{\partial w^{(l)}_{ij}}$ 。

我们考虑将所有的训练数据分开求解，对于其中一个训练数据 $x_i, y_i)$ 来说：

$J_i = \sum_{k = 1}^{S_L}\frac 12 \bigg[ (h(x_i))_k - y_{ik} \bigg]^2$

我们定义 $δi(l)\delta^{(l)}_i$ 表示 $ai(l)a^{(l)}_i$ 对真实值的差值，也就是：

$\delta^{(l)}_j = \frac{\partial J_i}{\partial z^{(l)}_j}$

而对于最后一层来说：

$\begin{aligned} \delta^{(L)}_j = \frac{\partial J_i}{\partial z^{(L)}_j} = \frac{\partial J_i}{\partial a^{(L)}_j} \cdot \frac{\partial a^{(L)}_j}{\partial z^{(L)}_j} = &\frac{\partial \sum\limits_{k = 1}^{S_L}\frac 12 [(h(x_i))_k - y_{ik}]^2}{\partial a^{(L)}_j} \cdot \frac{\partial g(z^{(L)}_j)}{\partial z^{(L)}_j} \\ = & \frac{\partial \sum\limits_{k = 1}^{S_L}\frac 12 [a^{(L)}_k - y_{ik}]^2}{\partial a^{(L)}_j} \cdot g'(z^{(L)}_j) = (a^{(L)}_j - y_{ik}) \cdot g'(z^{(L)}_j) \end{aligned}$

而我们要算的是：

$\begin{aligned} \frac{\partial J_i}{\partial w^{(L-1)}_{jk}} = \frac{\partial J_i}{\partial a^{(L)}_j} \cdot \frac{\partial a^{(L)}_j}{\partial z^{(L)}_j} \cdot \frac{\partial z^{(L)}_j}{\partial w^{(L-1)}_{jk}} = \delta^{(L)}_j \cdot \frac{\partial z^{(L)}_j}{\partial w^{(L-1)}_{jk}} \end{aligned}$

所以我们只需要计算 $∂zj(L)∂wjk(L−1)\frac{\partial z^{(L)}_j}{\partial w^{(L-1)}_{jk}}$ 就可以了

我们又知道：

$z^{(L)}_j = \sum_{i = 1}^{S_{L - 1}}w^{(L - 1)}_{ji}a^{(L-1)}_i$

所以：

$\frac{\partial z^{(L)}_j}{\partial w^{(L-1)}_{jk}} = \frac{\sum\limits_{i = 1}^{S_{L - 1}}\partial w^{(L-1)}_{ji}a^{(L-1)}_i }{\partial w^{(L-1)}_{jk}} = a^{(L-1)}_k$

于是：

$\frac{\partial J_i}{\partial w^{(L-1)}_{jk}} = \delta^{(L)}_j \cdot a^{(L-1)}_k$

现在我们有了最后一层，我们考虑能不能往前推回去，这里我们以一个简单的例子来更直观的计算（这里我画图时把 $w$ 写成 $φ\varphi$ 了qwq）：

在这里插入图片描述

我们假设我们要计算 $J$ 对 $w11(3)w^{(3)}_{11}$ 求偏导：

$\frac{\partial J_i}{\partial w^{(3)}_{11}} = \frac{\partial (J_{i1} + J_{i2})}{\partial w^{(3)}_{11}} = \frac{\partial J_{i1}}{\partial w^{(3)}_{11}} + \frac{\partial J_{i2}}{\partial w^{(3)}_{11}}$

我们考虑分开求 $∂Ji1∂w11(3)\frac{\partial J_{i1}}{\partial w^{(3)}_{11}}$ 和 $∂Ji2∂w11(3)\frac{\partial J_{i2}}{\partial w^{(3)}_{11}}$

先算前一项，沿着神经网络做分布求导：

$\begin{aligned} \frac{\partial J_{i1}}{\partial w^{(3)}_{11}} = & \frac{\partial J_{i1}}{\partial a^{(5)}_1} \cdot \frac{\partial a^{(5)}_1}{\partial z^{(5)}_1} \cdot \frac{\partial z^{(5)}_1}{\partial a^{(4)}_1} \cdot \frac{\partial a^{(4)}_1}{\partial z^{(4)}_1} \cdot \frac{\partial z^{(4)}_1}{w^{(3)}_{11}} \\ = & \delta^{(5)}_1 \cdot \frac{\partial z^{(5)}_1}{\partial a^{(4)}_1} \cdot \frac{\partial a^{(4)}_1}{\partial z^{(4)}_1} \cdot \frac{\partial z^{(4)}_1}{w^{(3)}_{11}} \end{aligned}$

我们又有：

$\begin{aligned} z^{(5)}_1 = w^{(4)}_{11}a^{(4)}_1 + w^{(4)}_{12}a^{(4)}_2 \rightarrow & \frac{\partial z^{(5)}_1}{\partial a^{(4)}_1} = w^{(4)}_{11} \\ a^{(4)}_1 = g(z^{(4)}_1) \rightarrow & \frac{\partial a^{(4)}_1}{\partial z^{(4)}_1} = g'(z^{(4)}_1) \\ z^{(4)}_1 = w^{(3)}_{11}a^{(3)}_1 + w^{(3)}_{12}a^{(3)}_2 \rightarrow & \frac{\partial z^{(4)}_1}{\partial w^{(3)}_{11}} = a^{(3)}_1 \end{aligned}$

所以：

$\frac{\partial J_{i1}}{\partial w^{(3)}_{11}} = \delta^{(5)}_1 \cdot w^{(4)}_{11} \cdot g'(z^{(4)}_1) \cdot a^{(3)}_1$

同样的，我们也可以推出（这里因为和前面几乎一样所以过程就省略了 ~~（绝对不是因为公式打起来太麻烦了qwq~~）：

$\frac{\partial J_{i2}}{\partial w^{(3)}_{11}} = \delta^{(5)}_2 \cdot w^{(4)}_{21} \cdot g'(z^{(4)}_1) \cdot a^{(3)}_1$

所以把这俩玩意儿加起来就能得到：

$\begin{aligned} \frac{\partial J_i}{\partial w^{(3)}_{11}} = & \delta^{(5)}_1 \cdot w^{(4)}_{11} \cdot g'(z^{(4)}_1) \cdot a^{(3)}_1 + \delta^{(5)}_2 \cdot w^{(4)}_{21} \cdot g'(z^{(4)}_1) \cdot a^{(3)}_1\\ = & (\delta^{(5)}_1 \cdot w^{(4)}_{11} + \delta^{(5)}_2 \cdot w^{(4)}_{21})\cdot g'(z^{(4)}_1) \cdot a^{(3)}_1 \end{aligned}$

然后我们令：

$\delta^{(4)}_1 = (\delta^{(5)}_1 \cdot w^{(4)}_{11} + \delta^{(5)}_2 \cdot w^{(4)}_{21}) \cdot g'(z^{(4)}_1)$

于是我们就有：

$\frac{\partial J_i}{\partial w^{(3)}_{11}} = \delta^{(4)}_1 \cdot a^{(3)}_1$

我们发现，这个式子跟我们上面的

$\frac{\partial J_i}{\partial w^{(L-1)}_{jk}} = \delta^{(L)}_j \cdot a^{(L-1)}_k$

这个的结构完全一致。

所以我们得到了一个递推式：

$\delta^{(4)}_1 = (\delta^{(5)}_1 \cdot w^{(4)}_{11} + \delta^{(5)}_2 \cdot w^{(4)}_{21}) \cdot g'(z^{(4)}_1)$

同样的，我们也能得到：

$\delta^{(4)}_2 = (\delta^{(5)}_1 \cdot w^{(4)}_{12} + \delta^{(5)}_2 \cdot w^{(4)}_{22}) \cdot g'(z^{(4)}_2)$

也可以写成向量的形式：

$\begin{bmatrix} \delta^{(4)}_1 \\ \delta^{(4)}_2 \end{bmatrix} = \left(\begin{bmatrix} w^{(4)}_{11} & w^{(4)}_{12} \\ w^{(4)}_{21} & w^{(4)}_{22} \end{bmatrix} \begin{bmatrix} \delta^{(5)}_1 \\ \delta^{(5)}_2 \end{bmatrix}\right) \cdot* \begin{bmatrix} g'(z^{(4)}_1) \\ g'(z^{(4)}_2) \end{bmatrix}$

也就是：

$\delta^{(4)} = \bigg[(w^{(4)})^T\delta^{(5)}\bigg] \cdot* g'(z^{(4)})$

同样的，我们也能将这个式子推广到其他层：

$\delta^{(l)} = \bigg[ (w^{(l)})^T\delta^{(l+1)} \bigg] \cdot* g'(z^{(l)})$

这个式子就是我们 $\; propagation$ 的关键了。

然后我们对于每个训练数据 $i$ 都跑一遍 $BP$ 计算出 $∂Ji∂wjk(L−1)\frac{\partial J_i}{\partial w^{(L-1)}_{jk}}$ ，然后令 $Δjk(l)\Delta^{(l)}_{jk}$ 存储 $∂Ji∂wjk(L−1)\frac{\partial J_i}{\partial w^{(L-1)}_{jk}}$ 的和。最后跑完 $m$ 个训练数据后令 $Djk(l)=1mΔjk(l)D^{(l)}_{jk} = \frac 1m\Delta^{(l)}_{jk}$ ，我们就得到了：

$\frac{\partial}{\partial w^{(l)}_{jk}}J(w) = D^{(l)}_{jk}$

然后再进行 $G D$ 就可以了。

6 条评论

2401_84079994 2024.09.21
I guess the math part is used in certain functions in the whole process flow. I did not try to memorize all the related functions. At least not doing so before I can catch the supporting logic and/or math prove behind. In fact I am trying to run the simple network on a very simple demo example provided in a book, line by line, for about 2 months on and off. Kind of frustrated by the line by line explanations on the book. Only understand what it is doing but lack why it is doing so. Well, it is college experience. Professor can not tell everything and student can not learn everything.

2401_84079994 2024.09.20
thanks a lot for reply. any input fills my curious mind. for deep network, my thought is that it may run much less times in loop while simple network may run much more time in loop, given a timeframe. but, since the designs are different with unknown effect on back and forth process between any two layers, the deeper network may still get better result than simpler network. by the way let me ask a dummy question about the math part. is it true that, given a designed network, you do have a fixed set of parameters (or variables, like x1,x2, etc) and therefore those partial derivatives comes out in play? if true, those pre-fixed parameters are part of the neural network design for a problem? if it is still true, I would guess that the deep learning process may have certain capacity to increase/decrease parameters and change parameter values at the same time in order to generate better result. to this point, it suddenly tastes like partial random evolution process. LOL. thanks again. If I have more energy I may get into the math part a little bit and ask for more understanding.
- aWty_回复2401_84079994 2024.09.20
  In a particular training, those parameters are randomly given in the initialization part, and the SGD algorithm would help you adjust those parameters automatically to reach a lower loss, which means reaching higher accuracy for prediction. So the initialization for parameters is not a part of network design, but a part of training process. That is my understanding for your question.
- aWty_回复2401_84079994 2024.09.20
  In a particular training, those parameters are randomly given in the initialization part, and the SGD algorithm would help you adjust those parameters automatically to reach a lower loss, which means reaching higher accuracy for prediction. So the initialization for parameters is a part of network design, but a part of training process. That is my understanding for your question.

2401_84079994 2024.09.17
Before spending any time to recall or catch the math part, let me ask some fundamental questions (having kept in mind for a few months). 1. Given unlimited resource and time, is the neural network model theoretically proved to provide correct answer for any problems which have been already solved by other methods? 2. How to determine how many members (or cells) are used in given middle layers? Based on input layer setup? Or the more the better? 3. Does deep network with more middle layers always provide better answer given a fixed timeframe?
- aWty_回复2401_84079994 2024.09.19
  Firstly, sorry for being unable to answer your questions, I'm also a green hand learner in the machine learning field. However, I would still like to share my viewpoint on your questions. 1. The network cannot guarantee 100% accuracy for predictions, yet you can increase it by adjusting parameters while training your network with SGD. 2. If you are curious about how to design a network, I recommend you learn something about LeNet AlexNet VGG GoogLeNet ResNet NiN and so on. These are both the nets that are proven to be efficient in solving problems like computer vision. 3. As for the third question, the answer is absolutely no. As we all know, the more middle layers a net has, the more time it requires to train. So in a fixed timeframe, an over-complex net may even unable to finish training, let alone provide a better prediction. Additionally, if a network has too many layers, it may encounter a problem called overfitting. So the answer is NO.