Coursera | Andrew Ng (01-week-3-3.7)—为什么需要非线性激活函数？_why we should instead use the linear activation fu-优快云博客

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

优快云：http://blog.youkuaiyun.com/junjun_zhao/article/details/79000242

3.7 Why do you need non-linear activation functions?

为什么需要非线性激活函数？

(字幕来源：网易云课堂)

这里写图片描述

Why does your neural network need a nonlinear activation function.It turns out that for your neural network to compute interesting functions,you do need to take a nonlinear activation function unless you want,so here is the forprop equations for the neural network,why don’t we just get rid of this,get rid of the function g and set $a^{[1]}$ equals $z^{[1]}$ ,or alternatively you could say that g(z) is equal to z, right,Sometimes this is called the linear activation function,maybe a better name for it would be the identity activation function,because they’re just outputs whatever was input,for the purpose of this what if $a^{[2]}$ was just equal to $z^{[2]}$ ,it turns out if you do this,then this model is just computing y or y hat as a linear function of your input features x,take the first two equations if you have that, $a^{[1]}$ is equal to $z^{[1]}$ is equal to $W^{[1]}$ x plus b,and if then $a^{[2]}$ is equal to $z^{[2]}$ is equal to $W^{[2]}$ $a^{[1]}$ plus $b^{[2]}$ ,then if you take the definition of $a^{[1]}$ ,and plug it in there you find that, $a^{[2]}$ is equal to $W^{[2]}$ times $W^{[1]}$ x plus $b^{[1]}$ , a bit.

为什么神经网络需要非线性激活函数?，事实证明要让你的神经网络能够计算出有趣的函数，你必须使用非线性激活函数除非你想..，这是神经网络正向传播的方程，为什么我们不能直接去掉这个?，去掉函数 $g$ 然后令 $a^{[1]}$ 等于 $z^{[1]}$ ，或者你可以令 $g(z)$ 等于 z 对吧，这有时叫线性激活函数，更学术一点的名字是 恒等激活函数，因为它们就直接把输入值输出了，为了说明问题我们看看 $a^{[2]}$ 等于 $z^{[2]}$ 会怎样，事实证明如果你这样做，那么这个模型的输出 $y$ 或 $\hat{y}$ 不过是你输入特征 $x$ 的线性组合，我们看前两个式子如果你令， $a^{[1]}$ 等于 $z^{[1]}$ 等于 $W^{[1]} x+b$ ，然后 $a^{[2]}$ 等于 $z^{[2]}$ 等于 $W^{[2]}$ $a^{[1]}$ 加上 $b^{[2]}$ ，如果你代入 $a^{[1]}$ 的定义，代进去就发现， $a^{[2]}$ 等于 $W^{[2]}$ 乘以 $W^{[1]}$ x+ $b^{[1]}$ **。

这里写图片描述

all right so this is um $a^{[1]}$ plus $b^{[2]}$ ,and so this simplifies to $W^{[2]}$ $W^{[1]}$ x plus $W^{[2]}$ $b^{[1]}$ plus $b^{[2]}$ ,so this it’s just let’s call this w prime b prime,so it is just equal to w prime x plus b Prime.If you were to use linear activation functions,or we go to call them identity activation functions,then the neural network is just outputting a linear function of the input,and we’ll talk about deep networks later,neural networks with many many layers, many many hidden layers,and it turns out that if you use a linear activation function,or alternatively if you don’t have an activation function,then no matter how many layers your neural network has,always doing is just computing a linear activation function,so you might as well not have any hidden layers,some of the cases that briefly mentioned,it turns out that if you have a linear activation function here,and a $Sigmoid$ function here,then this model is no more expressive,than standard $Logistic$ regression without any hidden layer.

好如果这是 $a^{[1]}$ 加 $b^{[2]}$ ，因为这化简成 $W^{[2]}$ $W^{[1]}$ x + $W^{[2]}$ $b^{[1]}$ + $b^{[2]}$ ，所以我们可以写成 $W' b'$ ，这其实就等于 $W'x+b'$ ，如果你要用线性激活函数，或者叫恒等激活函数，那么神经网络只是把输入线性组合再输出，我们稍后会谈到深度网络，有很多层的神经网络很多隐藏层，事实证明如果你使用线性激活函数，或者如果没有激活函数，那么无论你的神经网络有多少层，一直在做的只是计算线性激活函数，所以不如直接去掉全部隐藏层，在我们简要提到的案例中，事实证明如果你在这里用线性激活函数，在这里用 $Sigmoid$ 函数，那这个模型的复杂度，和没有任何隐藏层的标准逻辑 $Logistic$ 回归是一样的。

这里写图片描述

so I won’t bother to prove that,but you could try to do so if you want,but the take-home is that a linear hidden layer is more or less useless,because the composition of two linear functions is itself a linear function,so unless you throw a non-linearity in there,then you’re not computing more interesting functions,even as you go deeper in the network,there is just one place where you might use a linear activation function,g(z) equals z and that’s if you are doing machine learning on a regression problem,so y is a real number, so for example if you’re trying to predict housing prices,so y is a, it’s not 0, 1, but it’s a real number,you know anywhere from zero dollars is a price of house,up to however expensive right house is gonna get.I guess maybe however can be, you know, potentially millions of dollars,so however however much houses cost in your data set,but y takes on these real values,then it might be OK to have a linear activation function here.

我懒得证明，但如果你愿意可以自己证明一下，但要点在于线性隐层一点用都没有，因为两个线性函数的组合本身就是线性函数，所以除非你引入非线性，那么你无法计算更有趣的函数，网络层数再多也不行，只有一个地方可以使用线性激活函数， $g(z)=z$ 就是如果你要机器学习的是回归问题，所以y是一个实数比如说你想预测房地产价格，所以y是一个.. 不是 0 和 1 而是一个实数，你知道房价是 0 美元一直到，能多贵就多贵，我也不知道能到多少也许几百万美元，但不管你的数据集里房价是多少，y 都是一个实值，那么用线性激活函数也许可行。

这里写图片描述

so that your output y hat is also a real number,going anywhere from minus infinity to plus infinity,but then the hidden units should not use the linear activation functions,they could use $ReLU$ or $tanh$ or leaky $ReLU$ or maybe something else,so the one place you might use as linear activation function,others usually in the output layer,but other than that using a linear activation function in a hidden layer,except for some very special circumstances relating to compression,that we won’t want to talk about.Using a linear activation function is extremely rare,oh and of course actually predicting housing prices as you saw on the week 1 video,because housing prices are all non-negative,perhaps even then you can use a $ReLU$ activation function,so that your outputs y hat are all greater than or equal to 0.

所以你的输出 y 也是一个实数，从负无穷到正无穷，但这些隐藏单元不能用线性激活函数，它们可以用 $ReLU$ 或者 $tanh$ 或者带泄漏的 $ReLU$ 或者别的东西，所以唯一可以用线性激活函数的地方，通常就是输出层，除了这种情况会在隐层用线性激活函数的，可能除了与压缩有关的一些非常特殊的情况，那方面我不想深入讨论，在那之外使用线性激活函数非常少见，哦当然实际上预测住房价格就像你在第一周的视频中看到的，因为房价都是非负数，也许甚至可以使用 $ReLU$ 激活函数，这样你的所有 $\hat{y}$ 都大于等于 0。

这里写图片描述

so I hope that gives you a sense of why having a nonlinear activation function,is a critical part of neural networks,next we’re going to start to talk about gradient descent,and to do that to set up for discussion for gradient descent in the next video.I want to show you how to estimate how to compute,the slope of the derivative of individual activation functions,so let’s go on to the next video.

我希望这样你就知道为什么使用非线性激活函数，对神经网络来说很关键，接下来我们将开始谈论梯度下降，并在下一个视频中开始讨论梯度下降的基础，我想告诉你如何估计如何计算，单个激活函数的导数斜率，我们下一个视频继续。

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。