吴恩达深度学习 课程二
训练优化检验技巧
训练、开发、测试集
训练数据通常被分为训练集、验证集和测试集三部分。
- 对训练集执行训练算法
- 在验证集集上比较不同模型的优劣,得到最终模型(之前机器学习所称测试集实际为验证集,一种不严谨的叫法)
- 最后在测试集上评价模型(用作对所有模型的无偏评估)
在大数据时代,对于数据集的划分,不用保证严格的73分。在数据集很大时,保证验证集和测试集样本数据足够就行。
经验法则:为防止训练集和验证集结果相差过大,保证训练集和验证集来自同一训练集
偏差、方差
-
方差:衡量模型训练集和验证集表现差异的指标
-
偏差:衡量训练集表现的指标
高偏差意味着欠拟合
高方差意味着过拟合
机器学习基础
如果拟合模型后发现高偏差,就改变神经网络结构(增加深度,增加节点,改变神经网络结构之类)
若发现高方差,就采用更多的数据,或增加正则化
机器学习较少注重平衡偏差和方差的原因:增大神经网络规模几乎百试百灵
经验法则:神经网络表现差,就换个更大规模的神经网络
正则化
正则化是减少过拟合的有效方法。
正则化一般形式为:
J(w,b)=1m∑i=1mL(yˉ(i),y(i))+λ2m∥w∥22
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacaWGkbGaaiikaiaadEhacaGGSaGaamOy
% aiaacMcacqGH9aqpdaWcaaqaaiaaigdaaeaacaWGTbaaamaaqahaba
% GaamitaiaacIcaceWG5bGbaebadaahaaWcbeqaaiaacIcacaWGPbGa
% aiykaaaakiaacYcacaWG5bWaaWbaaSqabeaacaGGOaGaamyAaiaacM
% caaaGccaGGPaaaleaacaWGPbGaeyypa0JaaGymaaqaaiaad2gaa0Ga
% eyyeIuoakiabgUcaRmaalaaabaGaeq4UdWgabaGaaGOmaiaad2gaaa
% WaauWaaeaacaWG3baacaGLjWUaayPcSdWaaSbaaSqaaiaaikdaaeqa
% aOWaaWbaaSqabeaacaaIYaaaaaaa!60C9!
J(w,b) = \frac{1}{m}\sum\limits_{i = 1}^m {L({{\bar y}^{(i)}},{y^{(i)}})} + \frac{\lambda }{{2m}}{\left\| w \right\|_2}^2
J(w,b)=m1i=1∑mL(yˉ(i),y(i))+2mλ∥w∥22
主流:L2−regulationL2_-regulationL2−regulation(采用L2L2L2范数,弗罗贝尼乌斯范数)即为:
∥w∥22=∑j=1nxwj2=wTw
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaadaqbdaqaaiaadEhaaiaawMa7caGLkWoa
% daWgaaWcbaGaaGOmaaqabaGcdaahaaWcbeqaaiaaikdaaaGccqGH9a
% qpdaaeWbqaaiaadEhadaWgaaWcbaGaamOAaaqabaGcdaahaaWcbeqa
% aiaaikdaaaaabaGaamOAaiabg2da9iaaigdaaeaacaWGUbWaaSbaaW
% qaaiaadIhaaeqaaaqdcqGHris5aOGaeyypa0Jaam4DamaaCaaaleqa
% baGaamivaaaakiaadEhaaaa!5456!
{\left\| w \right\|_2}^2 = \sum\limits_{j = 1}^{{n_x}} {{w_j}^2} = {w^T}w
∥w∥22=j=1∑nxwj2=wTw
L1−regulationL1_-regulationL1−regulation(采用L1L1L1范数)即为:
∥w∥12=∑j=1nx∣w∣
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaadaqbdaqaaiaadEhaaiaawMa7caGLkWoa
% daWgaaWcbaGaaGymaaqabaGcdaahaaWcbeqaaiaaikdaaaGccqGH9a
% qpdaaeWbqaamaaemaabaGaam4DaaGaay5bSlaawIa7aaWcbaGaamOA
% aiabg2da9iaaigdaaeaacaWGUbWaaSbaaWqaaiaadIhaaeqaaaqdcq
% GHris5aaaa!515C!
{\left\| w \right\|_1}^2 = \sum\limits_{j = 1}^{{n_x}} {\left| w \right|}
∥w∥12=j=1∑nx∣w∣
因为www可包含非常高维的参数,而bbb仅为一维参数,影响不大,所以一般不包含bbb。
其中,λ\lambdaλ是正则化参数,使用验证集和交叉验证来配置这个值,属于超参数。
若引入L2L2L2正则化,则反向传播时dwdwdw更新为:
dw[L]=dw[L]+λmw[L]
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacaWGKbGaam4DamaaCaaaleqabaGaai4w
% aiaadYeacaGGDbaaaOGaeyypa0JaamizaiaadEhadaahaaWcbeqaai
% aacUfacaWGmbGaaiyxaaaakiabgUcaRmaalaaabaGaeq4UdWgabaGa
% amyBaaaacaWG3bWaaWbaaSqabeaacaGGBbGaamitaiaac2faaaaaaa!50D4!
d{w^{[L]}} = d{w^{[L]}} + \frac{\lambda }{m}{w^{[L]}}
dw[L]=dw[L]+mλw[L]
这使www权重在更新过程中减小,发生权重衰退
为什么正则化可以减少过拟合
当存在正则化时,www权重变大会有负反馈惩罚,使权重降低;当www中某些维度的值接近于0时(简单理解),相当于某些节点消失了,从而简化了神经网络,防止了过拟合。但是正则化程度如果太高,神经网络过于简单,容易导致欠拟合。因此,要找到合适的超参数λ\lambdaλ。
当λ\lambdaλ较大时,正则化程度高,www一般较小,以tanhtanhtanh函数为例,www较小时,zzz会相应变小,集中在中间区域,几乎相当于线性。如果每层都是线性,也就相当于被极大地简化了。
Dropout正则化
dropoutdropoutdropout正则化,即随机失活,训练样本时,设置概率,随机让部分节点失活。避免神经网络过于依赖某些典型样本。
dropoutdropoutdropout正则化实现的方法:
1.生成一个随机矩阵
d3=np.randm.rand(a3.shape[0],a3.shape[1])d_3=np.randm.rand(a3.shape[0],a3.shape[1])d3=np.randm.rand(a3.shape[0],a3.shape[1])
2.随机矩阵变为01矩阵
设置如keep−probkeep-probkeep−prob=0.8,随机矩阵中小于0.8的值变为1,大于0.8的变为0;
3.归零节点
a3=np.multiply(a3,d3)a_3=np.multiply(a3,d3)a3=np.multiply(a3,d3),使得相应节点失活
4.确保期望值不变
a3/=keep−proba3/=keep-proba3/=keep−prob
理解Dropout
因为任何节点的输入都有可能被删除,因此不能将过多权重寄托在一个节点输入上,可以起到压缩权重的作用类似于平方范数。在计算机视觉领域被广泛运用,在其他领域较少。主要过拟合后再考虑。
缺点:训练过程变得难以预测,成本函数不再明确,难以调试结果。
经验法则:参数wiw^iwi的维度越高越复杂,dropout−probdropout-probdropout−prob值越小,节点失活概率越高。
经验法则:对于输入层,一般dropout−probdropout-probdropout−prob值大于0.9,多数时候为1。
其他正则化方法
1.扩增训练集:处理图像数据时,将图片水平旋转,任意裁剪(也别阴间裁剪)得到新图片。
2.early−stoppingearly-stoppingearly−stopping:训练神经网络过程中,绘制出训练集和验证集上的成本函数,一般来说两个图像先降后升,且验证集成本函数先达到极小值点,当验证集成本函数达到极小值点时,提前结束训练过程。
缺点:提早结束训练会使得不再尝试降低成本函数。即在避免过拟合的同时影响了降低成本函数。使得问题更加复杂,考虑的东西更多。
归一化
归一化是加速神经网络训练的一种方法。
1.零均值化
μ=1m∑i=1mx(i)x:=x−μ
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakqaabeqaaiabeY7aTjabg2da9maalaaabaGa
% aGymaaqaaiaad2gaaaWaaabCaeaacaWG4bWaaWbaaSqabeaacaGGOa
% GaamyAaiaacMcaaaaabaGaamyAaiabg2da9iaaigdaaeaacaWGTbaa
% niabggHiLdaakeaacaWG4bGaaiOoaiabg2da9iaadIhacqGHsislcq
% aH8oqBaaaa!5356!
\begin{array}{l}
\mu = \frac{1}{m}\sum\limits_{i = 1}^m {{x^{(i)}}} \\
x: = x - \mu
\end{array}\
μ=m1i=1∑mx(i)x:=x−μ
2.归一化方差
σ2=1m∑i=1mx(i)2x/=σ2
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakqaabeqaaiabeo8aZnaaCaaaleqabaGaaGOm
% aaaakiabg2da9maalaaabaGaaGymaaqaaiaad2gaaaWaaabCaeaaca
% WG4bWaaWbaaSqabeaacaGGOaGaamyAaiaacMcaaaGcdaahaaWcbeqa
% aiaaikdaaaaabaGaamyAaiabg2da9iaaigdaaeaacaWGTbaaniabgg
% HiLdaakeaacaWG4bGaai4laiabg2da9iabeo8aZnaaCaaaleqabaGa
% aGOmaaaaaaaa!544A!
\begin{array}{l}
{\sigma ^2} = \frac{1}{m}\sum\limits_{i = 1}^m {{x^{(i)}}^2} \\
x/ = {\sigma ^2}
\end{array}\
σ2=m1i=1∑mx(i)2x/=σ2
通过以上两步骤使得,样本特征较为均匀地分布在原点附近。
注意:训练集和测试集应该有同样的归一化操作
归一化操作使得代价函数的图像更匀称,梯度更加稳定,更容易找到最佳参数,更容易优化,学习率可以大一些
梯度消失与梯度爆炸
指在深层神经网络中,因为神经网络层数过多,激活函数容易指数级递减或递增导致梯度接近于0或者无穷大,导致梯度消失和梯度爆炸。因此要谨慎选择参数www的大小
神经网络的权重初始化
为防止zzz的值过大,www大的时候,nnn小;nnn大的时候,www小。下列为Relu激活函数时的初始化
w[l]=np.random.randn(shape)∗np.sqrt(2n[l−1])
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacaWG3bWaaWbaaSqabeaacaGGBbGaamiB
% aiaac2faaaGccqGH9aqpcaWGUbGaamiCaiaac6cacaWGYbGaamyyai
% aad6gacaWGKbGaam4Baiaad2gacaGGUaGaamOCaiaadggacaWGUbGa
% amizaiaad6gacaGGOaGaam4CaiaadIgacaWGHbGaamiCaiaadwgaca
% GGPaGaaiOkaiaad6gacaWGWbGaaiOlaiaadohacaWGXbGaamOCaiaa
% dshacaGGOaWaaSaaaeaacaaIYaaabaGaamOBamaaCaaaleqabaGaai
% 4waiaadYgacqGHsislcaaIXaGaaiyxaaaaaaGccaGGPaaaaa!6674!
{w^{[l]}} = np.random.randn(shape)*np.sqrt(\frac{2}{{{n^{[l - 1]}}}})\
w[l]=np.random.randn(shape)∗np.sqrt(n[l−1]2)
这减缓了梯度消失和梯度爆炸的问题。
梯度的数值逼近
使用双边误差比单边误差更能接近于导数,更加准确。
fracf(θ+ε)−f(θ−ε)2ε
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaadaWcaaqaaiaadAgacaGGOaGaeqiUdeNa
% ey4kaSIaeqyTduMaaiykaiabgkHiTiaadAgacaGGOaGaeqiUdeNaey
% OeI0IaeqyTduMaaiykaaqaaiaaikdacqaH1oqzaaaaaa!4F93!
\\frac{{f(\theta + \varepsilon ) - f(\theta - \varepsilon )}}{{2\varepsilon }}\
fracf(θ+ε)−f(θ−ε)2ε
梯度检验
使用梯度检验观察反向传播过程是否顺利进行。
1.将W、bW、bW、b矩阵转换为向量,连接之后得到一个巨大向量θ\thetaθ
2.将dw,dbdw,dbdw,db矩阵转换为向量,连接后得到向量dθd\thetadθ
3.for循环每个iii,得到dθ−approxd\theta-approxdθ−approx
dθapprox[i]=J(θ1,θ2,.....,θi+ε,....)−J(θ1,θ2,.....,θi+ε,....)2ε
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacaWGKbGaeqiUde3aaSbaaSqaaiaadgga
% caWGWbGaamiCaiaadkhacaWGVbGaamiEaaqabaGcdaahaaWcbeqaai
% aacUfacaWGPbGaaiyxaaaakiabg2da9maalaaabaGaamOsaiaacIca
% cqaH4oqCdaWgaaWcbaGaaGymaaqabaGccaGGSaGaeqiUde3aaSbaaS
% qaaiaaikdaaeqaaOGaaiilaiaac6cacaGGUaGaaiOlaiaac6cacaGG
% UaGaaiilaiabeI7aXnaaBaaaleaacaWGPbaabeaakiabgUcaRiabew
% 7aLjaacYcacaGGUaGaaiOlaiaac6cacaGGUaGaaiykaiabgkHiTiaa
% dQeacaGGOaGaeqiUde3aaSbaaSqaaiaaigdaaeqaaOGaaiilaiabeI
% 7aXnaaBaaaleaacaaIYaaabeaakiaacYcacaGGUaGaaiOlaiaac6ca
% caGGUaGaaiOlaiaacYcacqaH4oqCdaWgaaWcbaGaamyAaaqabaGccq
% GHRaWkcqaH1oqzcaGGSaGaaiOlaiaac6cacaGGUaGaaiOlaiaacMca
% aeaacaaIYaGaeqyTdugaaaaa!7AB2!
d{\theta _{approx}}^{[i]} = \frac{{J({\theta _1},{\theta _2},.....,{\theta _i} + \varepsilon ,....) - J({\theta _1},{\theta _2},.....,{\theta _i} + \varepsilon ,....)}}{{2\varepsilon }}\
dθapprox[i]=2εJ(θ1,θ2,.....,θi+ε,....)−J(θ1,θ2,.....,θi+ε,....)
4.判断dθd\thetadθ和dθ−approxd\theta-approxdθ−approx的相似程度
∥dθapprox−dθ∥2∥dθapprox∥2+∥dθ∥2
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaadaWcaaqaamaafmaabaGaamizaiabeI7a
% XnaaBaaaleaacaWGHbGaamiCaiaadchacaWGYbGaam4BaiaadIhaae
% qaaOGaeyOeI0IaamizaiabeI7aXbGaayzcSlaawQa7amaaBaaaleaa
% caaIYaaabeaaaOqaamaafmaabaGaamizaiabeI7aXnaaBaaaleaaca
% WGHbGaamiCaiaadchacaWGYbGaam4BaiaadIhaaeqaaaGccaGLjWUa
% ayPcSdWaaSbaaSqaaiaaikdaaeqaaOGaey4kaSYaauWaaeaacaWGKb
% GaeqiUdehacaGLjWUaayPcSdWaaSbaaSqaaiaaikdaaeqaaaaaaaa!639A!
\frac{{{{\left\| {d{\theta _{approx}} - d\theta } \right\|}_2}}}{{{{\left\| {d{\theta _{approx}}} \right\|}_2} + {{\left\| {d\theta } \right\|}_2}}}
∥dθapprox∥2+∥dθ∥2∥dθapprox−dθ∥2
该值通常很小很小,如果该值大小小于ε\varepsilonε,则说明导数逼近真确,如果不是那么就要很小心了,就得检查bug。
关于梯度检验实现的注记
- 不要在训练中使用梯度检验,它只用于调试
- 如果梯度检验失败,要检查所有项,尝试找出bug
- 使用正则项时,注意成本函数和梯度值都有正则化项的增加
- dropoutdropoutdropout和梯度检验不要同时使用
- 在随机初始化过程中,运行梯度检查,再训练神经网络(用的较少)
梯度下降方法选择
Mini-batch梯度下降法
当样本数据非常非常大的时候,传统的处理方法计算量很大,处理速度慢。这时将原始数据集分为若干小数据集即mini−batchmini-batchmini−batch,写作X{i},Y也要作同样的拆分处理。即参数持续更新,但是每次用不同的小训练集。
理解Mini-batch梯度下降法

传统梯度更新方法和mini−batchmini-batchmini−batch梯度更新的成本函数图像对比。
超参数为mini−batchmini-batchmini−batch的大小,可以设为mmm或1(即随机梯度下降法,一条数据样本下降一次),随机梯度下降噪声更多,但是训练速度更快,且总体是到全局最优解的。因此,实际中要选择不大不小的mini−batchmini-batchmini−batch。
指数加权平均
vt=βvt−1+(1−β)θt % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWG2bWaaSbaaSqaaiaadshaaeqaaOGa % eyypa0JaeqOSdiMaamODamaaBaaaleaacaWG0bGaeyOeI0IaaGymaa % qabaGccqGHRaWkcaGGOaGaaGymaiabgkHiTiabek7aIjaacMcacqaH % 4oqCdaWgaaWcbaGaamiDaaqabaaaaa!5024! {v_t} = \beta {v_{t - 1}} + (1 - \beta ){\theta _t} vt=βvt−1+(1−β)θt
其中β\betaβ是可调整的参数。β\betaβ越大,曲线波动越小,越接近平均值;越小越能适应变化
理解指数加权平均
vtv_tvt是计算的平均值,这样计算的平均值虽然有些许误差,但是却只需要占用一行代码和一点内存使用v=βv+(1−βθi)v={\beta}v+(1-{\beta}{\theta_i})v=βv+(1−βθi)不断更新即可。简单来说,就是计算和存储平均值的简单方法。
指数加权平均的偏差修正
因为令v0=0v_0=0v0=0,因此前几次的预测会小于真实值很多,所以要引入偏差修正,即
vt1−βt=βvt−1+(1−β)θt
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaadaWcaaqaaiaadAhadaWgaaWcbaGaamiD
% aaqabaaakeaacaaIXaGaeyOeI0IaeqOSdi2aaWbaaSqabeaacaWG0b
% aaaaaakiabg2da9iabek7aIjaadAhadaWgaaWcbaGaamiDaiabgkHi
% TiaaigdaaeqaaOGaey4kaSIaaiikaiaaigdacqGHsislcqaHYoGyca
% GGPaGaeqiUde3aaSbaaSqaaiaadshaaeqaaaaa!54AD!
\frac{{{v_t}}}{{1 - {\beta ^t}}} = \beta {v_{t - 1}} + (1 - \beta ){\theta _t}\
1−βtvt=βvt−1+(1−β)θt
在预测前期,添加的这一项的存在会使,预测值增大一点,预测后期该项接近于1,不对后面的预测造成影响。因为偏差只在前期,所以如果不在乎前期的预测值,可以不进行偏差修正。
MomentumMomentumMomentum动量梯度下降法
这种梯度下降法(mini−batchmini-batchmini−batch可用),运用了上文所述指数加权平均进行梯度下降,运行速度几乎每次都优于传统梯度下降
vdw=βvdw+(1−β)dwvdb=βvdb+(1−β)dbw=w−αvdwb=b−αvdb
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakqaabeqaaiaadAhadaWgaaWcbaGaamizaiaa
% dEhaaeqaaOGaeyypa0JaeqOSdiMaamODamaaBaaaleaacaWGKbGaam
% 4DaaqabaGccqGHRaWkcaGGOaGaaGymaiabgkHiTiabek7aIjaacMca
% caWGKbGaam4DaaqaaiaadAhadaWgaaWcbaGaamizaiaadkgaaeqaaO
% Gaeyypa0JaeqOSdiMaamODamaaBaaaleaacaWGKbGaamOyaaqabaGc
% cqGHRaWkcaGGOaGaaGymaiabgkHiTiabek7aIjaacMcacaWGKbGaam
% OyaaqaaiaadEhacqGH9aqpcaWG3bGaeyOeI0IaeqySdeMaamODamaa
% BaaaleaacaWGKbGaam4DaaqabaaakeaacaWGIbGaeyypa0JaamOyai
% abgkHiTiabeg7aHjaadAhadaWgaaWcbaGaamizaiaadkgaaeqaaaaa
% aa!705B!
\begin{array}{l}
{v_{dw}} = \beta {v_{dw}} + (1 - \beta )dw\\
{v_{db}} = \beta {v_{db}} + (1 - \beta )db\\
w = w - \alpha {v_{dw}}\\
b = b - \alpha {v_{db}}
\end{array}
vdw=βvdw+(1−β)dwvdb=βvdb+(1−β)dbw=w−αvdwb=b−αvdb
每次梯度下降更新时,引入前面梯度计算的平均值,原来的梯度计算后会有偏差地上上下下地朝最优解移动,但引入平均值后,上上下下的偏差互相抵消,向最优解移动的梯度被加强。既减少了摆动,还加快了学习效率。
经验法则:超参数β\betaβ等于0.9时,效果都还行。
经验法则:除特殊要求,一般不会添加偏差修正。
RMSprop
sdw=βsdw+(1−β)dw2sdb=βsdb+(1−β)db2w=w−αdwsdwb=b−αdbsdb % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakqaabeqaaiaadohadaWgaaWcbaGaamizaiaa % dEhaaeqaaOGaeyypa0JaeqOSdiMaam4CamaaBaaaleaacaWGKbGaam % 4DaaqabaGccqGHRaWkcaGGOaGaaGymaiabgkHiTiabek7aIjaacMca % caWGKbGaam4DamaaCaaaleqabaGaaGOmaaaaaOqaaiaadohadaWgaa % WcbaGaamizaiaadkgaaeqaaOGaeyypa0JaeqOSdiMaam4CamaaBaaa % leaacaWGKbGaamOyaaqabaGccqGHRaWkcaGGOaGaaGymaiabgkHiTi % abek7aIjaacMcacaWGKbGaamOyamaaCaaaleqabaGaaGOmaaaaaOqa % aiaadEhacqGH9aqpcaWG3bGaeyOeI0IaeqySde2aaSaaaeaacaWGKb % Gaam4DaaqaamaakaaabaGaam4CamaaBaaaleaacaWGKbGaam4Daaqa % baaabeaaaaaakeaacaWGIbGaeyypa0JaamOyaiabgkHiTiabeg7aHn % aalaaabaGaamizaiaadkgaaeaadaGcaaqaaiaadohadaWgaaWcbaGa % amizaiaadkgaaeqaaaqabaaaaaaaaa!7624! \begin{array}{l} {s_{dw}} = \beta {s_{dw}} + (1 - \beta )d{w^2}\\ {s_{db}} = \beta {s_{db}} + (1 - \beta )d{b^2}\\ w = w - \alpha \frac{{dw}}{{\sqrt {{s_{dw}}} }}\\ b = b - \alpha \frac{{db}}{{\sqrt {{s_{db}}} }} \end{array} sdw=βsdw+(1−β)dw2sdb=βsdb+(1−β)db2w=w−αsdwdwb=b−αsdbdb
这种方法可以达到和momentummomentummomentum类似的效果,消除摆动然后增大学习率前进。
Adam优化算法
结合MomentumMomentumMomentum和RMSpropRMSpropRMSprop得到AdamAdamAdam优化算法。
vdw=β1vdw+(1−β1)dw,vdb=β1vdb+(1−β1)dbsdw=β2sdw+(1−β2)dw2,sdb=β2sdb+(1−β2)db2vdwcorrested=vdw1−β1t,vdbcorrested=vdb1−β1tsdwcorrested=sdw1−β2t,sdbcorrested=sdb1−β2tw=w−αvdwcorrestedsdwcorrested+ε,b=b−αvdbcorrestedsdbcorrested+ε
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakqaabeqaaiaadAhadaWgaaWcbaGaamizaiaa
% dEhaaeqaaOGaeyypa0JaeqOSdi2aaSbaaSqaaiaaigdaaeqaaOGaam
% ODamaaBaaaleaacaWGKbGaam4DaaqabaGccqGHRaWkcaGGOaGaaGym
% aiabgkHiTiabek7aInaaBaaaleaacaaIXaaabeaakiaacMcacaWGKb
% Gaam4DaiaacYcacaWG2bWaaSbaaSqaaiaadsgacaWGIbaabeaakiab
% g2da9iabek7aInaaBaaaleaacaaIXaaabeaakiaadAhadaWgaaWcba
% GaamizaiaadkgaaeqaaOGaey4kaSIaaiikaiaaigdacqGHsislcqaH
% YoGydaWgaaWcbaGaaGymaaqabaGccaGGPaGaamizaiaadkgaaeaaca
% WGZbWaaSbaaSqaaiaadsgacaWG3baabeaakiabg2da9iabek7aInaa
% BaaaleaacaaIYaaabeaakiaadohadaWgaaWcbaGaamizaiaadEhaae
% qaaOGaey4kaSIaaiikaiaaigdacqGHsislcqaHYoGydaWgaaWcbaGa
% aGOmaaqabaGccaGGPaGaamizaiaadEhadaahaaWcbeqaaiaaikdaaa
% GccaGGSaGaam4CamaaBaaaleaacaWGKbGaamOyaaqabaGccqGH9aqp
% cqaHYoGydaWgaaWcbaGaaGOmaaqabaGccaWGZbWaaSbaaSqaaiaads
% gacaWGIbaabeaakiabgUcaRiaacIcacaaIXaGaeyOeI0IaeqOSdi2a
% aSbaaSqaaiaaikdaaeqaaOGaaiykaiaadsgacaWGIbWaaWbaaSqabe
% aacaaIYaaaaaGcbaGaamODamaaDaaaleaacaWGKbGaam4Daaqaaiaa
% dogacaWGVbGaamOCaiaadkhacaWGLbGaam4CaiaadshacaWGLbGaam
% izaaaakiabg2da9maalaaabaGaamODamaaBaaaleaacaWGKbGaam4D
% aaqabaaakeaacaaIXaGaeyOeI0IaeqOSdi2aa0baaSqaaiaaigdaae
% aacaWG0baaaaaakiaacYcacaWG2bWaa0baaSqaaiaadsgacaWGIbaa
% baGaam4yaiaad+gacaWGYbGaamOCaiaadwgacaWGZbGaamiDaiaadw
% gacaWGKbaaaOGaeyypa0ZaaSaaaeaacaWG2bWaaSbaaSqaaiaadsga
% caWGIbaabeaaaOqaaiaaigdacqGHsislcqaHYoGydaqhaaWcbaGaaG
% ymaaqaaiaadshaaaaaaaGcbaGaam4CamaaDaaaleaacaWGKbGaam4D
% aaqaaiaadogacaWGVbGaamOCaiaadkhacaWGLbGaam4Caiaadshaca
% WGLbGaamizaaaakiabg2da9maalaaabaGaam4CamaaBaaaleaacaWG
% KbGaam4DaaqabaaakeaacaaIXaGaeyOeI0IaeqOSdi2aa0baaSqaai
% aaikdaaeaacaWG0baaaaaakiaacYcacaWGZbWaa0baaSqaaiaadsga
% caWGIbaabaGaam4yaiaad+gacaWGYbGaamOCaiaadwgacaWGZbGaam
% iDaiaadwgacaWGKbaaaOGaeyypa0ZaaSaaaeaacaWGZbWaaSbaaSqa
% aiaadsgacaWGIbaabeaaaOqaaiaaigdacqGHsislcqaHYoGydaqhaa
% WcbaGaaGOmaaqaaiaadshaaaaaaaGcbaGaam4Daiabg2da9iaadEha
% cqGHsislcqaHXoqydaWcaaqaaiaadAhadaqhaaWcbaGaamizaiaadE
% haaeaacaWGJbGaam4BaiaadkhacaWGYbGaamyzaiaadohacaWG0bGa
% amyzaiaadsgaaaaakeaadaGcaaqaaiaadohadaqhaaWcbaGaamizai
% aadEhaaeaacaWGJbGaam4BaiaadkhacaWGYbGaamyzaiaadohacaWG
% 0bGaamyzaiaadsgaaaaabeaakiabgUcaRiabew7aLbaacaGGSaGaam
% Oyaiabg2da9iaadkgacqGHsislcqaHXoqydaWcaaqaaiaadAhadaqh
% aaWcbaGaamizaiaadkgaaeaacaWGJbGaam4BaiaadkhacaWGYbGaam
% yzaiaadohacaWG0bGaamyzaiaadsgaaaaakeaadaGcaaqaaiaadoha
% daqhaaWcbaGaamizaiaadkgaaeaacaWGJbGaam4BaiaadkhacaWGYb
% GaamyzaiaadohacaWG0bGaamyzaiaadsgaaaaabeaakiabgUcaRiab
% ew7aLbaaaaaa!1E1A!
\begin{array}{l}
{v_{dw}} = {\beta _1}{v_{dw}} + (1 - {\beta _1})dw,{v_{db}} = {\beta _1}{v_{db}} + (1 - {\beta _1})db\\
{s_{dw}} = {\beta _2}{s_{dw}} + (1 - {\beta _2})d{w^2},{s_{db}} = {\beta _2}{s_{db}} + (1 - {\beta _2})d{b^2}\\
v_{dw}^{corrested} = \frac{{{v_{dw}}}}{{1 - \beta _1^t}},v_{db}^{corrested} = \frac{{{v_{db}}}}{{1 - \beta _1^t}}\\
s_{dw}^{corrested} = \frac{{{s_{dw}}}}{{1 - \beta _2^t}},s_{db}^{corrested} = \frac{{{s_{db}}}}{{1 - \beta _2^t}}\\
w = w - \alpha \frac{{v_{dw}^{corrested}}}{{\sqrt {s_{dw}^{corrested}} + \varepsilon }},b = b - \alpha \frac{{v_{db}^{corrested}}}{{\sqrt {s_{db}^{corrested}} + \varepsilon }}
\end{array}
vdw=β1vdw+(1−β1)dw,vdb=β1vdb+(1−β1)dbsdw=β2sdw+(1−β2)dw2,sdb=β2sdb+(1−β2)db2vdwcorrested=1−β1tvdw,vdbcorrested=1−β1tvdbsdwcorrested=1−β2tsdw,sdbcorrested=1−β2tsdbw=w−αsdwcorrested+εvdwcorrested,b=b−αsdbcorrested+εvdbcorrested
经验法则:α\alphaα需要根据实际调整;β1=0.9\beta_1=0.9β1=0.9;β2=0.999\beta_2=0.999β2=0.999;ξ=10−8\xi=10^{-8}ξ=10−8。
学习率衰减
在mini−patchmini-patchmini−patch中固定的学习率会让算法在最小值的区域内徘徊,越小的区域徘徊区域越小,就越精准。但是训练前期如果学习率太小,会导致学习速度很慢,所以要让学习率逐渐衰减。
学习率衰减的公式:
α=α01+decay−rate∗epoch−num
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacqaHXoqycqGH9aqpdaWcaaqaaiabeg7a
% HnaaBaaaleaacaaIWaaabeaaaOqaaiaaigdacqGHRaWkcaWGKbGaam
% yzaiaadogacaWGHbGaamyEaiabgkHiTiaadkhacaWGHbGaamiDaiaa
% dwgacaGGQaGaamyzaiaadchacaWGVbGaam4yaiaadIgacqGHsislca
% WGUbGaamyDaiaad2gaaaaaaa!5871!
\alpha = \frac{{{\alpha _0}}}{{1 + decay - rate*epoch - num}}
α=1+decay−rate∗epoch−numα0
其中,epoch−numepoch-numepoch−num为遍历次数。
其他公示:
α=0.95epoch−numα0
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacqaHXoqycqGH9aqpcaaIWaGaaiOlaiaa
% iMdacaaI1aWaaWbaaSqabeaacaWGLbGaamiCaiaad+gacaWGJbGaam
% iAaiabgkHiTiaad6gacaWG1bGaamyBaaaakiabeg7aHnaaBaaaleaa
% caaIWaaabeaaaaa!4FE5!
\alpha = {0.95^{epoch - num}}{\alpha _0}
α=0.95epoch−numα0
α=kepoch−numα0 % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacqaHXoqycqGH9aqpdaWcaaqaaiaadUga % aeaadaGcaaqaaiaadwgacaWGWbGaam4BaiaadogacaWGObGaeyOeI0 % IaamOBaiaadwhacaWGTbaaleqaaaaakiabeg7aHnaaBaaaleaacaaI % Waaabeaaaaa!4DE5! \alpha = \frac{k}{{\sqrt {epoch - num} }}{\alpha _0} α=epoch−numkα0
等等
局部最优问题
-
我们所认为的局部最优点在高维空间里大概率只是鞍点
-
神经网络规模很大时,一般不会困在很差的局部最优点里面
-
平稳段让神经网络梯度下降很慢,这才是问题
超参数的调试
调试处理
超参数的调试优先级排序:
1.学习率
2.momentun-β\betaβ,mini−batch−sizemini-batch -sizemini−batch−size,hidden−unitshidden-unitshidden−units
3.层数,学习率衰减
4,其他
-
随机取值: 尝试不同超参数取值时,在取值空间内随机取点,而不是规则取点。
-
粗糙到精细: 粗糙取点确定一个区域后,放大该区域再次随机取点,确定下个放大区域
为超参数选择合适范围
使用均匀数轴进行随机取值,会使搜索资源分布不平衡。
应该采用对数数轴,即0.001-0.01,0.01-0.1,0.1-1之间,搜索资源均匀分配,实现方法如下:
要在[m,M]间均匀赋值,那么就需要找出10a=m10^a=m10a=m,10b=M10^b=M10b=M。生成[a,b]间的随机值,然后就可对数随机赋值。
超参数训练的实践
调整超参数的几个办法:
1.**熊猫方式:**每天观察它的表现,耐心地调试学习率(计算能力欠缺时)
2.**鱼子酱方式:**同时平行训练多个模型,然后快速选择效果最好的一个
正则化网络的激活函数
batch归一化:使参数搜索问题变得容易,使神经网络对超参数的选择更加稳定,超参数范围更大,工作效果也更好。
即对每层计算得到a[i]a^{[i]}a[i]进行归一化处理,一般是具体的对z[i]z^{[i]}z[i]进行过归一化处理,然后代入激活函数。公式如下:
μ=1mz(i)σ2=1m∑(zi−μ)2znorm(i)=z(i)−μσ2+εz~(i)=γznorm(i)+β
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakqaabeqaaiabeY7aTjabg2da9maalaaabaGa
% aGymaaqaaiaad2gaaaGaamOEamaaCaaaleqabaGaaiikaiaadMgaca
% GGPaaaaaGcbaGaeq4Wdm3aaWbaaSqabeaacaaIYaaaaOGaeyypa0Za
% aSaaaeaacaaIXaaabaGaamyBaaaadaaeabqaaiaacIcacaWG6bWaaS
% baaSqaaiaadMgaaeqaaOGaeyOeI0IaeqiVd0MaaiykaaWcbeqab0Ga
% eyyeIuoakmaaCaaaleqabaGaaGOmaaaaaOqaaiaadQhadaqhaaWcba
% GaaeOBaiaab+gacaqGYbGaaeyBaaqaaiaacIcacaWGPbGaaiykaaaa
% kiabg2da9maalaaabaGaamOEamaaCaaaleqabaGaaiikaiaadMgaca
% GGPaaaaOGaeyOeI0IaeqiVd0gabaWaaOaaaeaacqaHdpWCdaahaaWc
% beqaaiaaikdaaaGccqGHRaWkcqaH1oqzaSqabaaaaaGcbaWaaacaae
% aacaWG6baacaGLdmaadaahaaWcbeqaaiaacIcacaWGPbGaaiykaaaa
% kiabg2da9iabeo7aNjaadQhadaqhaaWcbaGaaeOBaiaab+gacaqGYb
% GaaeyBaaqaaiaacIcacaWGPbGaaiykaaaakiabgUcaRiabek7aIbaa
% aa!7A5D!
\begin{array}{l}
\mu = \frac{1}{m}{z^{(i)}}\\
{\sigma ^2} = \frac{1}{m}{\sum {({z_i} - \mu )} ^2}\\
z_{{\rm{norm}}}^{(i)} = \frac{{{z^{(i)}} - \mu }}{{\sqrt {{\sigma ^2} + \varepsilon } }}\\
{\widetilde z^{(i)}} = \gamma z_{{\rm{norm}}}^{(i)} + \beta
\end{array}
μ=m1z(i)σ2=m1∑(zi−μ)2znorm(i)=σ2+εz(i)−μz(i)=γznorm(i)+β
其中,γ\gammaγ和β\betaβ是控制方差和平均值的参数,不是超参数,可以通过他们控制方差和平均值。
将Batch Norm 拟合进神经网络
简而言之,Batch Norm 做的就是将每个单元输入的z(i)z^{(i)}z(i)归一化处理,然后交给激活函数处理。单元内部有归一化参数γ\gammaγ和β\betaβ,单元之间有参数www和bbb。梯度下降的方法仍然是求偏导β=β−αdβ\beta=\beta-{\alpha}d\betaβ=β−αdβ,同样也能运用Adam等优化方法。
BN也能和mini−batchmini-batchmini−batch协同工作。同时,因为归一化的过程会消除常数的影响,所以bbb这个常数没有作用。γ\gammaγ和β\betaβ维度同bbb。
Batch Norm 为什么奏效
1.将大量数据归一到0-1的范围,可以加速学习
2.使权重比我的网络更滞后或更深层,减少了输入值改变的问题
比如,他可以使第十层的权重更能经受得住变化。对于较深的层的参数而言,前面层的参数的些许改变,会使后面层的输入值范围改变,经过层层传播使输入值范围波动大,从而后面层的参数变化大。从而前面参数和后面参数权重差别大。将每一层的输入归一化处理后,输入值永远在固定范围内波动,后面层参数的变化受前面层参数变化或输入数据范围的影响较小。举例:Batch Norm后,识别黑猫的神经网络可识别猫。
层与层之间更独立了,每层都有一定独立学习能力。
3.有一定的正则化作用
因为mini−batchmini-batchmini−batch中计算均值和方差,而不是在整个数据集上,所以会有一定的噪声作用,这些噪声对神经网络有一定干扰,与DropoutDropoutDropout类似,防止模型过拟合,使模型更稳定。
测试时的Batch Norm
因为要逐一处理样本,测试难以得到测试集的μ\muμ和σ2\sigma^{2}σ2(对于单一样本,均值和方差没有意义)。因此对于每一层的μ\muμ和σ2\sigma^{2}σ2就需要,用指数加权平均的方法,结合训练过程中对应层所有mini−patchmini-patchmini−patch的μ\muμ和σ2\sigma^{2}σ2进行估算得到。
除了指数加权平均的方式估算也还有其他估算方法,不展开。大多数都奏效。
Softmax回归
分类类别有多个的如(0,1,2,3)。此类回归的神经网络softmaxsoftmaxsoftmax层有多个单元,每个单元表示是该类的概率大小。softmaxsoftmaxsoftmax层的处理如下:
1.计算线性输出
z[L]=w[L]a[L−1]+b[L]
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacaWG6bWaaWbaaSqabeaacaGGBbGaamit
% aiaac2faaaGccqGH9aqpcaWG3bWaaWbaaSqabeaacaGGBbGaamitai
% aac2faaaGccaWGHbWaaWbaaSqabeaacaGGBbGaamitaiabgkHiTiaa
% igdacaGGDbaaaOGaey4kaSIaamOyamaaCaaaleqabaGaai4waiaadY
% eacaGGDbaaaaaa!5190!
{z^{[L]}} = {w^{[L]}}{a^{[L - 1]}} + {b^{[L]}}
z[L]=w[L]a[L−1]+b[L]
2.幂次方输出(将负数转为正数,易于归一化)
t=e(z[L])
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacaWG0bGaeyypa0JaamyzamaaCaaaleqa
% baGaaiikaiaadQhadaahaaadbeqaaiaacUfacaWGmbGaaiyxaaaali
% aacMcaaaaaaa!475A!
t = {e^{({z^{[L]}})}}
t=e(z[L])
3.归一化得到概率
ai[L]=ti∑ti
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacaWGHbWaa0baaSqaaiaadMgaaeaacaGG
% BbGaamitaiaac2faaaGccqGH9aqpdaWcaaqaaiaadshadaWgaaWcba
% GaamyAaaqabaaakeaadaaeabqaaiaadshadaWgaaWcbaGaamyAaaqa
% baaabeqab0GaeyyeIuoaaaaaaa!4B0D!
a_i^{[L]} = \frac{{{t_i}}}{{\sum {{t_i}} }}
ai[L]=∑titi
上述过程可以被概括为一个softmaxsoftmaxsoftmax激活函数,其特殊之处在于输入向量,输出向量。
训练一个SOftmax分类器
hardmaxhardmaxhardmax指最后输出时只有01分布;softmaxsoftmaxsoftmax最后输出时都是概率分布(映射更加温和)。
损失函数:
L(y~,y)=−∑j=1Cyjlogyj~
% MathType!MTEF!2!1!+-
% feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr
% pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs
% 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai
% aabeqaamaabaabauaakeaacaWGmbGaaiikamaaGaaabaGaamyEaaGa
% ay5adaGaaiilaiaadMhacaGGPaGaeyypa0JaeyOeI0YaaabCaeaaca
% WG5bWaaSbaaSqaaiaadQgaaeqaaaqaaiaadQgacqGH9aqpcaaIXaaa
% baGaam4qaaqdcqGHris5aOGaciiBaiaac+gacaGGNbWaaacaaeaaca
% WG5bWaaSbaaSqaaiaadQgaaeqaaaGccaGLdmaaaaa!5434!
L(\widetilde y,y) = - \sum\limits_{j = 1}^C {{y_j}} \log \widetilde {{y_j}}
L(y,y)=−j=1∑Cyjlogyj
因为yyy中仅有一个1,所以唯一能使该损失函数变小的方法是升高yyy中为1的那一类别的预测概率。
深度学习框架
部分深度学习框架如下:
- Caffe/Caffe2Caffe/Caffe2Caffe/Caffe2
- CNTKCNTKCNTK
- DL4JDL4JDL4J
- KerasKerasKeras
- LasagneLasagneLasagne
- mxnetmxnetmxnet
- PaddlePaddlePaddlePaddlePaddlePaddle
- TensorFlowTensorFlowTensorFlow
- TheanoTheanoTheano
- TorchTorchTorch
深度学习框架选择需求:易于编程、运行速度快、开放程度高
TensorFlow
只需要计算正向传播,设置损失函数就行,TensorFlowTensorFlowTensorFlow会自己计算反向传播和更新参数并减小损失函数,极大简化了深度学习的代码量。
一个代码例子如下:
深度学习框架
部分深度学习框架如下:
- Caffe/Caffe2Caffe/Caffe2Caffe/Caffe2
- CNTKCNTKCNTK
- DL4JDL4JDL4J
- KerasKerasKeras
- LasagneLasagneLasagne
- mxnetmxnetmxnet
- PaddlePaddlePaddlePaddlePaddlePaddle
- TensorFlowTensorFlowTensorFlow
- TheanoTheanoTheano
- TorchTorchTorch
深度学习框架选择需求:易于编程、运行速度快、开放程度高
TensorFlow
只需要计算正向传播,设置损失函数就行,TensorFlowTensorFlowTensorFlow会自己计算反向传播和更新参数并减小损失函数,极大简化了深度学习的代码量。
一个代码例子如下:

987

被折叠的 条评论
为什么被折叠?



