吴恩达深度学习 课程二

吴恩达深度学习 课程二

训练优化检验技巧

训练、开发、测试集

训练数据通常被分为训练集验证集测试集三部分。

  • 训练集执行训练算法
  • 验证集集上比较不同模型的优劣,得到最终模型(之前机器学习所称测试集实际为验证集,一种不严谨的叫法)
  • 最后在测试集上评价模型(用作对所有模型的无偏评估)

在大数据时代,对于数据集的划分,不用保证严格的73分。在数据集很大时,保证验证集和测试集样本数据足够就行。

经验法则:为防止训练集和验证集结果相差过大,保证训练集和验证集来自同一训练集

偏差、方差

  • 方差:衡量模型训练集和验证集表现差异的指标

  • 偏差:衡量训练集表现的指标

    高偏差意味着欠拟合

    高方差意味着过拟合

机器学习基础

如果拟合模型后发现高偏差,就改变神经网络结构(增加深度,增加节点,改变神经网络结构之类)

若发现高方差,就采用更多的数据,或增加正则化

机器学习较少注重平衡偏差和方差的原因:增大神经网络规模几乎百试百灵

经验法则:神经网络表现差,就换个更大规模的神经网络

正则化

正则化是减少过拟合的有效方法。

正则化一般形式为:
J(w,b)=1m∑i=1mL(yˉ(i),y(i))+λ2m∥w∥22 % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWGkbGaaiikaiaadEhacaGGSaGaamOy % aiaacMcacqGH9aqpdaWcaaqaaiaaigdaaeaacaWGTbaaamaaqahaba % GaamitaiaacIcaceWG5bGbaebadaahaaWcbeqaaiaacIcacaWGPbGa % aiykaaaakiaacYcacaWG5bWaaWbaaSqabeaacaGGOaGaamyAaiaacM % caaaGccaGGPaaaleaacaWGPbGaeyypa0JaaGymaaqaaiaad2gaa0Ga % eyyeIuoakiabgUcaRmaalaaabaGaeq4UdWgabaGaaGOmaiaad2gaaa % WaauWaaeaacaWG3baacaGLjWUaayPcSdWaaSbaaSqaaiaaikdaaeqa % aOWaaWbaaSqabeaacaaIYaaaaaaa!60C9! J(w,b) = \frac{1}{m}\sum\limits_{i = 1}^m {L({{\bar y}^{(i)}},{y^{(i)}})} + \frac{\lambda }{{2m}}{\left\| w \right\|_2}^2 J(w,b)=m1i=1mL(yˉ(i),y(i))+2mλw22
主流:L2−regulationL2_-regulationL2regulation(采用L2L2L2范数,弗罗贝尼乌斯范数)即为:
∥w∥22=∑j=1nxwj2=wTw % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaadaqbdaqaaiaadEhaaiaawMa7caGLkWoa % daWgaaWcbaGaaGOmaaqabaGcdaahaaWcbeqaaiaaikdaaaGccqGH9a % qpdaaeWbqaaiaadEhadaWgaaWcbaGaamOAaaqabaGcdaahaaWcbeqa % aiaaikdaaaaabaGaamOAaiabg2da9iaaigdaaeaacaWGUbWaaSbaaW % qaaiaadIhaaeqaaaqdcqGHris5aOGaeyypa0Jaam4DamaaCaaaleqa % baGaamivaaaakiaadEhaaaa!5456! {\left\| w \right\|_2}^2 = \sum\limits_{j = 1}^{{n_x}} {{w_j}^2} = {w^T}w w22=j=1nxwj2=wTw
L1−regulationL1_-regulationL1regulation(采用L1L1L1范数)即为:
∥w∥12=∑j=1nx∣w∣ % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaadaqbdaqaaiaadEhaaiaawMa7caGLkWoa % daWgaaWcbaGaaGymaaqabaGcdaahaaWcbeqaaiaaikdaaaGccqGH9a % qpdaaeWbqaamaaemaabaGaam4DaaGaay5bSlaawIa7aaWcbaGaamOA % aiabg2da9iaaigdaaeaacaWGUbWaaSbaaWqaaiaadIhaaeqaaaqdcq % GHris5aaaa!515C! {\left\| w \right\|_1}^2 = \sum\limits_{j = 1}^{{n_x}} {\left| w \right|} w12=j=1nxw
因为www可包含非常高维的参数,而bbb仅为一维参数,影响不大,所以一般不包含bbb

其中,λ\lambdaλ是正则化参数,使用验证集和交叉验证来配置这个值,属于超参数。

若引入L2L2L2正则化,则反向传播时dwdwdw更新为:
dw[L]=dw[L]+λmw[L] % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWGKbGaam4DamaaCaaaleqabaGaai4w % aiaadYeacaGGDbaaaOGaeyypa0JaamizaiaadEhadaahaaWcbeqaai % aacUfacaWGmbGaaiyxaaaakiabgUcaRmaalaaabaGaeq4UdWgabaGa % amyBaaaacaWG3bWaaWbaaSqabeaacaGGBbGaamitaiaac2faaaaaaa!50D4! d{w^{[L]}} = d{w^{[L]}} + \frac{\lambda }{m}{w^{[L]}} dw[L]=dw[L]+mλw[L]
这使www权重在更新过程中减小,发生权重衰退

为什么正则化可以减少过拟合

当存在正则化时,www权重变大会有负反馈惩罚,使权重降低;当www中某些维度的值接近于0时(简单理解),相当于某些节点消失了,从而简化了神经网络,防止了过拟合。但是正则化程度如果太高,神经网络过于简单,容易导致欠拟合。因此,要找到合适的超参数λ\lambdaλ

λ\lambdaλ较大时,正则化程度高,www一般较小,以tanhtanhtanh函数为例,www较小时,zzz会相应变小,集中在中间区域,几乎相当于线性。如果每层都是线性,也就相当于被极大地简化了。

Dropout正则化

dropoutdropoutdropout正则化,即随机失活,训练样本时,设置概率,随机让部分节点失活。避免神经网络过于依赖某些典型样本。

dropoutdropoutdropout正则化实现的方法:

1.生成一个随机矩阵

d3=np.randm.rand(a3.shape[0],a3.shape[1])d_3=np.randm.rand(a3.shape[0],a3.shape[1])d3=np.randm.rand(a3.shape[0],a3.shape[1])

2.随机矩阵变为01矩阵

设置如keep−probkeep-probkeepprob=0.8,随机矩阵中小于0.8的值变为1,大于0.8的变为0;

3.归零节点

a3=np.multiply(a3,d3)a_3=np.multiply(a3,d3)a3=np.multiply(a3,d3),使得相应节点失活

4.确保期望值不变

a3/=keep−proba3/=keep-proba3/=keepprob

理解Dropout

因为任何节点的输入都有可能被删除,因此不能将过多权重寄托在一个节点输入上,可以起到压缩权重的作用类似于平方范数。在计算机视觉领域被广泛运用,在其他领域较少。主要过拟合后再考虑。

缺点:训练过程变得难以预测,成本函数不再明确,难以调试结果。

经验法则:参数wiw^iwi的维度越高越复杂,dropout−probdropout-probdropoutprob值越小,节点失活概率越高。

经验法则:对于输入层,一般dropout−probdropout-probdropoutprob值大于0.9,多数时候为1。

其他正则化方法

1.扩增训练集:处理图像数据时,将图片水平旋转,任意裁剪(也别阴间裁剪)得到新图片。

2.early−stoppingearly-stoppingearlystopping:训练神经网络过程中,绘制出训练集和验证集上的成本函数,一般来说两个图像先降后升,且验证集成本函数先达到极小值点,当验证集成本函数达到极小值点时,提前结束训练过程。

缺点:提早结束训练会使得不再尝试降低成本函数。即在避免过拟合的同时影响了降低成本函数。使得问题更加复杂,考虑的东西更多。

归一化

归一化是加速神经网络训练的一种方法。

1.零均值化
μ=1m∑i=1mx(i)x:=x−μ  % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakqaabeqaaiabeY7aTjabg2da9maalaaabaGa % aGymaaqaaiaad2gaaaWaaabCaeaacaWG4bWaaWbaaSqabeaacaGGOa % GaamyAaiaacMcaaaaabaGaamyAaiabg2da9iaaigdaaeaacaWGTbaa % niabggHiLdaakeaacaWG4bGaaiOoaiabg2da9iaadIhacqGHsislcq % aH8oqBaaaa!5356! \begin{array}{l} \mu = \frac{1}{m}\sum\limits_{i = 1}^m {{x^{(i)}}} \\ x: = x - \mu \end{array}\ μ=m1i=1mx(i)x:=xμ 
2.归一化方差
σ2=1m∑i=1mx(i)2x/=σ2  % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakqaabeqaaiabeo8aZnaaCaaaleqabaGaaGOm % aaaakiabg2da9maalaaabaGaaGymaaqaaiaad2gaaaWaaabCaeaaca % WG4bWaaWbaaSqabeaacaGGOaGaamyAaiaacMcaaaGcdaahaaWcbeqa % aiaaikdaaaaabaGaamyAaiabg2da9iaaigdaaeaacaWGTbaaniabgg % HiLdaakeaacaWG4bGaai4laiabg2da9iabeo8aZnaaCaaaleqabaGa % aGOmaaaaaaaa!544A! \begin{array}{l} {\sigma ^2} = \frac{1}{m}\sum\limits_{i = 1}^m {{x^{(i)}}^2} \\ x/ = {\sigma ^2} \end{array}\ σ2=m1i=1mx(i)2x/=σ2 
通过以上两步骤使得,样本特征较为均匀地分布在原点附近。

注意:训练集和测试集应该有同样的归一化操作

归一化操作使得代价函数的图像更匀称,梯度更加稳定,更容易找到最佳参数,更容易优化,学习率可以大一些

梯度消失与梯度爆炸

指在深层神经网络中,因为神经网络层数过多,激活函数容易指数级递减或递增导致梯度接近于0或者无穷大,导致梯度消失和梯度爆炸。因此要谨慎选择参数www的大小

神经网络的权重初始化

为防止zzz的值过大,www大的时候,nnn小;nnn大的时候,www小。下列为Relu激活函数时的初始化
w[l]=np.random.randn(shape)∗np.sqrt(2n[l−1])  % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWG3bWaaWbaaSqabeaacaGGBbGaamiB % aiaac2faaaGccqGH9aqpcaWGUbGaamiCaiaac6cacaWGYbGaamyyai % aad6gacaWGKbGaam4Baiaad2gacaGGUaGaamOCaiaadggacaWGUbGa % amizaiaad6gacaGGOaGaam4CaiaadIgacaWGHbGaamiCaiaadwgaca % GGPaGaaiOkaiaad6gacaWGWbGaaiOlaiaadohacaWGXbGaamOCaiaa % dshacaGGOaWaaSaaaeaacaaIYaaabaGaamOBamaaCaaaleqabaGaai % 4waiaadYgacqGHsislcaaIXaGaaiyxaaaaaaGccaGGPaaaaa!6674! {w^{[l]}} = np.random.randn(shape)*np.sqrt(\frac{2}{{{n^{[l - 1]}}}})\ w[l]=np.random.randn(shape)np.sqrt(n[l1]2) 
这减缓了梯度消失和梯度爆炸的问题。

梯度的数值逼近

使用双边误差比单边误差更能接近于导数,更加准确。
fracf(θ+ε)−f(θ−ε)2ε  % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaadaWcaaqaaiaadAgacaGGOaGaeqiUdeNa % ey4kaSIaeqyTduMaaiykaiabgkHiTiaadAgacaGGOaGaeqiUdeNaey % OeI0IaeqyTduMaaiykaaqaaiaaikdacqaH1oqzaaaaaa!4F93! \\frac{{f(\theta + \varepsilon ) - f(\theta - \varepsilon )}}{{2\varepsilon }}\ fracf(θ+ε)f(θε)2ε 

梯度检验

使用梯度检验观察反向传播过程是否顺利进行。

1.将W、bW、bWb矩阵转换为向量,连接之后得到一个巨大向量θ\thetaθ

2.将dw,dbdw,dbdw,db矩阵转换为向量,连接后得到向量dθd\thetadθ

3.for循环每个iii,得到dθ−approxd\theta-approxdθapprox
dθapprox[i]=J(θ1,θ2,.....,θi+ε,....)−J(θ1,θ2,.....,θi+ε,....)2ε  % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWGKbGaeqiUde3aaSbaaSqaaiaadgga % caWGWbGaamiCaiaadkhacaWGVbGaamiEaaqabaGcdaahaaWcbeqaai % aacUfacaWGPbGaaiyxaaaakiabg2da9maalaaabaGaamOsaiaacIca % cqaH4oqCdaWgaaWcbaGaaGymaaqabaGccaGGSaGaeqiUde3aaSbaaS % qaaiaaikdaaeqaaOGaaiilaiaac6cacaGGUaGaaiOlaiaac6cacaGG % UaGaaiilaiabeI7aXnaaBaaaleaacaWGPbaabeaakiabgUcaRiabew % 7aLjaacYcacaGGUaGaaiOlaiaac6cacaGGUaGaaiykaiabgkHiTiaa % dQeacaGGOaGaeqiUde3aaSbaaSqaaiaaigdaaeqaaOGaaiilaiabeI % 7aXnaaBaaaleaacaaIYaaabeaakiaacYcacaGGUaGaaiOlaiaac6ca % caGGUaGaaiOlaiaacYcacqaH4oqCdaWgaaWcbaGaamyAaaqabaGccq % GHRaWkcqaH1oqzcaGGSaGaaiOlaiaac6cacaGGUaGaaiOlaiaacMca % aeaacaaIYaGaeqyTdugaaaaa!7AB2! d{\theta _{approx}}^{[i]} = \frac{{J({\theta _1},{\theta _2},.....,{\theta _i} + \varepsilon ,....) - J({\theta _1},{\theta _2},.....,{\theta _i} + \varepsilon ,....)}}{{2\varepsilon }}\ dθapprox[i]=2εJ(θ1,θ2,.....,θi+ε,....)J(θ1,θ2,.....,θi+ε,....) 
4.判断dθd\thetadθdθ−approxd\theta-approxdθapprox的相似程度
∥dθapprox−dθ∥2∥dθapprox∥2+∥dθ∥2 % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaadaWcaaqaamaafmaabaGaamizaiabeI7a % XnaaBaaaleaacaWGHbGaamiCaiaadchacaWGYbGaam4BaiaadIhaae % qaaOGaeyOeI0IaamizaiabeI7aXbGaayzcSlaawQa7amaaBaaaleaa % caaIYaaabeaaaOqaamaafmaabaGaamizaiabeI7aXnaaBaaaleaaca % WGHbGaamiCaiaadchacaWGYbGaam4BaiaadIhaaeqaaaGccaGLjWUa % ayPcSdWaaSbaaSqaaiaaikdaaeqaaOGaey4kaSYaauWaaeaacaWGKb % GaeqiUdehacaGLjWUaayPcSdWaaSbaaSqaaiaaikdaaeqaaaaaaaa!639A! \frac{{{{\left\| {d{\theta _{approx}} - d\theta } \right\|}_2}}}{{{{\left\| {d{\theta _{approx}}} \right\|}_2} + {{\left\| {d\theta } \right\|}_2}}} dθapprox2+dθ2dθapproxdθ2
该值通常很小很小,如果该值大小小于ε\varepsilonε,则说明导数逼近真确,如果不是那么就要很小心了,就得检查bug。

关于梯度检验实现的注记

  • 不要在训练中使用梯度检验,它只用于调试
  • 如果梯度检验失败,要检查所有项,尝试找出bug
  • 使用正则项时,注意成本函数和梯度值都有正则化项的增加
  • dropoutdropoutdropout和梯度检验不要同时使用
  • 在随机初始化过程中,运行梯度检查,再训练神经网络(用的较少)

梯度下降方法选择

Mini-batch梯度下降法

当样本数据非常非常大的时候,传统的处理方法计算量很大,处理速度慢。这时将原始数据集分为若干小数据集即mini−batchmini-batchminibatch,写作X{i},Y也要作同样的拆分处理。即参数持续更新,但是每次用不同的小训练集。

理解Mini-batch梯度下降法

在这里插入图片描述

传统梯度更新方法和mini−batchmini-batchminibatch梯度更新的成本函数图像对比。

超参数为mini−batchmini-batchminibatch的大小,可以设为mmm或1(即随机梯度下降法,一条数据样本下降一次),随机梯度下降噪声更多,但是训练速度更快,且总体是到全局最优解的。因此,实际中要选择不大不小的mini−batchmini-batchminibatch

指数加权平均

vt=βvt−1+(1−β)θt % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWG2bWaaSbaaSqaaiaadshaaeqaaOGa % eyypa0JaeqOSdiMaamODamaaBaaaleaacaWG0bGaeyOeI0IaaGymaa % qabaGccqGHRaWkcaGGOaGaaGymaiabgkHiTiabek7aIjaacMcacqaH % 4oqCdaWgaaWcbaGaamiDaaqabaaaaa!5024! {v_t} = \beta {v_{t - 1}} + (1 - \beta ){\theta _t} vt=βvt1+(1β)θt

其中β\betaβ是可调整的参数。β\betaβ越大,曲线波动越小,越接近平均值;越小越能适应变化

理解指数加权平均

vtv_tvt是计算的平均值,这样计算的平均值虽然有些许误差,但是却只需要占用一行代码和一点内存使用v=βv+(1−βθi)v={\beta}v+(1-{\beta}{\theta_i})v=βv+(1βθi)不断更新即可。简单来说,就是计算和存储平均值的简单方法。

指数加权平均的偏差修正

因为令v0=0v_0=0v0=0​,因此前几次的预测会小于真实值很多,所以要引入偏差修正,即
vt1−βt=βvt−1+(1−β)θt  % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaadaWcaaqaaiaadAhadaWgaaWcbaGaamiD % aaqabaaakeaacaaIXaGaeyOeI0IaeqOSdi2aaWbaaSqabeaacaWG0b % aaaaaakiabg2da9iabek7aIjaadAhadaWgaaWcbaGaamiDaiabgkHi % TiaaigdaaeqaaOGaey4kaSIaaiikaiaaigdacqGHsislcqaHYoGyca % GGPaGaeqiUde3aaSbaaSqaaiaadshaaeqaaaaa!54AD! \frac{{{v_t}}}{{1 - {\beta ^t}}} = \beta {v_{t - 1}} + (1 - \beta ){\theta _t}\ 1βtvt=βvt1+(1β)θt 
在预测前期,添加的这一项的存在会使,预测值增大一点,预测后期该项接近于1,不对后面的预测造成影响。因为偏差只在前期,所以如果不在乎前期的预测值,可以不进行偏差修正。

MomentumMomentumMomentum动量梯度下降法

这种梯度下降法(mini−batchmini-batchminibatch可用),运用了上文所述指数加权平均进行梯度下降,运行速度几乎每次都优于传统梯度下降
vdw=βvdw+(1−β)dwvdb=βvdb+(1−β)dbw=w−αvdwb=b−αvdb % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakqaabeqaaiaadAhadaWgaaWcbaGaamizaiaa % dEhaaeqaaOGaeyypa0JaeqOSdiMaamODamaaBaaaleaacaWGKbGaam % 4DaaqabaGccqGHRaWkcaGGOaGaaGymaiabgkHiTiabek7aIjaacMca % caWGKbGaam4DaaqaaiaadAhadaWgaaWcbaGaamizaiaadkgaaeqaaO % Gaeyypa0JaeqOSdiMaamODamaaBaaaleaacaWGKbGaamOyaaqabaGc % cqGHRaWkcaGGOaGaaGymaiabgkHiTiabek7aIjaacMcacaWGKbGaam % OyaaqaaiaadEhacqGH9aqpcaWG3bGaeyOeI0IaeqySdeMaamODamaa % BaaaleaacaWGKbGaam4DaaqabaaakeaacaWGIbGaeyypa0JaamOyai % abgkHiTiabeg7aHjaadAhadaWgaaWcbaGaamizaiaadkgaaeqaaaaa % aa!705B! \begin{array}{l} {v_{dw}} = \beta {v_{dw}} + (1 - \beta )dw\\ {v_{db}} = \beta {v_{db}} + (1 - \beta )db\\ w = w - \alpha {v_{dw}}\\ b = b - \alpha {v_{db}} \end{array} vdw=βvdw+(1β)dwvdb=βvdb+(1β)dbw=wαvdwb=bαvdb
每次梯度下降更新时,引入前面梯度计算的平均值,原来的梯度计算后会有偏差地上上下下地朝最优解移动,但引入平均值后,上上下下的偏差互相抵消,向最优解移动的梯度被加强。既减少了摆动,还加快了学习效率。

经验法则:超参数β\betaβ等于0.9时,效果都还行。

经验法则:除特殊要求,一般不会添加偏差修正。

RMSprop

sdw=βsdw+(1−β)dw2sdb=βsdb+(1−β)db2w=w−αdwsdwb=b−αdbsdb % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakqaabeqaaiaadohadaWgaaWcbaGaamizaiaa % dEhaaeqaaOGaeyypa0JaeqOSdiMaam4CamaaBaaaleaacaWGKbGaam % 4DaaqabaGccqGHRaWkcaGGOaGaaGymaiabgkHiTiabek7aIjaacMca % caWGKbGaam4DamaaCaaaleqabaGaaGOmaaaaaOqaaiaadohadaWgaa % WcbaGaamizaiaadkgaaeqaaOGaeyypa0JaeqOSdiMaam4CamaaBaaa % leaacaWGKbGaamOyaaqabaGccqGHRaWkcaGGOaGaaGymaiabgkHiTi % abek7aIjaacMcacaWGKbGaamOyamaaCaaaleqabaGaaGOmaaaaaOqa % aiaadEhacqGH9aqpcaWG3bGaeyOeI0IaeqySde2aaSaaaeaacaWGKb % Gaam4DaaqaamaakaaabaGaam4CamaaBaaaleaacaWGKbGaam4Daaqa % baaabeaaaaaakeaacaWGIbGaeyypa0JaamOyaiabgkHiTiabeg7aHn % aalaaabaGaamizaiaadkgaaeaadaGcaaqaaiaadohadaWgaaWcbaGa % amizaiaadkgaaeqaaaqabaaaaaaaaa!7624! \begin{array}{l} {s_{dw}} = \beta {s_{dw}} + (1 - \beta )d{w^2}\\ {s_{db}} = \beta {s_{db}} + (1 - \beta )d{b^2}\\ w = w - \alpha \frac{{dw}}{{\sqrt {{s_{dw}}} }}\\ b = b - \alpha \frac{{db}}{{\sqrt {{s_{db}}} }} \end{array} sdw=βsdw+(1β)dw2sdb=βsdb+(1β)db2w=wαsdwdwb=bαsdbdb

这种方法可以达到和momentummomentummomentum类似的效果,消除摆动然后增大学习率前进。

Adam优化算法

结合MomentumMomentumMomentumRMSpropRMSpropRMSprop得到AdamAdamAdam优化算法。
vdw=β1vdw+(1−β1)dw,vdb=β1vdb+(1−β1)dbsdw=β2sdw+(1−β2)dw2,sdb=β2sdb+(1−β2)db2vdwcorrested=vdw1−β1t,vdbcorrested=vdb1−β1tsdwcorrested=sdw1−β2t,sdbcorrested=sdb1−β2tw=w−αvdwcorrestedsdwcorrested+ε,b=b−αvdbcorrestedsdbcorrested+ε % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakqaabeqaaiaadAhadaWgaaWcbaGaamizaiaa % dEhaaeqaaOGaeyypa0JaeqOSdi2aaSbaaSqaaiaaigdaaeqaaOGaam % ODamaaBaaaleaacaWGKbGaam4DaaqabaGccqGHRaWkcaGGOaGaaGym % aiabgkHiTiabek7aInaaBaaaleaacaaIXaaabeaakiaacMcacaWGKb % Gaam4DaiaacYcacaWG2bWaaSbaaSqaaiaadsgacaWGIbaabeaakiab % g2da9iabek7aInaaBaaaleaacaaIXaaabeaakiaadAhadaWgaaWcba % GaamizaiaadkgaaeqaaOGaey4kaSIaaiikaiaaigdacqGHsislcqaH % YoGydaWgaaWcbaGaaGymaaqabaGccaGGPaGaamizaiaadkgaaeaaca % WGZbWaaSbaaSqaaiaadsgacaWG3baabeaakiabg2da9iabek7aInaa % BaaaleaacaaIYaaabeaakiaadohadaWgaaWcbaGaamizaiaadEhaae % qaaOGaey4kaSIaaiikaiaaigdacqGHsislcqaHYoGydaWgaaWcbaGa % aGOmaaqabaGccaGGPaGaamizaiaadEhadaahaaWcbeqaaiaaikdaaa % GccaGGSaGaam4CamaaBaaaleaacaWGKbGaamOyaaqabaGccqGH9aqp % cqaHYoGydaWgaaWcbaGaaGOmaaqabaGccaWGZbWaaSbaaSqaaiaads % gacaWGIbaabeaakiabgUcaRiaacIcacaaIXaGaeyOeI0IaeqOSdi2a % aSbaaSqaaiaaikdaaeqaaOGaaiykaiaadsgacaWGIbWaaWbaaSqabe % aacaaIYaaaaaGcbaGaamODamaaDaaaleaacaWGKbGaam4Daaqaaiaa % dogacaWGVbGaamOCaiaadkhacaWGLbGaam4CaiaadshacaWGLbGaam % izaaaakiabg2da9maalaaabaGaamODamaaBaaaleaacaWGKbGaam4D % aaqabaaakeaacaaIXaGaeyOeI0IaeqOSdi2aa0baaSqaaiaaigdaae % aacaWG0baaaaaakiaacYcacaWG2bWaa0baaSqaaiaadsgacaWGIbaa % baGaam4yaiaad+gacaWGYbGaamOCaiaadwgacaWGZbGaamiDaiaadw % gacaWGKbaaaOGaeyypa0ZaaSaaaeaacaWG2bWaaSbaaSqaaiaadsga % caWGIbaabeaaaOqaaiaaigdacqGHsislcqaHYoGydaqhaaWcbaGaaG % ymaaqaaiaadshaaaaaaaGcbaGaam4CamaaDaaaleaacaWGKbGaam4D % aaqaaiaadogacaWGVbGaamOCaiaadkhacaWGLbGaam4Caiaadshaca % WGLbGaamizaaaakiabg2da9maalaaabaGaam4CamaaBaaaleaacaWG % KbGaam4DaaqabaaakeaacaaIXaGaeyOeI0IaeqOSdi2aa0baaSqaai % aaikdaaeaacaWG0baaaaaakiaacYcacaWGZbWaa0baaSqaaiaadsga % caWGIbaabaGaam4yaiaad+gacaWGYbGaamOCaiaadwgacaWGZbGaam % iDaiaadwgacaWGKbaaaOGaeyypa0ZaaSaaaeaacaWGZbWaaSbaaSqa % aiaadsgacaWGIbaabeaaaOqaaiaaigdacqGHsislcqaHYoGydaqhaa % WcbaGaaGOmaaqaaiaadshaaaaaaaGcbaGaam4Daiabg2da9iaadEha % cqGHsislcqaHXoqydaWcaaqaaiaadAhadaqhaaWcbaGaamizaiaadE % haaeaacaWGJbGaam4BaiaadkhacaWGYbGaamyzaiaadohacaWG0bGa % amyzaiaadsgaaaaakeaadaGcaaqaaiaadohadaqhaaWcbaGaamizai % aadEhaaeaacaWGJbGaam4BaiaadkhacaWGYbGaamyzaiaadohacaWG % 0bGaamyzaiaadsgaaaaabeaakiabgUcaRiabew7aLbaacaGGSaGaam % Oyaiabg2da9iaadkgacqGHsislcqaHXoqydaWcaaqaaiaadAhadaqh % aaWcbaGaamizaiaadkgaaeaacaWGJbGaam4BaiaadkhacaWGYbGaam % yzaiaadohacaWG0bGaamyzaiaadsgaaaaakeaadaGcaaqaaiaadoha % daqhaaWcbaGaamizaiaadkgaaeaacaWGJbGaam4BaiaadkhacaWGYb % GaamyzaiaadohacaWG0bGaamyzaiaadsgaaaaabeaakiabgUcaRiab % ew7aLbaaaaaa!1E1A! \begin{array}{l} {v_{dw}} = {\beta _1}{v_{dw}} + (1 - {\beta _1})dw,{v_{db}} = {\beta _1}{v_{db}} + (1 - {\beta _1})db\\ {s_{dw}} = {\beta _2}{s_{dw}} + (1 - {\beta _2})d{w^2},{s_{db}} = {\beta _2}{s_{db}} + (1 - {\beta _2})d{b^2}\\ v_{dw}^{corrested} = \frac{{{v_{dw}}}}{{1 - \beta _1^t}},v_{db}^{corrested} = \frac{{{v_{db}}}}{{1 - \beta _1^t}}\\ s_{dw}^{corrested} = \frac{{{s_{dw}}}}{{1 - \beta _2^t}},s_{db}^{corrested} = \frac{{{s_{db}}}}{{1 - \beta _2^t}}\\ w = w - \alpha \frac{{v_{dw}^{corrested}}}{{\sqrt {s_{dw}^{corrested}} + \varepsilon }},b = b - \alpha \frac{{v_{db}^{corrested}}}{{\sqrt {s_{db}^{corrested}} + \varepsilon }} \end{array} vdw=β1vdw+(1β1)dw,vdb=β1vdb+(1β1)dbsdw=β2sdw+(1β2)dw2,sdb=β2sdb+(1β2)db2vdwcorrested=1β1tvdw,vdbcorrested=1β1tvdbsdwcorrested=1β2tsdw,sdbcorrested=1β2tsdbw=wαsdwcorrested+εvdwcorrested,b=bαsdbcorrested+εvdbcorrested
经验法则:α\alphaα需要根据实际调整;β1=0.9\beta_1=0.9β1=0.9;β2=0.999\beta_2=0.999β2=0.999;ξ=10−8\xi=10^{-8}ξ=108

学习率衰减

mini−patchmini-patchminipatch中固定的学习率会让算法在最小值的区域内徘徊,越小的区域徘徊区域越小,就越精准。但是训练前期如果学习率太小,会导致学习速度很慢,所以要让学习率逐渐衰减。

学习率衰减的公式:
α=α01+decay−rate∗epoch−num % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacqaHXoqycqGH9aqpdaWcaaqaaiabeg7a % HnaaBaaaleaacaaIWaaabeaaaOqaaiaaigdacqGHRaWkcaWGKbGaam % yzaiaadogacaWGHbGaamyEaiabgkHiTiaadkhacaWGHbGaamiDaiaa % dwgacaGGQaGaamyzaiaadchacaWGVbGaam4yaiaadIgacqGHsislca % WGUbGaamyDaiaad2gaaaaaaa!5871! \alpha = \frac{{{\alpha _0}}}{{1 + decay - rate*epoch - num}} α=1+decayrateepochnumα0
其中,epoch−numepoch-numepochnum为遍历次数。

其他公示:
α=0.95epoch−numα0 % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacqaHXoqycqGH9aqpcaaIWaGaaiOlaiaa % iMdacaaI1aWaaWbaaSqabeaacaWGLbGaamiCaiaad+gacaWGJbGaam % iAaiabgkHiTiaad6gacaWG1bGaamyBaaaakiabeg7aHnaaBaaaleaa % caaIWaaabeaaaaa!4FE5! \alpha = {0.95^{epoch - num}}{\alpha _0} α=0.95epochnumα0

α=kepoch−numα0 % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacqaHXoqycqGH9aqpdaWcaaqaaiaadUga % aeaadaGcaaqaaiaadwgacaWGWbGaam4BaiaadogacaWGObGaeyOeI0 % IaamOBaiaadwhacaWGTbaaleqaaaaakiabeg7aHnaaBaaaleaacaaI % Waaabeaaaaa!4DE5! \alpha = \frac{k}{{\sqrt {epoch - num} }}{\alpha _0} α=epochnumkα0

等等

局部最优问题

  • 我们所认为的局部最优点在高维空间里大概率只是鞍点

  • 神经网络规模很大时,一般不会困在很差的局部最优点里面

  • 平稳段让神经网络梯度下降很慢,这才是问题

超参数的调试

调试处理

超参数的调试优先级排序:

1.学习率

2.momentun-β\betaβ,mini−batch−sizemini-batch -sizeminibatchsize,hidden−unitshidden-unitshiddenunits

3.层数,学习率衰减

4,其他

  • 随机取值: 尝试不同超参数取值时,在取值空间内随机取点,而不是规则取点。

  • 粗糙到精细: 粗糙取点确定一个区域后,放大该区域再次随机取点,确定下个放大区域

为超参数选择合适范围

使用均匀数轴进行随机取值,会使搜索资源分布不平衡。

应该采用对数数轴,即0.001-0.01,0.01-0.1,0.1-1之间,搜索资源均匀分配,实现方法如下:

要在[m,M]间均匀赋值,那么就需要找出10a=m10^a=m10a=m,10b=M10^b=M10b=M。生成[a,b]间的随机值,然后就可对数随机赋值。

超参数训练的实践

调整超参数的几个办法:

1.**熊猫方式:**每天观察它的表现,耐心地调试学习率(计算能力欠缺时)

2.**鱼子酱方式:**同时平行训练多个模型,然后快速选择效果最好的一个

正则化网络的激活函数

batch归一化:使参数搜索问题变得容易,使神经网络对超参数的选择更加稳定,超参数范围更大,工作效果也更好。

即对每层计算得到a[i]a^{[i]}a[i]进行归一化处理,一般是具体的对z[i]z^{[i]}z[i]进行过归一化处理,然后代入激活函数。公式如下:
μ=1mz(i)σ2=1m∑(zi−μ)2znorm(i)=z(i)−μσ2+εz~(i)=γznorm(i)+β % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakqaabeqaaiabeY7aTjabg2da9maalaaabaGa % aGymaaqaaiaad2gaaaGaamOEamaaCaaaleqabaGaaiikaiaadMgaca % GGPaaaaaGcbaGaeq4Wdm3aaWbaaSqabeaacaaIYaaaaOGaeyypa0Za % aSaaaeaacaaIXaaabaGaamyBaaaadaaeabqaaiaacIcacaWG6bWaaS % baaSqaaiaadMgaaeqaaOGaeyOeI0IaeqiVd0MaaiykaaWcbeqab0Ga % eyyeIuoakmaaCaaaleqabaGaaGOmaaaaaOqaaiaadQhadaqhaaWcba % GaaeOBaiaab+gacaqGYbGaaeyBaaqaaiaacIcacaWGPbGaaiykaaaa % kiabg2da9maalaaabaGaamOEamaaCaaaleqabaGaaiikaiaadMgaca % GGPaaaaOGaeyOeI0IaeqiVd0gabaWaaOaaaeaacqaHdpWCdaahaaWc % beqaaiaaikdaaaGccqGHRaWkcqaH1oqzaSqabaaaaaGcbaWaaacaae % aacaWG6baacaGLdmaadaahaaWcbeqaaiaacIcacaWGPbGaaiykaaaa % kiabg2da9iabeo7aNjaadQhadaqhaaWcbaGaaeOBaiaab+gacaqGYb % GaaeyBaaqaaiaacIcacaWGPbGaaiykaaaakiabgUcaRiabek7aIbaa % aa!7A5D! \begin{array}{l} \mu = \frac{1}{m}{z^{(i)}}\\ {\sigma ^2} = \frac{1}{m}{\sum {({z_i} - \mu )} ^2}\\ z_{{\rm{norm}}}^{(i)} = \frac{{{z^{(i)}} - \mu }}{{\sqrt {{\sigma ^2} + \varepsilon } }}\\ {\widetilde z^{(i)}} = \gamma z_{{\rm{norm}}}^{(i)} + \beta \end{array} μ=m1z(i)σ2=m1(ziμ)2znorm(i)=σ2+εz(i)μz(i)=γznorm(i)+β
其中,γ\gammaγβ\betaβ是控制方差和平均值的参数,不是超参数,可以通过他们控制方差和平均值。

将Batch Norm 拟合进神经网络

简而言之,Batch Norm 做的就是将每个单元输入的z(i)z^{(i)}z(i)归一化处理,然后交给激活函数处理。单元内部有归一化参数γ\gammaγβ\betaβ,单元之间有参数wwwbbb。梯度下降的方法仍然是求偏导β=β−αdβ\beta=\beta-{\alpha}d\betaβ=βαdβ,同样也能运用Adam等优化方法。

BN也能和mini−batchmini-batchminibatch协同工作。同时,因为归一化的过程会消除常数的影响,所以bbb这个常数没有作用。γ\gammaγβ\betaβ维度同bbb

Batch Norm 为什么奏效

1.将大量数据归一到0-1的范围,可以加速学习

2.使权重比我的网络更滞后或更深层,减少了输入值改变的问题

比如,他可以使第十层的权重更能经受得住变化。对于较深的层的参数而言,前面层的参数的些许改变,会使后面层的输入值范围改变,经过层层传播使输入值范围波动大,从而后面层的参数变化大。从而前面参数和后面参数权重差别大。将每一层的输入归一化处理后,输入值永远在固定范围内波动,后面层参数的变化受前面层参数变化或输入数据范围的影响较小。举例:Batch Norm后,识别黑猫的神经网络可识别猫。

层与层之间更独立了,每层都有一定独立学习能力。

3.有一定的正则化作用

因为mini−batchmini-batchminibatch中计算均值和方差,而不是在整个数据集上,所以会有一定的噪声作用,这些噪声对神经网络有一定干扰,与DropoutDropoutDropout类似,防止模型过拟合,使模型更稳定。

测试时的Batch Norm

因为要逐一处理样本,测试难以得到测试集的μ\muμσ2\sigma^{2}σ2(对于单一样本,均值和方差没有意义)。因此对于每一层的μ\muμσ2\sigma^{2}σ2就需要,用指数加权平均的方法,结合训练过程中对应层所有mini−patchmini-patchminipatchμ\muμσ2\sigma^{2}σ2进行估算得到。

除了指数加权平均的方式估算也还有其他估算方法,不展开。大多数都奏效。

Softmax回归

分类类别有多个的如(0,1,2,3)。此类回归的神经网络softmaxsoftmaxsoftmax层有多个单元,每个单元表示是该类的概率大小。softmaxsoftmaxsoftmax层的处理如下:

1.计算线性输出
z[L]=w[L]a[L−1]+b[L] % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWG6bWaaWbaaSqabeaacaGGBbGaamit % aiaac2faaaGccqGH9aqpcaWG3bWaaWbaaSqabeaacaGGBbGaamitai % aac2faaaGccaWGHbWaaWbaaSqabeaacaGGBbGaamitaiabgkHiTiaa % igdacaGGDbaaaOGaey4kaSIaamOyamaaCaaaleqabaGaai4waiaadY % eacaGGDbaaaaaa!5190! {z^{[L]}} = {w^{[L]}}{a^{[L - 1]}} + {b^{[L]}} z[L]=w[L]a[L1]+b[L]
2.幂次方输出(将负数转为正数,易于归一化)
t=e(z[L]) % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWG0bGaeyypa0JaamyzamaaCaaaleqa % baGaaiikaiaadQhadaahaaadbeqaaiaacUfacaWGmbGaaiyxaaaali % aacMcaaaaaaa!475A! t = {e^{({z^{[L]}})}} t=e(z[L])
3.归一化得到概率
ai[L]=ti∑ti % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWGHbWaa0baaSqaaiaadMgaaeaacaGG % BbGaamitaiaac2faaaGccqGH9aqpdaWcaaqaaiaadshadaWgaaWcba % GaamyAaaqabaaakeaadaaeabqaaiaadshadaWgaaWcbaGaamyAaaqa % baaabeqab0GaeyyeIuoaaaaaaa!4B0D! a_i^{[L]} = \frac{{{t_i}}}{{\sum {{t_i}} }} ai[L]=titi
上述过程可以被概括为一个softmaxsoftmaxsoftmax激活函数,其特殊之处在于输入向量,输出向量。

训练一个SOftmax分类器

hardmaxhardmaxhardmax指最后输出时只有01分布;softmaxsoftmaxsoftmax最后输出时都是概率分布(映射更加温和)。

损失函数:
L(y~,y)=−∑j=1Cyjlog⁡yj~ % MathType!MTEF!2!1!+- % feaahGart1ev3aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbWexLMBbXgBd9gzLbvyNv2CaeHbl7mZLdGeaGqiVu0Je9sqqr % pepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9vqaqpepm0xbba9pwe9Q8fs % 0-yqaqpepae9pg0FirpepeKkFr0xfr-xfr-xb9adbaqaaeGaciGaai % aabeqaamaabaabauaakeaacaWGmbGaaiikamaaGaaabaGaamyEaaGa % ay5adaGaaiilaiaadMhacaGGPaGaeyypa0JaeyOeI0YaaabCaeaaca % WG5bWaaSbaaSqaaiaadQgaaeqaaaqaaiaadQgacqGH9aqpcaaIXaaa % baGaam4qaaqdcqGHris5aOGaciiBaiaac+gacaGGNbWaaacaaeaaca % WG5bWaaSbaaSqaaiaadQgaaeqaaaGccaGLdmaaaaa!5434! L(\widetilde y,y) = - \sum\limits_{j = 1}^C {{y_j}} \log \widetilde {{y_j}} L(y,y)=j=1Cyjlogyj
因为yyy中仅有一个1,所以唯一能使该损失函数变小的方法是升高yyy中为1的那一类别的预测概率。

深度学习框架

部分深度学习框架如下:

  • Caffe/Caffe2Caffe/Caffe2Caffe/Caffe2
  • CNTKCNTKCNTK
  • DL4JDL4JDL4J
  • KerasKerasKeras
  • LasagneLasagneLasagne
  • mxnetmxnetmxnet
  • PaddlePaddlePaddlePaddlePaddlePaddle
  • TensorFlowTensorFlowTensorFlow
  • TheanoTheanoTheano
  • TorchTorchTorch

深度学习框架选择需求:易于编程、运行速度快、开放程度高

TensorFlow

只需要计算正向传播,设置损失函数就行,TensorFlowTensorFlowTensorFlow会自己计算反向传播和更新参数并减小损失函数,极大简化了深度学习的代码量。

一个代码例子如下:

image-20240801112851802 aaSbaaSqaaiaadQgaaeqaaaqaaiaadQgacqGH9aqpcaaIXaaa % baGaam4qaaqdcqGHris5aOGaciiBaiaac+gacaGGNbWaaacaaeaaca % WG5bWaaSbaaSqaaiaadQgaaeqaaaGccaGLdmaaaaa!5434! L(\widetilde y,y) = - \sum\limits_{j = 1}^C {{y_j}} \log \widetilde {{y_j}} $$ 因为$y$中仅有一个1,所以唯一能使该损失函数变小的方法是升高$y$中为1的那一类别的预测概率。

深度学习框架

部分深度学习框架如下:

  • Caffe/Caffe2Caffe/Caffe2Caffe/Caffe2
  • CNTKCNTKCNTK
  • DL4JDL4JDL4J
  • KerasKerasKeras
  • LasagneLasagneLasagne
  • mxnetmxnetmxnet
  • PaddlePaddlePaddlePaddlePaddlePaddle
  • TensorFlowTensorFlowTensorFlow
  • TheanoTheanoTheano
  • TorchTorchTorch

深度学习框架选择需求:易于编程、运行速度快、开放程度高

TensorFlow

只需要计算正向传播,设置损失函数就行,TensorFlowTensorFlowTensorFlow会自己计算反向传播和更新参数并减小损失函数,极大简化了深度学习的代码量。

一个代码例子如下:

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值