Deep Learning相关概念

本文深入探讨了深度学习训练中的核心概念,包括epoch、iteration和batch size的区别,batch normalization的作用及其在条件模型中的应用,如conditional batch normalization,以及如何通过优化非可微操作提升模型效果。同时,文章对比了平移不变性和平移等变性在卷积神经网络中的作用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  • Epoch

One epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE [1].

  • Iteration

Iterations is the number of batches needed to complete one epoch [1].

Batch normalization [3]

Use BBB to denote a mini-batch of size mmm of the entire training set. The empirical mean and variance of BBB could thus be denoted as
μB=1m∑i=1mxi,and σB2=1m∑i=1m(xi−μB)2.{\mu _{B}=\frac {1}{m}\sum _{i=1}^{m}x_i, \text{and } \sigma _{B}^{2}={\frac {1}{m}}\sum _{i=1}^{m}(x_{i}-\mu _{B})^{2}}.μB=m1i=1mxi,and σB2=m1i=1m(xiμB)2.

For a layer of the network with d-dimensional input, x=(x(1),...,x(d))x=(x^{(1)},...,x^{(d)})x=(x(1),...,x(d)), each dimension of its input is then normalized separately
x^i(k)=xi(k)−μB(k)σB(k)2+ϵ.\hat {x}_{i}^{(k)}=\frac{x_{i}^{(k)}-\mu _{B}^{(k)}}{\sqrt {\sigma _{B}^{(k)^{2}}+\epsilon .}}x^i(k)=σB(k)2+ϵ.xi(k)μB(k)

To restore the representation power of the network, a transformation step then follows as
yi(k)=γ(k)x^i(k)+β(k),y_{i}^{(k)}=\gamma ^{(k)}{\hat {x}}_{i}^{(k)}+\beta ^{(k)},yi(k)=γ(k)x^i(k)+β(k),

where the parameters γ(k)\gamma ^{(k)}γ(k) and β(k)\beta ^{(k)}β(k) are subsequently learned in the optimization process.

Formally, the operation that implements batch normalization is a transform BNγ(k),β(k):x1...m(k)→y1...m(k)BN_{\gamma ^{(k)},\beta ^{(k)}}:x_{1...m}^{(k)} \rightarrow y_{1...m}^{(k)}BNγ(k),β(k):x1...m(k)y1...m(k) called the Batch Normalizing transform. The output of the BN transform y(k)=BNγ(k),β(k)(x(k))y^{(k)}=BN_{\gamma ^{(k)},\beta ^{(k)}}(x^{(k)})y(k)=BNγ(k),β(k)(x(k)) is then passed to other network layers, while the normalized output x^i(k)\hat {x}_{i}^{(k)}x^i(k) remains internal to the current layer.

Conditional batch normalization [4]

CBN instead learns to output new BN parameters γ^i,c\hat{\gamma}_{i,c}γ^i,c and β^i,c\hat{\beta}_{i,c}β^i,c as a function of some input xi\bm x_ixi:
γi,c=fc(xi), βi,c=hc(xi),\gamma_{i,c} = f_c(\bm x_i),\ \beta_{i,c} = h_c(\bm x_i) ,γi,c=fc(xi), βi,c=hc(xi),

where fff and hhh are arbitrary functions such as neural networks. Thus, fff and hhh can learn to control the distribution of CNN activations based on xi\bm x_ixi.

Combined with ReLU non-linearities, CBN empowers a conditioning model to manipulate feature maps of a target CNN by scaling them up or down, negating them, shutting them off, selectively thresholding them, and more. Each feature map is modulated independently, giving the conditioning model an exponential (in the number of feature maps) number of ways to affect the feature representation.

Rather than output γ^i,c\hat{\gamma}_{i,c}γ^i,c directly, [4] output Δγ^i,c\Delta \hat{\gamma}_{i,c}Δγ^i,c, where:
γ^i,c=1+Δγ^i,c,\hat{\gamma}_{i,c} = 1+\Delta \hat{\gamma}_{i,c} ,γ^i,c=1+Δγ^i,c,

since initially zero-centred γ^i,c\hat{\gamma}_{i,c}γ^i,c can zero out CNN feature map activations and thus gradients.

Conditional normalization [7,8]
  • Conditional normalization aims to modulate the activation via a learned affine transformation conditioned on external data (e.g., an image of an artwork for capturing
    a specific style).
  • Conditional normalization methods include Conditional Batch Normalization for general visual questions answering on complex scenes such as VQA and GuessWhat (Dumoulin et al., 2017), Conditional Instance Normalization and Adaptive Instance Normalization (Huang & Belongie, 2017) for image stylization, Dynamic Layer Norm for speech recognition, and SPADE (Park et al., 2019).
  • Conditional normalization methods are widely used in the style transfer and image synthesis tasks, and also applied to align different data distributions for domain adaptation.
  • FiLM [8] can be viewed as a generalization of CN methods.
    CN - “replace the parameters of the feature-wise affine transformation typical in normalization layers”
    FiLM - “not strictly necessary for the affine transformation to occur directly after normalization”
对convolutional networks有效性的评论
  • Local connectivity can greatly reduce the number of parameters in the model, which inherently provides some form of build-in regularization.
  • The convolution operation has a direct filtering interpretation, where each feature map is convolved against input features to identify patterns as groups of pixels. Thus, the outputs of each convolutional layer correspond to important spatial features in the original input space and offer some robustness to simple transformations.
Translational invariance vs Translational equivariance [5,6]
  • Translation invariance means that the system produces exactly the same response, regardless of how its input is shifted. Equivariance means that the system works equally well across positions, but its response shifts with the position of the target. For example, a heat map of “face-iness” would have similar bumps at different positions.
  • Convolution provides translational equivariance (rather than translation invariance), meaning if an object in an image is at area A and through convolution a feature is detected at the output at area B, then the same feature would be detected when the object in the image is translated to A’. The position of the output feature would also be translated to a new area B’ based on the filter kernel size.
  • Translational Invariance is a result of the pooling operation (not the convolutional operation).
  • Additional info (see [5]): 1. The convolution operator commutes with respect to translation. 2. One approach to translation-invariant object recognition is via template-matching.
Optimisation over non-differentiable kkkNN operations

In DNNs, there is a common trick in computing the gradient of operations non-differentiable at some points, but differentiable elsewhere, such as Max-Pooling (top-1) and top-k. In forward computation pass, the index position of the max (or top-k) values are stored. While in the back propagation pass, the gradient is computed only with respect to these saved positions. This trick is implemented in modern deep learning frameworks such as tensorflow (tf.nn.topk()tf.nn.top_k()tf.nn.topk()) and pytorch.
——————                https://openreview.net/forum?id=SyVuRiC5K7 (Response to AnonReviewer 2)

参考文献

  1. https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9
  2. Siamese Neural Networks for One-shot Image Recognition
  3. https://en.wikipedia.org/wiki/Batch_normalization
  4. Perez, Ethan, et al. “Learning visual reasoning without strong priors.” arXiv preprint arXiv:1707.03017 (2017).
  5. https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo
  6. Translational Invariance Vs Translational Equivariance. https://towardsdatascience.com/translational-invariance-vs-translational-equivariance-f9fbc8fca63a
  7. Tseng, Hung-Yu, et al. “Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation.” ICLR. 2019.
  8. Perez, Ethan, et al. “Film: Visual reasoning with a general conditioning layer.” AAAI. 2018.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值