Deep Learning相关概念

最新推荐文章于 2021-09-08 16:07:09 发布

原创最新推荐文章于 2021-09-08 16:07:09 发布 · 228 阅读

0 ·

CC 4.0 BY-SA版权

专题专栏收录该内容

3 篇文章

订阅专栏

本文深入探讨了深度学习训练中的核心概念，包括epoch、iteration和batch size的区别，batch normalization的作用及其在条件模型中的应用，如conditional batch normalization，以及如何通过优化非可微操作提升模型效果。同时，文章对比了平移不变性和平移等变性在卷积神经网络中的作用。

Epoch

One epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE [1].

Iteration

Iterations is the number of batches needed to complete one epoch [1].

Batch normalization [3]

Use $B$ to denote a mini-batch of size $m$ of the entire training set. The empirical mean and variance of $B$ could thus be denoted as
$σB2=1m∑i=1m(xi−μB)2.{\mu _{B}=\frac {1}{m}\sum _{i=1}^{m}x_i, \text{and } \sigma _{B}^{2}={\frac {1}{m}}\sum _{i=1}^{m}(x_{i}-\mu _{B})^{2}}.$

For a layer of the network with d-dimensional input, $x=(x^{(1)},...,x^{(d)})$ , each dimension of its input is then normalized separately
$x^i(k)=xi(k)−μB(k)σB(k)2+ϵ.\hat {x}_{i}^{(k)}=\frac{x_{i}^{(k)}-\mu _{B}^{(k)}}{\sqrt {\sigma _{B}^{(k)^{2}}+\epsilon .}}$

To restore the representation power of the network, a transformation step then follows as
$yi(k)=γ(k)x^i(k)+β(k),y_{i}^{(k)}=\gamma ^{(k)}{\hat {x}}_{i}^{(k)}+\beta ^{(k)},$

where the parameters $γ(k)\gamma ^{(k)}$ and $β(k)\beta ^{(k)}$ are subsequently learned in the optimization process.

Formally, the operation that implements batch normalization is a transform $BNγ(k),β(k):x1...m(k)→y1...m(k)BN_{\gamma ^{(k)},\beta ^{(k)}}:x_{1...m}^{(k)} \rightarrow y_{1...m}^{(k)}$ called the Batch Normalizing transform. The output of the BN transform $y(k)=BNγ(k),β(k)(x(k))y^{(k)}=BN_{\gamma ^{(k)},\beta ^{(k)}}(x^{(k)})$ is then passed to other network layers, while the normalized output $x^i(k)\hat {x}_{i}^{(k)}$ remains internal to the current layer.

Conditional batch normalization [4]

CBN instead learns to output new BN parameters $γ^i,c\hat{\gamma}_{i,c}$ and $β^i,c\hat{\beta}_{i,c}$ as a function of some input $xi\bm x_i$ :
$βi,c=hc(xi),\gamma_{i,c} = f_c(\bm x_i),\ \beta_{i,c} = h_c(\bm x_i) ,$

where $f$ and $h$ are arbitrary functions such as neural networks. Thus, $f$ and $h$ can learn to control the distribution of CNN activations based on $xi\bm x_i$ .

Combined with ReLU non-linearities, CBN empowers a conditioning model to manipulate feature maps of a target CNN by scaling them up or down, negating them, shutting them off, selectively thresholding them, and more. Each feature map is modulated independently, giving the conditioning model an exponential (in the number of feature maps) number of ways to affect the feature representation.

Rather than output $γ^i,c\hat{\gamma}_{i,c}$ directly, [4] output $Δγ^i,c\Delta \hat{\gamma}_{i,c}$ , where:
$γ^i,c=1+Δγ^i,c,\hat{\gamma}_{i,c} = 1+\Delta \hat{\gamma}_{i,c} ,$

since initially zero-centred $γ^i,c\hat{\gamma}_{i,c}$ can zero out CNN feature map activations and thus gradients.

Conditional normalization [7,8]

Conditional normalization aims to modulate the activation via a learned affine transformation conditioned on external data (e.g., an image of an artwork for capturing
a specific style).
Conditional normalization methods include Conditional Batch Normalization for general visual questions answering on complex scenes such as VQA and GuessWhat (Dumoulin et al., 2017), Conditional Instance Normalization and Adaptive Instance Normalization (Huang & Belongie, 2017) for image stylization, Dynamic Layer Norm for speech recognition, and SPADE (Park et al., 2019).
Conditional normalization methods are widely used in the style transfer and image synthesis tasks, and also applied to align different data distributions for domain adaptation.
FiLM [8] can be viewed as a generalization of CN methods.
CN - “replace the parameters of the feature-wise affine transformation typical in normalization layers”
FiLM - “not strictly necessary for the affine transformation to occur directly after normalization”

对convolutional networks有效性的评论

Local connectivity can greatly reduce the number of parameters in the model, which inherently provides some form of build-in regularization.
The convolution operation has a direct filtering interpretation, where each feature map is convolved against input features to identify patterns as groups of pixels. Thus, the outputs of each convolutional layer correspond to important spatial features in the original input space and offer some robustness to simple transformations.

Translational invariance vs Translational equivariance [5,6]

Translation invariance means that the system produces exactly the same response, regardless of how its input is shifted. Equivariance means that the system works equally well across positions, but its response shifts with the position of the target. For example, a heat map of “face-iness” would have similar bumps at different positions.
Convolution provides translational equivariance (rather than translation invariance), meaning if an object in an image is at area A and through convolution a feature is detected at the output at area B, then the same feature would be detected when the object in the image is translated to A’. The position of the output feature would also be translated to a new area B’ based on the filter kernel size.
Translational Invariance is a result of the pooling operation (not the convolutional operation).
Additional info (see [5]): 1. The convolution operator commutes with respect to translation. 2. One approach to translation-invariant object recognition is via template-matching.

Optimisation over non-differentiable $k$ NN operations

In DNNs, there is a common trick in computing the gradient of operations non-differentiable at some points, but differentiable elsewhere, such as Max-Pooling (top-1) and top-k. In forward computation pass, the index position of the max (or top-k) values are stored. While in the back propagation pass, the gradient is computed only with respect to these saved positions. This trick is implemented in modern deep learning frameworks such as tensorflow ( $tf.nn.top_k()$ ) and pytorch.
—————— https://openreview.net/forum?id=SyVuRiC5K7 (Response to AnonReviewer 2)

参考文献

https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9
Siamese Neural Networks for One-shot Image Recognition
https://en.wikipedia.org/wiki/Batch_normalization
Perez, Ethan, et al. “Learning visual reasoning without strong priors.” arXiv preprint arXiv:1707.03017 (2017).
https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo
Translational Invariance Vs Translational Equivariance. https://towardsdatascience.com/translational-invariance-vs-translational-equivariance-f9fbc8fca63a
Tseng, Hung-Yu, et al. “Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation.” ICLR. 2019.
Perez, Ethan, et al. “Film: Visual reasoning with a general conditioning layer.” AAAI. 2018.