Tensorflow：批归一化和l1l2正则化

深度解析TensorFlow中的Batch Normalization与优化策略

最新推荐文章于 2024-09-09 00:14:36 发布

原创最新推荐文章于 2024-09-09 00:14:36 发布 · 1.5k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#tensorflow

Tensorflow 专栏收录该内容

19 篇文章

订阅专栏

本文详细探讨了TensorFlow中batch_normalization的不同实现，包括tf.nn.batch_normalization、tf.layers.batch_normalization和tf.keras.layers.BatchNormalization，重点讲解了它们的使用方法、参数、训练与推理的区别，以及如何避免更新操作遗漏的常见坑。同时介绍了L2正则化的应用和衰减系数的选择。

部署运行你感兴趣的模型镜像

Batch Nomalization

tf.nn.batch_normalization()是一个低级的操作函数，调用者需要自己处理张量的平均值和方差。

tf.nn.fused_batch_norm是另一个低级的操作函数，和前者十分相似，不同之处在于它针对4维输入张量进行了优化，这是卷积神经网络中常见的情况，而前者tf.nn.batch_normalization则接受任何等级大于1的张量。

tf.nn.batch_norm_with_global_normalization是另一个被弃用的操作，现在这个函数会委托给tf.nn.batch_normalization执行，在未来这个函数会被放弃。

tf.contrib.layers.batch_norm是batch_norm的早期实现，其升级的核心api版本为(tf.layers.batch_normalization)，不推荐使用它，因为它可能会在未来的版本中丢失。

tf.layers.batch_normalization是对先前操作的高级封装，最大的不同在于它负责创建和管理运行张量的均值和方差，并尽可可能地快速融合计算，通常，这个函数应该是你的默认选择。

tf.keras.layers.BatchNormalization是BN算法的keras实现，这个函数在后端会调用tensorflow的tf.nn.batch_normalization函数。

Note:

1 tf.nn.batch_normalization(),tf.layers.batch_normalization和tensorflow.contrib.layers.batch_norm()这三个batch normal函数的封装程度逐渐递增，都会自动将 update_ops 添加到tf.GraphKeys.UPDATE_OPS这个collection中。[TensorFlow踩坑指南]

2 tf.keras.layers.BatchNormalization 不会自动将 update_ops 添加到 tf.GraphKeys.UPDATE_OPS 这个 collection 中。所以在 TensorFlow 训练 session 中使用 tf.keras.layers.BatchNormalization 时，需要手动将 keras.BatchNormalization 层的 updates 添加到 tf.GraphKeys.UPDATE_OPS 中。[TensorFlow 中 Batch Normalization API 的一些坑]

[tensorflow中Batch Normalization的不同实现]

tf.layers.batch_normalization

公式如下：

y=γ(x−μ)/σ+β

其中x是输入，y是输出，μ是均值，σ是方差，γ和β是缩放（scale）、偏移（offset）系数。

tf.keras.layers.BatchNormalization(...)：

使用keras的话，是不需且不能In particular, tf.control_dependencies(tf.GraphKeys.UPDATE_OPS) should not be used。

tf.layers.batch_normalization( #将被tf.keras.layers.BatchNormalization(...)取代
inputs, axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True,
beta_initializer=tf.zeros_initializer(), gamma_initializer=tf.ones_initializer(),
moving_mean_initializer=tf.zeros_initializer(),
moving_variance_initializer=tf.ones_initializer(),
beta_regularizer=None, gamma_regularizer=None,
beta_constraint=None, gamma_constraint=None, training=False, trainable=True, name=None, reuse=None, renorm=False, renorm_clipping=None,
renorm_momentum=0.99, fused=None, virtual_batch_size=None, adjustment=None)

参数：

inputs：张量输入。
axis：一个int，应该被规范化的轴（通常是特征轴）。例如，在使用data_format=“channels_first”的Convolution2D层之后，在BatchNormalization中设置axis=1。理解的话可能参考示例及tf.contrib.layers.layer_norm中参数inputs的说明。
momentum：滑动平均值的动量。
epsilon：小浮点数加上方差以避免被零除。
center：如果为True，则将beta的偏移量添加到标准化张量。如果为False，则忽略beta。
scale：如果为True，则乘以gamma。如果为False，则不使用gamma。当下一层是线性的（例如，nn.relu）时，可以禁用此选项，因为可以由下一层进行缩放。
beta_initializer：beta权重的初始值设定项。
gamma_initializer：gamma权重的初始值设定项。
moving_mean_initializer：滑动平均值的初始化器。
moving_variance_initializer：滑动方差的初始值设定项。
beta_regularizer：可选的beta权重正则化器。
gamma_regularizer：gamma权重的可选调节器。
beta_constraint：由Optimizer更新后应用于beta权重的可选投影函数（例如，用于实现层权重的规范约束或值约束）。函数必须将未投影的变量作为输入，并且必须返回投影的变量（必须具有相同的形状）。在进行异步分布式训练时，使用约束是不安全的。
gamma_constraint：由Optimizer更新后应用于gamma权重的可选投影函数。
training：要么是Python布尔值，要么是TensorFlow布尔值标量张量（例如占位符）。是以训练模式（使用当前批的统计数据进行规范化）还是以推理模式（使用滑动统计数据进行规范化）返回输出。注意：请确保正确设置此参数，否则您的训练 / 验证将无法正常工作。
trainable：布尔值，如果为True，还将变量添加到图形集合GraphKeys.TRAINABLE_VARIABLES（请参见tf.variable）。
name：字符串，层的名称。
reuse：布尔值，是否以相同的名称重用前一层的权重。
`renorm：是否使用批量再规范化（https://arxiv.org/abs/1702.03275）。这会在训练期间增加额外的变量。这个参数的任何一个值的推断都是相同的。
renorm_clipping：一种字典，可以将关键字“rmax”、“rmin”、“dmax”映射到用于剪裁renorm校正的标量张量。校正（r，d）用作corrected_value = normalized_value * r + d，其中r被剪裁为[rmin，rmax]，d被剪裁为[-dmax，dmax]。缺少的rmax、rmin和dmax分别设置为inf、0和inf。
renorm_momentum：用renorm更新滑动方式和标准偏差的动量。与动量不同，这会影响训练，既不应太小（会增加噪音），也不应太大（会给出过时的估计）。注意，momentum仍然被用来得到均值和方差来进行推理。
fused：如果False或者True，尽可能使用更快的融合实现。如果为False，则使用系统建议的实现。
virtual_batch_size：一个int。默认情况下，virtual_batch_size为None，这意味着在整个批次中执行批次规范化。当virtual_batch_size不是None时，改为执行“Ghost Batch Normalization”，创建每个单独规范化的虚拟子批（使用共享gamma、beta和滑动统计）。必须在执行期间划分实际批大小。
adjustment：仅在训练期间，采用包含输入张量（动态）形状的张量并返回一对（scale、bias）以应用于标准化值（γ和β之前）的函数。例如，如果axis=-1，adjustment = lambda shape: ( tf.random_uniform(shape[-1:], 0.93, 1.07)，tf.random_uniform(shape[-1:], -0.1, 0.1))将标准化值向上或向下缩放7%，然后将结果向上滑动0.1（每个功能都有独立的缩放和偏移，但在所有示例中都有共享），最后应用gamma 和/或 beta。如果没有，则不应用调整。如果指定了virtual_batch_size，则无法指定。

BN头号大坑：没有调用 update_ops

Batch Normalization 中需要计算移动平均值，所以 BN 中有一些 update_ops，在训练中需要通过 tf.control_dependencies() 来添加对 update_ops 的调用：

1update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.control_dependencies(update_ops):

train_op = tf.train.AdamOptimizer(lrn_rate).minimize(cost)

或者等价地：

x_norm = tf.compat.v1.layers.batch_normalization(x, training=training)

# ...

update_ops = tf.compat.v1.get_collection(tf.GraphKeys.UPDATE_OPS)
train_op = optimizer.minimize(loss)
train_op = tf.group([train_op, update_ops])

在使用 update_ops 前，需要将 BN 层的 update_ops 添加到 tf.GraphKeys.UPDATE_OPS 这个 collection 中。

tf.layers.BatchNormalization 和 tf.layers.batch_normalization 会自动将 update_ops 添加到 tf.GraphKeys.UPDATE_OPS 这个 collection 中（注：training 参数为 True 时，才会添加，False 时不添加）。这样才能计算μ和σ的滑动平均（训练时，需要更新滑动平均值(moving_mean)和滑动方差(moving_variance)。默认情况下，更新操作放置在tf.GraphKeys.UPDATE_OPS中，因此需要将它们作为对train_ops的依赖项添加。此外，在获取update_ops集合之前，请确保添加任何批处理标准化(batch_normalization)操作。否则，update_ops将为空，训练 / 验证将无法正常工作。）

检测是否添加了 updates 的方法：

print(tf.get_collection(tf.GraphKeys.UPDATE_OPS))

Note: 要生效，在优化时，都需要加（下面的所有示例没写出来）

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    self.train_op = layers.optimize_loss(self.loss, tf.train.get_global_step(), optimizer=self.optimizer, learning_rate=self.learning_rate, clip_gradients=self.clip_norm)

[TensorFlow 中 Batch Normalization API 的一些坑]

BN中的变量

定义tf.layers.BatchNormalization后，

Trainable Variable:

<tf.Variable 'bn0/gamma:0' shape=(64,) dtype=float32_ref>
<tf.Variable 'bn0/beta:0' shape=(64,) dtype=float32_ref>

非Trainable的Global Variable:
<tf.Variable 'bn0/moving_mean:0' shape=(64,) dtype=float32_ref>
<tf.Variable 'bn0/moving_variance:0' shape=(64,) dtype=float32_ref>

[深度学习：批归一化和层归一化Batch Normalization、Layer Normalization]

BN在training和inference时使用的方法是不一样的

training时：

我们需要逐个神经元逐个样本地来计算，这个batch在某一层输出的均值和标准差，然后再对该层的输出进行标准化。同时还要学习gamma和beta两个参数。这是非常非常耗时的，显然，我们不能在inference的时候使用这种方法。

解决方案就是，在训练时使用滑动平均维护population均值和方差：

running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_std = momentum * running_std + (1 - momentum) * sample_std
在训练结束保存模型时，running_mean、running_var、trained_gamma和trained_beta一同被保存下来。

inference时：

output = (input - running_mean) / running_std
output = trained_gamma * output + trained_beta
也就是说，在inference时，BN对应的操作不再是公式里提到的那样，计算该batch的各种统计量，而是直接使用在训练时保存下来的population均值和方差，进行一次线性变换。这样效率提升了很多。但是缺点也显而易见，如果训练集和验证集不平衡的时候，验证的效果会一直一直很差。

[TensorFlow中批归一化的实现——tf.layers.batch_normalization()函数]

[由training参数想到的]

tensorflow Batch Normalization使用示例

示例1：高阶kpi，先定义再使用

self.bn_layers = [
    tf.layers.BatchNormalization(name='bn' + str(i), trainable=self.is_training, _reuse=tf.AUTO_REUSE)
    for i in range(layers_cnt)]
#_reuse=tf.AUTO_REUSE是用于再次定义（不是再次使用时）的参数共享，再次使用应该不需要此参数就是直接用的，参数当然一样
fc = self.bn_layers[i](fc, training=training)

Note: tf.keras.layers.BatchNormalization应该和这个差不多。

示例2：老版本示例，不太建议

import tensorflow.contrib.layers as layers
x = layers.batch_norm(
  x, center=True, scale=True, reuse=tf.AUTO_REUSE, scope=scope, is_training=is_training
)

示例3：

output = tf.Variable([[[0.3, 0.0, 0.5, 0.2],
[0.44, 0.32, 0.23, 0.01],
[-0.2, 0.6, 0.5, 0.1]],
[[0.4, 0.0, 0.0, 0.2],
[0.4, 0.2, 0.3, 0.01],
[0.2, -0.6, -0.5, 0.15]]])
print(output.shape)

axis = [0, 1]
output_n = tf.layers.batch_normalization(output, axis=axis, training=True, trainable=False)
print(output_n.shape)
output_n_mean = tf.reduce_mean(output_n, axis=[2])with tf.Session() as sess:

sess.run(tf.global_variables_initializer())
print('\n')
print(output_n.eval())
print('\n')
print(output_n_mean.eval())

(2, 3, 4)

(2, 3, 4)
[[[ 0.2731793 -1.365896 1.365896 -0.27317917]
[ 1.1840361 0.43622375 -0.12463534 -1.4956245 ]
[-1.3987572 1.0879223 0.77708733 -0.4662524 ]]

[[ 1.4808724 -0.8885234 -0.8885234 0.29617447]
[ 1.1691556 -0.18638718 0.49138427 -1.4741528 ]
[ 1.0586927 -1.1269956 -0.85378444 0.92208725]]]

[[ 2.9802322e-08 0.0000000e+00 1.4901161e-08]
[ 1.4901161e-08 -2.9802322e-08 -1.4901161e-08]] #误差范围内可以认为是0

如果batch_nomalization中axis改成axis = [2]，且output_n_mean = tf.reduce_mean(output_n, axis=[0,1])，则输出：

[[[ 0.19564903 -0.23395906 0.9464483 1.0328814 ]
[ 0.8277463 0.6298898 0.16815075 -1.1887878 ]
[-2.061841 1.3857577 0.9464483 -0.13641822]]

[[ 0.64714706 -0.23395906 -0.49484357 1.0328814 ]
[ 0.64714706 0.30594647 0.3699316 -1.1887878 ]
[-0.255849 -1.8536758 -1.9361355 0.4482317 ]]]

[-9.934107e-08 0.000000e+00 0.000000e+00 9.934107e-08] #误差范围内可以认为是0

也就是说batch_nomalization中axis是指定哪几个维度，则对另外几个维度进行nomalization。

其它示例：cnn后加batch_normalization[使用tf.layers高级函数来构建带有BatchNormalization的神经网络]

-柚子皮-

TensorFlow层归一化函数

tf.contrib.layers.layer_norm(
inputs,
center=True,
scale=True,
activation_fn=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
begin_norm_axis=1,
begin_params_axis=-1,
scope=None
)

By default, begin_norm_axis = 1 and begin_params_axis = -1, meaning that normalization is performed over all but the first axis (the HWC if inputs is NHWC), while the beta and gamma trainable parameters are calculated for the rightmost axis (the C if inputs is NHWC). Scaling and recentering is performed via broadcast of the beta and gamma parameters with the normalized tensor.

参数：

inputs: A tensor having rank R. The normalization is performed over axes begin_norm_axis ... R - 1 and centering and scaling parameters are calculated over begin_params_axis ... R - 1.

lstm中使用layer_norm

[tf.contrib.rnn.LayerNormBasicLSTMCell]

直接在lstm后面使用layer_norm或者batch_normalization还不清楚怎么搞。

-柚子皮-

L2正则化

tensorflow实现

示例1：

from tensorflow.python.keras.regularizers import l2
self.kernels = [self.add_weight(name='kernel' + str(i),
                shape=(hidden_units[i], hidden_units[i + 1]),
                initializer=glorot_normal(seed=self.seed),
                regularizer=l2(self.l2_reg), 
                trainable=self.is_training,
                getter=tf.get_variable)
                for i in range(layers_cnt)]
#self.bias不需要正则
self.reg_loss = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
self.reg_loss = tf.reduce_sum(self.reg_loss)
self.loss = self.loss + self.reg_loss

示例2：

from tensorflow.contrib.layers.python.layers import regularizers
x = layers.fully_connected(
  x,
  activation_fn=activation,
  num_outputs=out_size,
  weights_initializer=initializer_map.get(initializer)(),
  weights_regularizer=regularizers.l1_l2_regularizer(scale_l1=l1_reg,         scale_l2=l2_reg),
  biases_initializer=init_ops.zeros_initializer(),
  scope=scope
)
loss部分同示例1

示例3：

weight_decay=0.1
tmp=tf.constant([0,1,2,3],dtype=tf.float32)
l2_reg=tf.contrib.layers.l2_regularizer(weight_decay)
a=tf.get_variable("I_am_a",regularizer=l2_reg,initializer=tmp) 
#上面代码的等价代码
'''
a=tf.get_variable("I_am_a",initializer=tmp)
a2=tf.reduce_sum(a*a)*weight_decay/2;
tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES,a2)
'''
loss部分同示例1

等价代码的说明：

首先定义a变量：a=tf.get_variable("I_am_a",initializer=tmp)

然后将a进行正则化处理：a2=tf.reduce_sum(a*a)*weight_decay/2，最后将处理后的变量加入tf.GraphKeys.REGULARIZATION_LOSSES集合，所有经过正则化处理的变量都会加入这个集合：tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES,a2)。

对变量执行完L2正则化后，利用tf.get_collection（）函数将所有正则化后的变量取出来放入一个列表：keys = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)，将该列表中的值加起来，再加到loss上，就完成了整个正则化过程：l2_loss=loss + tf.add_n(keys)。

解释[Tensorflow中实现正则化]

Note: L2正则化的预处理数据是平方和除以2，这是方便处理加的一个系数，因为w平方求导之后会多出来一个系数2，有没有系数，优化过程都是一样进行的，减小a和减小10a是一样的训练目标。如果说正则化和主loss的比例不同，还有衰减系数可以调。

[tensorflow使用L2 regularization正则化修正overfitting过拟合]