tf.gradients()和tf.stop_gradient()

本文介绍TensorFlow中的tf.gradients()和tf.stop_gradient()函数。tf.gradients()用于计算张量的梯度,支持多种参数设置;tf.stop_gradient()用于停止梯度计算,适用于多种场景。文章还提供了多个代码示例,帮助理解如何使用这两个函数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

参考 tf.stop_gradient - 云+社区 - 腾讯云

1、tf.gradients()

tf.gradients(
    ys,
    xs,
    grad_ys=None,
    name='gradients',
    colocate_gradients_with_ops=False,
    gate_gradients=False,
    aggregation_method=None,
    stop_gradients=None,
    unconnected_gradients=tf.UnconnectedGradients.NONE
)

参数:

  • ys:  要微分的张量或张量列表。
  • xs:  用于微分的张量或列表。
  • grad_ys:可选。一个张量或张量列表,其大小与y相同,并保持y中每个y的梯度。用于对求和后加权。
  • name:  可选名称,用于将所有渐变操作分组在一起。默认为“梯度”。
  • colocate_gradients_with_ops:  如果为真,请尝试使用相应的op来合并渐变。
  • gate_gradients:  如果为真,在操作返回的渐变周围添加一个元组。这避免了一些竞争条件。
  • aggregation_method:  指定用于合并渐变项的方法。可接受的值是类AggregationMethod中定义的常量。
  • stop_gradients:  可选。一个张量或张量的列表,不需要微分。
  • unconnected_gradients:  可选。指定给定输入张量断开连接时返回的梯度值。可接受的值是在类tf中定义的常量。UnconnectedGradients,默认值为none。

返回值:

  • xs中每个x的sum(dy/dx)的列表。

可能产生的异常:

  • LookupError: if one of the operations between x and y does not have a registered gradient function.
  • ValueError: if the arguments are invalid.
  • RuntimeError: if called in Eager mode.

 ys和xs都是张量或张量列表。grad_ys是一个张量列表,包含了ys接收到的梯度。列表的长度必须与ys相同。gradient()将ops添加到图中,输出ys对xs的导数。它返回一个长度为len(xs)的张量列表,其中每个张量都是y在ys中的和(dy/dx)。grad_ys是一个与ys长度相同的张量列表,它包含ys中每个y的初始梯度。当grad_ys为0时,为y中的每个y填充一个'1'的张量,其形状为y。用户可以提供自己的初始grad_y来计算导数,使用每个y的不同初始梯度(例如,如果想对每个y中的每个值使用不同的梯度权重)。stop_gradients是一个张量或张量列表,它被认为是关于所有xs的常数。这些张量不会反向传播,就像它们已经使用stop_gradient显式断开连接一样。除此之外,这允许计算偏导数而不是全导数。

例:

a = tf.constant(0.)
b = 2 * a
g = tf.gradients(a + b, [a, b], stop_gradients=[a, b])

这里的偏导数g的值为[1.0,1.0],与全导数tf相比。梯度(a + b, [a, b]),考虑a对b的影响,取值为[3.0,1.0]。

请注意,上述内容相当于:

a = tf.stop_gradient(tf.constant(0.))
b = tf.stop_gradient(2 * a)
g = tf.gradients(a + b, [a, b])

 stop_gradients提供了一种在图已经构建好之后停止梯度的方法,而tf.stop_gradient在构建图期间停止梯度。当这两种方法结合时,反向传播在两个tf处都停止。tf.stop_gradient节点和stop_gradients中的节点,以最先遇到的节点为准。所有整数张量对于所有xs都被认为是常量,就像它们包含在stop_gradients中一样。unconnected_gradients确定如果图中每个x未连接到ys,则为xs中的每个x返回的值。默认情况下,这不是用来防止错误的。从数学上讲,这些梯度是零,可以使用“零”选项来请求。tf.UnconnectedGradients提供了以下选项和行为:

unconnected_gradients确定如果图中每个x未连接到ys,则为xs中的每个x返回的值。默认情况下,这不是用来防止错误的。从数学上讲,这些梯度是零,可以使用“零”选项来请求。UnconnectedGradients提供了以下选项和行为:

a = tf.ones([1, 2])
b = tf.ones([3, 1])
g1 = tf.gradients([b], [a], unnconnected_gradients='none')
sess.run(g1)  # [None]

g2 = tf.gradients([b], [a], unconnected_gradients='zero')
sess.run(g2)  # [array([[0., 0.]], dtype=float32)]

2、tf.stop_gradient()

停止梯度计算。当在一个图中执行时,这个op按原样输出它的输入张量。当构建ops来计算梯度时,该op会阻止将其输入的贡献考虑在内。通常情况下,梯度发生器通过递归找出对其计算有贡献的输入,将ops添加到图中,计算指定“损失”的导数。如果将这个op插入到图中,它的输入将被梯度生成器屏蔽。它们没有考虑到计算梯度。当你想用TensorFlow计算一个值,但需要假设该值是常量时,这是非常有用的。一些例子包括:

  • EM算法中m步不应该包含通过e步输出的反向传播。
  • 玻尔兹曼机的对比发散训练,当对能量函数进行微分时,训练不能通过从模型中生成样本的图进行反向传播。
  • 对抗性训练,在对抗性示例生成过程中不应该有任何支持。

参数:

  • input:  一个张量。
  • name:  操作的名称(可选)。

返回值:

  • 一个张量。具有与输入相同的类型。

例:

# 1、计算梯度
import tensorflow as tf

w1 = tf.get_variable('w1', shape=[3])
w2 = tf.get_variable('w2', shape=[3])

w3 = tf.get_variable('w3', shape=[3])
w4 = tf.get_variable('w4', shape=[3])

z1 = w1 + w2+ w3
z2 = w3 + w4

grads = tf.gradients([z1, z2], [w1, w2, w3, w4], grad_ys=[tf.convert_to_tensor([2.,2.,3.]),
                                                          tf.convert_to_tensor([3.,2.,4.])])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(sess.run(grads))

Outout:
-------------------------------------
[array([2., 2., 3.], dtype=float32), 
array([2., 2., 3.], dtype=float32), 
array([5., 4., 7.], dtype=float32), 
array([3., 2., 4.], dtype=float32)]
-------------------------------------



# 2、阻挡节点BP的梯度

import tensorflow as tf

w1 = tf.Variable(2.0)
w2 = tf.Variable(2.0)

a = tf.multiply(w1, 3.0)
a_stoped = tf.stop_gradient(a)

# b=w1*3.0*w2
b = tf.multiply(a_stoped, w2)
gradients = tf.gradients(b, xs=[w1, w2])
print(gradients)
# 可见,一个节点被 stop之后,这个节点上的梯度,就无法再向前BP了。由于w1变量的梯度只能来自a节点,所
# 以,计算梯度返回的是None。

Output:
--------------------------------------------------------------------------
[None, <tf.Tensor 'gradients/Mul_1_grad/Mul_1:0' shape=() dtype=float32>]
--------------------------------------------------------------------------

import tensorflow as tf

a = tf.Variable(1.0)
b = tf.Variable(1.0)

c = tf.add(a, b)

c_stoped = tf.stop_gradient(c)

d = tf.add(a, b)

e = tf.add(c_stoped, d)

# e = c + d = a + b + a + b

gradients = tf.gradients(e, xs=[a, b])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(sess.run(gradients))

# 虽然 c节点被stop了,但是a,b还有从d传回的梯度,所以还是可以输出梯度值的。

Output:
-----------
[1.0, 1.0]
-----------

import tensorflow as tf

w1 = tf.Variable(2.0)
w2 = tf.Variable(2.0)
a = tf.multiply(w1, 3.0)
a_stoped = tf.stop_gradient(a)

# b=w1*3.0*w2
b = tf.multiply(a_stoped, w2)

opt = tf.train.GradientDescentOptimizer(0.1)

gradients = tf.gradients(b, xs=tf.trainable_variables())

tf.summary.histogram(gradients[0].name, gradients[0])
# 这里会报错,因为gradients[0]是None
#其它地方都会运行正常,无论是梯度的计算还是变量的更新。总觉着tensorflow这么设计有点不好,
#不如改成流过去的梯度为0
train_op = opt.apply_gradients(zip(gradients, tf.trainable_variables()))

print(gradients)
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(sess.run(train_op))
    print(sess.run([w1, w2]))

# 3、求高阶导数
import tensorflow as tf

with tf.device('/cpu:0'):
    a = tf.constant(1.)
    b = tf.pow(a, 2)
    grad = tf.gradients(ys=b, xs=a) # 一阶导
    print(grad[0])
    grad_2 = tf.gradients(ys=grad[0], xs=a) # 二阶导
    grad_3 = tf.gradients(ys=grad_2[0], xs=a) # 三阶导
    print(grad_3)

with tf.Session() as sess:
    print(sess.run(grad_3))

Output:
--------------------------------------------------------------------------------------
Tensor("gradients/Pow_grad/Reshape:0", shape=(), dtype=float32, device=/device:CPU:0)
[<tf.Tensor 'gradients_2/gradients_1/gradients/Pow_grad/Pow_grad/Pow_grad/Reshape:0' shape=() dtype=float32>]
[0.0]
---------------------------------------------------------------------------------------

原链接: 

tf.gradients  |  TensorFlow Core v2.7.0

https://tensorflow.google.cn/versions/r1.11/api_docs/python/tf/stop_gradient?hl=en

# Copyright 2018 DeepMind Technologies Limited # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================ """Code defining LEO inner loop. See "Meta-Learning with Latent Embedding Optimization" by Rusu et al. (https://arxiv.org/pdf/1807.05960.pdf). """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np from six.moves import range from six.moves import zip import sonnet as snt import tensorflow as tf import tensorflow_probability as tfp import data as data_module def get_orthogonality_regularizer(orthogonality_penalty_weight): """Returns the orthogonality regularizer.""" def orthogonality(weight): """Calculates the layer-wise penalty encouraging orthogonality.""" with tf.name_scope(None, "orthogonality", [weight]) as name: w2 = tf.matmul(weight, weight, transpose_b=True) wn = tf.norm(weight, ord=2, axis=1, keepdims=True) + 1e-32 correlation_matrix = w2 / tf.matmul(wn, wn, transpose_b=True) matrix_size = correlation_matrix.get_shape().as_list()[0] base_dtype = weight.dtype.base_dtype identity = tf.eye(matrix_size, dtype=base_dtype) weight_corr = tf.reduce_mean( tf.squared_difference(correlation_matrix, identity)) return tf.multiply( tf.cast(orthogonality_penalty_weight, base_dtype), weight_corr, name=name) return orthogonality class LEO(snt.AbstractModule): """Sonnet module implementing the inner loop of LEO.""" def __init__(self, config=None, use_64bits_dtype=True, name="leo"): super(LEO, self).__init__(name=name) self._float_dtype = tf.float64 if use_64bits_dtype else tf.float32 self._int_dtype = tf.int64 if use_64bits_dtype else tf.int32 self._inner_unroll_length = config["inner_unroll_length"] self._finetuning_unroll_length = config["finetuning_unroll_length"] self._inner_lr_init = config["inner_lr_init"] self._finetuning_lr_init = config["finetuning_lr_init"] self._num_latents = config["num_latents"] self._dropout_rate = config["dropout_rate"] self._kl_weight = config["kl_weight"] # beta self._encoder_penalty_weight = config["encoder_penalty_weight"] # gamma self._l2_penalty_weight = config["l2_penalty_weight"] # lambda_1 # lambda_2 self._orthogonality_penalty_weight = config["orthogonality_penalty_weight"] assert self._inner_unroll_length > 0, ("Positive unroll length is necessary" " to create the graph") def _build(self, data, is_meta_training=True): """Connects the LEO module to the graph, creating the variables. Args: data: A data_module.ProblemInstance constaining Tensors with the following shapes: - tr_input: (N, K, dim) - tr_output: (N, K, 1) - tr_info: (N, K) - val_input: (N, K_valid, dim) - val_output: (N, K_valid, 1) - val_info: (N, K_valid) where N is the number of classes (as in N-way) and K and the and K_valid are numbers of training and validation examples within a problem instance correspondingly (as in K-shot), and dim is the dimensionality of the embedding. is_meta_training: A boolean describing whether we run in the training mode. Returns: Tensor with the inner validation loss of LEO (include both adaptation in the latent space and finetuning). """ if isinstance(data, list): data = data_module.ProblemInstance(*data) self.is_meta_training = is_meta_training self.save_problem_instance_stats(data.tr_input) latents, kl = self.forward_encoder(data) tr_loss, adapted_classifier_weights, encoder_penalty = self.leo_inner_loop( data, latents) val_loss, val_accuracy = self.finetuning_inner_loop( data, tr_loss, adapted_classifier_weights) val_loss += self._kl_weight * kl val_loss += self._encoder_penalty_weight * encoder_penalty # The l2 regularization is is already added to the graph when constructing # the snt.Linear modules. We pass the orthogonality regularizer separately, # because it is not used in self.grads_and_vars. regularization_penalty = ( self._l2_regularization + self._decoder_orthogonality_reg) batch_val_loss = tf.reduce_mean(val_loss) batch_val_accuracy = tf.reduce_mean(val_accuracy) return batch_val_loss + regularization_penalty, batch_val_accuracy @snt.reuse_variables def leo_inner_loop(self, data, latents): with tf.variable_scope("leo_inner"): inner_lr = tf.get_variable( "lr", [1, 1, self._num_latents], dtype=self._float_dtype, initializer=tf.constant_initializer(self._inner_lr_init)) starting_latents = latents loss, _ = self.forward_decoder(data, latents) for _ in range(self._inner_unroll_length): loss_grad = tf.gradients(loss, latents) # dLtrain/dz latents -= inner_lr * loss_grad[0] loss, classifier_weights = self.forward_decoder(data, latents) if self.is_meta_training: encoder_penalty = tf.losses.mean_squared_error( labels=tf.stop_gradient(latents), predictions=starting_latents) encoder_penalty = tf.cast(encoder_penalty, self._float_dtype) else: encoder_penalty = tf.constant(0., self._float_dtype) return loss, classifier_weights, encoder_penalty @snt.reuse_variables def finetuning_inner_loop(self, data, leo_loss, classifier_weights): tr_loss = leo_loss with tf.variable_scope("finetuning"): finetuning_lr = tf.get_variable( "lr", [1, 1, self.embedding_dim], dtype=self._float_dtype, initializer=tf.constant_initializer(self._finetuning_lr_init)) for _ in range(self._finetuning_unroll_length): loss_grad = tf.gradients(tr_loss, classifier_weights) classifier_weights -= finetuning_lr * loss_grad[0] tr_loss, _ = self.calculate_inner_loss(data.tr_input, data.tr_output, classifier_weights) val_loss, val_accuracy = self.calculate_inner_loss( data.val_input, data.val_output, classifier_weights) return val_loss, val_accuracy @snt.reuse_variables def forward_encoder(self, data): encoder_outputs = self.encoder(data.tr_input) relation_network_outputs = self.relation_network(encoder_outputs) latent_dist_params = self.average_codes_per_class(relation_network_outputs) latents, kl = self.possibly_sample(latent_dist_params) return latents, kl @snt.reuse_variables def forward_decoder(self, data, latents): weights_dist_params = self.decoder(latents) # Default to glorot_initialization and not stddev=1. fan_in = self.embedding_dim.value fan_out = self.num_classes.value stddev_offset = np.sqrt(2. / (fan_out + fan_in)) classifier_weights, _ = self.possibly_sample(weights_dist_params, stddev_offset=stddev_offset) tr_loss, _ = self.calculate_inner_loss(data.tr_input, data.tr_output, classifier_weights) return tr_loss, classifier_weights @snt.reuse_variables def encoder(self, inputs): with tf.variable_scope("encoder"): after_dropout = tf.nn.dropout(inputs, rate=self.dropout_rate) regularizer = tf.contrib.layers.l2_regularizer(self._l2_penalty_weight) initializer = tf.initializers.glorot_uniform(dtype=self._float_dtype) encoder_module = snt.Linear( self._num_latents, use_bias=False, regularizers={"w": regularizer}, initializers={"w": initializer}, ) outputs = snt.BatchApply(encoder_module)(after_dropout) return outputs @snt.reuse_variables def relation_network(self, inputs): with tf.variable_scope("relation_network"): regularizer = tf.contrib.layers.l2_regularizer(self._l2_penalty_weight) initializer = tf.initializers.glorot_uniform(dtype=self._float_dtype) relation_network_module = snt.nets.MLP( [2 * self._num_latents] * 3, use_bias=False, regularizers={"w": regularizer}, initializers={"w": initializer}, ) total_num_examples = self.num_examples_per_class*self.num_classes inputs = tf.reshape(inputs, [total_num_examples, self._num_latents]) left = tf.tile(tf.expand_dims(inputs, 1), [1, total_num_examples, 1]) right = tf.tile(tf.expand_dims(inputs, 0), [total_num_examples, 1, 1]) concat_codes = tf.concat([left, right], axis=-1) outputs = snt.BatchApply(relation_network_module)(concat_codes) outputs = tf.reduce_mean(outputs, axis=1) # 2 * latents, because we are returning means and variances of a Gaussian outputs = tf.reshape(outputs, [self.num_classes, self.num_examples_per_class, 2 * self._num_latents]) return outputs @snt.reuse_variables def decoder(self, inputs): with tf.variable_scope("decoder"): l2_regularizer = tf.contrib.layers.l2_regularizer(self._l2_penalty_weight) orthogonality_reg = get_orthogonality_regularizer( self._orthogonality_penalty_weight) initializer = tf.initializers.glorot_uniform(dtype=self._float_dtype) # 2 * embedding_dim, because we are returning means and variances decoder_module = snt.Linear( 2 * self.embedding_dim, use_bias=False, regularizers={"w": l2_regularizer}, initializers={"w": initializer}, ) outputs = snt.BatchApply(decoder_module)(inputs) self._orthogonality_reg = orthogonality_reg(decoder_module.w) return outputs def average_codes_per_class(self, codes): codes = tf.reduce_mean(codes, axis=1, keep_dims=True) # K dimension # Keep the shape (N, K, *) codes = tf.tile(codes, [1, self.num_examples_per_class, 1]) return codes def possibly_sample(self, distribution_params, stddev_offset=0.): means, unnormalized_stddev = tf.split(distribution_params, 2, axis=-1) stddev = tf.exp(unnormalized_stddev) stddev -= (1. - stddev_offset) stddev = tf.maximum(stddev, 1e-10) distribution = tfp.distributions.Normal(loc=means, scale=stddev) if not self.is_meta_training: return means, tf.constant(0., dtype=self._float_dtype) samples = distribution.sample() kl_divergence = self.kl_divergence(samples, distribution) return samples, kl_divergence def kl_divergence(self, samples, normal_distribution): random_prior = tfp.distributions.Normal( loc=tf.zeros_like(samples), scale=tf.ones_like(samples)) kl = tf.reduce_mean( normal_distribution.log_prob(samples) - random_prior.log_prob(samples)) return kl def predict(self, inputs, weights): after_dropout = tf.nn.dropout(inputs, rate=self.dropout_rate) # This is 3-dimensional equivalent of a matrix product, where we sum over # the last (embedding_dim) dimension. We get [N, K, N, K] tensor as output. per_image_predictions = tf.einsum("ijk,lmk->ijlm", after_dropout, weights) # Predictions have shape [N, K, N]: for each image ([N, K] of them), what # is the probability of a given class (N)? predictions = tf.reduce_mean(per_image_predictions, axis=-1) return predictions def calculate_inner_loss(self, inputs, true_outputs, classifier_weights): model_outputs = self.predict(inputs, classifier_weights) model_predictions = tf.argmax( model_outputs, -1, output_type=self._int_dtype) accuracy = tf.contrib.metrics.accuracy(model_predictions, tf.squeeze(true_outputs, axis=-1)) return self.loss_fn(model_outputs, true_outputs), accuracy def save_problem_instance_stats(self, instance): num_classes, num_examples_per_class, embedding_dim = instance.get_shape() if hasattr(self, "num_classes"): assert self.num_classes == num_classes, ( "Given different number of classes (N in N-way) in consecutive runs.") if hasattr(self, "num_examples_per_class"): assert self.num_examples_per_class == num_examples_per_class, ( "Given different number of examples (K in K-shot) in consecutive" "runs.") if hasattr(self, "embedding_dim"): assert self.embedding_dim == embedding_dim, ( "Given different embedding dimension in consecutive runs.") self.num_classes = num_classes self.num_examples_per_class = num_examples_per_class self.embedding_dim = embedding_dim @property def dropout_rate(self): return self._dropout_rate if self.is_meta_training else 0.0 def loss_fn(self, model_outputs, original_classes): original_classes = tf.squeeze(original_classes, axis=-1) # Tensorflow doesn't handle second order gradients of a sparse_softmax yet. one_hot_outputs = tf.one_hot(original_classes, depth=self.num_classes) return tf.nn.softmax_cross_entropy_with_logits_v2( labels=one_hot_outputs, logits=model_outputs) def grads_and_vars(self, metatrain_loss): """Computes gradients of metatrain_loss, avoiding NaN. Uses a fixed penalty of 1e-4 to enforce only the l2 regularization (and not minimize the loss) when metatrain_loss or any of its gradients with respect to trainable_vars are NaN. In practice, this approach pulls the variables back into a feasible region of the space when the loss or its gradients are not defined. Args: metatrain_loss: A tensor with the LEO meta-training loss. Returns: A tuple with: metatrain_gradients: A list of gradient tensors. metatrain_variables: A list of variables for this LEO model. """ metatrain_variables = self.trainable_variables metatrain_gradients = tf.gradients(metatrain_loss, metatrain_variables) nan_loss_or_grad = tf.logical_or( tf.is_nan(metatrain_loss), tf.reduce_any([tf.reduce_any(tf.is_nan(g)) for g in metatrain_gradients])) regularization_penalty = ( 1e-4 / self._l2_penalty_weight * self._l2_regularization) zero_or_regularization_gradients = [ g if g is not None else tf.zeros_like(v) for v, g in zip(tf.gradients(regularization_penalty, metatrain_variables), metatrain_variables)] metatrain_gradients = tf.cond(nan_loss_or_grad, lambda: zero_or_regularization_gradients, lambda: metatrain_gradients, strict=True) return metatrain_gradients, metatrain_variables @property def _l2_regularization(self): return tf.cast( tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)), dtype=self._float_dtype) @property def _decoder_orthogonality_reg(self): return self._orthogonality_reg 中解码器将潜在代码映射为分类器参数,然后将分类器参数用于更新分类器,再通过更新后的分类器求损失,这些对应哪段代码?
最新发布
05-15
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Wanderer001

ROIAlign原理

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值