在使用tensorflow做实验的这短暂一段时间内,遇到了不少问题,把还没忘问题写在这里,方便以后查阅。
1. 运行sess=tf.Session() 或 sess=tf.InteractiveSession()后发现所有GPU的显存全部占满
A:这是正常现象。为了节省资源,可以如下做:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0' #指定第几块gpu被该程序发现(使用)
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3 #限制使用gpu显存的百分之几。如果只写这句,没有下一句的话,当限制的显存不能满足你的程序需求,就会报错OOM
config.gpu_options.allow_growth = True #允许显存的增长。如果当前限制的显存不够,没关系可以继续增加。如果只有这一句话,tensorflow运行后显存从最低慢慢增长。
sess = tf.Session(config=config)
#sess = tf.InteractiveSession(config=config)
但占用了的显存是不会自动释放的,即使当前没使用这么多显存。
2. 写了个基本网络,训练后保存为model. 在该基本网络上增加若干层并利用训练好的基本网络继续训练,想restore model时发生错误
A: 正常现象,因为 tf.train.Saver() 发现保存的model中并不存在你新加的那些Variable,因此模型不匹配,便会报错。这个的解决方案找了一天,也许因为查找的关键字不对,百度没有找到。最终Google 用英文搜索才解决。看一下 tf.train.Saver() 这个类的构造函数:
__init__(
var_list=None,
reshape=False,
sharded=False,
max_to_keep=5,
keep_checkpoint_every_n_hours=10000.0,
name=None,
restore_sequentially=False,
saver_def=None,
builder=None,
defer_build=False,
allow_empty=False,
write_version=tf.train.SaverDef.V2,
pad_step_number=False,
save_relative_paths=False,
filename=None
)
第一个var_list 的官方解释是
specifies the variables that will be saved and restored. It can be
passed as a dict or a list
说明可以人为指定哪些 Variables 可以被保存和加载,而且是以字典或 list 的形式指定。那么这时,我们首先要知道有哪些 Variable。
all_variables = tf.contrib.framework.get_variables_to_restre() #得到该网络中所有Variable的信息,返回的all_variables是个list。
variables_to_restore = [] # 顾名思义
variables_not_restore = []
for v in all_variables:
if v.name.split('/')[0] != 'New_layer':
variables_to_restore.append(v)
else:
variables_not_restore.append(v)
saver = tf.train.Saver(var_list=variables_to_restore,max_to_keep=1,write_version=1)
saver.restore(sess, './model-1')
sess.run(tf.variables_initializer(var_list=variables_not_restore) #未被restore的variable也要初始化。
#tf.variables_initializer(var_list=variables_not_restore).run()
得到当前网络的所有Variables后,我需要分辨哪些是预训练模型中有的 (variables_to_restore
),哪些是我新添加的不需要restore的(variables_not_restore
),而且因为我新添加的所有变量是在 with tf.variable_scope( 'New_layer' )
下定义的,因此变量名的第一部分就是‘New_layer’,方便识别新旧变量。之后如程序所示,顺利加载预训练模型。值得注意的是,未被 restore 的 Variable 仍然需要初始化。tf.variable_initializer(var_list)
能初始化指定的 Variable。顺便一提,tf.global_variables_initializer()
是 tf.variables_initializer( tf.global_variables())
的简化写法,官网如是说。
3. tf.nn.rnn_cell.BasicLSTMCell( num_units, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None) 处理的是以当前数据xt和上一时刻隐含状态ht-1为输入的LSTM单元,然而我需要当前两个数据xt和yt以及ht-1作为LSTM单元的输入。
A:个人认为tensorflow应该提供了这样的LSTM类,但是由于不熟悉tensorflow的用法,并没有找到。无奈仿照tf.nn.rnn_cell.BasicLSTMCell()的实现自己重写了一个有特殊要求的LSTM类。主要是重载_linear()
和__call__()
两个函数。由于从一开始就保持了state_is_tuple=False
,因此自己实现过程中索性强制要求state_is_tuple=False
(并不推荐)。代码如下:
class MyLSTM(tf.contrib.rnn.RNNCell):
def __init__(self, num_units, forget_bias=1.0, state_is_tuple=True,activation=None, reuse=None):
super(MyLSTM,self).__init__(_reuse=reuse)
assert state_is_tuple == False, "state_is_tuple should be 'False' in this implement"
self._num_units = num_units
self._forget_bias = forget_bias
self._state_is_tuple = state_is_tuple
self._activation = activation or tf.tanh
def _linear(self, args, output_size, bias, bias_initializer=None, kernel_initializer=None):
# args: is list of 2D, batch x n, Tensor
total_arg_size = 0
shapes = [ a.get_shape() for a in args]
for shape in shapes:
if shape.ndims != 2:
raise ValueError("linear is expecting 2D arguments: %s" % shapes)
if shape[1].value is None:
raise ValueError("linear expects shape[1] to be provided for shape %s, "
"but saw %s" % (shape, shape[1]))
else:
total_arg_size += shape[1].value
dtype = [a.dtype for a in args][0]
scope = tf.get_variable_scope()
with tf.variable_scope(scope) as outer_scope:
weights = tf.get_variable( "kernel",shape=[total_arg_size,output_size],dtype=dtype,initializer=kernel_initializer)
if len(args) == 1:
res = tf.matmul( args[0],weights)
else:
res = tf.matmul( tf.concat(args,axis=1), weights )
if not bias:
return res
with tf.variable_scope(outer_scope) as inner_scope:
inner_scope.set_partitioner(None)
if bias_initializer is None:
bias_initializer = tf.constant_initializer(value=0.0, dtype=dtype)
biases = tf.get_variable("bias",[output_size],dtype=dtype, initializer=bias_initializer)
return tf.nn.bias_add(res, biases)
@property
def state_size(self):
return 2*self._num_units
@property
def output_size(self):
return self._num_units
def __call__(self,input1,input2,state):
"""
:param input1: `2-D` tensor with shape `[batch_size x input_size]`
:param input2: same size as input1
:param state: a `Tensor` shaped `[batch_size x 2 * self.state_size] with state_is_tuple=False
:return: new_h, concat([new_c,new_h],1)
"""
sigmoid = tf.sigmoid
c, h = tf.split( value=state, num_or_size_splits=2, axis=1)
concat = self._linear( [input1,input2,h], 4*self._num_units,True) # (batch_size, 4*self._num_units)
i, j, f, o = tf.split(value=concat, num_or_size_splits=4, axis=1)
new_c = ( c*sigmoid(f+self._forget_bias) + sigmoid(i)*self._activation(j))
new_h = self._activation(new_c) * sigmoid(o)
if self._state_is_tuple:
new_state = tf.nn.rnn_cell.LSTMStateTuple(new_c,new_h)
else:
new_state = tf.concat(values=[new_c,new_h],axis=1 )
return new_h, new_state
4. 网络不大,但是执行到train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf_loss)时显存占用巨大。(已经设置显存随需要增长)
A:原本我是这样写的:
train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf_loss)
Google后改为这样写:
train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf_loss,aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N)
或者:
train_op = tf.train.AdamOptimizer(learning_rate).minimize(tf_loss, aggregation_method=tf.AggregationMethod.EXPERIMENTAL_TREE)
就解决了。应该是因为我在for循环中计算tf_loss的缘故,具体原因没搞清楚,但很管用。
参考:
问题2:https://stackoverflow.com/questions/42217320/restore-variables-that-are-a-subset-of-new-model-in-tensorflow
https://github.com/DrSleep/tensorflow-deeplab-resnet/issues/11
问题3:https://stackoverflow.com/questions/45439045/tensorflow-rnn-input-of-two-different-types
问题4:https://stackoverflow.com/questions/36194394/how-i-reduce-memory-consumption-in-a-loop-in-tensorflow