TensorFlow—相关

最新推荐文章于 2022-04-23 10:33:36 发布

qq924178473

最新推荐文章于 2022-04-23 10:33:36 发布

阅读量3.3k

点赞数 2

CC 4.0 BY-SA版权

分类专栏： python 文章标签： TensorFlow

本文链接：https://blog.youkuaiyun.com/h_jlwg6688/article/details/65441723

python 专栏收录该内容

29 篇文章

订阅专栏

本文深入浅出地介绍了TensorFlow的基础知识，包括运行机制、占位符、共享变量、GPU配置、模型训练与测试等核心内容，并详细解析了多种实用技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

主要来自：http://blog.youkuaiyun.com/u012436149/article/details/53140869

1、TensorFlow的运行说明：

TensorFlow把复杂的计算放在Python之外完成，但是因为计算完成之后要切回到Python，为了避免这种开销，TensorFlow不单独地运行单一的复杂计算，而是让我们可以先用图描述一系列可交互的计算操作，然后全部一起在Python之外运行。

2、placeholder：

x=tf.placeholder(tf.float32,[None,784])

其中x不是一个特定的值，而是一个占位符placeholder，在TensorFlow运行计算时输入这个值，这里[None,784]是个2维的张量，第一维None表示可以是任意长度（大小），第二维长度（大小）是784。

//2017/3/27

1、tensorflow ConfigProto有什么用：

tf.ConfigProto一般用在创建session的时候。用来对session进行参数配置，参数包括：

a）记录设备指派情况：为了获取你的 operations 和 Tensor 被指派到哪个设备上运行, 用 log_device_placement 新建一个 session, 并设置为 True：sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))这将会打印出各个操作在哪个设备（cpu或者gpu）上运行。

另：可以手动设置哪些操作在cpu或者GPU上运行，即：with tf.device('/cpu:0'):，这就设定了设备环境为cpu0，在这个设备环境下的操作都将在cpu0上进行。

b）为了避免出现你指定的设备不存在这种情况, 你可以在创建的 session 里把参数 allow_soft_placement 设置为 True, 这样 tensorFlow 会自动选择一个存在并且支持的设备来运行 operation.

c）config = tf.ConfigProto()
config.gpu_options.allow_growth = True：刚一开始分配少量的GPU容量，然后按需慢慢的增加，由于不会释放内存，所以会导致碎片。

d）gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
config=tf.ConfigProto(gpu_options=gpu_options)：设置每个GPU应该拿出多少容量给进程使用，0.4代表 40%

参考：http://blog.youkuaiyun.com/u012436149/article/details/53837651

http://wiki.jikexueyuan.com/project/tensorflow-zh/how_tos/using_gpu.html

e）intra_op_parallelism_threads和inter_op_parallelism_threads：Choose how many cores to use

参考：http://stackoverflow.com/questions/41233635/tensorflow-inter-and-intra-op-parallelism-configuration

2、TensorFlow的共享变量：

数据的初始化式：random_uniform_initializer(min,max,seed,dtype)：创建一个大小在min和max之间的一系列数据。

参考：http://wiki.jikexueyuan.com/project/tensorflow-zh/how_tos/variable_scope/index.html

//2017/3/28

1、TensorFlow的reshape中索引为负一（-1）的情况：

# tensor 't' is [[[1, 1, 1],
# [2, 2, 2]],
# [[3, 3, 3],
# [4, 4, 4]],
# [[5, 5, 5],
# [6, 6, 6]]]

# -1 is inferred to be 2:
reshape(t, [-1, 9]) ==> [[1, 1, 1, 2, 2, 2, 3, 3, 3],
[4, 4, 4, 5, 5, 5, 6, 6, 6]]

//2017/4/1

1、TensorFlow如何运用构建的rnn模型，加载训练好的参数，进行测试：

with tf.Session(graph=self.graph) as session:

saver = tf.train.Saver()

saver.restore(session,ckpt_file)

其中self.graph是已经构建的rnn网络，而ckpt_file是保存的模型参数文件

2、如何将要测试的数据输入到网络中，进行测试：

对于网络中的测试节点，调用eval()方法，以feed的方式传入要测试的数据：

self.sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b)) 这个是网络graph（图）的节点——（预测）的节点

真正的运行测试：

prediction=self.sample_prediction.eval({self.sample_input:[sample_input], self.keep_prob:1.0})

//2017/4/12

1、tf的命令行解析：

flags = tf.flags，这里flags是一个文件：flags.py，用于处理命令行参数的解析工作：

调用flags内部的DEFINE_string函数来制定解析规则：
flags.DEFINE_string("para_name_1","default_val", "description")
flags.DEFINE_bool("para_name_2","default_val", "description")

FLAGS是一个对象，保存了解析后的命令行参数：
FLAGS = flags.FLAGS

调用命令行输入的参数：

def main(_):

FLAGS.para_name

解析命令行参数，调用main函数 main(sys.argv)：

if __name__ = "__main__":

tf.app.run()

调用方式：

python script.py --para_name_1=name --para_name_2=name2

2、tf.reduce_**：

3、tf.concat：

t1 = [[1, 2, 3], [4, 5, 6]]
t2 = [[7, 8, 9], [10, 11, 12]]

tf.concat(0, [t1, t2]) ==> [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]

tf.concat(1, [t1, t2]) ==> [[1, 2, 3, 7, 8, 9], [4, 5, 6, 10, 11, 12]]

4、tf.squeeze：

# 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
shape(squeeze(t)) ==> [2, 3]
Or, to remove specific size 1 dimensions:
# 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
shape(squeeze(t, [2, 4])) ==> [1, 2, 3, 1]

5、tf.split：

'value' is a tensor with shape [5, 30]
Split 'value' into 3 tensors along dimension 1
split0, split1, split2 = tf.split(1, 3, value)
tf.shape(split0) ==> [5, 10]

6、embedding：

mat = np.array([1,2,3,4,5,6,7,8,9]).reshape((3,-1))
ids = [[1,2], [0,1]]
res = tf.nn.embedding_lookup(mat, ids)
res.eval()
array([[[4, 5, 6],
[7, 8, 9]],

[[1, 2, 3],
[4, 5, 6]]])

6、tf.expand_dims：

# 't' is a tensor of shape [2]
#一次扩展一维
shape(tf.expand_dims(t, 0)) ==> [1, 2]
shape(tf.expand_dims(t, 1)) ==> [2, 1]
shape(tf.expand_dims(t, -1)) ==> [2, 1]
# 't2' is a tensor of shape [2, 3, 5]
shape(tf.expand_dims(t2, 0)) ==> [1, 2, 3, 5]
shape(tf.expand_dims(t2, 2)) ==> [2, 3, 1, 5]
shape(tf.expand_dims(t2, 3)) ==> [2, 3, 5, 1]

7、tf.slice：

x=[[1,2,3],[4,5,6]]

begin_x=[1,0]

size_x=[1,2]

out=tf.slice(x,begin_x,size_x)：结果：[[4 5]]

y=np.arange(24).reshape([2,3,4])

begin_y=[1,0,0]
size_y=[1,2,3]
out=tf.slice(y,begin_y,size_y) ：结果:[[[12 13 14] [16 17 18]]]

a = np.array([[1,2,3,4,5],[4,5,6,7,8],[9,10,11,12,13]])
tf.slice(a,[1,2],[-1,2]).eval()：结果：array([[ 6, 7], [11, 12]])

8、tf.stack：

# 'x' is [1, 4]
# 'y' is [2, 5]
# 'z' is [3, 6]
stack([x, y, z]) => [[1, 4], [2, 5], [3, 6]] # Pack along first dim.
stack([x, y, z], axis=1) => [[1, 2, 3], [4, 5, 6]]

9、tf.gather：

temp = tf.range(0,10)*10 + tf.constant(1,shape=[10])
temp2 = tf.gather(temp,[1,5,9])

temp：[ 1 11 21 31 41 51 61 71 81 91]
tmp2：[11 51 91]

10、tf.nn.softmax_cross_entropy_with_logits(logits, labels, name=None)：（http://blog.youkuaiyun.com/mao_xiao_feng/article/details/53382790）

第一个参数logits：就是神经网络最后一层的输出，如果有batch的话，它的大小就是[batchsize，num_classes]，单样本的话，大小就是num_classes
第二个参数labels：实际的标签，大小同上

具体的执行流程大概分为两步：
第一步是先对网络最后一层的输出做一个softmax，这一步通常是求取输出属于某一类的概率，对于单样本而言，输出就是一个num_classes大小的向量（[Y1，Y2,Y3...]其中Y1，Y2，Y3...分别代表了是属于该类的概率）

softmax的公式是：

第二步是softmax的输出向量[Y1，Y2,Y3...]和样本的实际标签做一个交叉熵，公式如下：

其中指代实际的标签中第i个的值（用mnist数据举例，如果是3，那么标签是[0，0，0，1，0，0，0，0，0，0]，除了第4个值为1，其他全为0）
就是softmax的输出向量[Y1，Y2,Y3...]中，第i个元素的值
显而易见，预测越准确，结果的值越小（别忘了前面还有负号），最后求一个平均，得到我们想要的loss
注意！！！这个函数的返回值并不是一个数，而是一个向量，如果要求交叉熵，我们要再做一步tf.reduce_sum操作,就是对向量里面所有元素求和，最后才得到，如果求loss，则要做一步tf.reduce_mean操作，对向量求均值！

例子：

import tensorflow as tf

#our NN's output
logits=tf.constant([[1.0,2.0,3.0],[1.0,2.0,3.0],[1.0,2.0,3.0]])
#step1:do softmax
y=tf.nn.softmax(logits)
#true label
y_=tf.constant([[0.0,0.0,1.0],[0.0,0.0,1.0],[0.0,0.0,1.0]])
#step2:do cross_entropy
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
#do cross_entropy just one step
cross_entropy2=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits, y_))#dont forget tf.reduce_sum()!!

with tf.Session() as sess:
softmax=sess.run(y)
c_e = sess.run(cross_entropy)
c_e2 = sess.run(cross_entropy2)
print("step1:softmax result=")
print(softmax)
print("step2:cross_entropy result=")
print(c_e)
print("Function(softmax_cross_entropy_with_logits) result=")
print(c_e2)

结果：

step1:softmax result=
[[ 0.09003057 0.24472848 0.66524094]
[ 0.09003057 0.24472848 0.66524094]
[ 0.09003057 0.24472848 0.66524094]]
step2:cross_entropy result= 1.22282
Function(softmax_cross_entropy_with_logits) result= 1.2228

结论：最后大家可以试试e^1/(e^1+e^2+e^3)是不是0.09003057，发现确实一样！！这也证明了输出是符合公式逻辑的

11、tf.nn.sigmoid_cross_entropy_with_logits(logits, targets, name=None)：（http://dataunion.org/26447.html）

交叉熵（Cross Entropy）是Loss函数的一种（也称为损失函数或代价函数），用于描述模型预测值与真实值的差距大小。

交叉熵的定义如下：

这里用于计算的“a”也是经过sigmoid激活的，取值范围在0到1。

sigmoid_cross_entropy_with_logits，为什么呢，因为它的实现和前面的交叉熵算法定义是一样的，也是TensorFlow最早实现的交叉熵算法。这个函数的输入是logits和targets，logits就是神经网络模型中的 W * X矩阵，注意不需要经过sigmoid，而targets的shape和logits相同，就是正确的label值，例如这个模型一次要判断100张图是否包含10种动物，这两个输入的shape都是[100, 10]。注释中还提到这10个分类之间是独立的、不要求是互斥，这种问题我们成为多目标，例如判断图片中是否包含10种动物，label值可以包含多个1或0个1，还有一种问题是多分类问题，例如我们对年龄特征分为5段，只允许5个值有且只有1个值为1，这种问题可以直接用这个函数吗？答案是不可以，我们先来看看sigmoid_cross_entropy_with_logits的代码实现吧：

12、tesorflow中的激活函数（所有激活函数输入和输出的维度是一样的）

tf.nn.relu()
tf.nn.sigmoid()
tf.nn.tanh()
tf.nn.elu()
tf.nn.bias_add()
tf.nn.crelu()
tf.nn.relu6()
tf.nn.softplus()
tf.nn.softsign()
tf.nn.dropout()
tf.nn.relu_layer(x, weights, biases,name=None)

//2017/4/13

1、def nce_loss(nce_weights, nce_biases, embed, train_labels, num_sampled, vocabulary_size):（http://stackoverflow.com/questions/41475180/understanding-tf-nn-nce-loss-in-tensorflow）
#word2vec中用到了这个函数

2、def sequence_loss_by_example(logits, targets, weights,
average_across_timesteps=True,
softmax_loss_function=None, name=None):（http://blog.youkuaiyun.com/appleml/article/details/54017873）

3、保存tensorflow训练好的模型参数：

saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(init_op)
#训练模型。。。
#使用saver提供的简便方法去调用 save op
saver.save(sess, "save_path/file_name.ckpt") #file_name.ckpt如果不存在的话，会自动创建
#后缀可加可不加

4、restore变量的子集,然后使用初始化op初始化其他变量
#想要实现这个功能的话,必须从Saver的构造函数下手
saver=tf.train.Saver([sub_set])
init = tf.initialize_all_variables()
with tf.Session() as sess:
#这样你就可以使用restore的变量替换掉初始化的变量的值,而其它初始化的值不受影响
sess.run(init)
if restor_from_checkpoint:
saver.restore(sess,"file.ckpt")
# train
saver.save(sess,"file.ckpt")

5、TensorFlow的cnn操作：

http://www.jianshu.com/p/e3a79eac554f

6、dropout：

dropout对于防止过拟合效果不错
dropout一般用在全连接的部分，卷积部分不会用到dropout,输出曾也不会使用dropout：

另种dropout：

1.tf.nn.dropout(x, keep_prob, noise_shape=None, seed=None, name=None)
2.tf.nn.rnn_cell.DropoutWrapper(rnn_cell, input_keep_prob=1.0, output_keep_prob=1.0)

7、获取feed数据的维度tf.shape(x)和tensor.get_shape()：

1、shape=tf.placeholder(tf.float32, shape=[None, 227,227,3] )
我们经常会这样来feed数据,如果在运行的时候想知道None到底是多少,这时候,只能通过tf.shape(x)[0]这种方式来获得.

2、tensor.get_shape()
只有tensor有这个方法，返回是一个tuple

3、local_response_normalization：

local_response_normalization出现在论文”ImageNet Classification with deep Convolutional Neural Networks”中,论文中说,这种normalization对于泛化是有好处的：

bix,y=aix,y(k+α∑min(0,i+n/2)j=max(0,i−n/2)(ajx,y)2)β

经过了一个conv2d或pooling后,我们获得了[batch_size, height, width, channels]这样一个tensor.现在,将channels称之为层,不考虑batch_size
- i代表第i层
- aix,y就代表第i层的 (x,y)位置所对应的值
- n个相邻feature maps.
- k...α...n...β是hyper parameters
- 可以看出,这个函数的功能就是, aix,y需要用他的相邻的map的同位置的值进行normalization
在alexnet中, k=2,n=5,α=10−4,β=0.75

tf.nn.local_response_normalization(input, depth_radius=None, bias=None, alpha=None, beta=None, name=None)：

depth_radius: 就是公式里的n/2
bias : 公式里的k
input: 将conv2d或pooling 的输出输入就行了[batch_size, height, width, channels]
return :[batch_size, height, width, channels], 正则化后

4、batch_normalization：

batch_normalization, 故名思意,就是以batch为单位进行normalization

输入:mini_batch: In={x1,x2,..,xm}
- γ,β ,需要学习的参数,都是向量
- ϵ : 一个常量
- 输出: Out={y1,y2,...,ym}

算法如下:
(1)mini_batch mean:

μ I n \leftarrow 1 m \sum i = 1 m x i

(2)mini_batch variance

σ 2 I n = 1 m \sum i = 1 m (x i - μ I n) 2

(3)Normalize

x^i = x i - μ I n σ 2 I n + ϵ - - - - - - \sqrt

(4)scale and shift

y i = γ x^i + β

可以看出,batch_normalization之后,数据的维数没有任何变化,只是数值发生了变化

Out 作为下一层的输入

def batch_normalization(x,mean, variance, offset,scale,variance_epsilon,name=None):

5、tensorflow中操作gradient-clip：

在训练深度神经网络的时候,我们经常会碰到梯度消失和梯度爆炸问题,scientists提出了很多方法来解决这些问题,本篇就介绍一下如何在tensorflow中使用clip来address这些问题：

train_op = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)

在调用minimize方法的时候,底层实际干了两件事:
- 计算所有 trainable variables 梯度
- apply them to variables
随后, 在我们 sess.run(train_op) 的时候, 会对 variables 进行更新

clip：
那我们如果想处理一下计算完的 gradients ,那该怎么办呢?
官方给出了以下步骤
1. Compute the gradients with compute_gradients(). 计算梯度
2. Process the gradients as you wish. 处理梯度
3. Apply the processed gradients with apply_gradients(). apply处理后的梯度给variables
这样,我们以后在train的时候就会使用 processed gradient去更新 variable

框架:
# Create an optimizer.optimizer必须和variable在一个设备上声明
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

例子:
#return a list of trainable variable in you model
params = tf.trainable_variables()

#create an optimizer
opt = tf.train.GradientDescentOptimizer(self.learning_rate)

#compute gradients for params
gradients = tf.gradients(loss, params)

#process gradients
clipped_gradients, norm = tf.clip_by_global_norm(gradients,max_gradient_norm)

train_op = opt.apply_gradients(zip(clipped_gradients, params)))

这时, sess.run(train_op) 就可以进行训练了

//2017/4/14

1、variable scop：

get_varibale.name 以创建变量的 scope 作为名字的prefix：

with variable_scope.variable_scope("tet1"):
var3 = tf.get_variable("var3",shape=[2],dtype=tf.float32)
print var3.name
with variable_scope.variable_scope("tet2"):
var4 = tf.get_variable("var4",shape=[2],dtype=tf.float32)
print var4.name
#输出为****************
#tet1/var3:0
#tet1/tet2/var4:0
#*********************

def te2():
with variable_scope.variable_scope("te2"):
var2 = tf.get_variable("var2",shape=[2], dtype=tf.float32)
print var2.name
def te1():
with variable_scope.variable_scope("te1"):
var1 = tf.get_variable("var1", shape=[2], dtype=tf.float32)
return var1
return te1() #在scope te2 内调用的
res = te2()
print res.name
#输出*********************
#te2/var2:0
#te2/te1/var1:0
#************************

观察和上个程序的不同：

def te2():
with variable_scope.variable_scope("te2"):
var2 = tf.get_variable("var2",shape=[2], dtype=tf.float32)
print var2.name
def te1():
with variable_scope.variable_scope("te1"):
var1 = tf.get_variable("var1", shape=[2], dtype=tf.float32)
return var1
return te1() #在scope te2外面调用的
res = te2()
print res.name
#输出*********************
#te2/var2:0
#te1/var1:0
#************************

需要注意一点的是tf.variable_scope("name") 与 tf.variable_scope(scope)的区别，看下面代码：

import tensorflow as tf
with tf.variable_scope("scope"):
tf.get_variable("w",shape=[1])#这个变量的name是 scope/w
with tf.variable_scope("scope"):
tf.get_variable("w", shape=[1]) #这个变量的name是 scope/scope/w
# 这两个变量的名字是不一样的，所以不会产生冲突

对比：

import tensorflow as tf
with tf.variable_scope("yin"):
tf.get_variable("w",shape=[1])
scope = tf.get_variable_scope()#这个变量的name是 scope/w
with tf.variable_scope(scope):#这种方式设置的scope，是用的外部的scope
tf.get_variable("w", shape=[1])#这个变量的name也是 scope/w
# 两个变量的名字一样，会报错

2、共享变量
共享变量的前提是，变量的名字是一样的，变量的名字是由变量名和其scope前缀一起构成， tf.get_variable_scope().reuse_variables() 是允许共享当前scope下的所有变量。reused variables可以看同一个节点。

例子：

def my_image_filter(input_images):
conv1_weights = tf.Variable(tf.random_normal([5, 5, 32, 32]),
name="conv1_weights")
conv1_biases = tf.Variable(tf.zeros([32]), name="conv1_biases")
conv1 = tf.nn.conv2d(input_images, conv1_weights,
strides=[1, 1, 1, 1], padding='SAME')
return tf.nn.relu(conv1 + conv1_biases)

有两个变量（Variables）conv1_weighs，conv1_biases和一个操作（op）conv1，如果直接调用两次，不会出什么问题，但是会生成两套变量：

result1 = my_image_filter(image1)

result2 = my_image_filter(image2)

上面两个式子没有问题，但是如果把tf.variable改成tg.get_variable，还像上面两式那样调用，就会出问题：

result1 = my_image_filter(image1)

result2 = my_image_filter(image2)

解决上面问题的方法就是使用tf.variable_scope及在一个作用于scope内共享一些变量：

1）

with tf.variable_scope("image_filters") as scope:
result1 = my_image_filter(image1)
scope.reuse_variables() # or
#tf.get_variable_scope().reuse_variables()
result2 = my_image_filter(image2)

2）

with tf.variable_scope("image_filters1") as scope1:
result1 = my_image_filter(image1)
with tf.variable_scope(scope1, reuse = True)
result2 = my_image_filter(image2)

3、name_scope和variable_scope：

两个scope的区别：

程序1：

with tf.name_scope("hello") as name_scope:
arr1 = tf.get_variable("arr1", shape=[2,10],dtype=tf.float32)

print name_scope
print arr1.name
print "scope_name:%s " % tf.get_variable_scope().original_name_scope

输出：

hello/
arr1:0
scope_name:

可以看出:
- tf.name_scope() 返回的是一个string,”hello/”
- 在name_scope中定义的variable的name并没有 “hello/”前缀
- tf.get_variable_scope()的original_name_scope 是空

程序2：

with tf.variable_scope("hello") as variable_scope:
arr1 = tf.get_variable("arr1", shape=[2, 10], dtype=tf.float32)

print variable_scope
print variable_scope.name #打印出变量空间名字
print arr1.name
print tf.get_variable_scope().original_name_scope
#tf.get_variable_scope() 获取的就是variable_scope

with tf.variable_scope("xixi") as v_scope2:
print tf.get_variable_scope().original_name_scope
#tf.get_variable_scope() 获取的就是v _scope2

输出：

<tensorflow.python.ops.variable_scope.VariableScope object at 0x7f82741e7910>
hello
hello/arr1:0
hello/
hello/xixi/

程序3：

with tf.name_scope("name1"):
with tf.variable_scope("var1"):
w = tf.get_variable("w",shape=[2])
res = tf.add(w,[3])
print w.name
print res.name
输出：
var1/w:0
name1/var1/Add:0

可以看出对比三个个程序可以看出:
- name_scope对 get_variable()创建的变量的名字不会有任何影响,而创建的op会被加上前缀.
- tf.get_variable_scope() 返回的只是 variable_scope,不管 name_scope.所以以后我们在使用tf.get_variable_scope().reuse_variables() 时可以无视name_scope：variable scope和name scope都会给op的name加上前缀

其它：

1）

with tf.name_scope("scope1") as scope1:
with tf.name_scope("scope2") as scope2:
print scope2
#输出：scope1/scope2/

2）

import tensorflow as tf
with tf.variable_scope("scope1") as scope1:
with tf.variable_scope("scope2") as scope2:
print scope2.name
#输出:scope1/scope2

简单来看
1. 使用tf.Variable()的时候，tf.name_scope()和tf.variable_scope() 都会给 Variable 和 op 的 name属性加上前缀。
2. 使用tf.get_variable()的时候，tf.name_scope()就不会给 tf.get_variable()创建出来的Variable加前缀。

name_scope可以用来干什么：

典型的 TensorFlow 可以有数以千计的节点，如此多而难以一下全部看到，甚至无法使用标准图表工具来展示。为简单起见，我们为变量名划定范围，并且可视化把该信息用于在图表中的节点上定义一个层级。默认情况下，只有顶层节点会显示。下面这个例子使用tf.name_scope在hidden命名域下定义了三个操作：

import tensorflow as tf

with tf.name_scope('hidden') as scope:
a = tf.constant(5, name='alpha')
W = tf.Variable(tf.random_uniform([1, 2], -1.0, 1.0), name='weights')
b = tf.Variable(tf.zeros([1]), name='biases')
print a.name
print W.name
print b.name

结果是得到了下面三个操作名:
hidden/alpha
hidden/weights
hidden/biases
name_scope 是给op_name加前缀, variable_scope是给variable_name加前缀.也许有人会问,get_variable()返回的不也是op吗?我们可以这么理解,只有get_variable().name打印出来的是variable_name

tf.variable_scope有时也会处理命名冲突：

import tensorflow as tf
def test(name=None):
with tf.variable_scope(name, default_name="scope") as scope:
w = tf.get_variable("w", shape=[2, 10])

test()
test()
ws = tf.trainable_variables()
for w in ws:
print(w.name)

#scope/w:0
#scope_1/w:0
#可以看出，如果只是使用default_name这个属性来创建variable_scope
#的时候，会处理命名冲突

3、Multiple GPUs（多GPU）：

如何设置训练系统

(1)每个GPU上都会有model的副本
(2)对模型的参数进行同步更新

抽象名词：

计算单个副本inference和 gradients 的函数称之为tower,使用tf.name_scope()为tower中的每个op_name加上前缀
使用tf.device('/gpu:0') 来指定tower中op的运算设备
框架：

with tf.Graph().as_default(), tf.device('/cpu:0'):
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
tower_grads=[]
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (TOWER_NAME, i)) as scope:
#这里定义你的模型
#ops,variables

#损失函数
loss = yourloss
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()

# Calculate the gradients for the batch of data on this tower.
grads = opt.compute_gradients(loss)

# Keep track of the gradients across all towers.
tower_grads.append(grads)
# We must calculate the mean of each gradient. Note that this is the
# synchronization point across all towers.
grads = average_gradients(tower_grads)

# Apply the gradients to adjust the shared variables.
apply_gradient_op = opt.apply_gradients(grads)

4、gRPC（Google remote procedure call）谷歌远程过程调用：

分布式TensorFlow底层的通信是gRPC，这个通信就是：假设你在本机上执行一段代码num=add(a,b),它调用了一个过程 call,然后返回了一个值num,你感觉这段代码只是在本机上执行的, 但实际情况是,本机上的add方法是将参数打包发送给服务器,然后服务器运行服务器端的add方法,返回的结果再将数据打包返回给客户端.

5、cluster、job及Task：cluster.Job.Task：

Job是Task的集合.

Cluster是Job的集合

6、为什么要分成Cluster,Job,和Task呢?
首先,我们介绍一下Task:Task就是主机上的一个进程,在大多数情况下,一个机器上只运行一个Task.

为什么Job是Task的集合呢? 在分布式深度学习框架中,我们一般把Job划分为Parameter和Worker,Parameter Job是管理参数的存储和更新工作.Worker Job是来运行ops.如果参数的数量太大,一台机器处理不了,这就要需要多个Tasks.

Cluster 是 Jobs 的集合: Cluster(集群),就是我们用的集群系统了

7、如何创建集群

从上面的描述我们可以知道,组成Cluster的基本单位是Task(动态上理解,主机上的一个进程,从静态的角度理解,Task就是我们写的代码).我们只需编写Task代码,然后将代码运行在不同的主机上,这样就构成了Cluster(集群)。

8、如何编写Task代码
首先,Task需要知道集群上都有哪些主机,以及它们都监听什么端口.tf.train.ClusterSpec()就是用来描述这个：

tf.train.ClusterSpec({
"worker": [
"worker_task0.example.com:2222",# /job:worker/task:0 运行的主机
"worker_task1.example.com:2222",# /job:worker/task:1 运行的主机
"worker_task2.example.com:2222"# /job:worker/task:3 运行的主机
],
"ps": [
"ps_task0.example.com:2222", # /job:ps/task:0 运行的主机
"ps_task1.example.com:2222" # /job:ps/task:0 运行的主机
]})

这个ClusterSec告诉我们,我们这个Cluster(集群)有两个Job(worker.ps),worker中有三个Task(即,有三个Task执行Tensorflow op操作)

9、然后,将ClusterSpec当作参数传入到 tf.train.Server()中,同时指定此Task的Job_name和task_index.

#jobName和taskIndex是函数运行时,通过命令行传递的参数
server = tf.train.Server(cluster, job_name=jobName, task_index=taskIndex)

10、下面代码描述的是,一个cluster中有一个Job,叫做(worker), 这个job有两个task,这两个task是运行在两个主机上的：

#在主机(10.1.1.1)上,实际是运行以下代码
cluster = tf.train.ClusterSpec({"worker": ["10.1.1.1:2222", "10.1.1.2:3333"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)

#在主机(10.1.1.2)上,实际运行以下代码
cluster = tf.train.ClusterSpec({"worker": ["10.1.1.1:2222", "10.1.1.2:3333"]})
server = tf.train.Server(cluster, job_name="local", task_index=1)

11、tf.trian.Server干了些什么呢?
首先,一个tf.train.Server包含了: 本地设备(GPUs,CPUs)的集合,可以连接到到其它task的ip:port(存储在cluster中), 还有一个session target用来执行分布操作.还有最重要的一点就是,它创建了一个服务器,监听port端口,如果有数据传过来,他就会在本地执行(启动session target,调用本地设备执行运算),然后结果返回给调用者.

12、我们继续来写我们的task代码:在你的model中指定分布式设备

with tf.device("/job:ps/task:0"):
weights_1 = tf.Variable(...)
biases_1 = tf.Variable(...)

with tf.device("/job:ps/task:1"):
weights_2 = tf.Variable(...)
biases_2 = tf.Variable(...)

with tf.device("/job:worker/task:0"): #映射到主机(10.1.1.1)上去执行
input, labels = ...
layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
with tf.device("/job:worker/task:1"): #映射到主机(10.1.1.2)上去执行
input, labels = ...
layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
# ...
train_op = ...
with tf.Session("grpc://10.1.1.2:3333") as sess:#在主机(10.1.1.2)上执行run
for _ in range(10000):
sess.run(train_op)

其中：with tf.Session("grpc://..")是指定gprc://..为master,master将op分发给对应的task。

13、写分布式程序时,我们需要关注一下问题:
(1) 使用In-graph replication还是Between-graph replication

In-graph replication:一个client(显示调用tf::Session的进程),将里面的参数和ops指定给对应的job去完成.数据分发只由一个client完成.

Between-graph replication:下面的代码就是这种形式,有很多独立的client,各个client构建了相同的graph(包含参数,通过使用tf.train.replica_device_setter,将这些参数映射到ps_server上.)

另外的解释：

在TensorFlow中启动分布式深度学习模型训练任务也有两种模式。一种为In-graph replication。在这种模式下神经网络的参数会都保存在同一个TensorFlow计算图中，只有计算会分配到不同计算服务器。另一种为Between-graph replication，这种模式下所有的计算服务器也会创建参数，但参数会通过统一的方式分配到参数服务器。因为In-graph replication处理海量数据的能力稍弱，所以Between-graph replication是一个更加常用的模式。

(2)同步训练,还是异步训练

Synchronous training:在这种方式中,每个graph的副本读取相同的parameter的值,并行的计算gradients,然后将所有计算完的gradients放在一起处理.Tensorlfow提供了函数(tf.train.SyncReplicasOptimizer)来处理这个问题(在Between-graph replication情况下),在In-graph replication将所有的gradients平均就可以了

Asynchronous training:自己计算完gradient就去更新paramenter,不同replica之间不会去协调进度。

14、一个完整的例子,来自官网：

import tensorflow as tf
# Flags for defining the tf.train.ClusterSpec
tf.app.flags.DEFINE_string("ps_hosts", "",
"Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
"Comma-separated list of hostname:port pairs")

# Flags for defining the tf.train.Server
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")

FLAGS = tf.app.flags.FLAGS

由于是相同的代码运行在不同的主机上,所以要传入job_name和task_index加以区分,而ps_hosts和worker_hosts对于所有主机来说,都是一样的,用来描述集群的。

def main(_):
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")

# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

# Create and start a server for the local task.
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)

if FLAGS.job_name == "ps":
server.join()

我们都知道,服务器进程如果执行完的话,服务器就会关闭.为了是我们的ps_server能够一直处于监听状态,我们需要使用server.join().这时,进程就会block在这里.至于为什么ps_server刚创建就join呢:原因是因为下面的代码会将参数指定给ps_server保管,所以ps_server静静的监听就好了：

elif FLAGS.job_name == "worker":

# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):

其中：tf.train.replica_device_setter(ps_tasks=0, ps_device='/job:ps', worker_device='/job:worker', merge_devices=True, cluster=None, ps_ops=None)),返回值可以被tf.device使用,指明下面代码中variable和ops放置的设备。

example:

# To build a cluster with two ps jobs on hosts ps0 and ps1, and 3 worker
# jobs on hosts worker0, worker1 and worker2.
cluster_spec = {
"ps": ["ps0:2222", "ps1:2222"],
"worker": ["worker0:2222", "worker1:2222", "worker2:2222"]}
with tf.device(tf.replica_device_setter(cluster=cluster_spec)):
# Build your graph
v1 = tf.Variable(...) # assigned to /job:ps/task:0
v2 = tf.Variable(...) # assigned to /job:ps/task:1
v3 = tf.Variable(...) # assigned to /job:ps/task:0
# Run compute

这个例子是没有指定参数worker_device和ps_device的,你可以手动指定。

继续代码注释,下面就是,模型的定义了：
# Build model...variables and ops
loss = ...
global_step = tf.Variable(0)

train_op = tf.train.AdagradOptimizer(0.01).minimize(
loss, global_step=global_step)

saver = tf.train.Saver()
summary_op = tf.merge_all_summaries()
init_op = tf.initialize_all_variables()

# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir="/tmp/train_logs",
init_op=init_op,
summary_op=summary_op,
saver=saver,
global_step=global_step,
save_model_secs=600)

# The supervisor takes care of session initialization, restoring from
# a checkpoint, and closing when done or an error occurs.
with sv.managed_session(server.target) as sess:
# Loop until the supervisor shuts down or 1000000 steps have completed.
step = 0
while not sv.should_stop() and step < 1000000:
# Run a training step asynchronously.
# See `tf.train.SyncReplicasOptimizer` for additional details on how to
# perform *synchronous* training.
_, step = sess.run([train_op, global_step])

# Ask for all the services to stop.
sv.stop()

考虑一个场景(Between-graph),我们有一个parameter server(存放着参数的副本),有好几个worker server(分别保存着相同的graph的副本).更通俗的说,我们有10台电脑,其中一台作为parameter server,其余九台作为worker server.因为同一个程序在10台电脑上同时运行(不同电脑,job_name,task_index不同),所以每个worker server上都有我们建立的graph的副本(replica).这时我们可以使用Supervisor帮助我们管理各个process.Supervisor的is_chief参数很重要,它指明用哪个task进行参数的初始化工作.sv.managed_session(server.target)创建一个被sv管理的session

if __name__ == "__main__":
tf.app.run()

To start the trainer with two parameter servers and two workers, use the following command line (assuming the script is called trainer.py):

# On ps0.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=worker --task_index=1

可以看出,我们只需要写一个程序,在不同的主机上,传入不同的参数使其运行。

15、分布式TensorFlow注意事项：

版本 tensorflow0.11.0

适用于 between-graph&synchronous

(1) 一定要指定 chief task

(2) chief task 要增加两个op:

init_token_op = opt.get_init_tokens_op()
chief_queue_runner = opt.get_chief_queue_runner()

(3) chief task要执行上面两个op:

sv.start_queue_runners(sess, [chief_queue_runner])
sess.run(init_token_op)

(4) 使用 sv.prepare_or_wait_for_session创建sess的时候,一定不要使用with block
# wrong：
with sv.prepare_or_wait_for_session(server.target) as sess:
...

会出现错误: 只有chief task在训练,other task一直打印start master session...,不知是什么原因.

# right
sess = sv.prepare_or_wait_for_session(server.target)

(5) opt.minimize()或opt.apply_gradients()的时候一定要传入global_step(用来同步的)
(6) 创建sv的时候,一定要传入logdir(共享文件夹).简便方法:传入log_dir = tempfile.mktemp()

16、tensorflow的可视化是使用summary和tensorboard合作完成的.

17、首先明确一点,summary也是op。

18、输出网络结构
with tf.Session() as sess:
writer = tf.summary.FileWriter(your_dir, sess.graph)

命令行运行tensorboard --logdir=your_dir,然后浏览器输入127.0.1.1:6006
这样你就可以在tensorboard中看到你的网络结构图了

19、可视化参数：

#ops
loss = ...
tf.summary.scalar("loss", loss)
merged_summary = tf.summary.merge_all()

init = tf.global_variable_initializer()
with tf.Session() as sess:
writer = tf.summary.FileWriter(your_dir, sess.graph)
sess.run(init)
for i in xrange(100):
_,summary = sess.run([train_op,merged_summary], feed_dict)
writer.add_summary(summary, i)

这时,打开tensorboard,在EVENTS可以看到loss随着i的变化了，如果看不到的话，可以在代码最后加上writer.flush()试一下，原因后面说明。

20、tf.summary.merge_all: 将之前定义的所有summary op整合到一起。

21、FileWriter: 创建一个file writer用来向硬盘写summary数据。

22、tf.summary.scalar(summary_tags, Tensor/variable): 用于标量的 summary

23、tf.summary.image(tag, tensor, max_images=3, collections=None, name=None):tensor,必须4维,形状[batch_size, height, width, channels],max_images(最多只能生成3张图片的summary),觉着这个用在卷积中的kernel可视化很好用.max_images确定了生成的图片是[-max_images: ,height, width, channels]，还有一点就是，TensorBord中看到的image summary永远是最后一个global step的原因。

24、tf.summary.histogram(tag, values, collections=None, name=None):values,任意形状的tensor,生成直方图summary。

25、tf.summary.audio(tag, tensor, sample_rate, max_outputs=3, collections=None, name=None)。

26、FileWriter
注意:add_summary仅仅是向FileWriter对象的缓存中存放event data。而向disk上写数据是由FileWrite对象控制的。下面通过FileWriter的构造函数来介绍这一点：

tf.summary.FileWriter.__init__(logdir, graph=None, max_queue=10, flush_secs=120, graph_def=None)
#Creates a FileWriter and an event file.
# max_queue: 在向disk写数据之前，最大能够缓存event的个数
# flush_secs: 每多少秒像disk中写数据，并清空对象缓存

27、如果使用writer.add_summary(summary，global_step)时没有传global_step参数,会使scarlar_summary变成一条直线。

28、只要是在计算图上的Summary op，都会被merge_all捕捉到，不需要考虑变量生存空间问题！

29、如果执行一次，disk上没有保存Summary数据的话，可以尝试下file_writer.flush()。

30、如果想要生成的summary有层次的话，记得在summary外面加一个name_scope
with tf.name_scope("summary_gradients"):
tf.summary.histgram("name", gradients)

这样，tensorboard在显示的时候，就会有一个sumary_gradients一集目录。

//2017/4/17

1、tf.train.Supervisor

在不使用Supervisor的时候，我们的代码经常是这么组织的：

variables
...
ops
...
summary_op
...
merge_all_summarie
saver
init_op

with tf.Session() as sess:
writer = tf.tf.train.SummaryWriter()
sess.run(init)
saver.restore()
for ...:
train
merged_summary = sess.run(merge_all_summarie)
writer.add_summary(merged_summary,i)
saver.save

下面介绍如何用Supervisor来改写上面程序：

import tensorflow as tf
a = tf.Variable(1)
b = tf.Variable(2)
c = tf.add(a,b)
update = tf.assign(a,c)
tf.scalar_summary("a",a)
init_op = tf.initialize_all_variables()
merged_summary_op = tf.merge_all_summaries()
sv = tf.train.Supervisor(logdir="/home/keith/tmp/",init_op=init_op) #logdir用来保存checkpoint和summary
saver=sv.saver #创建saver
with sv.managed_session() as sess: #会自动去logdir中去找checkpoint，如果没有的话，自动执行初始化
for i in xrange(1000):
update_ = sess.run(update)
print update_
if i % 10 == 0:
merged_summary = sess.run(merged_summary_op)
sv.summary_computed(sess, merged_summary,global_step=i)
if i%100 == 0:
saver.save(sess,logdir="/home/keith/tmp/",global_step=i)

总结

从上面代码可以看出，Supervisor帮助我们处理一些事情
（1）自动去checkpoint加载数据或初始化数据
（2）自身有一个Saver，可以用来保存checkpoint
（3）有一个summary_computed用来保存Summary
所以，我们就不需要：
（1）手动初始化或从checkpoint中加载数据
（2）不需要创建Saver，使用sv内部的就可以
（3）不需要创建summary writer

2、tf.Variable与tf.get_variable()的区别：

使用tf.Variable时，如果检测到命名冲突，系统会自己处理。使用tf.get_variable()时，系统不会处理冲突，而会报错：

import tensorflow as tf
w_1 = tf.Variable(3,name="w_1")
w_2 = tf.Variable(1,name="w_1")
print w_1.name
print w_2.name
#输出
#w_1:0
#w_1_1:0

import tensorflow as tf

w_1 = tf.get_variable(name="w_1",initializer=1)
w_2 = tf.get_variable(name="w_1",initializer=2)
#错误信息
#ValueError: Variable w_1 already exists, disallowed. Did
#you mean to set reuse=True in VarScope?

基于这两个函数的特性，当我们需要共享变量的时候，需要使用tf.get_variable()。在其他情况下，这两个的用法是一样的

3、tf.ConfigProto一般用在创建session的时候。用来对session进行参数配置：

with tf.Session(config = tf.ConfigProto(...),...)：

#tf.ConfigProto()的参数
log_device_placement=True : 是否打印设备分配日志
allow_soft_placement=True ：如果你指定的设备不存在，允许TF自动分配设备
tf.ConfigProto(log_device_placement=True,allow_soft_placement=True)

4、控制GPU资源使用率：

#allow growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
# 使用allow_growth option，刚一开始分配少量的GPU容量，然后按需慢慢的增加，由于不会释放
#内存，所以会导致碎片

# per_process_gpu_memory_fraction
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
config=tf.ConfigProto(gpu_options=gpu_options)
session = tf.Session(config=config, ...)
#设置每个GPU应该拿出多少容量给进程使用，0.4代表 40%

5、控制使用哪块GPU：

~/ CUDA_VISIBLE_DEVICES=0 python your.py#使用GPU0
~/ CUDA_VISIBLE_DEVICES=0,1 python your.py#使用GPU0,1

6、tensorflow的collection提供一个全局的存储机制，不会受到变量名生存空间的影响。一处保存，到处可取：

#向collection中存数据
tf.Graph.add_to_collection(name, value)

#Stores value in the collection with the given name.
#Note that collections are not sets, so it is possible to add a value to a collection
#several times.
# 注意，一个‘name’下，可以存很多值; add_to_collection("haha", [a,b]),这种情况下
#tf.get_collection("haha")获得的是 [[a,b]], 并不是[a,b]
tf.add_to_collection(name, value)
#这个和上面函数功能上没有区别，区别是，这个函数是给默认图使用的

#从collection中获取数据
tf.Graph.get_collection(name, scope=None)
Returns a list of values in the collection with the given name.

7、merge_all引发的错误：

1）在训练深度神经网络的时候，我们经常会使用Dropout，然而在test的时候，需要把dropout撤掉.为了应对这种问题，我们通常要建立两个模型，让他们共享变量。详情.

2）为了使用Tensorboard来可视化我们的数据，我们会经常使用Summary，最终都会用一个简单的merge_all函数来管理我们的Summary。

错误示例：

当1）和2）这两种情况相遇时，bug就产生了，看代码：

import tensorflow as tf
import numpy as np
class Model(object):
def __init__(self):
self.graph()
self.merged_summary = tf.summary.merge_all()# 引起血案的地方
def graph(self):
self.x = tf.placeholder(dtype=tf.float32,shape=[None,1])
self.label = tf.placeholder(dtype=tf.float32, shape=[None,1])
w = tf.get_variable("w",shape=[1,1])
self.predict = tf.matmul(self.x,w)
self.loss = tf.reduce_mean(tf.reduce_sum(tf.square(self.label-self.predict),axis=1))
self.train_op = tf.train.GradientDescentOptimizer(0.01).minimize(self.loss)
tf.summary.scalar("loss",self.loss)
def run_epoch(session, model):
x = np.random.rand(1000).reshape(-1,1)
label = x*3
feed_dic = {model.x.name:x, model.label:label}
su = session.run([model.merged_summary], feed_dic)
def main():
with tf.Graph().as_default():
with tf.name_scope("train"):
with tf.variable_scope("var1",dtype=tf.float32):
model1 = Model()
with tf.name_scope("test"):
with tf.variable_scope("var1",reuse=True,dtype=tf.float32):
model2 = Model()
with tf.Session() as sess:
tf.global_variables_initializer().run()
run_epoch(sess,model1)
run_epoch(sess,model2)
if __name__ == "__main__":
main()

运行情况是这样的：执行run_epoch(sess,model1)时候，程序并不会报错，一旦执行到run_epoch(sess,model1)，就会报错

报错内容：

error
tensorflow.Python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor ‘train/var1/Placeholder’ with dtype float
[Node: train/var1/Placeholder = Placeholder[dtype=DT_FLOAT, shape=[], _device=”/job:localhost/replica:0/task:0/gpu:0”]]

错误原因：

class Model(object):
def __init__(self):
self.graph()
self.merged_summary = tf.summary.merge_all()# 引起血案的地方
...
with tf.name_scope("train"):
with tf.variable_scope("var1",dtype=tf.float32):
model1 = Model() # 这里的merge_all只是管理了自己的summary
with tf.name_scope("test"):
with tf.variable_scope("var1",reuse=True,dtype=tf.float32):
model2 = Model()# 这里的merge_all管理了自己的summary和上边模型的Summary

由于Summary的计算是需要feed数据的，所以会报错。

解决方法：

我们只需要替换掉merge_all就可以解决这个问题。看代码：

class Model(object):
def __init__(self，scope):
self.graph()
self.merged_summary = tf.summary.merge(
tf.get_collection(tf.GraphKeys.SUMMARIES,scope)
)
...
with tf.Graph().as_default():
with tf.name_scope("train") as train_scope:
with tf.variable_scope("var1",dtype=tf.float32):
model1 = Model(train_scope)
with tf.name_scope("test") as test_scope:
with tf.variable_scope("var1",reuse=True,dtype=tf.float32):
model2 = Model(test_scope)

总结：当有多个模型时，出现类似错误，应该考虑使用的方法是不是涉及到了其他的模型。

8、tensorflow中有一个计算梯度的函数tf.gradients(ys, xs)，要注意的是，xs中的x必须要与ys相关，不相关的话，会报错：

代码中定义了两个变量w1， w2，但res只与w1相关

#wrong
import tensorflow as tf
w1 = tf.Variable([[1,2]])
w2 = tf.Variable([[3,4]])
res = tf.matmul(w1, [[2],[1]])
grads = tf.gradients(res,[w1,w2])

with tf.Session() as sess:
tf.global_variables_initializer().run()
re = sess.run(grads)
print(re)

错误信息
TypeError: Fetch argument None has invalid type

# right
import tensorflow as tf
w1 = tf.Variable([[1,2]])
w2 = tf.Variable([[3,4]])
res = tf.matmul(w1, [[2],[1]])
grads = tf.gradients(res,[w1])
with tf.Session() as sess:
tf.global_variables_initializer().run()
re = sess.run(grads)
print(re)
# [array([[2, 1]], dtype=int32)]

9、tf.stop_gradient()阻挡节点BP的梯度：

import tensorflow as tf

w1 = tf.Variable(2.0)
w2 = tf.Variable(2.0)

a = tf.multiply(w1, 3.0)
a_stoped = tf.stop_gradient(a)

# b=w1*3.0*w2
b = tf.multiply(a_stoped, w2)
gradients = tf.gradients(b, xs=[w1, w2])
print(gradients)
#输出
#[None, <tf.Tensor 'gradients/Mul_1_grad/Reshape_1:0' shape=() dtype=float32>]

可见，一个节点被 stop之后，这个节点上的梯度，就无法再向前BP了。由于w1变量的梯度只能来自a节点，所以，计算梯度返回的是None。

a = tf.Variable(1.0)
b = tf.Variable(1.0)
c = tf.add(a, b)
c_stoped = tf.stop_gradient(c)
d = tf.add(a, b)
e = tf.add(c_stoped, d)
gradients = tf.gradients(e, xs=[a, b])
with tf.Session() as sess:
tf.global_variables_initializer().run()
print(sess.run(gradients))
#输出 [1.0, 1.0]

虽然 c节点被stop了，但是a，b还有从d传回的梯度，所以还是可以输出梯度值的。

import tensorflow as tf
w1 = tf.Variable(2.0)
w2 = tf.Variable(2.0)
a = tf.multiply(w1, 3.0)
a_stoped = tf.stop_gradient(a)
# b=w1*3.0*w2
b = tf.multiply(a_stoped, w2)
opt = tf.train.GradientDescentOptimizer(0.1)
gradients = tf.gradients(b, xs=tf.trainable_variables())
tf.summary.histogram(gradients[0].name, gradients[0])# 这里会报错，因为gradients[0]是None
#其它地方都会运行正常，无论是梯度的计算还是变量的更新。总觉着tensorflow这么设计有点不好，
#不如改成流过去的梯度为0
train_op = opt.apply_gradients(zip(gradients, tf.trainable_variables()))
print(gradients)
with tf.Session() as sess:
tf.global_variables_initializer().run()
print(sess.run(train_op))
print(sess.run([w1, w2]))

10、构建多GPU代码：

如何实现multi_gpu_model函数：

def multi_gpu_model(num_gpus=1):
grads = []
for i in range(num_gpus):
with tf.device("/gpu:%d"%i):
with tf.name_scope("tower_%d"%i):
model = Model(is_training, config, scope)
# 放到collection中，方便feed的时候取
tf.add_to_collection("train_model", model)
grads.append(model.grad) #grad 是通过tf.gradients(loss, vars)求得
#以下这些add_to_collection可以直接在模型内部完成。
# 将loss放到 collection中，方便以后操作
tf.add_to_collection("loss",model.loss)
#将predict放到collection中，方便操作
tf.add_to_collection("predict", model.predict)
#将 summary.merge op放到collection中，方便操作
tf.add_to_collection("merge_summary", model.merge_summary)
# ...
with tf.device("cpu:0"):
averaged_gradients = average_gradients(grads)# average_gradients后面说明
opt = tf.train.GradientDescentOptimizer(learning_rate)
train_op=opt.apply_gradients(zip(average_gradients,tf.trainable_variables()))
return train_op

如何feed data：

def generate_feed_dic(model, feed_dict, batch_generator):
x, y = batch_generator.next_batch()
feed_dict[model.x] = x
feed_dict[model.y] = y

如何实现run_epoch：

#这里的scope是用来区别 train 还是 test
def run_epoch(session, data_set, scope, train_op=None, is_training=True):
batch_generator = BatchGenerator(data_set, batch_size)
...
...
if is_training and train_op is not None:
models = tf.get_collection("train_model")
# 生成 feed_dict
feed_dic = {}
for model in models:
generate_feed_dic(model, feed_dic, batch_generator)
#生成fetch_dict
losses = tf.get_collection("loss", scope)#保证了在 test的时候，不会fetch train的loss
...
...

main 函数干了以下几件事：
1. 数据处理
2. 建立多GPU训练模型
3. 建立单/多GPU测试模型
4. 创建Saver对象和FileWriter对象
5. 创建session
6. run_epoch：

data_process()
with tf.name_scope("train") as train_scope:
train_op = multi_gpu_model(..)
with tf.name_scope("test") as test_scope:
model = Model(...)
saver = tf.train.Saver()
# 建图完毕，开始执行运算
with tf.Session() as sess:
writer = tf.summary.FileWriter(...)
...
run_epoch(...,train_scope)
run_epoch(...,test_scope)

如何编写average_gradients函数：

def average_gradients(grads):#grads:[[grad0, grad1,..], [grad0,grad1,..]..]
averaged_grads = []
for grads_per_var in zip(*grads):
grads = []
for grad in grads_per_var:
expanded_grad = tf.expanded_dim(grad,0)
grads.append(expanded_grad)
grads = tf.concat_v2(grads, 0)
grads = tf.reduce_mean(grads, 0)
averaged_grads.append(grads)
return averaged_grads

11、tensorflow 中的 Saver 对象是用于参数保存和恢复的。如何使用：

v1 = tf.Variable(..., name='v1')
v2 = tf.Variable(..., name='v2')

# Pass the variables as a dict:
saver = tf.train.Saver({'v1': v1, 'v2': v2})

# Or pass them as a list.
saver = tf.train.Saver([v1, v2])
# Passing a list is equivalent to passing a dict with the variable op names
# as keys:
saver = tf.train.Saver({v.op.name: v for v in [v1, v2]})

这里使用了三种不同的方式来创建 saver 对象，但是它们内部的原理是一样的。我们都知道，参数会保存到 checkpoint 文件中，通过键值对的形式在 checkpoint中存放着。如果 Saver 的构造函数中传的是 dict，那么在 save 的时候，checkpoint文件中存放的就是对应的 key-value。如下：

import tensorflow as tf
# Create some variables.
v1 = tf.Variable(1.0, name="v1")
v2 = tf.Variable(2.0, name="v2")

saver = tf.train.Saver({"variable_1":v1, "variable_2": v2})
# Use the saver object normally after that.
with tf.Session() as sess:
tf.global_variables_initializer().run()
saver.save(sess, 'test-ckpt/model-2')

我们通过官方提供的工具来看一下 checkpoint 中保存了什么：

from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file

print_tensors_in_checkpoint_file("test-ckpt/model-2", None, True)
# 输出:
#tensor_name: variable_1
#1.0
#tensor_name: variable_2
#2.0

如果构建saver对象的时候，我们传入的是 list，那么将会用对应 Variable 的 variable.op.name 作为 key：

import tensorflow as tf
# Create some variables.
v1 = tf.Variable(1.0, name="v1")
v2 = tf.Variable(2.0, name="v2")

saver = tf.train.Saver([v1, v2])
# Use the saver object normally after that.
with tf.Session() as sess:
tf.global_variables_initializer().run()
saver.save(sess, 'test-ckpt/model-2')

我们再使用官方工具打印出 checkpoint 中的数据，得到：

tensor_name: v1
1.0
tensor_name: v2
2.0

如果我们现在想将 checkpoint 中v2的值restore到v1 中，v1的值restore到v2中，我们该怎么做？
这时，我们只能采用基于 dict 的 saver：

save 部分的代码如上所示，下面写 restore 的代码，和save代码有点不同：

```python
import tensorflow as tf
# Create some variables.
v1 = tf.Variable(1.0, name="v1")
v2 = tf.Variable(2.0, name="v2")
#restore的时候，variable_1对应到v2，variable_2对应到v1，就可以实现目的了。
saver = tf.train.Saver({"variable_1":v2, "variable_2": v1})
# Use the saver object normally after that.
with tf.Session() as sess:
tf.global_variables_initializer().run()
saver.restore(sess, 'test-ckpt/model-2')
print(sess.run(v1), sess.run(v2))
# 输出的结果是 2.0 1.0，如我们所望

我们发现，其实创建 saver对象时使用的键值对就是表达了一种对应关系：
- save时，表示：variable的值应该保存到 checkpoint文件中的哪个 key下
- restore时，表示：checkpoint文件中key对应的值，应该restore到哪个variable

12、tf.cond(pred, fn1, fn2, name=None)

等价于:res = fn1() if pred else fn2()

注意：pred不能是 Python bool， pred是个标量Tensor

官网例子：

z = tf.mul(a, b)
result = tf.cond(x < y, lambda: tf.add(x, z), lambda: tf.square(y))

13、tf.case(pred_fn_pairs, default, exclusive=False, name=’case’)：

pred_fn_pairs:以下两种形式都是正确的
1. [(pred_1, fn_1), (pred_2, fn_2)]
2. {pred_1:fn_1, pred_2:fn_2}
tf.case()等价于:

if pred_1:
return fn_1()
elif pred_2:
return fn_2()
else:
return default()

14、tf.group() 与 tf.tuple()：

如果我们有很多 tensor 或 op想要一起run，这时这两个函数就是一个很好的帮手了。

w = tf.Variable(1)
mul = tf.multiply(w, 2)
add = tf.add(w, 2)
group = tf.group(mul, add)
tuple = tf.tuple([mul, add])
# sess.run(group)和sess.run(tuple)都会求Tensor(add)
#Tensor(mul)的值。区别是，tf.group()返回的是`op`
#tf.tuple()返回的是list of tensor。
#这样就会导致，sess.run(tuple)的时候，会返回 Tensor(mul),Tensor(add)的值.
#而 sess.run(group)不会

15、learning rate decay

在训练神经网络的时候，通常在训练刚开始的时候使用较大的learning rate，随着训练的进行，我们会慢慢的减小learning rate。对于这种常用的训练策略，tensorflow 也提供了相应的API让我们可以更简单的将这个方法应用到我们训练网络的过程中。

接口
tf.train.exponential_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)
参数:
learning_rate : 初始的learning rate
global_step : 全局的step，与 decay_step 和 decay_rate一起决定了 learning rate的变化。

更新公式：

decayed_learning_rate = learning_rate *
decay_rate ^ (global_step / decay_steps)

用法：

import tensorflow as tf
global_step = tf.Variable(0, trainable=False)
initial_learning_rate = 0.1 #初始学习率
learning_rate = tf.train.exponential_decay(initial_learning_rate,
global_step=global_step,
decay_steps=10,decay_rate=0.9)
opt = tf.train.GradientDescentOptimizer(learning_rate)

add_global = global_step.assign_add(1)
with tf.Session() as sess:
tf.global_variables_initializer().run()
print(sess.run(learning_rate))
for i in range(10):
_, rate = sess.run([add_global, learning_rate])
print(rate)

16、tf.sparse_to_dense的用法（http://blog.youkuaiyun.com/mao_xiao_feng/article/details/53365889）：

tf.sparse_to_dense(sparse_indices, output_shape, sparse_values, default_value, name=None)
除去name参数用以指定该操作的name，与方法有关的一共四个参数：
第一个参数sparse_indices：稀疏矩阵中那些个别元素对应的索引值。
有三种情况：
sparse_indices是个数，那么它只能指定一维矩阵的某一个元素
sparse_indices是个向量，那么它可以指定一维矩阵的多个元素
sparse_indices是个矩阵，那么它可以指定二维矩阵的多个元素
第二个参数output_shape：输出的稀疏矩阵的shape
第三个参数sparse_values：个别元素的值。
分为两种情况：
sparse_values是个数：所有索引指定的位置都用这个数
sparse_values是个向量：输出矩阵的某一行向量里某一行对应的数（所以这里向量的长度应该和输出矩阵的行数对应，不然报错）
第四个参数default_value：未指定元素的默认值，一般如果是稀疏矩阵的话就是0了

例子：

假设一个batch有6个样本，每个样本的label是0，2，3，6，7，9：

BATCHSIZE=6

label=tf.expand_dims(tf.constant([0,2,3,6,7,9]),1)

生成一个index表明一个batch里面每个样本对应的序号：

index=tf.expand_dims(tf.range(0, BATCHSIZE),1)

最后把他们两个矩阵进行连接，连接以后的矩阵是这样的：

concated = tf.concat(1, [index, label]) ：

[[0 0]
[1 2]
[2 3]
[3 6]
[4 7]
[5 9]]

最后一步，调用tf.sparse_to_dense输出一个onehot标签的矩阵，输出的shape就是行数为BATCHSIZE，列数为10的矩阵，指定元素值为1.0，其余元素值为0.0：

onehot_labels = tf.sparse_to_dense(concated, tf.pack([BATCHSIZE,10]), 1.0, 0.0)