TensorFlow Seq2Seq Model笔记

最新推荐文章于 2025-06-02 09:06:18 发布

原创

最新推荐文章于 2025-06-02 09:06:18 发布 · 1.6w 阅读

8 ·

CC 4.0 BY-SA版权

0. tf跑起来一直没有用GPU...

尴尬，跑起来发现GPU没用起来，CPU满了。发现装错了，应该装tensorflow-gpu。

代码测试是否用的是GPU：https://stackoverflow.com/questions/38009682/how-to-tell-if-tensorflow-is-using-gpu-acceleration-from-inside-python-shell

1. tf.app.run()的疑惑

http://stackoverflow.com/questions/33703624/how-does-tf-app-run-work

tf.app类似python中argparse

2. variable scope 和 name scope

Variable Scope mechanism: https://www.tensorflow.org/programmers_guide/variable_scope

http://stackoverflow.com/questions/35919020/whats-the-difference-of-name-scope-and-a-variable-scope-in-tensorflow

重点：Name scopes can be opened in addition to a variable scope, and then they will only affect the names of the ops, but not of variables.

with tf.variable_scope("foo"):
    with tf.name_scope("bar"):
        v = tf.get_variable("v", [1])
        x = 1.0 + v
assert v.name == "foo/v:0"
assert x.op.name == "foo/bar/add"

scope.original_name_scope和scope.name的区别

http://stackoverflow.com/questions/41756054/tensorflow-variablescope-original-name-scope-vs-name

3. Python2 Python3区别
在修改data_utils.py（https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/data_utils.py）文件中：
没注意版本不同的区别，Python3语法中print语句没有了，取而代之的是print()函数。
另外用python2执行时候：
with gfile.GFile(data_path, mode="rb") as f:
counter = 0
for line in f:

这里line里面含有'\n'，用split切分后会和最后一个word组合一起读入list。出现list写到文件中和len（list）大小不一致。
比如 li=['a', 'b\n'] 写入文件。'\n'会换行，写的文件成为3行。
下次逐行读入时候会把空符号（‘’）计算为一个新word。

4. TensorFlow Saver类 https://www.tensorflow.org/api_docs/python/tf/train/Saver

http://blog.youkuaiyun.com/u011500062/article/details/51728830

5. Seq2Seq模型保存

其中保存的模型为：

translate.ckpt-16.data-00000-of-00001

translate.ckpt-16.index

translate.ckpt-16.meta

这些东西的解释见：

https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/Y4mzbDAUSec

http://stackoverflow.com/questions/36195454/what-is-the-tensorflow-checkpoint-meta-file

6. RNN示例

https://uqer.io/community/share/58a9332bf1973300597ae209

http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html

7. List of tensor to tensor

http://stackoverflow.com/questions/35730161/how-to-convert-a-list-of-tensors-of-dim-n-to-a-tensor-of-dim-n1

http://blog.youkuaiyun.com/sherry_up/article/details/52169318

8. batch_matmul问题

想进行的操作： suppose I have a T x n x k and want to multiply it by a k x k2, and then to a max pool overT and then a mean pool over n. To do this now, I think you need to reshape, do the matmul() and then undo the reshape and then do the pooling.

https://github.com/tensorflow/tensorflow/issues/216

https://www.tensorflow.org/versions/r0.10/api_docs/python/math_ops/matrix_math_functions#batch_matmul

使用时候报错：AttributeError: 'module' object has no attribute 'batch_matmul'，才发现1.0版本中没有这个。需要用matmul加参数进行使用

9. Cannot feed value of shape (XX) for Tensor u'target/Y:0', which has shape '(YY)'？

第一次遇到这种问题，google后说是input feed的数据shape不一致。但出问题是第五个变量（mask5），导致自己以为不是input feed问题（如果是为什么会是第5个才出问题？）。瞎折腾好久后，发现还是输入数据时候的问题.....

10. 读取现有模型

之前一直可以读取指定目录下现有模型，后来发现读不了，折腾了几个小时才发现以前下面是有一个checkpoint文件，里面会告诉模型两个path：

model_checkpoint_path: "translate.ckpt-101000"
all_model_checkpoint_paths: "translate.ckpt-101000"

11. 读模型内的参数值

https://www.tensorflow.org/programmers_guide/variables#checkpoint_files：

When you create a Saver object, you can optionally choose names for the variables in the checkpoint files. By default, it uses the value of the tf.Variable.name property for each variable.

To understand what variables are in a checkpoint, you can use the inspect_checkpoint library, and in particular, the print_tensors_in_checkpoint_file function.

*https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/inspect_checkpoint.py

用法：

python inspect_checkpoint.py --file_name=./alpha_easy_nmt/valid_model/translate.ckpt-625000tensor

显示：

Decoder/trg_lookup_table/embedding (DT_FLOAT) [16000,620]
Decoder/trg_lookup_table/embedding/Adadelta (DT_FLOAT) [16000,620]
Decoder/trg_lookup_table/embedding/Adadelta_1 (DT_FLOAT) [16000,620]

用法：

python inspect_checkpoint.py --file_name=./alpha_easy_nmt/valid_model/translate.ckpt-625000 --tensor_name=Decoder/W_sf

显示：

tensor_name: Decoder/W_sf
[[ -4.55709170e-07 -9.10816539e-07 4.44753543e-02 ..., -2.58049741e-02
4.26506670e-03 -3.64431571e-07]
[ 7.86067460e-07 7.86348721e-07 1.29140466e-02 ..., 7.92008177e-06
5.49392325e-07 6.99410566e-06]
[ -5.86683996e-07 5.51591484e-08 9.70983803e-02 ..., 2.75615434e-07
-4.86231060e-04 1.23817983e-07]
...,
[ -1.40239194e-06 -1.00237912e-06 -1.44313052e-01 ..., -1.33047411e-06
-1.17946070e-06 -2.41477892e-07]
[ 1.19242941e-06 -9.48488719e-08 -2.48298571e-02 ..., 1.00101170e-03
-3.03782895e-03 1.45507602e-06]
[ -1.27071712e-06 -1.27975386e-06 -2.31240150e-02 ..., -7.33333752e-02
2.30671745e-03 -5.72958811e-07]]

12. tf.get_variable的default initializer

https://www.tensorflow.org/api_docs/python/tf/get_variable：

If initializer is None (the default), the default initializer passed in the variable scope will be used. If that one is None too, a glorot_uniform_initializer will be used. The initializer can also be a Tensor, in which case the variable is initialized to this value and shape.

奇怪是glorot_uniform_initializer也差不到任何文档提及，github上倒是有人问过这个问题，不过没人回答https://github.com/tensorflow/tensorflow/issues/7791。

13. 系统记录

13.1 之前系统在NIST06上BLEU到20就停住了（theano版本和我师弟的版本都能到34）。beam search输出每一个beam，发现特别多的over translation问题。找了下发现自己系统出了一个小Bug，在beam search中忘记给source annotation加上mask。

Vocab = 16k，Batch = 50，两种优化算法每隔3000batch测一次BLEU。

Adam:

72000   BLEU score = 0.2412     BEST BLEU is 0
75000   BLEU score = 0.2377     BEST BLEU is 0.2412
78000   BLEU score = 0.2380     BEST BLEU is 0.2412
81000   BLEU score = 0.2513     BEST BLEU is 0.2412
84000   BLEU score = 0.2231     BEST BLEU is 0.2513
87000