目录
背景概述
百度/Google “Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model"的错误,可以发现网上有很多类似的问题分析,但基本都是从装载pretrain的checkpoint文件路径提示解决方案,忽略了该错误信息中最后一个model 关键字 “Unsuccessful ... Failed to find any matching files for model" 。
本文通过一个可以复现问题的sample程序,结合Tensorflow r1.15的源码对该问题做深入的分析。
示例Sample
import pdb
import tensorflow as tf
#pretrain的vgg16模型checkpoint文件
pretrained_model = '/your/path/to/vgg_16.ckpt'
#构建fc6 conv层的权重变量
fc6_conv = tf.get_variable("fc6_conv", [7, 7, 512, 4096], trainable=False)
#构建Saver OP,准备从模型checkpoint中恢复权重Assign给fc6_conv
restorer_fc = tf.compat.v1.train.Saver({"vgg_16/fc6/weights": fc6_conv})
with tf.Session() as sess:
graph = tf.get_default_graph()
fetch_list = []
#从计算图中找到名字为"save/Assign"的Assign OP 加入fetch list
for op in graph.get_operations():
if op.name.find("save/Assign") >= 0:
for tensor_o in op.outputs:
fetch_list.append(tensor_o)
#运行Saver OP从模型checkpoint中恢复权重并Assign给fc6_conv - 运行正常
restorer_fc.restore(sess, pretrained_model)
#运行fetch_list中的"save/Assign" Assign OP - 报错"... Failed to find any matching files for model"
sess.run(fetch_list)
sess.close()
Sample示意程序及其代码注释如上,当运行"sess.run(fetch_list)" 的时候,就会发生类似下面的Error错误,导致Sample程序运行异常终止
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model
[[node save/RestoreV2 (defined at /home/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[save/RestoreV2/_7]]
(1) Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model
[[node save/RestoreV2 (defined at /home/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
问题分析

Sample程序的计算图如上所示,第一眼的感觉就是和sample程序逻辑上表达的计算图不相符,TF在图生成、切分、优化过程中自说自话添加了很多OP节点,这是导致TF静态计算图难以debug的重要原因,比如:其中出现Error错误的OP 即图中红框所示的save/Assign OP,它是一个Assign 类型的OP,而该OP在sample程序中其实没有API显示的创建。

save/Assign OP的input输入Tensor是上图中蓝框所示的save/RestoreV2 OP 和 fc6_conv OP,前者是一个RestoreV2类型OP,后者是一个VariableV2类型OP
//tensorflow/core/kernels/assign_op.h
void Compute(OpKernelContext* context) override {
const Tensor& rhs = context->input(1);
// We always return the input ref.
context->forward_ref_input_to_ref_output(0, 0);
...
如上根据Tensorflow Assign kernel的源码分析可以知道,Assign OP的kernel计算过程就是把input 1的Tensor直接输出给input 0 Tensor。根据上面的计算图可知,save/Assign OP的input 1就是 save/RestoreV2 OP的输出Tensor,input 0 就是fc6_conv OP的输出Tensor,即把save/RestoreV2从pretrain的模型权重checkpoint文件读取的值给到fc6_conv Variable中,实现restore权重的目的,so far so good 看不出为什么会产生Error 错误
//tensorflow/core/kernels/save_restore_v2_ops.cc
class RestoreV2 : public OpKernel {
...
void Compute(OpKernelContext* context) override {
//pretrain的模型权重checkpoint文件路径(可以是待通配符的路径pattern),如果有错就会引起本文的Error错误
const Tensor& prefix = context->input(0);
...
//读取checkpoint文件中的Tensor值
RestoreTensor(context, &checkpoint::OpenTableTensorSliceReader,
/* preferred_shard */ -1, /* restore_slice */ true,
/* restore_index */ i);
//tensorflow/core/kernels/save_restore_tensor.cc
void RestoreTensor(OpKernelContext* context,
checkpoint::TensorSliceReader::OpenTableFunction open_func,
int preferred_shard, bool restore_slice, int restore_index) {
//pretrain的模型权重checkpoint文件路径(可以是待通配符的路径pattern)
const string& file_pattern = file_pattern_t.flat<tstring>()(0);
...
if (!reader) {
//构建读取模型权重checkpoint文件的allocated_reader
allocated_reader.reset(new checkpoint::TensorSliceReader(
file_pattern, open_func, preferred_shard));
...
//tensorflow/core/util/tensor_slice_reader.cc
TensorSliceReader::TensorSliceReader(const string& filepattern,
...
Status s = Env::Default()->GetMatchingPaths(filepattern, &fnames_);
...
//分析模型权重checkpoint文件路径的pattern,提取其中发现的checkpoint文件路径,如果找不到合适的checkppoint文件路径,就会抛出本文的Error错误
if (fnames_.empty()) {
status_ = errors::NotFound(
"Unsuccessful TensorSliceReader constructor: "
"Failed to find any matching files for ",
filepattern);
return;
}
分析save/RestoreV2 OP的kernel 源码实现如上,展示其中最关键的代码部分并加上了注释,可以看到如果save/RestoreV2 OP的input 0 Tensor中没有发现合适的模型权重checkpoint文件路径pattern,那么最终就是在构建allocated_reader的时候抛出本文分析的"Unsuccessful TensorSliceReader constructor: Failed to find any matching files for " 错误。注意一个细节的地方,本文提到错误信息中最后一个model 关键字 “Unsuccessful ... Failed to find any matching files for model",该值来源与变量filepattern,而网上大多数文章在这个错误出现的时候变量filepattern是一个checkpoint的路径,而不是”model“,所以重点是分析为什么在Sample用例中出现了这个model值。

根据上面的代码分析可知, RestoreV2 OP input 0 Tensor是中非常重要的模型权重路径pattern。分析计算图中的Tensor关系可以发现RestoreV2 OP input 0 Tensor来自输入 save/Const OP的输出,而save/Const OP的输入来自save/filename OP,而最终save/filename OP的input是个Const string 值”model“。所以这就完美解释了在sample代码中运行sess.run(fetch_list) 的过程如下
- sess.run(fetch_list) 等于运行save/Assign OP
- 运行save/Assign OP需要运行save/RestoreV2 OP
- 运行save/RestoreV2 OP需要运行save/Const OP获得save/RestoreV2 kerenl中input 0 Tensor依赖的模型权重checkpoint文件路径pattern
- 运行save/Const OP需要运行save/filename OP获得它的输出,而它的输出就是一个Const string 值”model“,所以导致save/RestoreV2 kerenl中input 0 Tensor依赖的模型权重checkpoint文件路径pattern值为”model“,最终就是在构建allocated_reader的时候抛出本文分析的“Unsuccessful ... Failed to find any matching files for model" Error
Saver.restore API分析
//tensorflow/python/training/saver.py
class Saver:
...
def restore(self, sess, save_path):
"""Restores previously saved variables.
This method runs the ops added by the constructor for restoring variables.
It requires a session in which the graph was launched. The variables to
restore do not have to have been initialized, as restoring is itself a way
to initialize variables.
The `save_path` argument is typically a value previously returned from a
`save()` call, or a call to `latest_checkpoint()`.
Args:
sess: A `Session` to use to restore the parameters. None in eager mode.
save_path: Path where parameters were previously saved.
Raises:
ValueError: If save_path is None or not a valid checkpoint.
"""
if self._is_empty:
return
if save_path is None:
raise ValueError("Can't load save_path when it is None.")
checkpoint_prefix = compat.as_text(save_path)
...
#应用程序提供模型权重checkpoint的文件路径到参数save_path
sess.run(self.saver_def.restore_op_name,
{self.saver_def.filename_tensor_name: save_path})
细心的同学一定会有个疑问,既然计算图中看到执行 RestoreV2 OP 会发生Error,那为什么sample代码中restorer_fc.restore(sess, pretrained_model) 也会执行RestoreV2 OP,为毛没有发生Error。如上面的RestoreV2 OP源码所示,关键原因就是sample程序在调用restore API的时候输入了模型权重的checkpoint文件路径,所以在TF源码中sess.run 的时候就把文件路径作为feed list送入了计算图,所以在运行save/RestoreV2 OP的kerenl中input 0 Tensor 依赖的模型权重checkpoint文件路径pattern就不是错误的”model“值,而是sample程序在调用restore API的时候输入了模型权重的checkpoint文件路径,从而能够正确的找到checkpoint文件读取数据了