【问题分析】Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model

HaoBBNuanMM

已于 2022-03-25 18:09:36 修改

阅读量5.4k

点赞数 1

文章标签： tensorflow 人工智能深度学习

于 2022-03-25 16:20:18 首次发布

本文链接：https://blog.youkuaiyun.com/HaoBBNuanMM/article/details/123735318

版权

背景概述

百度/Google “Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model"的错误，可以发现网上有很多类似的问题分析，但基本都是从装载pretrain的checkpoint文件路径提示解决方案，忽略了该错误信息中最后一个model 关键字 “Unsuccessful ... Failed to find any matching files for model" 。

本文通过一个可以复现问题的sample程序，结合Tensorflow r1.15的源码对该问题做深入的分析。

示例Sample

import pdb
import tensorflow as tf

#pretrain的vgg16模型checkpoint文件
pretrained_model = '/your/path/to/vgg_16.ckpt'

#构建fc6 conv层的权重变量
fc6_conv = tf.get_variable("fc6_conv", [7, 7, 512, 4096], trainable=False)
#构建Saver OP，准备从模型checkpoint中恢复权重Assign给fc6_conv
restorer_fc = tf.compat.v1.train.Saver({"vgg_16/fc6/weights": fc6_conv})

with tf.Session() as sess:
  graph = tf.get_default_graph()
  fetch_list = []
  #从计算图中找到名字为"save/Assign"的Assign OP 加入fetch list
  for op in graph.get_operations():
    if op.name.find("save/Assign") >= 0:
      for tensor_o in op.outputs:
         fetch_list.append(tensor_o)

  #运行Saver OP从模型checkpoint中恢复权重并Assign给fc6_conv - 运行正常
  restorer_fc.restore(sess, pretrained_model)

  #运行fetch_list中的"save/Assign" Assign OP - 报错"... Failed to find any matching files for model"
  sess.run(fetch_list)

  sess.close()

Sample示意程序及其代码注释如上，当运行"sess.run(fetch_list)" 的时候，就会发生类似下面的Error错误，导致Sample程序运行异常终止

tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model
	 [[node save/RestoreV2 (defined at /home/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[save/RestoreV2/_7]]
  (1) Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for model
	 [[node save/RestoreV2 (defined at /home/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

问题分析

Sample程序的计算图如上所示，第一眼的感觉就是和sample程序逻辑上表达的计算图不相符，TF在图生成、切分、优化过程中自说自话添加了很多OP节点，这是导致TF静态计算图难以debug的重要原因，比如：其中出现Error错误的OP 即图中红框所示的save/Assign OP，它是一个Assign 类型的OP，而该OP在sample程序中其实没有API显示的创建。

save/Assign OP的input输入Tensor是上图中蓝框所示的save/RestoreV2 OP 和 fc6_conv OP，前者是一个RestoreV2类型OP,后者是一个VariableV2类型OP

//tensorflow/core/kernels/assign_op.h

  void Compute(OpKernelContext* context) override {
    const Tensor& rhs = context->input(1);

    // We always return the input ref.
    context->forward_ref_input_to_ref_output(0, 0);
...

如上根据Tensorflow Assign kernel的源码分析可以知道，Assign OP的kernel计算过程就是把input 1的Tensor直接输出给input 0 Tensor。根据上面的计算图可知，save/Assign OP的input 1就是 save/RestoreV2 OP的输出Tensor，input 0 就是fc6_conv OP的输出Tensor，即把save/RestoreV2从pretrain的模型权重checkpoint文件读取的值给到fc6_conv Variable中，实现restore权重的目的，so far so good 看不出为什么会产生Error 错误

//tensorflow/core/kernels/save_restore_v2_ops.cc

class RestoreV2 : public OpKernel {
...
  void Compute(OpKernelContext* context) override {
    //pretrain的模型权重checkpoint文件路径（可以是待通配符的路径pattern），如果有错就会引起本文的Error错误
    const Tensor& prefix = context->input(0);
...
        //读取checkpoint文件中的Tensor值
        RestoreTensor(context, &checkpoint::OpenTableTensorSliceReader,
                      /* preferred_shard */ -1, /* restore_slice */ true,
                      /* restore_index */ i);


//tensorflow/core/kernels/save_restore_tensor.cc

void RestoreTensor(OpKernelContext* context,
                   checkpoint::TensorSliceReader::OpenTableFunction open_func,
                   int preferred_shard, bool restore_slice, int restore_index) {
    //pretrain的模型权重checkpoint文件路径（可以是待通配符的路径pattern）
  const string& file_pattern = file_pattern_t.flat<tstring>()(0);
...
  if (!reader) {
    //构建读取模型权重checkpoint文件的allocated_reader
    allocated_reader.reset(new checkpoint::TensorSliceReader(
        file_pattern, open_func, preferred_shard));
...


//tensorflow/core/util/tensor_slice_reader.cc

TensorSliceReader::TensorSliceReader(const string& filepattern,
 ...
  Status s = Env::Default()->GetMatchingPaths(filepattern, &fnames_);
 ...
  //分析模型权重checkpoint文件路径的pattern，提取其中发现的checkpoint文件路径,如果找不到合适的checkppoint文件路径，就会抛出本文的Error错误
  if (fnames_.empty()) {
    status_ = errors::NotFound(
        "Unsuccessful TensorSliceReader constructor: "
        "Failed to find any matching files for ",
        filepattern);
    return;
  }

分析save/RestoreV2 OP的kernel 源码实现如上，展示其中最关键的代码部分并加上了注释，可以看到如果save/RestoreV2 OP的input 0 Tensor中没有发现合适的模型权重checkpoint文件路径pattern，那么最终就是在构建allocated_reader的时候抛出本文分析的"Unsuccessful TensorSliceReader constructor: Failed to find any matching files for " 错误。注意一个细节的地方，本文提到错误信息中最后一个model 关键字 “Unsuccessful ... Failed to find any matching files for model"，该值来源与变量filepattern，而网上大多数文章在这个错误出现的时候变量filepattern是一个checkpoint的路径，而不是”model“，所以重点是分析为什么在Sample用例中出现了这个model值。

根据上面的代码分析可知, RestoreV2 OP input 0 Tensor是中非常重要的模型权重路径pattern。分析计算图中的Tensor关系可以发现RestoreV2 OP input 0 Tensor来自输入 save/Const OP的输出，而save/Const OP的输入来自save/filename OP，而最终save/filename OP的input是个Const string 值”model“。所以这就完美解释了在sample代码中运行sess.run(fetch_list) 的过程如下

sess.run(fetch_list) 等于运行save/Assign OP
运行save/Assign OP需要运行save/RestoreV2 OP
运行save/RestoreV2 OP需要运行save/Const OP获得save/RestoreV2 kerenl中input 0 Tensor依赖的模型权重checkpoint文件路径pattern
运行save/Const OP需要运行save/filename OP获得它的输出，而它的输出就是一个Const string 值”model“，所以导致save/RestoreV2 kerenl中input 0 Tensor依赖的模型权重checkpoint文件路径pattern值为”model“，最终就是在构建allocated_reader的时候抛出本文分析的“Unsuccessful ... Failed to find any matching files for model" Error

Saver.restore API分析

//tensorflow/python/training/saver.py

class Saver:
  ...

  def restore(self, sess, save_path):
    """Restores previously saved variables.

    This method runs the ops added by the constructor for restoring variables.
    It requires a session in which the graph was launched.  The variables to
    restore do not have to have been initialized, as restoring is itself a way
    to initialize variables.

    The `save_path` argument is typically a value previously returned from a
    `save()` call, or a call to `latest_checkpoint()`.

    Args:
      sess: A `Session` to use to restore the parameters. None in eager mode.
      save_path: Path where parameters were previously saved.

    Raises:
      ValueError: If save_path is None or not a valid checkpoint.
    """
    if self._is_empty:
      return
    if save_path is None:
      raise ValueError("Can't load save_path when it is None.")

    checkpoint_prefix = compat.as_text(save_path)
...
        #应用程序提供模型权重checkpoint的文件路径到参数save_path
        sess.run(self.saver_def.restore_op_name,
                 {self.saver_def.filename_tensor_name: save_path})

细心的同学一定会有个疑问，既然计算图中看到执行 RestoreV2 OP 会发生Error，那为什么sample代码中restorer_fc.restore(sess, pretrained_model) 也会执行RestoreV2 OP，为毛没有发生Error。如上面的RestoreV2 OP源码所示，关键原因就是sample程序在调用restore API的时候输入了模型权重的checkpoint文件路径，所以在TF源码中sess.run 的时候就把文件路径作为feed list送入了计算图，所以在运行save/RestoreV2 OP的kerenl中input 0 Tensor 依赖的模型权重checkpoint文件路径pattern就不是错误的”model“值，而是sample程序在调用restore API的时候输入了模型权重的checkpoint文件路径，从而能够正确的找到checkpoint文件读取数据了