nvidia errors in TensorFlow

本文探讨了在使用TensorFlow-GPU时遇到NVIDIA错误的情况,并提供了如何检查GPU内存占用的方法,通过nvidia-smi命令来定位问题所在。特别提到PyCharm科学模式下进程不会自动退出导致的问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Go to check GPU memory usage, when encountering nvidia or cuda error.

I’ve been worked with tensorflow-gpu for a while. When run the codes, nvidia reports kinds of errors without clear instruction some time.

As far as my knowledge, met a nvidia error, one should check whether the gpu memory is occupied by other process by

nvidia-smi -l

As the following screenshot shows, the GPU-util is 0%, but the memory is nearly used out. If one run a new tensorflow process, errors are likely to be reported.
If you are coding with pycharm, there is scitific mode. In this mode, the process will not exit automatically, so the tensorflow session remains consuming the GPU memrory.

这里写图片描述

error example

E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

Traceback (most recent call last): File "train.py", line 185, in <module> train() File "train.py", line 150, in train trainer.start() File "/home/nvidia/chenboln/4HDR-GAN-master/tensorkit/train.py", line 302, in start self._train_loop(self._sess) File "/home/nvidia/chenboln/4HDR-GAN-master/tensorkit/train.py", line 222, in _train_loop i(sess) File "train.py", line 132, in restore Restore().init(ckpt_dir=log_dir, ckpt_file=cf, optimistic=True).restore(sess) File "/home/nvidia/chenboln/4HDR-GAN-master/tensorkit/restore.py", line 39, in restore if self._restore_vars(sess): File "/home/nvidia/chenboln/4HDR-GAN-master/tensorkit/restore.py", line 58, in _restore_vars return self._optimistic_restore_model(sess) File "/home/nvidia/chenboln/4HDR-GAN-master/tensorkit/restore.py", line 69, in _optimistic_restore_model reader = tf.train.NewCheckpointReader(self.restore_ckpt_file) File "/home/nvidia/anaconda3/envs/pytorch1/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 636, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern)) File "/home/nvidia/anaconda3/envs/pytorch1/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 648, in __init__ this = _pywrap_tensorflow_internal.new_CheckpointReader(filename) tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./logs/2025.03.13_16.11.57_56028_unetpps_sphere_sn_lsTP: Failed precondition: logs/2025.03.13_16.11.57_56028_unetpps_sphere_sn_lsTP; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?
最新发布
03-25
从错误信息来看,这是 TensorFlow 在尝试读取检查点文件时发生的异常 (`DataLossError`)。具体原因是程序无法打开指定路径下的模型权重文件,因为目标位置实际上指向的是一个目录而非有效的 `.ckpt` 文件。 ### 错误原因分析 1. **路径问题**:提供的日志目录路径 `./logs/2025.03.13_16.11.57_56028_unetpps_sphere_sn_lsTP` 实际上是一个文件夹,并未包含具体的 checkpoint 文件名(例如 `model.ckpt-xxx`)。 2. **TensorFlow Checkpoint 格式变化**: - 如果你是基于较新版本的 TensorFlow (如 v2.x),checkpoint 可能已经采用新的 SavedModel 或 HDF5 格式存储,而旧版 API 默认期望传统的二进制格式。 - 若使用的训练脚本是为老版本 TensorFlow 编写的,则需要确认生成的 checkpoint 是否兼容。 3. **乐观恢复机制失败**:在 `_optimistic_restore_model` 方法里通过 `tf.train.NewCheckpointReader` 加载参数失败了,提示可能是由于数据丢失、损坏或者根本不存在相应变量集合所引起。 --- ### 解决方案 #### 步骤一:验证是否存在合法的 .ckpt 文件 首先进入对应的 log 目录查看是否有实际保存下来的权重文件。一般会看到类似下面这样的结构: ``` /logs/2025.03.13_16.11.57_56028_unetpps_sphere_sn_lsTP/ ├── checkpoints ├── model.ckpt.data-00000-of-00001 ├── model.ckpt.index └── model.ckpt.meta ``` 如果上述内容存在,请将完整前缀传递给加载函数,即形如 `checkpoints/model.ckpt`。 #### 步骤二:更新代码适配正确的路径传入方式 假设我们已知确切的 ckpt 路径名称为 `checkpoints/model.ckpt`,那么可以修改相关部分如下: ```python class Restore: def init(self, ckpt_dir=None, ckpt_file=None, optimistic=False): # 确保仅使用具体ckpt地址 assert ckpt_file.startswith('checkpoints/') self.restore_ckpt_file = os.path.join(ckpt_dir, ckpt_file) def _optimistic_restore_model(self, sess): try: reader = tf.compat.v1.train.NewCheckpointReader(self.restore_ckpt_file) # 兼容TF高版本 saved_shapes = reader.get_variable_to_shape_map() var_names = sorted([(var.name, var.dtype.name) for var in tf.global_variables() if not var.name.startswith("Adam")]) vars_to_restore = {} for vn_name, dtype in var_names: stripped_name = vn_name.split(':')[0] if stripped_name in saved_shapes: print(f"Matched variable {stripped_name}") vars_to_restore[stripped_name] = stripped_name saver = tf.train.Saver(vars_to_restore) saver.restore(sess, self.restore_ckpt_file) return True except Exception as e: logging.error(e) return False ``` 此外,在初始化实例时候也需要保证传递准确路径: ```python Restore().init(log_dir='./logs/...', ckpt_file='checkpoints/model.ckpt', optimistic=True).restore(sess) ``` #### 步骤三:升级至最新框架特性 如果是迁移项目到 Tensorflow 新版环境中运行,建议逐步替换掉原始低级操作接口转而利用更高层封装工具完成任务,比如借助 keras.models.load_weights 方法等直接作用于构建完毕网络对象之上更为便捷高效。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值