GPU故障
2021-09-08 06:53:46.096205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4863 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-09-08 06:53:47.587810: F .\tensorflow/core/kernels/random_op_gpu.h:227] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: invalid device function
Fatal Python error: Aborted
检查tf版本
(base) PS D:\models-master\research> python
Python 3.7.10 (default, Feb 26 2021, 13:06:18) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-09-08 07:04:27.742112: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
>>> tf.__version__
'1.15.5'
>>> exit()
查看CUDA
(base) PS D:\models-master\research> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:48_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28540450_0
(base) PS D:\models-master\research> nvidia-smi
Wed Sep 8 07:13:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 461.40 Driver Version: 461.40 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+=|
| 0 GeForce GTX 166... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 53C P8 14W / N/A | 153MiB / 6144MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|========================================
| 0 N/A N/A 19812 C+G ...bbwe\Microsoft.Photos.exe N/A |
+-----------------------------------------------------------------------------+
可能原因:
cuda版本太高,换成10.0?
cuda换成10.0,cudnn测试了从7.3至7.6所有12个cuda10.0版本,故障依旧。
默认使用的GPU大小超过了当前GPU的最大值,或者说默认使用量太大了,只能手动限制
import tensorflow as tf
os.environ[“CUDA_DEVICE_ORDER”] = “PCI_BUS_ID”
os.environ[“CUDA_VISIBLE_DEVICES”] = “5”
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
gpu_memory_fraction=1 改为0.7,一样问题。
重新安装了CUDA和CUDNN ?