环境
- Ubuntu: 16.04
- CUDA: 10.0
- CUDNN: 7.6.5
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"
执行情况:
resnet50 (Model) (None, 2048) 23587712
_________________________________________________________________
dense (Dense) (None, 1024) 2098176
_________________________________________________________________
dropout (Dropout) (None, 1024) 0
_________________________________________________________________
dense_1 (Dense) (None, 2) 2050
=================================================================
Total params: 25,687,938
Trainable params: 25,634,818
Non-trainable params: 53,120
_________________________________________________________________
Train for 2 steps, validate for 1 steps
Epoch 1/50
2020-10-12 16:46:46.980388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-10-12 16:46:47.348629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-12 16:46:48.700629: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-10-12 16:46:48.720395: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-10-12 16:46:48.720496: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{
{node model/resnet50/conv1_conv/Conv2D}}]]
2020-10-12 16:46:48.733878: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2020-10-12 16:46:48.734055: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcupti.so.10.0'; dlerror: libcupti.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ubuntu/downloads/libpng/lib::/usr/local/cuda/lib64:/usr/local/cuda/lib64
2020-10-12 16:46:48.734080: W tensorflow/core/profiler/lib/profiler_session.cc:192] Encountered error while starting profiler: Unavailable: CUPTI error: CUPTI could not be loaded or symbol could not be found.
1/2 [==============>...............] - ETA: 9s2020-10-12 16:46:48.735185: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 0 kernel records, 0 memcpy records.
2020-10-12 16:46:48.735327: E tensorflow/core/platform/default/device_tracer.cc:70] CUPTI error: CUPTI could not be loaded or symbol could not be found.
Traceback (most recent call last):
File "train.py", line 168, in <module>
trainer.train()
File "train.py", line 131, in train
callbacks=callbacks)
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
total_epochs=epochs)
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(*args, **kwds)
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
return self._stateless_fn(*args, **kwds)
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
self.captured_inputs)
File "/home/ubuntu/anaconda3/envs/tf_v2/lib/python3.7/site-packages/t