tf训练时出现Function call stack: distributed_function -> distributed_function

本文针对在使用TensorFlow进行训练过程中遇到的Functioncallstack错误进行了详细分析,并提供了多种解决方案,包括调整数据集数量、增加训练集、设置GPU数量、调整线程数等方法。此外还介绍了如何通过设置GPU内存增长及禁用动态执行来解决该问题。
部署运行你感兴趣的模型镜像

在使用tf进行训练时,有时候会莫名其妙出现Function call stack: distributed_function -> distributed_function这种错误。

解决方法:调整数据集数量,增加训练集,设置GPU数量,以及调整线程数。总有一个方法能够解决。

如果都解决不了,加上:

 gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
 for gpu in gpus:    
     tf.config.experimental.set_memory_growth(gpu, True)

 分配每个gpu的使用内存

或者加上tf.compat.v1.disable_eager_execution()

tf.compat.v1.disable_eager_execution()
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

您可能感兴趣的与本文相关的镜像

TensorFlow-v2.15

TensorFlow-v2.15

TensorFlow

TensorFlow 是由Google Brain 团队开发的开源机器学习框架,广泛应用于深度学习研究和生产环境。 它提供了一个灵活的平台,用于构建和训练各种机器学习模型

2025-09-18 11:52:26.043472: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_INVALID_PTX' 2025-09-18 11:52:26.043528: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE' 2025-09-18 11:52:26.043572: W tensorflow/core/framework/op_kernel.cc:1827] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' 2025-09-18 11:52:26.043612: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [2025-09-18 11:52:26,043][pinnstf2.utils.utils][ERROR] - Traceback (most recent call last): File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/pinnstf2/utils/utils.py", line 72, in wrap metric_dict, object_dict = task_func( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/pinnstf2/train.py", line 122, in train model: PINNModule = hydra.utils.instantiate(cfg.model)( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/pinnstf2/models/pinn_module.py", line 56, in __init__ self.opt = optimizer() File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/optimizers/adam.py", line 62, in __init__ super().__init__( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/backend/tensorflow/optimizer.py", line 21, in __init__ super().__init__(*args, **kwargs) File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/optimizers/base_optimizer.py", line 158, in __init__ iterations = backend.Variable( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/backend/common/variables.py", line 153, in __init__ initializer = self._convert_to_tensor(initializer, dtype=dtype) File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/backend/tensorflow/core.py", line 69, in _convert_to_tensor return convert_to_tensor(value, dtype=dtype) File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/backend/tensorflow/core.py", line 139, in convert_to_tensor return tf.cast(x, dtype) File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 5983, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__Cast_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Cast] name: [2025-09-18 11:52:26,045][pinnstf2.utils.utils][INFO] - Output dir: /home/jfq/pinns-tf2-main/pinns-tf2-main/examples/aneurysm3D/examples/aneurysm3D/outputs/11-52-12 Error executing job with overrides: [] Traceback (most recent call last): File "/home/jfq/pinns-tf2-main/pinns-tf2-main/examples/aneurysm3D/train.py", line 139, in main metric_dict, _ = pinnstf2.train( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/pinnstf2/utils/utils.py", line 84, in wrap raise ex File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/pinnstf2/utils/utils.py", line 72, in wrap metric_dict, object_dict = task_func( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/pinnstf2/train.py", line 122, in train model: PINNModule = hydra.utils.instantiate(cfg.model)( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/pinnstf2/models/pinn_module.py", line 56, in __init__ self.opt = optimizer() File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/optimizers/adam.py", line 62, in __init__ super().__init__( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/backend/tensorflow/optimizer.py", line 21, in __init__ super().__init__(*args, **kwargs) File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/optimizers/base_optimizer.py", line 158, in __init__ iterations = backend.Variable( File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/backend/common/variables.py", line 153, in __init__ initializer = self._convert_to_tensor(initializer, dtype=dtype) File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/backend/tensorflow/core.py", line 69, in _convert_to_tensor return convert_to_tensor(value, dtype=dtype) File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/keras/src/backend/tensorflow/core.py", line 139, in convert_to_tensor return tf.cast(x, dtype) File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "/home/jfq/anaconda3/envs/pinn/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 5983, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__Cast_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Cast] name: Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
最新发布
09-19
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值