【Docker运行报错】[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!

(报错与原因记录)

报错

[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

分析

从错误信息来看,问题涉及多进程任务分发(task_distribute)和资源泄漏(leaked semaphore objects)。以下是详细的讲解和可能的解决方法:


1. 错误信息分析

TBE Subprocess[task_distribute] raise error[], main process disappeared!
  • 含义
    • TBE 可能是某个框架或模块的缩写(例如 TensorFlow Backend Engine 或其他自定义模块)。
    • task_distribute 表示任务分发过程中出现了问题。
    • main process disappeared 表示主进程意外退出或消失。
  • 可能原因
    • 主进程由于某种原因崩溃或被终止。
    • 子进程与主进程之间的通信中断。
    • 资源不足(如内存、CPU 等)导致进程被系统杀死。
resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  • 含义
    • semaphore 是一种用于多进程同步的机制。
    • leaked semaphore objects 表示在程序结束时,有 30 个信号量对象未被正确释放。
  • 可能原因
    • 程序未正确清理多进程资源。
    • 子进程未正常退出,导致资源泄漏。
    • Python 的 multiprocessing 模块在某些情况下未能正确管理资源。

2. 可能的原因

主进程消失
  • 主进程可能由于以下原因消失:
    1. 未捕获的异常:主进程中发生了未捕获的异常,导致进程崩溃。
    2. 资源不足:系统内存或 CPU 资源不足,导致进程被系统杀死。
    3. 信号中断:主进程收到了终止信号(如 SIGKILLSIGTERM)。
信号量泄漏
  • 信号量泄漏可能由于以下原因:
    1. 子进程未正常退出:子进程在完成任务后未正确释放资源。
    2. 资源管理不当:程序中未正确使用 multiprocessing 模块的 API,导致资源泄漏。
    3. Python 版本或库的 Bug:某些 Python 版本或库可能存在资源管理方面的 Bug。

3. 解决方法

检查主进程崩溃原因
  1. 捕获异常

    • 在主进程中添加异常捕获机制,确保所有异常都被记录和处理。例如:
      try:
          # 主进程代码
      except Exception as e:
          print(f"Main process crashed: {e}")
      
  2. 检查系统资源

    • 使用 tophtop 检查系统资源使用情况,确保内存和 CPU 充足。
    • 如果资源不足,可以尝试优化程序或增加系统资源。
  3. 检查信号处理

    • 确保主进程正确处理信号。例如:
      import signal
      import sys
      
      def handle_signal(signum, frame):
          print(f"Received signal {signum}, exiting...")
          sys.exit(0)
      
      signal.signal(signal.SIGTERM, handle_signal)
      signal.signal(signal.SIGINT, handle_signal)
      
解决信号量泄漏
  1. 显式释放资源

    • 确保在子进程结束时显式释放资源。例如:
      from multiprocessing import Process, Semaphore
      
      def worker(sem):
          try:
              # 子进程任务
          finally:
              sem.release()
      
      if __name__ == "__main__":
          sem = Semaphore(1)
          p = Process(target=worker, args=(sem,))
          p.start()
          p.join()
      
  2. 使用 multiprocessing 的正确 API

    • 确保正确使用 multiprocessing 模块的 API,避免资源泄漏。例如,使用 Pool 时,确保调用 close()join()
      from multiprocessing import Pool
      
      def worker(x):
          return x * x
      
      if __name__ == "__main__":
          with Pool(4) as p:
              results = p.map(worker, range(10))
          print(results)
      
  3. 升级 Python 或相关库

    • 如果问题是由于 Python 或库的 Bug 引起的,尝试升级到最新版本。

4. 调试建议

  1. 查看完整日志
    • 检查程序的完整日志,找到主进程崩溃的具体原因。
  2. 使用调试工具
    • 使用调试工具(如 gdbpdb)跟踪主进程的崩溃点。
  3. 简化问题
    • 尝试简化代码,复现问题,以便更容易定位原因。

5. 总结

  • 主进程消失:可能是由于未捕获的异常、资源不足或信号中断引起的。需要捕获异常、检查资源使用情况,并确保正确处理信号。
  • 信号量泄漏:可能是由于子进程未正常退出或资源管理不当引起的。需要显式释放资源,并确保正确使用 multiprocessing 模块的 API。
RuntimeError: External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. 2025-05-11 13:26:43,621 [ERROR] model.py:42 - [Model] >>> return initialize error result: {'status': 'error', 'npuBlockNum': '0', 'cpuBlockNum': '0'} 2025-05-11 13:26:43,618 [ERROR] model.py:39 - [Model] >>> Exception:External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. Traceback (most recent call last): File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/model_wrapper/model.py", line 37, in initialize return self.python_model.initialize(config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/model_wrapper/standard_model.py", line 146, in initialize self.generator = Generator( ^^^^^^^^^^ File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 119, in __init__ self.warm_up(max_prefill_tokens, max_seq_len, max_input_len, max_iter_times, inference_mode) File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 303, in warm_up raise e File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 296, in warm_up self._generate_inputs_warm_up_backend(input_metadata, inference_mode, dummy=True) File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/text_generator/generator.py", line 378, in _generate_inputs_warm_up_backend self.generator_backend.warm_up(model_inputs, inference_mode=inference_mode) File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 198, in warm_up super().warm_up(model_inputs) File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_backend.py", line 170, in warm_up _ = self.forward(model_inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/utils/decorators/time_decorator.py", line 38, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/text_generator/adapter/generator_torch.py", line 153, in forward logits = self.model_wrapper.forward(model_inputs, self.cache_pool.npu_cache, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/modeling/model_wrapper/atb/atb_model_wrapper.py", line 89, in forward logits = self.forward_tensor( ^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/python311/lib/python3.11/site-packages/mindie_llm/modeling/model_wrapper/atb/atb_model_wrapper.py", line 116, in forward_tensor logits = self.model_runner.forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 297, in forward res = self.model.forward(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Ascend/atb-models/atb_llm/models/base/flash_causal_lm.py", line 491, in forward self.init_ascend_weight() File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2/flash_causal_qwen2.py", line 287, in init_ascend_weight self.acl_encoder_operation.set_param(json.dumps({**encoder_param})) RuntimeError: External Comm Manager: Create the hccl communication group failed. export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to see more details. Default log path is $HOME/atb/log. 2025-05-11 13:26:43,623 [ERROR] model.py:42 - [Model] >>> return initialize error result: {'status': 'error', 'npuBlockNum': '0', 'cpuBlockNum': '0'} [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared! /root/anaconda3/envs/python311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /root/anaconda3/envs/python311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /root/anaconda3/envs/python311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /root/anaconda3/envs/python311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Daemon is killing... Killed (python311) root@zhangzhouzhixiao:/usr/local/Ascend/mindie/latest/mindie-service/bin# 我确认过当前容器内没有对应的hccl,我该如何安装?
最新发布
05-12
针对AttributeError: module 'tensorflow.compat.v1' has no attribute 'contrib'的问题,可能是因为TensorFlow版本更新导致contrib模块被移除了。解决方法是使用TensorFlow的新版本,或者使用旧版本中的contrib模块。具体解决seq_loss.py文件的方法如下: 1. 如果您使用的是TensorFlow的新版本,请将代码中所有的“tensorflow.contrib”替换为“tensorflow.compat.v1”,例如: ```python # 旧版本代码 import tensorflow as tf logits = tf.contrib.layers.fully_connected(inputs, num_outputs) # 新版本代码 import tensorflow.compat.v1 as tf logits = tf.layers.dense(inputs, num_outputs) ``` 2. 如果您使用的是TensorFlow的旧版本,请安装TensorFlow的旧版本,并确保contrib模块已经安装。您可以使用以下命令安装旧版本的TensorFlow: ``` pip install tensorflow==1.15 ``` 如果您已经安装了旧版本的TensorFlow但是仍然出现了该错误,请检查您的代码是否正确导入了contrib模块,例如: ```python import tensorflow as tf from tensorflow.contrib import rnn ``` 针对AttributeError: module 'tbe.common.utils' has no attribute 'para_check'的问题,可能是因为tbe.common.utils模块中没有para_check属性。解决方法是检查您的代码中是否正确导入了tbe.common.utils模块,并检查该模块中是否存在para_check属性。如果不存在,您可以尝试更新tbe.common.utils模块或者使用其他替代方法来实现您的需求。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值