Mindspore1.9+Altas 9010,加载Bert预训练模型进行下游任务,模型训练报错

[WARNING] ME(28985:139891715495744,MainProcess):2023-02-21-15:57:40.845.459 [mindspore/train/serialization.py:736] For 'load_param_into_net', remove parameter prefix name: bert., continue to load. 

terminate called after throwing an instance of 'dmlc::Error' 

  what():  TypeError: tvm_callback_get_cce_output_dir() takes 0 positional arguments but 1 was given 

Stack trace: 

    rv = local_pyfunc(*pyargs) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/_ffi/_ctypes/function.py", line 74, in cfun 

Stack trace: 

  [bt] (0) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x366e327) [0x7fc68d273327] 

  [bt] (1) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::codegen::CompileInfo::rmTmpDir()+0xb5) [0x7fc68c839685] 

  [bt] (2) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::codegen::CompileInfo::~CompileInfo()+0xcd) [0x7fc68c839abd] 

  [bt] (3) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x2c30af9) [0x7fc68c835af9] 

  [bt] (4) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::codegen::BuildCCE(std::vector<tvm::CCELoweredFunc, std::allocator<tvm::CCELoweredFunc> >&)+0xa91) [0x7fc68c8367f1] 

  [bt] (5) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::ir::CCECodegen(tvm::Array<tvm::Stmt, void> const&, std::string const&, tvm::Array<tvm::NodeRef, void> const&, bool, tvm::Expr, tvm::Array<tvm::CCELoweredFunc, void>)+0x13c4) [0x7fc68c8477d4] 

  [bt] (6) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x361adbd) [0x7fc68d21fdbd] 

  [bt] (7) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x361af71) [0x7fc68d21ff71] 

  [bt] (8) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(TVMFuncCall+0x5e) [0x7fc68d27451e] 

 

SystemError: null argument to internal routine 

terminate called after throwing an instance of 'dmlc::Error' 

  what():  TypeError: tvm_callback_get_cce_output_dir() takes 0 positional arguments but 1 was given 

Stack trace: 

    rv = local_pyfunc(*pyargs) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/_ffi/_ctypes/function.py", line 74, in cfun 

Stack trace: 

  [bt] (0) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x366e327) [0x7fc68d273327] 

  [bt] (1) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::codegen::CompileInfo::rmTmpDir()+0xb5) [0x7fc68c839685] 

  [bt] (2) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::codegen::CompileInfo::~CompileInfo()+0xcd) [0x7fc68c839abd] 

  [bt] (3) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x2c30af9) [0x7fc68c835af9] 

  [bt] (4) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::codegen::BuildCCE(std::vector<tvm::CCELoweredFunc, std::allocator<tvm::CCELoweredFunc> >&)+0xa91) [0x7fc68c8367f1] 

  [bt] (5) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::ir::CCECodegen(tvm::Array<tvm::Stmt, void> const&, std::string const&, tvm::Array<tvm::NodeRef, void> const&, bool, tvm::Expr, tvm::Array<tvm::CCELoweredFunc, void>)+0x13c4) [0x7fc68c8477d4] 

  [bt] (6) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x361adbd) [0x7fc68d21fdbd] 

  [bt] (7) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x361af71) [0x7fc68d21ff71] 

  [bt] (8) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(TVMFuncCall+0x5e) [0x7fc68d27451e] 

 

Traceback (most recent call last): 

  File "/mnt/ai1/project_code/ScriptGeneration3x/script_generation_37/mindspore_model/BERT/run_classifier.py", line 94, in <module> 

    run_classifier() 

  File "/mnt/ai1/project_code/ScriptGeneration3x/script_generation_37/mindspore_model/BERT/run_classifier.py", line 88, in run_classifier 

    train() 

  File "/mnt/ai1/project_code/ScriptGeneration3x/script_generation_37/mindspore_model/BERT/run_classifier.py", line 59, in train 

    model.train(config.epoch, train_dataset, callbacks=cb) 

  File "/usr/local/lib/python3.7/site-packages/mindspore/train/model.py", line 1050, in train 

    initial_epoch=initial_epoch) 

  File "/usr/local/lib/python3.7/site-packages/mindspore/train/model.py", line 98, in wrapper 

    func(self, *args, **kwargs) 

  File "/usr/local/lib/python3.7/site-packages/mindspore/train/model.py", line 624, in _train 

    cb_params, sink_size, initial_epoch, valid_infos) 

  File "/usr/local/lib/python3.7/site-packages/mindspore/train/model.py", line 702, in _train_dataset_sink_process 

    outputs = train_network(*inputs) 

  File "/usr/local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 596, in __call__ 

    out = self.compile_and_run(*args) 

  File "/usr/local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 985, in compile_and_run 

    self.compile(*inputs) 

  File "/usr/local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 957, in compile 

    jit_config_dict=self._jit_config_dict) 

  File "/usr/local/lib/python3.7/site-packages/mindspore/common/api.py", line 1131, in compile 

    result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode()) 

RuntimeError: Single op compile failed, op: mul_12884858188716146161_0 

 except_msg: 2023-02-21 07:58:12.198713+00:00: Query except_msg:Traceback (most recent call last): 

  File "/usr/local/lib/python3.7/site-packages/te_fusion/parallel_compilation.py", line 1621, in run 

    op_impl_switch=self._op_impl_switch) 

  File "/usr/local/lib/python3.7/site-packages/te_fusion/fusion_manager.py", line 1292, in build_single_op 

    compile_info = call_op() 

  File "/usr/local/lib/python3.7/site-packages/te_fusion/fusion_manager.py", line 1279, in call_op 

    opfunc(*inputs, *outputs, *new_attrs, **kwargs) 

  File "/usr/local/lib/python3.7/site-packages/tbe/common/utils/para_check.py", line 547, in _in_wrapper 

    return func(*args, **kwargs) 

  File "/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe/impl/mul.py", line 138, in mul 

    tbe.cce_build_code(sch, config) 

  File "/usr/local/lib/python3.7/site-packages/te/lang/cce/api.py", line 1300, in cce_build_code 

    return tbe.dsl.build(sch, config_map) 

  File "/usr/local/lib/python3.7/site-packages/tbe/dsl/api.py", line 1077, in build 

    return tbe_build(sch, config_map) 

  File "/usr/local/lib/python3.7/site-packages/tbe/dsl/unify_schedule/build.py", line 75, in build 

    return static_build(sch, config_map) 

  File "/usr/local/lib/python3.7/site-packages/tbe/dsl/static_schedule/cce_schedule.py", line 1481, in cce_build_code 

    _build(sch, tensor_list, local_config_map.get("name")) 

  File "/usr/local/lib/python3.7/site-packages/tbe/dsl/static_schedule/cce_schedule.py", line 1418, in _build 

    tvm.build(sch, tensor_list, device, name=name) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/build_module.py", line 841, in build 

    build_cce(inputs, args, target, target_host, name, rules, binds, evaluates) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/cce_build_module.py", line 1336, in build_cce 

    cce_lower(inputs, args, name, binds=binds, evaluates=evaluates, rule=rules) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/cce_build_module.py", line 72, in wrapper 

    r = fn(*args, **kw) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/cce_build_module.py", line 803, in cce_lower 

    stmt = lower_funcs.get(version)(stmt, ctx) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/cce_build_module.py", line 303, in cce_base_static_lower 

    stmt = _static_lower_phase_emit_insn(stmt, ctx, arg_list) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/cce_build_module.py", line 192, in _static_lower_phase_emit_insn 

    stmt = ir_pass.EmitInsn(stmt) 

  File "/usr/local/lib/python3.7/site-packages/tbe/tvm/_ffi/_ctypes/function.py", line 209, in __call__ 

    raise get_last_ffi_error() 

tvm._ffi.base.TVMError: {'errClass': 'EmitInsn Error', 'errCode': '[EB0002]', 'message': '', 'errPcause': 'N/A', 'errSolution': 'N/A', 'traceback': 'Traceback (most recent call last):\n  [bt] (8) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::ir::IRMutator::Mutate(tvm::Stmt)+0x5d) [0x7fc68bb07c6d]\n  [bt] (7) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::NodeFunctor<tvm::Stmt (tvm::runtime::ObjectRef const&, tvm::Stmt const&, tvm::ir::IRMutator*)>::operator()(tvm::runtime::ObjectRef const&, tvm::Stmt const&, tvm::ir::IRMutator*) const+0x62) [0x7fc68bb07af2]\n  [bt] (6) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x36472bc) [0x7fc68d24c2bc]\n  [bt] (5) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::ir::IRMutator::Mutate_(tvm::ir::ProducerConsumer const*, tvm::Stmt const&)+0x4a) [0x7fc68bd5617a]\n  [bt] (4) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::ir::IRMutator::Mutate(tvm::Stmt)+0x5d) [0x7fc68bb07c6d]\n  [bt] (3) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(tvm::NodeFunctor<tvm::Stmt (tvm::runtime::ObjectRef const&, tvm::Stmt const&, tvm::ir::IRMutator*)>::operator()(tvm::runtime::ObjectRef const&, tvm::Stmt const&, tvm::ir::IRMutator*) const+0x62) [0x7fc68bb07af2]\n  [bt] (2) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x3645b8c) [0x7fc68d24ab8c]\n  [bt] (1) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(+0x23aad92) [0x7fc68bfafd92]\n  [bt] (0) /usr/local/Ascend/nnae/latest/x86_64-linux/lib64/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x45) [0x7fc68ba8a6d5]\n  File "emit_insn.cc", line 426\n\nTVMError: [EB0002] Check failed: r.defined(): intrinsic rule must always return valid ExprCurrent IR Stmt:broadcast_tensor_5.local.UB[((i0.c*3) + i1.c)] = y.local.UB[i0.c]'} 

 

The function call stack: 

Corresponding code candidate: 

 - In file /usr/local/lib/python3.7/site-packages/mindspore/nn/loss/loss.py:676/                x = self.sparse_softmax_cross_entropy(logits, labels)/ 

   In file /usr/local/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:118/        return self._loss_fn(out, label)/ 

   In file /usr/local/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:380/        loss = self.network(*inputs)/ 

   In file /usr/local/lib/python3.7/site-packages/mindspore/train/dataset_helper.py:107/        return self.network(*outputs)/ 

 - In file /usr/local/lib/python3.7/site-packages/mindspore/ops/_grad/grad_nn_ops.py:927/            grad = grad_op(logits, labels)/ 

Corresponding forward node candidate: 

 - In file /usr/local/lib/python3.7/site-packages/mindspore/nn/loss/loss.py:676/                x = self.sparse_softmax_cross_entropy(logits, labels)/ 

   In file /usr/local/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:118/        return self._loss_fn(out, label)/ 

   In file /usr/local/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:380/        loss = self.network(*inputs)/ 

   In file /usr/local/lib/python3.7/site-packages/mindspore/train/dataset_helper.py:107/        return self.network(*outputs)/ 

 

---------------------------------------------------- 

- C++ Call Stack: (For framework developers) 

---------------------------------------------------- 

mindspore/ccsrc/plugin/device/ascend/kernel/tbe/tbe_kernel_compile.cc:471 QueryProcess 

 

[WARNING] MD(28985,7f3b14025740,python):2023-02-21-15:58:13.326.436 [mindspore/ccsrc/minddata/dataset/util/task.cc:163] Join] DataQueueOp(ID:0) Thread ID 139881036637952 is not responding. Interrupt again 

[WARNING] MD(28985,7f3b14025740,python):2023-02-21-15:58:14.326.577 [mindspore/ccsrc/minddata/dataset/util/task.cc:163] Join] DataQueueOp(ID:0) Thread ID 139881036637952 is not responding. Interrupt again 

[WARNING] MD(28985,7f3b14025740,python):2023-02-21-15:58:15.326.748 [mindspore/ccsrc/minddata/dataset/util/task.cc:163] Join] DataQueueOp(ID:0) Thread ID 139881036637952 is not responding. Interrupt again 

[WARNING] MD(28985,7f3b14025740,python):2023-02-21-15:58:16.326.921 [mindspore/ccsrc/minddata/dataset/util/task.cc:163] Join] DataQueueOp(ID:0) Thread ID 139881036637952 is not responding. Interrupt again 

[WARNING] MD(28985,7f3b14025740,python):2023-02-21-15:58:17.327.060 [mindspore/ccsrc/minddata/dataset/util/task.cc:163] Join] DataQueueOp(ID:0) Thread ID 139881036637952 is not responding. Interrupt again 

 

[WARNING] MD(28985,7f3b14025740,python):2023-02-21-15:58:18.327.210 [mindspore/ccsrc/minddata/dataset/util/task.cc:163] Join] DataQueueOp(ID:0) Thread ID 139881036637952 is not responding. Interrupt again 

[WARNING] MD(28985,7f3b14025740,python):2023-02-21-15:58:18.327.305 [mindspore/ccsrc/minddata/dataset/util/task.cc:170] Join] Wait 6 seconds, the task: DataQueueOp(ID:0) will be destroyed by TdtHostDestory. 

[WARNING] DEVICE(28985,7f38977fe700,python):2023-02-21-15:58:18.539.276 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_data_queue.cc:256] Push] Device queue thread had been interrupted by TdtHandle::DestroyHandle, you can ignore the above error: 'failed to send...'. In this scenario, the training ends first without using all epoch(s) data, and the data preprocessing is blocked by the data transmission channel on the device side. So we force the data transmission channel to stop. 

[WARNING] MD(28985,7f3b14025740,python):2023-02-21-15:58:18.539.874 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:93] ~DataQueueOp] preprocess_batch: 209; batch_queue: 0, 0, 0, 0, 0, 0, 0, 0, 0, 16; push_start_time: 2023-02-21-15:57:48.437.358, 2023-02-21-15:57:48.440.793, 2023-02-21-15:57:48.444.270, 2023-02-21-15:57:48.447.354, 2023-02-21-15:57:48.450.754, 2023-02-21-15:57:48.453.831, 2023-02-21-15:57:48.456.837, 2023-02-21-15:57:48.460.276, 2023-02-21-15:57:48.463.905, 2023-02-21-15:57:48.470.364; push_end_time: 2023-02-21-15:57:48.437.475, 2023-02-21-15:57:48.440.920, 2023-02-21-15:57:48.444.391, 2023-02-21-15:57:48.447.434, 2023-02-21-15:57:48.450.867, 2023-02-21-15:57:48.453.935, 2023-02-21-15:57:48.456.917, 2023-02-21-15:57:48.460.357, 2023-02-21-15:57:48.464.019, 2023-02-21-15:58:18.539.326. 

****************************************************解答*****************************************************

此问题在MindSpore1.9对应的CANN包中存在是因为算子未切动静合一实现。此问题在MindSpore1.10对应的CANN包中应该已经修复了(Mul算子已切动静合一),可以尝试使用MindSpore1.10+对应版本CANN包

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值