问题描述:
- 输入 tokens 是一个形状为 (1, 77) 的张量,表示提示词 '船' 的 token 序列,在 FrozenOpenCLIPEmbedder2.pool 方法中,ops.gather_nd(x, indices) 尝试从 x(形状 (1, 77, 1280))中提取特定位置的特征,其中 indices 是 [[0, 4]]。预期 GatherNd 返回形状为 (1, 1280) 的张量,表示第 0 个样本、第 4 个 token 的特征。然而,实际执行时抛出索引越界错误,提示访问了超出范围的索引 205,且要求索引在 [-2, 2) 范围内。错误发生在图执行阶段(construct 方法的调试打印前),表明问题可能源于 GatherNd 算子在 Ascend 上的索引计算或内存访问错误,而非用户代码逻辑问题
- 关键日志:Pool: x shape: (1, 77, 1280), tokens shape: (1, 77), dtype: Float32
Pool: indices shape: (1, 2), indices: [[0 4]]
Pool: max index: 4, x dim1: 77
Traceback (most recent call last):
File "/home/mindspore/work/mindone/examples/pangu_draw_v3/pangu_sampling.py", line 511, in <module>
sample(args)
File "/home/mindspore/work/mindone/examples/pangu_draw_v3/pangu_sampling.py", line 456, in sample
out = run_img2img(
File "/home/mindspore/work/mindone/examples/pangu_draw_v3/pangu_sampling.py", line 285, in run_img2img
out = model.do_img2img(
File "/home/mindspore/work/mindone/examples/pangu_draw_v3/gm/models/diffusion.py", line 415, in do_img2img
c, uc = self.conditioner.get_unconditional_conditioning(
File "/home/mindspore/work/mindone/examples/pangu_draw_v3/gm/modules/embedders/modules.py", line 234, in get_unconditional_conditioning
c = self.tokenize_embedding(batch_c)
File "/home/mindspore/work/mindone/examples/pangu_draw_v3/gm/modules/embedders/modules.py", line 215, in tokenize_embedding
vector, crossattn, concat = self.embedding(*tokens, force_zero_embeddings=force_zero_embeddings)
File "/home/mindspore/work/mindone/examples/pangu_draw_v3/gm/modules/embedders/modules.py", line 168, in embedding
emb_out = embedder(*token)
IndexError: Given index 205 out of range. Please make sure the value of index in [-2, 2), and the type is int32.
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/graph_scheduler/actor/control_flow/switch_actor.cc:51 FetchInput
解决方法:
最新的ms2.5.0版本, ops.gather_nd这个算子我在910a上单独测试了一下,并没有问题:
你这边报错确定是ops.gather_nd算子引发的吗,看报错信息感觉也不像ops.gather_nd引发的?目前运行的是静态图吗,如果是静态图,可以先改成动态图,并设置动态图同步模式运行,看看有没有报错,有错误的话可能可以获得更准确的错误信息;动态图默认异步模式,或者静态图下,报错有时候堆栈定位不是很准;如果动态图正常,静态图出错,那有可能是图编译上的bug了