yolov7实例分割中出现KeyError: 0

最新推荐文章于 2025-06-28 20:54:04 发布

神笔馬良

最新推荐文章于 2025-06-28 20:54:04 发布

阅读量868

点赞数 15

CC 4.0 BY-SA版权

文章标签： YOLO

本文链接：https://blog.youkuaiyun.com/weixin_43501408/article/details/148934189

问题描述：

yolov7实例分割中出现KeyError: 0

问题解答：

解决完上面的错误之后，又报了第三个错误，如上传的文件所示，如何解决呢？

这次报错的核心信息是：

File "utils/loss.py", line 811, in find_3_positive  
gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain  
KeyError: 0

🔍 错误解释

KeyError: 0 表示 p[i] 中的 i=0 并不合法，即 p 不是一个可以用整数索引的 list、tuple 或 dict。这意味着 p 的结构不对，可能是 None、空列表或维度不对。

🔧 解决方法

你需要 检查 p 的结构。它来自以下调用：

loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)

而 compute_loss_ota 的定义中调用了：

bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)

问题最终出在 p[i] 中的 p 不是预期格式。

通常 p 是模型的输出 pred，应为一个 list，每个元素对应一个检测层（例如有三个 feature map）。但你的模型可能返回的是一个 dict 或 Tensor，而不是一个 list。

✅ 修复建议

你需要在 train.py 中 查看模型输出 pred 是什么类型。例如：

print(type(pred))
print(len(pred) if isinstance(pred, list) else "Not a list")

修改思路：

如果你使用的是 YOLOR / YOLOv7 mask 分支模型，它的输出可能是 dict：

pred = model(imgs)

你可以尝试改成：

pred = model(imgs)[0]  # 只取主输出，忽略 mask

或者

if isinstance(pred, dict):
    pred = pred['bbox_and_cls']  # 具体键名要根据模型结构而定

🧠 建议

在 train.py 第 360 行左右加 print 输出 检查 pred 的类型和结构。
确认使用的是哪种 loss.py：如果你使用 ComputeLossOTA 来训练带 mask 的 YOLOv7，你应该确保 model(imgs) 输出的是用于 OTA Loss 的 bbox 结构，而不是完整的字典。

如果你愿意，我可以：

查看 model(imgs) 的输出结构（你可以加 print 发给我），
或者你也可以上传 loss.py 文件，我可以直接帮你找出是否需要对 p[i] 做结构转换。

##############################

你上传的第三个错误提示文件（错误3.docx）中显示如下报错：

TypeError: full_like(): argument 'fill_value' (position 2) must be Number, not Tensor

该错误发生在 loss.py 中的 SigmoidBin.training_loss() 方法里，具体位置如下这一行：

target_bins = torch.full_like(pred_bin, self.cn, device=device)

这里的 self.cn 是一个 Tensor（在 __init__() 方法中被定义为 torch.Tensor([BCE_weight])），而 torch.full_like() 的第二个参数必须是一个 标量（float 或 int），不能是一个张量。

✅ 解决办法

请将下面这一行：

self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([BCE_weight]))

改为：

self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device='cuda' if torch.cuda.is_available() else 'cpu'))

并将 self.cn 和 self.cp 的赋值方式从：

self.cp = 1.0 - 0.5 * smooth_eps
self.cn = 0.5 * smooth_eps

保持不变（即它们是 float，不要改成 tensor）。

然后这行代码就能正常工作了：

target_bins = torch.full_like(pred_bin, self.cn, device=device)

🚫 不推荐的做法

不要把 self.cn 改成 torch.tensor(self.cn)，否则会引发你现在遇到的 TypeError。

如果你已经改动了 self.cn 为 tensor，那么请立即改回：

self.cn = 0.5 * smooth_eps  # 保持为 float 类型

如需我对你改过的 loss.py 做一次彻底检查，看看是否还有其他不兼容 PyTorch 最新版本的用法，也可以告诉我。

##############################################

我的电脑就是gpu的包含cuda的

非常好，既然你电脑带有 GPU 并且支持 CUDA，我们可以对 loss.py 中的初始化进一步 优化为 GPU 加速 + 错误修复。

✅ 最终推荐修改方式如下：

找到 SigmoidBin.__init__() 方法中的这部分代码：

self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([BCE_weight]))

将其替换为 👇：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device=device))

保持下面两句不变（不要改成 tensor）：

self.cp = 1.0 - 0.5 * smooth_eps
self.cn = 0.5 * smooth_eps

然后你在 forward 中就可以安全地用：

target_bins = torch.full_like(pred_bin, self.cn, device=device)

🔍 原因总结：

torch.full_like() 第二个参数必须是 标量（int/float），不能是 tensor。
pos_weight 需要是一个 tensor，并放在模型的同一设备上（GPU）。
self.cn 是构造 target mask 的标量权重，因此保持 float 是对的。

现在可以重新运行训练，错误应该会解决。如果还有后续报错，可继续发我，我会一步步帮你修复。

###################################

到底是哪句修改成哪句呢？具体一点。

好的，具体修改如下，请一字不差地改👇。

❗问题出在 `loss.py` 中的这行代码：

self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([BCE_weight]))

这会报错：expected pos_weight to be on the same device，并且 torch.Tensor() 会创建在 CPU 上。

✅ 修改为：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device=device))

📌说明：

device = ... 会自动检测 CUDA 是否可用，放在 GPU 上；
torch.tensor(..., device=device) 显式地放在对应设备；
torch.Tensor(...) 是旧写法，默认是 CPU，会导致错误。

🔄 所以你要把原来的：

self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([BCE_weight]))

改成这两行：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device=device))

如果你想更规范一点，还可以把 device 提前设置成 self.device：

self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device=self.device))

后面构造 target 的时候记得加上 device=self.device。

需要我帮你把整个 loss.py 补丁修改发出来也可以，告诉我 👍

######################################

self.cp = 1.0 - 0.5 * smooth_eps
self.cn = 0.5 * smooth_eps
target_bins = torch.full_like(pred_bin, self.cn, device=device)

这三行内容不需要修改吗？

非常好的问题，这三行也涉及 device 的一致性问题，我们逐句来看并判断是否需要修改：

✅ 第一二行：

self.cp = 1.0 - 0.5 * smooth_eps
self.cn = 0.5 * smooth_eps

这两行只是浮点数赋值，不涉及 tensor 创建，不需要修改。

⚠️ 第三行：

target_bins = torch.full_like(pred_bin, self.cn, device=device)

如果你在前面定义了 self.cn 是 float（而不是 tensor），那 self.cn 会默认是 CPU 上的 float，虽然不会立刻报错，但为了避免设备不一致或数据类型不兼容的隐患，推荐如下修改方式：

✅ 推荐改法：

将：

target_bins = torch.full_like(pred_bin, self.cn, device=device)

改为：

target_bins = torch.full_like(pred_bin, self.cn, dtype=pred_bin.dtype, device=pred_bin.device)

这样更保险，让 target 的数据类型和 device 与 pred_bin 完全一致。

🔎 为什么不直接用 `device=device`？

因为如果你后面代码中 pred_bin 在别的 GPU（如多卡），那么 device=device（主卡）可能还是错的。所以更保险的是 直接跟随 pred_bin 的设备。

✅ 结论：

原代码	是否需改	推荐修改
`self.cp = 1.0 - 0.5 * smooth_eps`	否	保持不变
`self.cn = 0.5 * smooth_eps`	否	保持不变
`torch.full_like(pred_bin, self.cn, device=device)`	是（建议）	`torch.full_like(pred_bin, self.cn, dtype=pred_bin.dtype, device=pred_bin.device)`

如果你愿意，我可以把这部分 loss.py 全部修改过的代码段发给你做替换。是否需要？

##############################################

我已经按照您的要求修改了，但是仍然还是报了下面的错误
Image sizes 640 train, 640 test
Using 4 dataloader workers
Logging results to runs\train\exp6
Starting training for 2 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
  0%|          | 0/567 [00:00<?, ?it/s]D:\Instance_Segmentation\yolov7-mask\train.py:360: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(enabled=cuda):
  0%|          | 0/567 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "D:\Instance_Segmentation\yolov7-mask\train.py", line 613, in <module>
    train(hyp, opt, device, tb_writer)
  File "D:\Instance_Segmentation\yolov7-mask\train.py", line 363, in train
    loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)  # loss scaled by batch_size
  File "D:\Instance_Segmentation\yolov7-mask\utils\loss.py", line 588, in __call__
    bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
  File "D:\Instance_Segmentation\yolov7-mask\utils\loss.py", line 644, in build_targets
    indices, anch = self.find_3_positive(p, targets)
  File "D:\Instance_Segmentation\yolov7-mask\utils\loss.py", line 814, in find_3_positive
    gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain
KeyError: 0

Process finished with exit code 1

你遇到的这个错误：

KeyError: 0

出现在文件：

loss.py 第 814 行：
gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]

✅ 根本原因

你传入的 p[i] 是个字典（或其他不支持索引 [i] 的结构），但是代码中却假设 p 是一个列表（如 List[Tensor]），所以执行 p[i] 时出现 KeyError: 0。

🔍 解决方法（关键点）

请确认 p 是来自这句：

def __call__(self, p, targets, imgs):  # 第588行

而 p 实际是 model(imgs) 的输出。

在 YOLOv7-mask 中的 OTA 分支中，如果你调用的是：

loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)

这里的 pred 就是 model(imgs) 输出的结构，有时是 dict，有时是 list，这取决于模型。

✅ 快速判断 p 是什么

你可以在 __call__ 里加一句调试代码：

print("type(p):", type(p))
print("len(p):", len(p))
print("p[0].shape:", p[0].shape if isinstance(p, list) else "Not a list")

✅ 正确写法一：`p` 是 `dict` 的情况

如果 p 是 dict，如下结构：

p = {
    'cls': ...,
    'box': ...,
    'mask': ...,
    ...
}

你就 不能再对 p[i] 做索引。

修改如下这一行（第814行）：

gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]

替换为正确的写法，比如你需要的是 p['cls'][i] 之类的，就要具体替换成：

gain[2:6] = torch.tensor(p['cls'][i].shape)[[3, 2, 3, 2]]

⚠️ 注意：你必须根据模型输出的结构，正确找到 p[i] 的来源字段。

✅ 正确写法二：你用错了 loss 函数

你可能应该调用的是：

compute_loss(pred, targets.to(device))  # 而不是 compute_loss_ota

因为 compute_loss_ota 是为 带辅助分支（aux）的模型结构准备的，其模型输出结构是复杂字典，而标准 compute_loss 用的是 List[Tensor]。

如果你使用的是普通的 YOLOv7-mask，而非带 aux 的结构，请：

👉 把这句：

loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)

改成：

loss, loss_items = compute_loss(pred, targets.to(device))

即可避免结构不匹配。

✅ 总结

问题	解决方案
`KeyError: 0`	因为 `p[i]` 错误地试图访问 dict，用 `.keys()` 或字段名替代
如果是 `compute_loss_ota()` 报错	改为 `compute_loss()`，除非你用的是 aux 模型
建议调试方式	`print(type(p), p.keys())` 查看结构再决定访问方式

如你方便，可以把 model(imgs) 的返回结构贴出来，我可以直接告诉你如何改。如果你仍然不确定，我也可以为你对 loss.py 做定向修改。是否需要？

#######################################

有个博主提到了两个错误，但是和我报错的不一样，你看看这两个错误和上面的错误有关系吗？

你发的这张图片中提到的两个错误：

错误1：

AttributeError: 'Upsample' object has no attribute 'recompute_scale_factor'

解决方式：

去掉 recompute_scale_factor=self.recompute_scale_factor

错误2：

DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`

解决方式：

把 np.int 改成 int，或者改为 np.int32 / np.int64

✅ 和你遇到的错误是否相关？

我们再看你当前报错：

KeyError: 0
gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain

这个错误和上面两个提示没有直接关系。

✅ 它们之间的区别是：

错误	是否与你当前问题相关	说明
`Upsample.recompute_scale_factor`	❌无关	你没在 `loss.py` 中使用 `Upsample`
`np.int 警告`	✅有点关系	你之前已经修复过 `np.int` 的警告问题了
`KeyError: 0`	✅直接相关	是你当前`loss.py`中 `p[i]`结构访问错误导致的，需要解决模型输出结构问题

✅ 你接下来该做的

你应该专注解决 KeyError: 0，也就是 loss.py 中访问 p[i] 出错问题。

📌 快速建议：

在 __call__() 中加入：

print("type(p):", type(p))
print("p.keys() if dict:", p.keys() if isinstance(p, dict) else "Not a dict")

然后你再告诉我 p 的类型或结构，我可以直接告诉你如何正确改 p[i]。

是否需要我帮你修改 loss.py 具体那一行？你可以贴出 model(img) 输出的结构（例如 print 后的结果）。

###########################

ChatGPT 说：
好的，具体修改如下，请一字不差地改👇。

❗问题出在 loss.py 中的这行代码：
python
复制
编辑
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([BCE_weight]))
这会报错：expected pos_weight to be on the same device，并且 torch.Tensor() 会创建在 CPU 上。

✅ 修改为：
python
复制
编辑
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device=device))
📌说明：

device = ... 会自动检测 CUDA 是否可用，放在 GPU 上；

torch.tensor(..., device=device) 显式地放在对应设备；

torch.Tensor(...) 是旧写法，默认是 CPU，会导致错误。

🔄 所以你要把原来的：
python
复制
编辑
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([BCE_weight]))
改成这两行：

python
复制
编辑
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device=device))
如果你想更规范一点，还可以把 device 提前设置成 self.device：

python
复制
编辑
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device=self.device))
后面构造 target 的时候记得加上 device=self.device。

需要我帮你把整个 loss.py 补丁修改发出来也可以，告诉我 👍
这里你之前说的改进没有用处，需要修改回去吗？

问题解答：

不需要改回去。你不应该恢复成原来错误的写法，原因如下：

✅ 当前这段是正确写法：

self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([BCE_weight], device=self.device))

❌ 原来的错误写法：

self.BCEbins = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([BCE_weight]))

🔍 原因说明：

写法	问题	是否推荐
`torch.Tensor([...])`	在 CPU 上创建张量，不指定 device，会导致 device 不匹配（尤其在 CUDA 上训练）	❌不推荐
`torch.tensor([...], device=...)`	显式指定张量设备，保证与模型在同一设备上	✅推荐
`self.device = torch.device(...)`	统一管理 device，后续调用方便清晰	✅推荐

📌 所以你的做法没错，不需要改回去。

这个修改不会引起新问题，反而可以规避多种兼容性错误（尤其是 CUDA 上）。

如果你希望我把你当前的 loss.py 文件按正确方式统一清理一遍，可以告诉我 👍。你已经接近调通了，别急，我们一点点把坑填完。

########################################

但是你这个修改没有解决问题啊，说明错误的原因不在这里吧。

✅ 你说得完全正确：如果你已经照我说的方式把 BCEbins 的 device 问题修复了，但 错误依旧存在，那么说明你遇到的实际报错 并不是这个地方导致的。你提到的当前错误是：

KeyError: 0

🔍 你当前报错的根本原因

来自这行代码（你提供的 traceback）：

gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain

📌 报错解释：

错误：KeyError: 0
根源是 p[i] 出现了问题，p 是个字典（dict）而不是 list/tuple
所以 p[i] 不是合法索引，而你用了 [3, 2, 3, 2] 来索引一个不支持的结构

✅ 解决方案

请你打开你项目中文件：

utils/loss.py

找到报错行：

gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]

并先在上面一行加一句打印代码：

print(f"p type: {type(p)}, index: {i}, p[i] = {p.get(i, 'Not Found')}")

然后你再运行一次训练，我们可以进一步确认 p[i] 是不是你以为的张量。

🔁 可能出现的问题

根据你使用的是 ComputeLossOTA 或 ComputeLossAuxOTA，有可能你用了如下结构：

p = model(imgs)

而这个 model(imgs) 返回的是一个字典，而不是一个张量列表。

所以你必须改成：

p = model(imgs)["pred"]  # 或者是 model(imgs)["aux"]

取决于你的模型返回什么 key。

✅ 临时修改建议

你可以临时这样修改来跳过这个问题验证：

# 在 loss.py 中
if isinstance(p, dict):
    p = p['aux']  # 或者是 'pred'，具体 key 看模型返回内容

📌 总结

你的问题	原因	解决方法
`KeyError: 0`	p 是字典，不支持 p[i] 索引	确认模型返回值，使用 `p = p['aux']` 或其他 key
之前的 BCEbins device 改动	不是造成这个报错的原因	不要改回去，它仍是必要的修复

🔧 如果你可以发一下 model(imgs) 的返回结构（print 一下），我可以更快定位正确的 key 该取哪个值。你愿意的话我可以直接修改你上传的 loss.py 给你。是否需要？

#########################################

Traceback (most recent call last):
File "D:\Instance_Segmentation\yolov7-mask\train.py", line 613, in <module>
train(hyp, opt, device, tb_writer)
File "D:\Instance_Segmentation\yolov7-mask\train.py", line 363, in train
loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs) # loss scaled by batch_size
File "D:\Instance_Segmentation\yolov7-mask\utils\loss.py", line 588, in __call__
bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
File "D:\Instance_Segmentation\yolov7-mask\utils\loss.py", line 644, in build_targets
indices, anch = self.find_3_positive(p, targets)
File "D:\Instance_Segmentation\yolov7-mask\utils\loss.py", line 814, in find_3_positive
gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]] # xyxy gain
KeyError: 0
报错的地方不是三处指向loss吗？

✅ 没错，你说得非常准确：三处 traceback 都指向的是 utils/loss.py 文件，说明真正的报错发生在 loss.py 中，并不是 BCEbins 那一行，也不是训练主循环的调用逻辑。

我们来 逐步还原问题的根源 ⬇️

🔍 关键报错：

gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain

❌ 报错信息：

KeyError: 0

说明 p[i] 中的 i 为 0，但 p 本身不是可以被整数索引的结构。也就是说：

type(p) == dict

✅ 根本原因

你传入的 p 是来自模型 pred = model(imgs)，而此时：

type(pred) == dict

它并不是列表或元组，而是一个字典，里面可能是这样的结构（举例）：

{
  'test': [...],         # p['test']
  'bbox_and_cls': [...], # p['bbox_and_cls']
  'attn': ...,
  'mask_iou': ...,
  ...
}

所以当你在 loss.py 中继续传入 p[i]，就会触发 KeyError。

✅ 解决方案

① 在 `train.py` 中找到如下代码：

pred = model(imgs)
loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)

✅ 改成：

pred_all = model(imgs)
pred = pred_all['bbox_and_cls']  # 或可能是 pred_all['test']
loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)

这样你传入的 pred 就变成了 list[Tensor]，可以被 p[0] 索引。

② 可选修复方案：直接在 loss.py 顶部判断一次 `p` 的结构

⚠️ 推荐你在 loss.py → __call__ 函数或 find_3_positive() 开头加一句：

if isinstance(p, dict):
    p = p['bbox_and_cls']  # 或者 'test'，视你模型返回结构而定

🧪 如何确认 key 是啥？

你可以在 train.py 里加一句：

pred = model(imgs)
print(pred.keys()); exit()

看看里面有 'bbox_and_cls' 还是 'test'，然后选一个传给 compute_loss_ota() 即可。

🧠 总结一句话

你的错误是：

在 loss.py 中对 p[i] 索引时，p 实际是字典而不是张量列表，导致 KeyError: 0

✅ 只要传入正确的张量列表 p = pred_all['bbox_and_cls'] 或类似字段，就能解决问题。

需要我帮你 直接改好上传的 loss.py 和 train.py 文件，也可以告诉我 👍。

########################################

        lcls, lbox, lobj = torch.zeros(1, device=device), torch.zeros(1, device=device), torch.zeros(1, device=device)
        bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
        pre_gen_gains = [torch.tensor(pp.shape, device=device)[[3, 2, 3, 2]] for pp in p]
上面是558行前后代码


    def build_targets(self, p, targets, imgs):
        
        #indices, anch = self.find_positive(p, targets)
        indices, anch = self.find_3_positive(p, targets)
        #indices, anch = self.find_4_positive(p, targets)
        #indices, anch = self.find_5_positive(p, targets)
        #indices, anch = self.find_9_positive(p, targets)

        matching_bs = [[] for pp in p]
上面是644行代码

        for i in range(self.nl):
            anchors = self.anchors[i]
            gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain

            # Match targets to anchors
            t = targets * gain

上面是814行代码，请问报错的这三处如何进行修改呢？

感谢你提供了关键代码位置和上下文，下面是详细说明和修改方法：

🔧 报错根因分析：

你当前的 loss.py 中 p[i] 报 KeyError: 0，说明：

gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]

中 p 不是一个 list 或 tuple，而是一个 dict。

这说明在 train.py 中：

pred = model(imgs)  # forward

返回的是一个 dict（带有多个输出，如 'test', 'bbox_and_cls' 等），而 loss.py 却假设 p 是 list，才会出现这个错误。

✅ 正确的修改方式：

【1】修改 `train.py` 中这一行：

找到：

loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)

修改为：

loss, loss_items = compute_loss_ota(pred['bbox_and_cls'], targets.to(device), imgs)

这样 p 就是一个列表（通常是3层输出，对应3个检测层），而不是一个字典。

📌 补充建议：

如果你是用 ComputeLossOTA（或 ComputeLossAuxOTA），默认就只需要 bbox 和 cls 的输出，不应传整个 dict。

✅ 其他几行是否要改？

你提到的这三处代码：

lcls, lbox, lobj = torch.zeros(1, device=device), ...
bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
pre_gen_gains = [torch.tensor(pp.shape, device=device)[[3, 2, 3, 2]] for pp in p]

不需要改，只要 p 是一个 list，就能正确工作。

🔁 小结：

项目	说明
错误类型	`KeyError: 0` 来自于 `p` 是字典而不是列表
修改位置	`train.py` 中调用 `compute_loss_ota()` 的地方
修改方式	改为传 `pred['bbox_and_cls']` 而不是整个 `pred`

需要我帮你把这处改动所在上下文完整贴出也可以，或者你也可以上传完整的 train.py 文件（如已修改），我帮你精确替换。是否需要？

######################################

loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs) 只需要修改这一行就行了嘛？

是的，当前这个报错只需要修改这一行即可解决，改成如下形式即可：

loss, loss_items = compute_loss_ota(pred['bbox_and_cls'], targets.to(device), imgs)

✅ 原因简要回顾：

pred = model(imgs) 返回的是一个字典；
你传给 compute_loss_ota(pred, ...)，而它的内部逻辑期望 pred 是一个 list；
所以必须明确告诉它只取 pred['bbox_and_cls'] 这个 list；
一旦 p = pred['bbox_and_cls']，那么后面 p[i]、p[0] 等自然就不报错了。

🔁 后续建议：

如果你使用的是 yolov7-mask 或带辅助输出的分支，还要确保：

loss.py 中的 self.build_targets() 和 self.find_3_positive() 这类函数只处理主干输出；
不要误传了 'test'、'mask'、'attn' 等结构。

如还有其他函数中也调用 compute_loss_ota(pred, ...)，记得也要同步改一下。如果你不确定我也可以帮你全文件扫描。是否需要？

###########################################

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

修改之后出现了下面的新的错误。
Starting training for 2 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
  0%|          | 0/567 [00:00<?, ?it/s]D:\Instance_Segmentation\yolov7-mask\train.py:360: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(enabled=cuda):
  0%|          | 0/567 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "D:\Instance_Segmentation\yolov7-mask\train.py", line 613, in <module>
    train(hyp, opt, device, tb_writer)
  File "D:\Instance_Segmentation\yolov7-mask\train.py", line 363, in train
    loss, loss_items = compute_loss_ota(pred['bbox_and_cls'], targets.to(device), imgs)  # loss scaled by batch_size
  File "D:\Instance_Segmentation\yolov7-mask\utils\loss.py", line 588, in __call__
    bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
  File "D:\Instance_Segmentation\yolov7-mask\utils\loss.py", line 762, in build_targets
    from_which_layer = from_which_layer[fg_mask_inboxes]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Process finished with exit code 1

问题解答：

这个新报错：

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

说明你在某一行代码中 用一个在 CUDA 上的索引 tensor fg_mask_inboxes 去索引一个在 CPU 上的张量 from_which_layer，这在 PyTorch 中是不允许的。

✅ 解决方法

你需要在 loss.py 中 确保 from_which_layer 和 fg_mask_inboxes 在同一个设备上，修改如下那一行代码（大约在 build_targets 的 762 行）：

原始错误行：

from_which_layer = from_which_layer[fg_mask_inboxes]

✅ 修改为：

from_which_layer = from_which_layer.to(fg_mask_inboxes.device)[fg_mask_inboxes]

🔍 解释

fg_mask_inboxes 是一个布尔索引，通常在 GPU 上；
from_which_layer 是最开始创建的 torch.tensor(...)，默认在 CPU 上；
.to(fg_mask_inboxes.device) 把它转到 GPU；
然后 [fg_mask_inboxes] 的索引操作就不会报错了。

✅ 建议补充同步检查：

除了这一行，凡是有类似 tensor1[tensor2] 的地方，如果两个张量在不同设备（一个在 CPU 一个在 CUDA），都会出错，建议搜索 [...] 这种索引操作，确认是否有类似问题。

如你愿意，也可以把报错的 build_targets 函数代码粘出来，我帮你彻底排查一下是否还有设备不一致的地方。需要我继续帮你吗？

############################

网上（yolov7的gituhb官方问题）有两个博主这样解答的。

你上传的截图中，两位博主给出的解决方案是完全正确的，它们和你目前遇到的报错：

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

是同一个问题的本质原因：

索引张量 fg_mask_inboxes 和被索引的张量 from_which_layer 不在同一个设备上。

✅ 正确的修改方式如下：

把原来的：

from_which_layer = from_which_layer[fg_mask_inboxes]

修改为：

from_which_layer = from_which_layer.to(fg_mask_inboxes.device)[fg_mask_inboxes]

这样就确保 from_which_layer 被显式移动到了 fg_mask_inboxes 所在的设备（通常是 GPU），从而避免设备不一致导致的索引错误。

📌 总结

✅ 你应该按照博主建议修改，没有任何问题。
❌ 不需要回退你之前的 device 设置或 BCE 损失函数修改，因为那个报错是另一个不相关的问题。
🚫 不需要使用 train_aux.py，除非你确定用的是带辅助头（aux head）的模型结构（比如 YOLOv7-W6 等）。

如果你现在还卡在这个错误，只需把 loss.py 中相关 fg_mask_inboxes 索引操作都加上 .to(...device) 处理，比如：

from_which_layer = from_which_layer.to(fg_mask_inboxes.device)[fg_mask_inboxes]
matching_matrix = torch.zeros_like(cost, device="cpu")  # 或者 device=cost.device，根据上下文

你可以继续运行，如果还有报错，我可以再帮你逐个清除。需要我帮你定位所有相关索引并贴出完整补丁代码也可以说一声。

###########################

第二个博主的意思是需要修改三行吗？

是的，你理解得完全正确 ✅。

第二个博主的意思就是：

你需要将 utils/loss.py 文件中出现的三行：

from_which_layer = from_which_layer[fg_mask_inboxes]

全部都修改为如下内容：

from_which_layer = from_which_layer.to(fg_mask_inboxes.device)[fg_mask_inboxes]

📍修改原因：

这是因为你遇到的错误：

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

说明你在 GPU 上运行，但 from_which_layer 是在 CPU 上创建的（默认），而 fg_mask_inboxes 是在 GPU 上的，这两个张量在设备上冲突了。

🛠 总结你需要做的：

打开 utils/loss.py
搜索 from_which_layer = from_which_layer[fg_mask_inboxes]
无论出现多少次，都改成 👇：

from_which_layer = from_which_layer.to(fg_mask_inboxes.device)[fg_mask_inboxes]

如果你不确定文件中一共出现了几处，我可以帮你快速列出典型行号或者写个 grep 查找方式，也可以提供自动替换脚本。

是否需要我帮你打补丁？你只需上传 loss.py，我可以直接替你改好。

########################################

我在loss.py中发现4行，那么四行都要修改吗？

是的，你在 loss.py 中发现的 4 行 from_which_layer = from_which_layer[fg_mask_inboxes] 全部都要修改，否则只改一两处仍可能导致：

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

✅ 正确的统一修改方式如下：

把所有类似的代码：

from_which_layer = from_which_layer[fg_mask_inboxes]

都统一修改为：

from_which_layer = from_which_layer.to(fg_mask_inboxes.device)[fg_mask_inboxes]

📌 原因总结：

from_which_layer 默认是在 CPU
fg_mask_inboxes 一般是在 GPU
如果索引时设备不一致，就会抛出 RuntimeError
.to(fg_mask_inboxes.device) 是显式确保两者在相同设备上

如你仍担心，可以把四行贴给我确认是否都是该改的，我来逐行帮你确认。你也可以让我直接给你打补丁上传修改后的 loss.py。是否需要我来改？

yolov7实例分割中出现KeyError: 0

🔍 错误解释

🔧 解决方法

✅ 修复建议

修改思路：

🧠 建议

✅ 解决办法

🚫 不推荐的做法

✅ 最终推荐修改方式如下：

🔍 原因总结：

❗问题出在 loss.py 中的这行代码：

✅ 修改为：

🔄 所以你要把原来的：

✅ 第一二行：

⚠️ 第三行：

✅ 推荐改法：

🔎 为什么不直接用 device=device？

✅ 结论：

✅ 根本原因

🔍 解决方法（关键点）

✅ 快速判断 p 是什么

✅ 正确写法一：p 是 dict 的情况

✅ 正确写法二：你用错了 loss 函数

✅ 总结

错误1：

错误2：

✅ 和你遇到的错误是否相关？

✅ 它们之间的区别是：

✅ 你接下来该做的

✅ 当前这段是正确写法：

❌ 原来的错误写法：

🔍 原因说明：

📌 所以你的做法没错，不需要改回去。

🔍 你当前报错的根本原因

📌 报错解释：

✅ 解决方案

🔁 可能出现的问题

✅ 临时修改建议

📌 总结

🔍 关键报错：

❌ 报错信息：

✅ 根本原因

✅ 解决方案

① 在 train.py 中找到如下代码：

✅ 改成：

② 可选修复方案：直接在 loss.py 顶部判断一次 p 的结构

🧪 如何确认 key 是啥？

🧠 总结一句话

🔧 报错根因分析：

✅ 正确的修改方式：

【1】修改 train.py 中这一行：

📌 补充建议：

✅ 其他几行是否要改？

🔁 小结：

✅ 原因简要回顾：

🔁 后续建议：

✅ 解决方法

原始错误行：

✅ 修改为：

🔍 解释

✅ 建议补充同步检查：

✅ 正确的修改方式如下：

把原来的：

修改为：

📌 总结

第二个博主的意思就是：

全部都修改为如下内容：

📍修改原因：

🛠 总结你需要做的：

✅ 正确的统一修改方式如下：

📌 原因总结：

❗问题出在 `loss.py` 中的这行代码：

🔎 为什么不直接用 `device=device`？

✅ 正确写法一：`p` 是 `dict` 的情况

① 在 `train.py` 中找到如下代码：

② 可选修复方案：直接在 loss.py 顶部判断一次 `p` 的结构

【1】修改 `train.py` 中这一行：