RuntimeError: Expected all tensors to be on the same device, but found at least two devices

panhtt

已于 2025-02-08 10:31:58 修改

阅读量487

点赞数 6

分类专栏：算法文章标签：人工智能

于 2025-02-08 10:20:23 首次发布

本文链接：https://blog.youkuaiyun.com/panhtt/article/details/145508823

版权

算法专栏收录该内容

4 篇文章

订阅专栏

问题描述

大模型时代，使用低端 GPU 很难单卡部署模型，因为单卡的显存有限，像 V100，只有 16G，这时候就需要使用单机多卡，甚至是多机多卡进行部署。

在进行单机多卡部署时，很容易遇到一个问题：

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

这是因为在模型推理时，对来自于不同 GPU 上的数据进行操作。

背景知识

做大模型的都应该知道，我们一般使用 transformers 库加载模型，比如：

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map="auto",
)

这里的 device_map="auto" 就是为了让模型权重自动分配到不同的 GPU 上，还可以是 "balanced" 或者 "balanced_low_0" 等，但是一般使用这些还是解决不了问题，这时候需要使用另外一种输入格式：字典 Dict，即手动确定哪些层分配到哪个 GPU 上。

问题定位

我们可以打印出每一层所在的 GPU 来进行定位：

for name, p in model.named_parameters():
    print(f"{name}: {p.device}")

浏览打印信息，你应该能找到类似于这种的信息：

model.layers.10.self_attn.q_proj.weight: cuda:0

model.layers.10.self_attn.k_proj.weight: cuda:0

model.layers.10.self_attn.v_proj.weight: cuda:0

model.layers.10.self_attn.o_proj.weight: cuda:0

model.layers.10.mlp.up_proj.weight: cuda:0

model.layers.10.mlp.down_proj.weight: cuda:1

model.layers.10.input_layernorm.weight: cuda:1

model.layers.10.post_attention_layernorm.weight: cuda:1

这里的 layers.10 就是第 11 层 transformer 层，里面包含 Attention 层和 MLP 层。

在每一个 transformer 层中，是包含跳跃连接的，如果输入层和输出层不在同一个 GPU 上，那么在做跳跃连接的 add 操作或者 * 操作时，就是报错，这就是问题所在。

解决办法

手动设置每个层的 GPU 分布：

# 对模型的每一层权重都指定部署的 GPU_ID
device_map = {
    ...
    "model.layers.10": "cuda:0",
    "model.layers.11": "cuda:1",
    ...
}
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map=device_map,
)