LMDeploy多模态部署指南：Qwen2-VL与InternVL全流程-优快云博客

LMDeploy多模态部署指南：Qwen2-VL与InternVL全流程

【免费下载链接】lmdeploy LMDeploy is a toolkit for compressing, deploying, and serving LLMs. 项目地址: https://gitcode.com/gh_mirrors/lm/lmdeploy

引言：多模态大模型部署的痛点与解决方案

你是否还在为多模态模型部署时的环境配置复杂、推理速度慢、内存占用高而烦恼？LMDeploy（Large Model Deployment toolkit）作为一款专注于大模型压缩、部署和服务的工具包，提供了高效的解决方案。本文将以Qwen2-VL和InternVL两大主流多模态模型为例，详细介绍从环境搭建到在线服务的全流程部署方案，帮助你快速掌握多模态模型的高效部署技巧。

读完本文后，你将能够：

熟练搭建支持Qwen2-VL和InternVL的LMDeploy环境
掌握离线推理的多种高级用法，包括多图对话、视频处理等
部署高性能的在线服务，并通过API进行多模态交互
优化模型推理性能，解决实际应用中的常见问题

一、环境准备与安装

1.1 系统要求

LMDeploy支持Linux和Windows平台，最低要求CUDA版本为11.3，兼容以下NVIDIA GPU架构：

GPU架构	具体型号示例
Volta (sm70)	V100
Turing (sm75)	2080Ti, T4
Ampere (sm80, sm86)	3090, A100
Ada Lovelace (sm89)	4090, 4070Ti

1.2 LMDeploy安装

推荐使用conda创建独立环境：

conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy

1.2.1 pip安装（推荐）

对于CUDA 12.x环境：

pip install lmdeploy

对于CUDA 11.3+环境：

export LMDEPLOY_VERSION=0.10.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

1.2.2 源码安装

git clone https://gitcode.com/gh_mirrors/lm/lmdeploy.git
cd lmdeploy
pip install -e .

如需禁用Turbomind引擎（仅使用PyTorch）：

DISABLE_TURBOMIND=1 pip install -e .

1.3 模型特定依赖

Qwen2-VL依赖

pip install qwen_vl_utils

InternVL依赖

pip install timm
# 安装flash-attention（推荐）
pip install flash-attn --no-build-isolation

二、Qwen2-VL部署全流程

2.1 模型支持情况

模型	大小	支持的推理引擎
Qwen-VL-Chat	-	TurboMind
Qwen2-VL	2B, 7B	PyTorch

2.2 离线推理

2.2.1 基础用法

from lmdeploy import pipeline
from lmdeploy.vl import load_image

# 初始化pipeline
pipe = pipeline('Qwen/Qwen2-VL-7B-Instruct')

# 加载图片
image = load_image('local_image.jpg')  # 支持本地图片路径或base64编码

# 推理
response = pipe(('描述这张图片', image))
print(response.text)

2.2.2 多图多轮对话

from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('Qwen/Qwen2-VL-7B-Instruct', log_level='INFO')

# 准备图片和对话历史
image1 = load_image('image1.jpg')
image2 = load_image('image2.jpg')

messages = [
    dict(role='user', content=[
        dict(type='text', text='详细描述这两张图片'),
        dict(type='image_url', image_url=dict(url=image1)),
        dict(type='image_url', image_url=dict(url=image2))
    ])
]

# 第一轮推理
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
print(out.text)

# 第二轮推理（追问）
messages.append(dict(role='assistant', content=out.text))
messages.append(dict(role='user', content='这两张图片有什么异同点？'))
out = pipe(messages, gen_config=GenerationConfig(top_k=1))
print(out.text)

2.2.3 控制图片分辨率（加速推理）

from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('Qwen/Qwen2-VL-7B-Instruct', log_level='INFO')

# 控制图片分辨率（像素数）
min_pixels = 64 * 28 * 28  # 最小像素数
max_pixels = 64 * 28 * 28  # 最大像素数

messages = [
    dict(role='user', content=[
        dict(type='text', text='描述这张图片'),
        dict(type='image_url', image_url=dict(
            min_pixels=min_pixels, 
            max_pixels=max_pixels, 
            url='local_image.jpg'
        ))
    ])
]

response = pipe(messages, gen_config=GenerationConfig(top_k=1))
print(response.text)

2.3 在线服务部署

2.3.1 启动API服务

lmdeploy serve api_server Qwen/Qwen2-VL-7B-Instruct --server-port 23333

2.3.2 Docker部署

构建镜像：

docker build --build-arg CUDA_VERSION=cu12 -t openmmlab/lmdeploy:qwen2vl . -f ./docker/Qwen2VL_Dockerfile

启动容器：

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 23333:23333 \
    --ipc=host \
    openmmlab/lmdeploy:qwen2vl \
    lmdeploy serve api_server Qwen/Qwen2-VL-7B-Instruct

2.3.3 API调用示例

使用Python请求API：

import requests
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("local_image.jpg")

headers = {
    "Content-Type": "application/json"
}

payload = {
    "prompt": "描述这张图片",
    "image": base64_image
}

response = requests.post("http://localhost:23333/generate", headers=headers, json=payload)
print(response.json())

三、InternVL部署全流程

3.1 模型支持情况

模型	大小	支持的推理引擎
InternVL	13B-19B	TurboMind
InternVL1.5	2B-26B	TurboMind, PyTorch
InternVL2	4B	PyTorch
InternVL2	1B-2B, 8B-76B	TurboMind, PyTorch
InternVL2.5/2.5-MPO/3	1B-78B	TurboMind, PyTorch
Mono-InternVL	2B	PyTorch

3.2 离线推理

3.2.1 基础用法

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B')

image = load_image('local_image.jpg')
response = pipe(('describe this image', image))
print(response)

3.2.2 多图对话高级用法

from lmdeploy import pipeline, GenerationConfig
from lmdeploy.vl.constants import IMAGE_TOKEN

pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')

# 方式1：拼接图像
messages = [
    dict(role='user', content=[
        dict(type='text', text=f'{IMAGE_TOKEN}{IMAGE_TOKEN}\n详细描述这两张图片'),
        dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='image1.jpg')),
        dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='image2.jpg'))
    ])
]

# 方式2：独立图像标记
messages = [
    dict(role='user', content=[
        dict(type='text', text=f'图1: {IMAGE_TOKEN}\n图2: {IMAGE_TOKEN}\n详细描述这两张图片'),
        dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='image1.jpg')),
        dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='image2.jpg'))
    ])
]

out = pipe(messages, gen_config=GenerationConfig(top_k=1))
print(out.text)

3.2.3 视频处理

InternVL支持视频理解，通过抽取关键帧进行处理：

import numpy as np
from lmdeploy import pipeline, GenerationConfig
from decord import VideoReader, cpu
from lmdeploy.vl.constants import IMAGE_TOKEN
from lmdeploy.vl.utils import encode_image_base64
from PIL import Image

pipe = pipeline('OpenGVLab/InternVL2-8B', log_level='INFO')

def load_video(video_path, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())
    
    # 均匀采样视频帧
    seg_size = float(max_frame) / num_segments
    frame_indices = np.array([
        int(np.round(seg_size * idx)) for idx in range(num_segments)
    ])
    
    imgs = []
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        imgs.append(img)
    return imgs

# 加载视频并提取帧
video_path = 'video.mp4'
imgs = load_video(video_path, num_segments=8)

# 构建提示
question = ''
for i in range(len(imgs)):
    question += f'帧{i+1}: {IMAGE_TOKEN}\n'
question += '这只小熊猫在做什么？'

# 准备内容
content = [{'type': 'text', 'text': question}]
for img in imgs:
    content.append({
        'type': 'image_url', 
        'image_url': {'max_dynamic_patch': 1, 'url': f'data:image/jpeg;base64,{encode_image_base64(img)}'}
    })

messages = [dict(role='user', content=content)]
response = pipe(messages, gen_config=GenerationConfig(top_k=1))
print(response.text)

3.3 在线服务部署

3.3.1 启动API服务

lmdeploy serve api_server OpenGVLab/InternVL2-8B --server-port 23333

3.3.2 Docker Compose部署

创建docker-compose.yml：

version: '3.5'

services:
  lmdeploy:
    container_name: lmdeploy-internvl
    image: openmmlab/lmdeploy:internvl
    ports:
      - "23333:23333"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    stdin_open: true
    tty: true
    ipc: host
    command: lmdeploy serve api_server OpenGVLab/InternVL2-8B
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: "all"
              capabilities: [gpu]

启动服务：

docker-compose up -d

查看日志：

docker logs -f lmdeploy-internvl

四、高级特性与性能优化

4.1 多卡并行推理

from lmdeploy import pipeline, TurbomindEngineConfig

# 使用2张GPU进行张量并行
pipe = pipeline(
    'OpenGVLab/InternVL2-8B',
    backend_config=TurbomindEngineConfig(tp=2)  # tp参数指定张量并行数
)

4.2 上下文窗口调整

from lmdeploy import pipeline, TurbomindEngineConfig

# 设置更大的上下文窗口（默认通常为4096）
pipe = pipeline(
    'Qwen/Qwen2-VL-7B-Instruct',
    backend_config=TurbomindEngineConfig(session_len=8192)
)

4.3 生成参数优化

from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('Qwen/Qwen2-VL-7B-Instruct')

# 配置生成参数
gen_config = GenerationConfig(
    top_k=40,       # 候选词数量
    top_p=0.8,      # 核采样概率阈值
    temperature=0.6,# 温度参数，控制随机性
    max_new_tokens=1024  # 最大生成 tokens
)

response = pipe(('详细描述这张图片', image), gen_config=gen_config)

4.4 视觉模型参数调整

from lmdeploy import pipeline, VisionConfig

# 调整视觉模型参数
vision_config = VisionConfig(
    max_batch_size=16,  # 视觉模型批处理大小
    image_size=448      # 输入图像大小
)

pipe = pipeline(
    'OpenGVLab/InternVL2-8B',
    vision_config=vision_config
)

五、部署架构与工作原理

5.1 LMDeploy多模态推理流程

mermaid

5.2 ImageEncoder工作原理

LMDeploy的ImageEncoder负责处理图像输入并与语言模型交互：

mermaid

5.3 两种推理引擎对比

特性	PyTorch引擎	TurboMind引擎
实现语言	Python	C++/CUDA
灵活性	高，易于调试和修改	中，优化更深入
性能	基础性能	优化更好，吞吐量更高
内存效率	一般	高，支持KV缓存量化
支持模型	广泛	精选模型，优化更好

六、常见问题与解决方案

6.1 环境配置问题

Q: 安装flash-attn失败怎么办？
A: 可以从官方发布页下载对应环境的预编译whl包安装，或使用以下命令：

pip install flash-attn --no-build-isolation

Q: 运行时出现CUDA out of memory错误？
A: 尝试以下解决方案：

减少批处理大小
启用KV缓存量化：--cache-8bit或--cache-4bit
使用更小的模型版本
增加上下文窗口大小限制：--session-len 4096

6.2 推理问题

Q: 如何处理多轮对话中的图像引用？
A: LMDeploy会自动管理对话历史中的图像引用，无需重复传递图像数据，只需在第一轮对话中提供图像即可。

Q: 模型生成速度慢怎么办？
A: 1. 使用TurboMind引擎替代PyTorch引擎
2. 启用量化：lmdeploy lite quantize --model Qwen/Qwen2-VL-7B-Instruct --quant-policy w4a16
3. 调整生成参数：减小max_new_tokens，增大top_k

6.3 服务部署问题

Q: 如何查看API服务的详细文档？
A: 启动服务后，访问http://localhost:23333即可查看Swagger UI文档。

Q: 如何限制API服务的并发请求数？
A: 使用--max-num-batched-tokens参数限制批处理token总数，例如：

lmdeploy serve api_server Qwen/Qwen2-VL-7B-Instruct --max-num-batched-tokens 8192

七、总结与展望

本文详细介绍了使用LMDeploy部署Qwen2-VL和InternVL两大主流多模态模型的完整流程，包括环境搭建、离线推理、在线服务部署以及高级优化技巧。通过LMDeploy的pipeline接口，开发者可以轻松实现多模态模型的高效部署，而TurboMind引擎的优化则为模型推理性能提供了有力保障。

随着多模态模型的不断发展，LMDeploy将持续优化对新模型的支持，并进一步提升推理效率和易用性。未来，我们可以期待更多针对多模态场景的优化，如更高效的视觉特征提取、更低延迟的图像-文本交互等。

希望本文能帮助你顺利部署多模态模型，如有任何问题或建议，欢迎通过LMDeploy的GitHub仓库与我们交流。

收藏本文，关注LMDeploy项目，获取最新多模态部署技术！

【免费下载链接】lmdeploy LMDeploy is a toolkit for compressing, deploying, and serving LLMs. 项目地址: https://gitcode.com/gh_mirrors/lm/lmdeploy

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考