Qwen3-Omni（多模态：文本/图像/音频/视频）的安装与使用速通手册

Qwen3-Omni多模态安装与使用指南

最新推荐文章于 2025-11-24 20:11:50 发布

原创最新推荐文章于 2025-11-24 20:11:50 发布 · 1.4k 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#音视频 #omni

部署运行你感兴趣的模型镜像

**Qwen3-Omni（多模态：文本/图像/音频/视频）**的安装与使用速通手册，包含两种常见路线：Transformers 本地推理与 vLLM 服务化。我把能踩的坑也一并写上。

⸻

下载模型（可二选一）

Hugging Face：

pip install -U "huggingface_hub[cli]"

# 任选其一：Instruct / Thinking / Captioner
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct  --local-dir ./Qwen3-Omni-30B-A3B-Instruct
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Thinking  --local-dir ./Qwen3-Omni-30B-A3B-Thinking
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Captioner --local-dir ./Qwen3-Omni-30B-A3B-Captioner

ModelScope（大陆网络更友好）：

pip install -U modelscope
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Instruct  --local_dir ./Qwen3-Omni-30B-A3B-Instruct
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Thinking  --local_dir ./Qwen3-Omni-30B-A3B-Thinking
modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Captioner --local_dir ./Qwen3-Omni-30B-A3B-Captioner

这些命令是官方模型卡 QuickStart 的原生写法。

⸻

Transformers 本地推理（支持多模态 + 可出音频）

目前 Transformers 的支持已合并源码但尚未发 PyPI，需从源码安装；并建议装 qwen-omni-utils（多模态打包工具）与 FlashAttention-2（省显存/提速）。需本机有 ffmpeg。

环境

# 新建干净环境后：
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install -U qwen-omni-utils
# 可选：显存更省、速度更快（需 BF16/FP16 支持的 GPU）
pip install -U flash-attn --no-build-isolation

最小可用示例（图像+音频输入，文本+可选语音输出）

import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

MODEL = "Qwen/Qwen3-Omni-30B-A3B-Instruct"  # 或 Thinking
model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    MODEL, dtype="auto", device_map="auto", attn_implementation="flash_attention_2"
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL)

conversation = [{
  "role": "user",
  "content": [
    {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
    {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
    {"type": "text",  "text":  "What can you see and hear? Answer in one short sentence."}
  ],
}]
use_audio_in_video = True
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=use_audio_in_video)
inputs = processor(text=text, audio=audios, images=images, videos=videos,
                   return_tensors="pt", padding=True, use_audio_in_video=use_audio_in_video)
inputs = inputs.to(model.device).to(model.dtype)

# 生成文本 + 语音（Instruct 支持；Thinking 仅文本）
text_ids, audio = model.generate(**inputs, speaker="Ethan",
                                 thinker_return_dict_in_generate=True,
                                 use_audio_in_video=use_audio_in_video)

reply = processor.batch_decode(
    text_ids.sequences[:, inputs["input_ids"].shape[1]:],
    skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(reply)

if audio is not None:  # 写出语音（24kHz）
    sf.write("output.wav", audio.reshape(-1).detach().cpu().numpy(), samplerate=24000)

上面完整用法与官方模型卡一致；若仅需文本、想省约 10GB 显存，可 model.disable_talker()，并在 generate 时 return_audio=False。

语音“声线”可选 Ethan / Chelsie / Aiden，用 speaker=“Chelsie” 等切换。

⸻

vLLM 部署（高吞吐服务化）

官方强烈建议用 vLLM 做部署；当前说明：vLLM serve 仅完全支持 Thinking（思考体），Instruct 的音频输出在 vLLM 侧尚在推进中（本地 Python 直推已支持）。建议从源码分支安装并按示例参数启动。

安装（源码分支）

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# 如果报 "Undefined symbol" 再尝试：pip install -e . -v 纯源码编译
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install -U qwen-omni-utils
pip install -U flash-attn --no-build-isolation

Python SDK 推理（vLLM 引擎）

import os, torch
from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

os.environ["VLLM_USE_V1"] = "0"  # 目前要求
MODEL = "Qwen/Qwen3-Omni-30B-A3B-Instruct"  # 或 Thinking
llm = LLM(model=MODEL, trust_remote_code=True, gpu_memory_utilization=0.95,
          tensor_parallel_size=torch.cuda.device_count(),
          limit_mm_per_prompt={'image':3,'video':3,'audio':3},
          max_num_seqs=8, max_model_len=32768, seed=1234)

sp = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, max_tokens=16384)
proc = Qwen3OmniMoeProcessor.from_pretrained(MODEL)

messages = [{"role":"user","content":[{"type":"video","video":"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}]}]
text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

inputs = {"prompt": text, "multi_modal_data": {}, "mm_processor_kwargs": {"use_audio_in_video": True}}
if images: inputs["multi_modal_data"]["image"]=images
if videos: inputs["multi_modal_data"]["video"]=videos
if audios: inputs["multi_modal_data"]["audio"]=audios

out = llm.generate([inputs], sampling_params=sp)
print(out[0].outputs[0].text)

OpenAI 兼容服务（vLLM serve）

# 注意：目前 serve 侧“音频输出”未放通；Thinking 全量支持
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 \
  --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1

# 多卡示例
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --port 8901 --host 127.0.0.1 \
  --dtype bfloat16 --max-model-len 65536 --allowed-local-media-path / -tp 4

然后用 Chat Completions API 传多模态消息（image_url / audio_url / text）：

curl http://localhost:8901/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages":[
    {"role":"system","content":"You are a helpful assistant."},
    {"role":"user","content":[
      {"type":"image_url","image_url":{"url":"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}},
      {"type":"audio_url","audio_url":{"url":"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}},
      {"type":"text","text":"What can you see and hear? Answer in one sentence."}
    ]}
  ]
}'

关于 serve 限制与命令来源见模型卡 vLLM Usage 小节。

⸻

显存与机型参考

官方给的理论最小显存（BF16 + Transformers + FA2）：
• Instruct（30B）：约 78.9GB（15s视频）→ 144.8GB（120s视频）
• Thinking（30B）：约 68.7GB → 131.7GB
注意：视频越长占用越高；vLLM 需预分配显存，limit_mm_per_prompt/max_num_seqs 调小可缓解 OOM。

⸻

模型家族与场景选择
• Instruct：能文本+语音输出，适合助理对话/实时语音。
• Thinking：带推理增强（链路更强），输入可多模态，仅文本输出。
• Captioner：音频细粒度描述（ASR/声音理解/音乐分析等）。

⸻

常见坑与建议
• Transformers 版本：当前需要从 Git 安装最新版；旧版 transformers 无法直接用 Qwen3OmniMoe* 类。
• FlashAttention-2：仅在 FP16/BF16 下可用；需兼容的 GPU 与驱动/CUDA；不用 vLLM 时强烈建议装。
• vLLM serve 能力边界：暂时只完整支持 Thinking；Instruct 的音频输出在 vLLM 端仍在推进中（本地 Python 直推可出音频）。
• 多模态输入：官方提供 qwen-omni-utils.process_mm_info 来统一打包 base64/URL/本地的图像/视频/音频，建议直接用以避免自己写 loader。

您可能感兴趣的与本文相关的镜像

Vllm-v0.11.0

Vllm

vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架，旨在极大地提升实时场景下的语言模型服务的吞吐与内存使用效率。vLLM是一个快速且易于使用的库，用于 LLM 推理和服务，可以和HuggingFace 无缝集成。vLLM利用了全新的注意力算法「PagedAttention」，有效地管理注意力键和值