Qwen2-VL 模型架构深度解析
目录
-
[模型概述] -
[整体架构] -
[核心组件详解] -
[推理流程] -
[并行策略] -
[代码实现关联]
模型概述
Qwen2-VL 是一个多模态大语言模型,结合了视觉编码器(Vision Transformer)和语言模型(Qwen2 Model)。它能够处理图像和文本输入,生成相应的文本响应。
核心特性
-
视觉编码器: 使用 Vision Transformer (ViT) 架构处理图像 -
语言模型: 基于 Qwen2 的 Transformer 架构处理文本 -
多模态融合: 通过特殊的嵌入机制融合视觉和文本特征 -
并行计算: 支持张量并行(Tensor Parallelism)和流水线并行(Pipeline Parallelism)
整体架构
┌─────────────────────────────────────────────────────────────────┐
│ Qwen2-VL Model │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ 视觉编码器 (Visual) │ │ 语言模型 (Language) │ │
│ │ │ │ │ │
│ │ - Patch Embedding │ │ - Token Embedding │ │
│ │ - Vision Blocks │ ───> │ - Qwen2 Layers │ │
│ │ - Patch Merger │ │ - RoPE + Attention │ │
│ │ - RoPE Position │ │ - MLP │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ │ │ │
│ └───────────┬─────────────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ 多模态融合层 │ │
│ │ (MM Embed) │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ LM Head │ │
│ │ (Logits) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
核心组件详解
1. 视觉编码器 (Qwen2VisionTransformer)
位置: python/sglang/srt/models/qwen2_vl.py:296-418
1.1 Patch Embedding
# 文件: python/sglang/srt/models/qwen2_vl.py:188-211
class Qwen2VisionPatchEmbed(nn.Module):
def __init__(self, patch_size=14, temporal_patch_size=2,
in_chans=3, embed_dim=1152):
# 使用 3D Convolution 处理视频或图像序列
kernel_size = [temporal_patch_size, patch_size, patch_size]
self.proj = nn.Conv3d(in_chans, embed_dim,
kernel_size=kernel_size,
stride=kernel_size, bias=False)
原理说明:
-
使用 3D 卷积将图像/视频切分成 patches -
将空间维度(H x W)和时间维度(T)一起处理 -
输入: (num_patches, channels * patch_size^2 * temporal_patch_size) -
输出: (num_patches, embed_dim) -
支持动态 patch 大小和时间维度
1.2 Vision Blocks
# 文件: python/sglang/srt/models/qwen2_vl.py:121-185
class Qwen2VisionBlock(nn.Module):
def __init__(self, dim, num_heads, mlp_ratio, ...):
self.norm1 = norm_layer(dim)
self.norm2 = norm_layer(dim)
self.attn = VisionAttention(...) # 视觉注意力层
self.mlp = Qwen2VisionMLP(...) # 多层感知机
def forward(self, x, cu_seqlens, position_embeddings):
# 残差连接 + 注意力
hidden_states = self.norm1(x)
hidden_states = rearrange(hidden_states, "s b ... -> b s ...")
attn = self.attn(hidden_states, cu_seqlens, position_embeddings)
attn = rearrange(attn, "b s ... -> s b ...")
x = x + attn
# 残差连接 + MLP
x = x + self.mlp(self.norm2(x))
return x
Tensor 计算流程:
-
Norm1: x -> LayerNorm(x)维度保持(seq_len, batch, embed_dim) -
重新排列: (seq_len, batch, dim)->(batch, seq_len, dim) -
Attention: -
输入: (batch, seq_len, dim) -
QKV 投影: (batch, seq_len, dim) -> (batch, seq_len, 3*dim) -
计算注意力分数: QK^T / sqrt(head_dim) -
输出: (batch, seq_len, dim)
-
-
重新排列: (batch, seq_len, dim)->(seq_len, batch, dim) -
残差连接: x = x + attn -
Norm2 + MLP: 类似过程,通过两层线性变换
1.3 Vision Attention
# 文件: python/sglang/srt/layers/attention/vision.py:399-688
class VisionAttention(nn.Module):
def __init__(self, embed_dim, num_heads, ...):
self.qkv_proj = QKVParallelLinear(...) # 并行QKV投影
self.proj = RowParallelLinear(...) # 输出投影
self.qkv_backend = QKV_BACKEND_IMPL[qkv_backend](...)
def forward(self, x, cu_seqlens, position_embeddings):
# 1. 生成 Q, K, V
qkv, _ = self.qkv_proj(x)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
# 2. 应用位置编码 (RoPE)
if position_embeddings is not None:
cos, sin = position_embeddings
q, k = apply_rotary_pos_emb(q, k, cos, sin)
# 3. 计算注意力 (支持多种后端)
output = self.qkv_backend.forward(q, k, v, bsz, seq_len, cu_seqlens)
# 4. 输出投影
output, _ = self.proj(output)
return output
支持的 Attention 后端:
-
sdpa: 标准 PyTorch Scaled Dot Product Attention -
triton_attn: Triton 实现的高性能注意力 -
fa3: Flash Attention 3 -
ascend_attn: Ascend NPU 专用实现
Tensor 计算过程:
输入: (batch, seq_len, embed_dim)
↓
QKV 投影: (batch, seq_len, embed_dim) -> (batch, seq_len, 3*embed_dim)
↓
分割: (batch, seq_len, embed_dim) x 3 (Q, K, V)
↓
RoPE: 旋转位置编码
↓
Attention: QK^T / sqrt(head_dim) @ V
↓
输出投影: (batch, seq_len, embed_dim)
1.4 Patch Merger
# 文件: python/sglang/srt/models/qwen2_vl.py:214-258
class Qwen2VisionPatchMerger(nn.Module):
def __init__(self, d_model, context_dim, spatial_merge_size=2):
self.hidden_size = context_dim * (spatial_merge_size**2)
self.ln_q = norm_layer(context_dim)
self.mlp = nn.ModuleList([
ColumnParallelLinear(hidden_size, hidden_size, ...),
nn.GELU(),
RowParallelLinear(hidden_size, d_model, ...),
])
def forward(self, x):
# 归一化
x = self.ln_q(x)
# 重塑: (num_patches, batch, context_dim)
# -> (num_patches, batch, spatial_merge_size^2 * context_dim)
x = x.view(-1, self.hidden_size)
# MLP: 两层线性变换 + GELU 激活
x_parallel, _ = mlp_fc1(x)
x_parallel = mlp_act(x_parallel)
out, _ = mlp_fc2(x_parallel)
return out
原理说明:
-
将多个相邻的 patches 合并成一个较大的 token -
使用 MLP 进行特征融合和降维 -
输入: (num_patches, batch, context_dim) -
输出: (num_patches / spatial_merge_size^2, batch, d_model) -
减少序列长度,提高后续处理效率
1.5 Rotary Position Embedding
# 文件: python/sglang/srt/models/qwen2_vl.py:261-293
class Qwen2VisionRotaryEmbedding(nn.Module):
def __init__(self, dim, theta=10000.0):
self.dim = dim
self.theta = theta
inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2) / dim))
self.register_buffer("inv_freq", inv_freq)
def forward(self, seqlen):
# 计算旋转频率
seq = torch.arange(seqlen, device=self.inv_freq.device)
freqs = torch.outer(seq, self.inv_freq)
return freqs # 用于后续的 cos/sin 计算
位置编码计算逻辑:
# 在 Vision Transformer 中
def rot_pos_emb(self, grid_thw):
# grid_thw: (num_images, 3) 表示 (grid_t, grid_h, grid_w)
# 为每个 patch 生成 (h, w) 坐标
pos_ids = []
for i in range(grid_thw.size(0)):
t, h, w = grid_thw[i]
hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1))
# 应用 Rotary Embedding
rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)
rotary_pos_emb = rotary_pos_emb_full[pos_ids]
return rotary_pos_emb
2. 语言模型 (Qwen2Model)
位置: python/sglang/srt/models/qwen2.py:260-384
2.1 Token Embedding
# 文件: python/sglang/srt/models/qwen2.py:275-289
class Qwen2Model(nn.Module):
def __init__(self, config, ...):
if self.pp_group.is_first_rank:
self.embed_tokens = VocabParallelEmbedding(
config.vocab_size,
config.hidden_size,
# 支持词汇并行嵌入
)
def forward(self, input_ids, ...):
hidden_states = self.embed_tokens(input_ids)
# (batch, seq_len) -> (batch, seq_len, hidden_size)
词汇并行策略:
-
将词汇表分割到多个 GPU -
每个 GPU 负责一部分词汇的嵌入 -
输入: (batch, seq_len)包含 token IDs -
输出: (batch, seq_len, hidden_size)包含嵌入向量
2.2 Qwen2 Decoder Layer
# 文件: python/sglang/srt/models/qwen2.py:192-257
class Qwen2DecoderLayer(nn.Module):
def __init__(self, config, layer_id, ...):
# Self Attention
self.self_attn = Qwen2Attention(...)
# Feed Forward MLP
self.mlp = Qwen2MLP(...)
# Layer Normalization
self.input_layernorm = RMSNorm(...)
self.post_attention_layernorm = RMSNorm(...)
def forward(self, positions, hidden_states, forward_batch, residual):
# Pre-norm 架构
if residual is None:
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
else:
hidden_states, residual = self.input_layernorm(hidden_states, residual)
# Self Attention
hidden_states = self.self_attn(positions, hidden_states, forward_batch)
# Post Attention
hidden_states, residual = self.post_attention_layernorm(
hidden_states, residual
)
hidden_states = self.mlp(hidden_states)
return hidden_states, residual
Tensor 计算流程:
输入: (batch, seq_len, hidden_size)
↓
RMSNorm: (batch, seq_len, hidden_size) [残差连接优化]
↓
QKV 投影: (batch, seq_len, hidden_size) -> (batch, seq_len, 3*hidden_size)
↓
RoPE 位置编码: 应用到 Q 和 K
↓
Attention: QK^T @ V (带 KV Cache)
↓
输出投影: (batch, seq_len, hidden_size)
↓
残差连接: x = x + attention_output
↓
MLP: (batch, seq_len, hidden_size)
↓
残差连接: x = x + mlp_output
↓
输出: (batch, seq_len, hidden_size)
2.3 Qwen2 Attention
# 文件: python/sglang/srt/models/qwen2.py:102-189
class Qwen2Attention(nn.Module):
def __init__(self, hidden_size, num_heads, num_kv_heads, ...):
# QKV 并行投影
self.qkv_proj = QKVParallelLinear(hidden_size, head_dim,
total_num_heads, total_num_kv_heads)
# 输出投影
self.o_proj = RowParallelLinear(total_num_heads * head_dim, hidden_size)
# RoPE 位置编码
self.rotary_emb = get_rope(head_dim, ...)
# Radix Attention (支持 Radix Tree 优化的注意力)
self.attn = RadixAttention(num_heads, head_dim, ...)
def forward(self, positions, hidden_states, forward_batch):
# 生成 Q, K, V
qkv, _ = self.qkv_proj(hidden_states)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
# 应用 RoPE
q, k = self.rotary_emb(positions, q, k)
# 注意力计算 (支持 Radix Tree 优化)
attn_output = self.attn(q, k, v, forward_batch)
# 输出投影
output, _ = self.o_proj(attn_output)
return output
GQA (Grouped Query Attention):
-
使用不同的头数处理查询和键值 -
num_heads > num_kv_heads,减少内存占用 -
例如: num_heads=32, num_kv_heads=8表示 4 个查询头共享 1 个 KV 头
RoPE (Rotary Position Embedding):
# 旋转位置编码公式
f_theta(pos, i) = pos / (10000 ^ (2i/d))
# 旋转矩阵
R = [[cos(f), -sin(f)], [sin(f), cos(f)]]
# 应用到 q 和 k
q_rotated = R @ q
k_rotated = R @ k
2.4 Qwen2 MLP
# 文件: python/sglang/srt/models/qwen2.py:61-99
class Qwen2MLP(nn.Module):
def __init__(self, hidden_size, intermediate_size, ...):
# 门控投影 (gate + up)
self.gate_up_proj = MergedColumnParallelLinear(
hidden_size,
[intermediate_size] * 2, # [gate, up]
bias=False
)
self.down_proj = RowParallelLinear(
intermediate_size,
hidden_size,
bias=False
)
self.act_fn = SiluAndMul() # SiLU 激活
def forward(self, x):
gate_up, _ = self.gate_up_proj(x)
x = self.act_fn(gate_up)
x, _ = self.down_proj(x)
return x
Tensor 计算过程:
输入: (batch, seq_len, hidden_size)
↓
Gate + Up 投影: (batch, seq_len, hidden_size)
-> (batch, seq_len, 2*intermediate_size)
↓
分割: [gate, up] 各 (batch, seq_len, intermediate_size)
↓
激活: SiLU(gate) * up
↓
Down 投影: (batch, seq_len, intermediate_size)
-> (batch, seq_len, hidden_size)
↓
输出: (batch, seq_len, hidden_size)
3. 多模态融合
位置: python/sglang/srt/models/qwen2_vl.py:424-564
3.1 前向传播
# 文件: python/sglang/srt/models/qwen2_vl.py:520-564
class Qwen2VLForConditionalGeneration(nn.Module):
def forward(self, input_ids, positions, forward_batch, get_embedding=False):
# 1. 处理多模态输入 (通过 general_mm_embed_routine)
hidden_states = general_mm_embed_routine(
input_ids=input_ids,
forward_batch=forward_batch,
language_model=self.model,
multimodal_model=self,
positions=positions,
)
# 2. 通过语言模型
# 3. 输出 logits
if get_embedding:
return self.pooler(hidden_states, forward_batch)
else:
return self.logits_processor(
input_ids, hidden_states, self.lm_head, forward_batch
)
3.2 图像特征提取
# 文件: python/sglang/srt/models/qwen2_vl.py:484-493
def get_image_feature(self, items: List[MultimodalDataItem]) -> torch.Tensor:
# 1. 合并所有图像的 pixel values
pixel_values = torch.cat([item.feature for item in items], dim=0)
# 2. 获取图像网格维度 (grid_t, grid_h, grid_w)
image_grid_thw = torch.concat([item.image_grid_thw for item in items], dim=0)
# 3. 通过视觉编码器
image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
return image_embeds
3.3 general_mm_embed_routine
位置: python/sglang/srt/managers/mm_utils.py:636-709
def general_mm_embed_routine(input_ids, forward_batch, language_model,
multimodal_model, positions, **kwargs):
# 1. 判断是否需要处理多模态输入
if (not forward_batch.forward_mode.is_decode()
and forward_batch.contains_mm_inputs()):
# 2. 嵌入多模态数据
inputs_embeds, other_info = embed_mm_inputs(
mm_inputs_list=mm_inputs_list,
extend_prefix_lens=extend_prefix_lens,
extend_seq_lens=extend_seq_lens,
input_ids=input_ids,
multimodal_model=multimodal_model,
input_embedding=embed_tokens,
data_embedding_func_mapping={
Modality.IMAGE: self.get_image_feature,
Modality.VIDEO: self.get_video_feature,
},
)
forward_batch.mm_inputs = None
else:
# 纯文本情况
inputs_embeds = embed_tokens(input_ids)
# 3. 通过语言模型
hidden_states = language_model(
input_ids=None, # 使用 inputs_embeds 而不是 input_ids
forward_batch=forward_batch,
input_embeds=inputs_embeds,
**kwargs
)
return hidden_states
Tensor 融合过程:
图像输入:
pixel_values: (num_patches, channels * patch_size^2)
grid_thw: (num_images, 3)
↓
视觉编码器输出: (num_tokens, batch, hidden_size)
↓
与文本嵌入合并: (seq_len, batch, hidden_size)
↓
通过语言模型
↓
生成 logits: (seq_len, batch, vocab_size)
推理流程
完整前向传播流程
# 1. 输入处理
input_ids = [<text>, <image_token>, <image_token>, ..., <text>]
positions = [0, 1, 2, ..., seq_len-1]
forward_batch = ForwardBatch(...) # 包含批次信息和多模态数据
# 2. 嵌入阶段
# a. 文本 token -> embedding
# b. 图像 -> 视觉编码器 -> embedding
# c. 合并多模态嵌入
inputs_embeds = general_mm_embed_routine(...)
# 形状: (batch, seq_len, hidden_size)
# 3. 语言模型前向传播
hidden_states = inputs_embeds
for layer in model.layers:
# 3.1 自注意力
attn_output = layer.self_attn(positions, hidden_states, forward_batch)
hidden_states = hidden_states + attn_output # 残差连接
# 3.2 FFN
mlp_output = layer.mlp(hidden_states)
hidden_states = hidden_states + mlp_output # 残差连接
# 4. 输出层
hidden_states = model.norm(hidden_states)
logits = lm_head(hidden_states)
# 5. 后处理
output = logits_processor(logits, forward_batch)
支持的模式
Prefill 阶段
# 完整序列的并行处理
# 输入: (batch, prefix_len)
# KV Cache: 计算并存储整个前缀的 KV
# 输出: (batch, prefix_len, vocab_size)
Decode 阶段
# 自回归生成
# 输入: (batch, 1) # 单个新 token
# KV Cache: 使用缓存的前缀 KV
# 输出: (batch, 1, vocab_size) # 预测下一个 token
Extend 阶段
# 扩展前缀
# 输入: (batch, extend_len)
# KV Cache: 追加新的 KV 到缓存
# 输出: (batch, extend_len, vocab_size)
并行策略
1. 张量并行 (Tensor Parallelism)
在 Qwen2Attention 中
# 文件: python/sglang/srt/models/qwen2.py:119-132
class Qwen2Attention(nn.Module):
def __init__(self, ...):
tp_size = get_tensor_model_parallel_world_size()
# 计算每个 TP rank 的头数
self.num_heads = self.total_num_heads // tp_size
self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size)
# QKV 投影 (列并行)
self.qkv_proj = QKVParallelLinear(
hidden_size,
self.head_dim,
self.total_num_heads,
self.total_num_kv_heads,
)
# 输出投影 (行并行)
self.o_proj = RowParallelLinear(
self.total_num_heads * self.head_dim,
hidden_size,
)
QKV 并行投影:
# 输入: (batch, seq_len, hidden_size)
# 在不同 GPU 上:
# GPU 0: Q[:, :, :head_dim/2], K[:, :, :head_dim/2], V[:, :, :head_dim/2]
# GPU 1: Q[:, :, head_dim/2:], K[:, :, head_dim/2:], V[:, :, head_dim/2:]
# 输出: 各自计算部分注意力输出
# All-Gather: 收集所有 GPU 的输出
输出并行投影:
# 输入: (batch, seq_len, num_heads_per_tp * head_dim)
# 在不同的 GPU 上:
# 每个 GPU 计算部分输出
# All-Reduce: 求和所有 GPU 的输出
# 最终输出: (batch, seq_len, hidden_size)
在 VisionAttention 中
# 文件: python/sglang/srt/layers/attention/vision.py:412-520
class VisionAttention(nn.Module):
def __init__(self, ...):
attn_tp_rank = get_attention_tp_rank()
attn_tp_size = get_attention_tp_size()
# QKV 并行投影
self.qkv_proj = QKVParallelLinear(
hidden_size=embed_dim,
head_size=self.head_size,
total_num_heads=num_heads,
total_num_kv_heads=num_heads,
tp_rank=attn_tp_rank,
tp_size=attn_tp_size,
)
# 输出并行投影
self.proj = RowParallelLinear(
input_size=self.dummy_dim,
output_size=embed_dim,
tp_rank=attn_tp_rank,
tp_size=attn_tp_size,
)
在 Linear 层中
# 列并行 Linear
class ColumnParallelLinear:
def forward(self, input):
# 输入: (batch, seq_len, input_size)
# 在 TP 维度上分割
local_input = input.chunk(tp_size, dim=-1)[tp_rank]
# 每个 GPU 计算部分输出
local_output = F.linear(local_input, local_weight, bias)
return local_output, input # 返回本地输出和输入
# 行并行 Linear
class RowParallelLinear:
def forward(self, input):
# 输入: (batch, seq_len, output_size_per_tp)
# 在 TP 维度上并行计算
local_output = F.linear(input, local_weight, bias)
# All-Reduce 求和
output = dist.all_reduce(local_output)
return output, None
2. 流水线并行 (Pipeline Parallelism)
# 文件: python/sglang/srt/models/qwen2.py:273-305
class Qwen2Model(nn.Module):
def __init__(self, config, ...):
self.pp_group = get_pp_group()
# 只在第一个 rank 初始化 embedding
if self.pp_group.is_first_rank:
self.embed_tokens = VocabParallelEmbedding(...)
else:
self.embed_tokens = PPMissingLayer()
# 只在最后一个 rank 初始化 norm
if self.pp_group.is_last_rank:
self.norm = RMSNorm(config.hidden_size)
else:
self.norm = PPMissingLayer(return_tuple=True)
流水线阶段:
Stage 0 (GPU 0): Embedding -> Layer 0-2
Stage 1 (GPU 1): Layer 3-5
Stage 2 (GPU 2): Layer 6-8
Stage 3 (GPU 3): Layer 9-11 -> Norm -> Output
3. KV Cache 管理
# 文件: python/sglang/srt/layers/radix_attention.py:43-127
class RadixAttention(nn.Module):
def forward(self, q, k, v, forward_batch, save_kv_cache=True):
# 通过 attention backend 处理
return forward_batch.attn_backend.forward(
q, k, v, self, forward_batch, save_kv_cache
)
KV Cache 结构:
# 每个层维护一个 KV Cache
kv_cache[layer_id] = {
'k': (num_tokens, num_kv_heads, head_dim),
'v': (num_tokens, num_kv_heads, head_dim),
}
# Prefill: 填充整个前缀的 KV
# Decode: 追加新 token 的 KV
# Extend: 追加扩展序列的 KV
4. Radix Tree 优化
# Radix Attention 使用 Radix Tree 优化注意力计算
# 对于具有相同前缀的序列,共享部分 KV Cache
#
# 序列: [A, B, C, D, E] (prefix) + [F] (new)
# 序列: [A, B, C, D, E] (prefix) + [G] (new)
#
# KV Cache:
# - Shared: [A, B, C, D, E] 的 KV (两个序列共享)
# - Unique: 各自的 [F] 或 [G] 的 KV
优势:
-
减少重复计算 -
节省内存(共享相同前缀的 KV) -
提高批量处理的效率
代码实现关联
1. 关键文件映射
| 模块 | 文件路径 | 类名 |
|---|---|---|
| 视觉编码器 | srt/models/qwen2_vl.py | Qwen2VisionTransformer |
| 视觉块 | srt/models/qwen2_vl.py | Qwen2VisionBlock |
| 视觉注意力 | srt/layers/attention/vision.py | VisionAttention |
| 语言模型 | srt/models/qwen2.py | Qwen2Model |
| 解码器层 | srt/models/qwen2.py | Qwen2DecoderLayer |
| 语言注意力 | srt/models/qwen2.py | Qwen2Attention |
| 多模态融合 | srt/managers/mm_utils.py | general_mm_embed_routine |
| 注意力后端 | srt/layers/radix_attention.py | RadixAttention |
2. 数据流
用户输入 (图像 + 文本)
↓
MultimodalProcessor.process()
↓
BaseMultiModalProcessorOutput
↓
MultimodalInputs
↓
ForwardBatch.mm_inputs
↓
forward() -> get_image_feature()
↓
Qwen2VisionTransformer
↓ (生成图像嵌入)
输入到 language_model
↓
Qwen2Model (多层 Transformer)
↓
LogitsProcessor
↓
输出 logits
3. 关键函数调用链
# 1. Forward 入口
Qwen2VLForConditionalGeneration.forward()
↓
# 2. 多模态嵌入
general_mm_embed_routine()
↓
# 3. 图像特征提取
get_image_feature() -> Qwen2VisionTransformer.forward()
↓
# 4. 视觉编码器处理
Qwen2VisionPatchEmbed -> Qwen2VisionBlock (循环) -> Qwen2VisionPatchMerger
↓
# 5. 语言模型处理
Qwen2Model.forward()
↓
Qwen2DecoderLayer.forward() (每层)
↓
Qwen2Attention.forward() -> RadixAttention.forward()
↓
Qwen2MLP.forward()
↓
# 6. 输出层
lm_head -> logits
4. 并行计算实现
# 张量并行的关键函数
from sglang.srt.distributed import (
get_tensor_model_parallel_rank,
get_tensor_model_parallel_world_size,
tensor_model_parallel_all_gather,
split_tensor_along_last_dim,
)
# 流水线并行的关键函数
from sglang.srt.distributed import get_pp_group
# 在 Attention 中
class Qwen2Attention(nn.Module):
def __init__(self, ...):
tp_size = get_tensor_model_parallel_world_size()
self.num_heads = num_heads // tp_size
# QKV 投影使用 TP
self.qkv_proj = QKVParallelLinear(...)
# 输出投影使用 TP
self.o_proj = RowParallelLinear(...)
# Radix Attention (支持高效注意力计算)
self.attn = RadixAttention(...)
5. 注意力计算后端选择
# 文件: python/sglang/srt/layers/attention/vision.py:522-547
def _determine_attention_backend(self, passed_backend):
# 1. 优先使用 server args 中的配置
override_backend = get_global_server_args().mm_attention_backend
if override_backend is not None:
backend = override_backend
# 2. 使用构造函数参数
elif passed_backend is not None:
backend = passed_backend
# 3. 根据平台选择默认值
elif is_cuda():
major, minor = get_device_capability()
if major == 9: # Hopper
backend = "fa3" # Flash Attention 3
else:
backend = "triton_attn" # Triton Attention
else:
backend = "sdpa" # Standard PyTorch Attention
return backend
总结
Qwen2-VL 模型架构具有以下特点:
-
双塔架构: 视觉编码器和语言模型分离设计,便于独立优化 -
高效注意力: 支持多种注意力后端(Flash Attention、Triton、SDPA) -
并行计算: 支持张量并行、流水线并行,充分利用多 GPU 资源 -
KV Cache: 使用 Radix Tree 优化,支持高效的增量解码 -
多模态融合: 通过嵌入层自然融合视觉和文本特征 -
灵活扩展: 模块化设计便于添加新的模态(音频、视频等)
本文由 mdnice 多平台发布
Qwen2-VL多模态架构解析
3319

被折叠的 条评论
为什么被折叠?



