Salesforce LAVIS项目：使用BLIP2模型进行多模态特征提取实战指南-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00786/article/details/148415413

Salesforce LAVIS项目：使用BLIP2模型进行多模态特征提取实战指南

LAVIS LAVIS - A One-stop Library for Language-Vision Intelligence 项目地址: https://gitcode.com/gh_mirrors/la/LAVIS

一、项目背景与BLIP2模型简介

Salesforce LAVIS是一个强大的多模态AI框架，其中BLIP2(Bootstrapped Language-Image Pre-training)模型是其核心组件之一。BLIP2模型通过创新的跨模态预训练方法，能够高效地理解和关联视觉与文本信息。

BLIP2模型的核心优势在于：

采用两阶段预训练策略，先分别训练视觉和语言编码器，再进行跨模态对齐
使用轻量级的查询转换器(Query Transformer)桥接视觉和语言模态
支持多种视觉编码器(ViT, CLIP等)和语言模型(OPT, T5等)的组合

二、环境准备与模型加载

在开始特征提取前，我们需要设置基础环境并加载模型：

import torch
from PIL import Image
from lavis.models import load_model_and_preprocess

# 设置计算设备（优先使用GPU）
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

# 加载BLIP2特征提取模型及预处理工具
model, vis_processors, txt_processors = load_model_and_preprocess(
    name="blip2_feature_extractor",
    model_type="pretrain",
    is_eval=True,
    device=device
)

关键参数说明：

name: 指定模型名称，这里使用"blip2_feature_extractor"
model_type: 选择预训练模型类型"pretrain"
is_eval: 设置为True表示进入评估模式
device: 指定模型运行的设备

三、数据预处理流程

BLIP2模型需要规范的输入格式，包括图像和文本的预处理：

# 加载并预处理图像
raw_image = Image.open("merlion.png").convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

# 预处理文本
caption = "a large fountain spewing water into the air"
text_input = txt_processors["eval"](caption)

# 构建模型输入样本
sample = {"image": image, "text_input": [text_input]}

预处理过程会：

对图像进行归一化、尺寸调整等操作
对文本进行分词、添加特殊标记等处理
将处理后的数据转换为PyTorch张量并移至指定设备

四、多模态特征提取实战

4.1 提取联合多模态特征

多模态特征捕捉了图像和文本之间的交互信息：

features_multimodal = model.extract_features(sample)
print(features_multimodal.multimodal_embeds.shape)
# 输出: torch.Size([1, 32, 768])

特征维度解析：

1: 批处理大小(batch size)
32: 查询向量的数量(固定值)
768: 特征向量的维度

4.2 提取单模态特征

我们也可以单独提取图像或文本特征：

# 图像特征提取
features_image = model.extract_features(sample, mode="image")
print(features_image.image_embeds.shape)
# 输出: torch.Size([1, 32, 768])

# 文本特征提取
features_text = model.extract_features(sample, mode="text")
print(features_text.text_embeds.shape)
# 输出: torch.Size([1, 12, 768])

注意：

图像特征维度与多模态特征相同
文本特征中的12代表文本token的数量（会根据输入长度变化）

4.3 低维投影特征与相似度计算

BLIP2提供了降维后的特征表示，便于计算跨模态相似度：

# 投影后的低维特征
print(features_image.image_embeds_proj.shape)
# 输出: torch.Size([1, 32, 256])
print(features_text.text_embeds_proj.shape)
# 输出: torch.Size([1, 12, 256])

# 计算图像-文本相似度
similarity = (features_image.image_embeds_proj @ features_text.text_embeds_proj[:,0,:].t()).max()
print(similarity)
# 输出: tensor([[0.3642]])

技术要点：