使用HuggingFace Smol-Course中的视觉语言模型(VLM)处理多模态任务-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00983/article/details/148488729

使用HuggingFace Smol-Course中的视觉语言模型(VLM)处理多模态任务

smol-course A course on aligning smol models. 项目地址: https://gitcode.com/gh_mirrors/smo/smol-course

视觉语言模型(Vision-Language Models, VLMs)是近年来人工智能领域的重要突破，它能够同时理解图像和文本信息，实现跨模态的推理和理解。本文将基于HuggingFace Smol-Course中的实践案例，详细介绍如何使用量化版的SmolVLM-Instruct模型完成多种视觉语言任务。

环境准备与模型加载

在开始之前，我们需要准备好Python环境并安装必要的依赖库。核心依赖包括transformers、datasets、trl和bitsandbytes等。特别需要注意的是，由于视觉语言模型通常较大，我们使用4位量化技术来减少内存占用。

import torch, PIL
from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
from transformers.image_utils import load_image

# 设备检测与选择
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

# 4位量化配置
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# 加载模型和处理器
model_name = "HuggingFaceTB/SmolVLM-Instruct"
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    quantization_config=quantization_config,
).to(device)
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")

单张图像处理

视觉语言模型最基本的应用是对单张图像进行理解和描述。我们可以通过构建特定的提示模板(prompt template)来引导模型完成不同任务。

图像描述生成

# 加载图像
image_url = "https://cdn.pixabay.com/photo/2024/11/20/09/14/christmas-9210799_1280.jpg"
image = load_image(image_url)

# 构建消息模板
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe the image?"}
        ]
    },
]

# 准备输入并生成输出
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

这种图像描述生成能力可以应用于内容审核、辅助视觉障碍人士等多种场景。

多图像对比分析

VLM的强大之处在于能够同时处理多张图像并进行对比分析。例如，我们可以让模型找出两张图像的共同主题：

# 加载两张图像
image1 = load_image(image_url1)
image2 = load_image(image_url2)

# 构建对比提示
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "What event do they both represent?"}
        ]
    },
]

# 处理并生成响应
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=500)

这种多图像分析能力在产品分类、内容相似度判断等商业场景中具有重要应用价值。

图像文本识别(OCR)

除了理解图像内容，VLM还能识别图像中的文本内容，实现OCR功能：

# 加载包含文本的图像
document_image = load_image("https://cdn.pixabay.com/photo/2020/11/30/19/23/christmas-5792015_960_720.png")

# 构建OCR提示
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is written?"}
        ]
    }
]

# 处理并识别文本
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[document_image], return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=500)

与传统OCR技术相比，VLM不仅能识别文字，还能理解文字在特定上下文中的含义，实现更智能的文档分析。

视频内容理解

虽然VLM不是专门为视频设计的模型，但我们可以通过提取关键帧的方式实现对视频内容的理解：

def extract_frames(video_path, max_frames=50):
    """从视频中提取关键帧"""
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)
    
    frames = []
    for idx in frame_indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame = PIL.Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
            frames.append(frame)
    cap.release()
    return frames

# 处理视频帧
video_link = "https://cdn.pixabay.com/video/2023/10/28/186794-879050032_large.mp4"
frames = extract_frames(video_link, max_frames=15)

# 构建视频理解提示
image_tokens = [{"type": "image"} for _ in frames]
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Following are the frames of a video in temporal order."},
            *image_tokens,
            {"type": "text", "text": "Describe what the woman is doing."}
        ]
    }
]

# 生成视频描述
inputs = processor(
    text=processor.apply_chat_template(messages, add_generation_prompt=True),
    images=frames,
    return_tensors="pt"
).to(device)
outputs = model.generate(**inputs, max_new_tokens=100)

这种方法虽然无法捕捉精细的时间动态，但对于视频内容概括、关键动作识别等任务已经足够有效。