Qwen 3 VL 图像预处理、图像增广接入Processor

原创已于 2025-12-05 16:48:59 修改 · 786 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #人工智能 #算法

于 2025-12-04 15:25:41 首次发布

背景

在训练Qwen 3 VL时，希望在数据pipeline的图像处理部分做一些图像预处理，再传递给Processor，这是训练视觉模型的常用方案，比如随机裁切、亮度随机变换等。

我们的目的是：处理后的numpy数组/torch.Tensor/PIL Image经过Processor编码，要和处理后的图像直接读入一致。

这里在实践时，发现有一些坑值得记录，故整理本文。

（本文也适用于Qwen 2 VL、Qwen 2.5 VL）

数据处理基本流程

首先，读取模型和Processor

from transformers import AutoModelForImageTextToText, AutoProcessor
MODEL_NAME = "Qwen/Qwen3-VL-4B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_NAME, dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

通常，图像预处理管线会布置在data_collator中，这里实现一个简略版的collator。`piece_of_data`模拟了一个datasets数据集中的项。我们只对图像进行操作，这里不做具体的变换，保持图像原样，只做数据接口的演示。本案例中，展示了string、tensor、numpy、PIL格式图像用法。

import cv2
import torch
import numpy as np
piece_of_data = [
{    
    "image": "./sample_img.jpg",
    "messages": [
        {
            "role": "user",
            "content": [{"type": "text", "text": "Describe the image."},{"type": "image",  "image": "./sample_img.jpg"}]
        },
        # {
        #     "role": "assistant",
        #     "content": [{"type": "text", "text": "The image is a street scene with a car and a person."}]
        # }
    ]
}
]

def my_collator(inputs):
    # ... 其他必要过程
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
    image_inputs = []
    image_inputs_tmp = []
    image_strs = []
    for example in inputs:
        for message in example["messages"]:
            for content in message["content"]:
                if content["type"] == "image":
                    image = cv2.imread(content["image"])
                    image_inputs_tmp.append(image)
                    image_strs.append(content["image"])
    for i,image_input in enumerate(image_inputs_tmp):
        image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
        image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
        # image_tensor = YOUR_TRANSFORM(image_tensor)    # 在这里放置你的transform或其他预处理过程
        image_string = image_strs[i]
        image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
        image_np = image_tensor.numpy() # 不要转换精度，防止损失信息
        image_pil = Image.fromarray(np.uint8(image_np*255))   
        image_inputs.append(image_string)
    
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)
    # ... 其他必要过程
    return batch
print(my_collator(piece_of_data))

上述图像为sample_img.jpg

基准：路径输入

上述代码直接采用路径输入时，我们读取的是原图。输出如下：

{'input_ids': tensor([[151644,    872,    198,  74785,    279,   2168,     11,   2291,   6529,
            311,   1894,     13, 151652, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151653, 151645,
            198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]]), 'pixel_values': tensor([[ 0.8902,  0.8902,  0.8824,  ...,  0.9216,  0.9216,  0.9137],
        [ 0.8667,  0.8667,  0.8510,  ...,  0.9137,  0.9137,  0.9059],
        [ 0.8588,  0.8588,  0.8588,  ...,  0.9686,  0.9137,  0.8745],
        ...,
        [-0.1373, -0.1608, -0.1765,  ..., -0.0745, -0.0824,  0.0196],
        [ 0.0118, -0.0667, -0.1059,  ...,  0.0196,  0.0039,  0.0118],
        [-0.0588, -0.0431, -0.0353,  ...,  0.0667, -0.0118, -0.0824]]), 'image_grid_thw': tensor([[ 1, 30, 40]])}

通过相关文章 https://blog.youkuaiyun.com/qq_40672115/article/details/151675269?spm=1001.2014.3001.5502 解读，pixel_values是实际的图像编码后tensor。我们记住这时的pixel_values内容。

numpy格式输入

这里采用0.0-1.0值域，RGB格式，(H,W,C)的图像。直接将上述代码的

image_inputs.append(image_string)

改成

image_inputs.append(image_np)

此时发现pixel_values变成了

 'pixel_values': tensor([[-0.9926, -0.9926, -0.9926,  ..., -0.9925, -0.9925, -0.9925],
        [-0.9927, -0.9927, -0.9927,  ..., -0.9925, -0.9925, -0.9925],
        [-0.9927, -0.9927, -0.9927,  ..., -0.9923, -0.9925, -0.9926],
        ...,
        [-0.9966, -0.9967, -0.9968,  ..., -0.9964, -0.9964, -0.9960],
        [-0.9960, -0.9963, -0.9965,  ..., -0.9960, -0.9961, -0.9960],
        [-0.9963, -0.9962, -0.9962,  ..., -0.9958, -0.9961, -0.9964]])

这是怎么回事？查看Qwen2VLImageProcessor的源码，发现image默认值域是0-255范围的。如果要用0.0-1.0图像，需要设置do_rescale=False。可以在processor中传入参数images_kwargs={"do_rescale": False}

另外经过实验验证，这里的image格式通道顺序是RGB，并非opencv默认的BGR。对于图像的形状，推荐用(H,W,C)。

    def _preprocess(
        self,
        images: Union[ImageInput, VideoInput],
        do_resize: Optional[bool] = None,
        size: Optional[dict[str, int]] = None,
        resample: Optional[PILImageResampling] = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, list[float]]] = None,
        image_std: Optional[Union[float, list[float]]] = None,
        patch_size: Optional[int] = None,
        temporal_patch_size: Optional[int] = None,
        merge_size: Optional[int] = None,
        do_convert_rgb: Optional[bool] = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        """
        Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.

        Args:
            images (`ImageInput`):
                Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
            vision_info (`list[Dict]`, *optional*):
                Optional list of dictionaries containing additional information about vision inputs.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Scale factor to use if rescaling the image.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
                Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
                Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
            patch_size (`int`, *optional*, defaults to `self.patch_size`):
                The spatial patch size of the vision encoder.
            temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
                The temporal patch size of the vision encoder.
            merge_size (`int`, *optional*, defaults to `self.merge_size`):
                The merge size of the vision encoder to llm encoder.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.   - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """

因此，我们应该用的代码是：

def my_collator(inputs):
    # ... 其他必要过程
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
    image_inputs = []
    image_inputs_tmp = []
    image_strs = []
    for example in inputs:
        for message in example["messages"]:
            for content in message["content"]:
                if content["type"] == "image":
                    image = cv2.imread(content["image"])
                    image_inputs_tmp.append(image)
                    image_strs.append(content["image"])
    for i,image_input in enumerate(image_inputs_tmp):
        image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
        image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
        # image_tensor = YOUR_TRANSFORM(image_tensor)    # 在这里放置你的transform或其他预处理过程
        image_string = image_strs[i]
        image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
        image_np = image_tensor.numpy() # 不要转换精度，防止损失信息
        image_pil = Image.fromarray(np.uint8(image_np*255))   
        image_inputs.append(image_np)
    
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True, images_kwargs={"do_rescale": False})
    # ... 其他必要过程
    return batch

这时输出为

'pixel_values': tensor([[ 0.8902,  0.8902,  0.8824,  ...,  0.9216,  0.9216,  0.9137],
        [ 0.8667,  0.8667,  0.8510,  ...,  0.9137,  0.9137,  0.9059],
        [ 0.8588,  0.8588,  0.8588,  ...,  0.9686,  0.9137,  0.8745],
        ...,
        [-0.1373, -0.1608, -0.1765,  ..., -0.0745, -0.0824,  0.0196],
        [ 0.0118, -0.0667, -0.1059,  ...,  0.0196,  0.0039,  0.0118],
        [-0.0588, -0.0431, -0.0353,  ...,  0.0667, -0.0118, -0.0824]]

符合基本预期

Tensor格式输入

这里采用0.0-1.0值域，RGB格式，(H,W,C)的图像。采用代码

def my_collator(inputs):
    # ... 其他必要过程
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
    image_inputs = []
    image_inputs_tmp = []
    image_strs = []
    for example in inputs:
        for message in example["messages"]:
            for content in message["content"]:
                if content["type"] == "image":
                    image = cv2.imread(content["image"])
                    image_inputs_tmp.append(image)
                    image_strs.append(content["image"])
    for i,image_input in enumerate(image_inputs_tmp):
        image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
        image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
        # image_tensor = YOUR_TRANSFORM(image_tensor)    # 在这里放置你的transform或其他预处理过程
        image_string = image_strs[i]
        image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
        image_np = image_tensor.numpy() # 不要转换精度，防止损失信息
        image_pil = Image.fromarray(np.uint8(image_np*255))   
        image_inputs.append(image_tensor)
    
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True, images_kwargs={"do_rescale": False})
    # ... 其他必要过程
    return batch

输出

'pixel_values': tensor([[ 0.8902,  0.8902,  0.8824,  ...,  0.9216,  0.9216,  0.9137],
        [ 0.8667,  0.8667,  0.8510,  ...,  0.9137,  0.9137,  0.9059],
        [ 0.8588,  0.8588,  0.8588,  ...,  0.9686,  0.9137,  0.8745],
        ...,
        [-0.1373, -0.1608, -0.1765,  ..., -0.0745, -0.0824,  0.0196],
        [ 0.0118, -0.0667, -0.1059,  ...,  0.0196,  0.0039,  0.0118],
        [-0.0588, -0.0431, -0.0353,  ...,  0.0667, -0.0118, -0.0824]]

符合预期。

PIL格式输入

这里采用0-255值域，RGB格式，(H,W,C)的图像。此时不需要设置"do_rescale": False。采用代码

def my_collator(inputs):
    # ... 其他必要过程
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
    image_inputs = []
    image_inputs_tmp = []
    image_strs = []
    for example in inputs:
        for message in example["messages"]:
            for content in message["content"]:
                if content["type"] == "image":
                    image = cv2.imread(content["image"])
                    image_inputs_tmp.append(image)
                    image_strs.append(content["image"])
    for i,image_input in enumerate(image_inputs_tmp):
        image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
        image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
        # image_tensor = YOUR_TRANSFORM(image_tensor)    # 在这里放置你的transform或其他预处理过程
        image_string = image_strs[i]
        image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
        image_np = image_tensor.numpy() # 不要转换精度，防止损失信息
        image_pil = Image.fromarray(np.uint8(image_np*255))   
        image_inputs.append(image_pil)
    
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)
    # ... 其他必要过程
    return batch

输出

 tensor([[ 0.8902,  0.8902,  0.8824,  ...,  0.9216,  0.9216,  0.9137],
        [ 0.8667,  0.8667,  0.8510,  ...,  0.9137,  0.9137,  0.9059],
        [ 0.8588,  0.8588,  0.8588,  ...,  0.9686,  0.9137,  0.8745],
        ...,
        [-0.1373, -0.1608, -0.1765,  ..., -0.0745, -0.0824,  0.0196],
        [ 0.0118, -0.0667, -0.1059,  ...,  0.0196,  0.0039,  0.0118],
        [-0.0588, -0.0431, -0.0353,  ...,  0.0667, -0.0118, -0.0824]],

符合预期

总结

本文围绕 Qwen 3 VL（及 2/2.5 系列）训练中图像预处理的核心需求 —— 确保处理后图像经 Processor 编码与原图直接读入结果一致，展开了多格式图像输入的实践验证与坑点解析。通过搭建简化的数据 collator 流程，分别测试了路径、numpy 数组、torch.Tensor、PIL Image 四种输入方式，明确了各格式的关键适配要点：路径输入作为基准无需额外配置；numpy 数组和 Tensor（0.0-1.0 值域、RGB 格式、(H,W,C) 形状）需通过images_kwargs={"do_rescale": False}关闭默认 rescale，避免值域二次转换导致编码异常；PIL Image（0-255 值域、RGB 格式）则可直接适配 Processor 默认逻辑，无需额外参数调整。同时需注意图像格式转换中的通道顺序（BGR 转 RGB）和精度保留（避免不必要的类型转换），确保预处理过程不损失图像信息。这些实践结论为视觉模型训练中灵活嵌入随机裁切、亮度变换等预处理操作提供了可靠的实现参考，帮助开发者高效规避格式适配问题，保障数据处理管线的一致性与稳定性。