Qwen 3 VL 图像预处理、图像增广接入Processor

背景

在训练Qwen 3 VL时,希望在数据pipeline的图像处理部分做一些图像预处理,再传递给Processor,这是训练视觉模型的常用方案,比如随机裁切、亮度随机变换等。

我们的目的是:处理后的numpy数组/torch.Tensor/PIL Image经过Processor编码,要和处理后的图像直接读入一致。

这里在实践时,发现有一些坑值得记录,故整理本文。

(本文也适用于Qwen 2 VL、Qwen 2.5 VL)

数据处理基本流程

首先,读取模型和Processor

from transformers import AutoModelForImageTextToText, AutoProcessor
MODEL_NAME = "Qwen/Qwen3-VL-4B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_NAME, dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

通常,图像预处理管线会布置在data_collator中,这里实现一个简略版的collator。`piece_of_data`模拟了一个datasets数据集中的项。我们只对图像进行操作,这里不做具体的变换,保持图像原样,只做数据接口的演示。本案例中,展示了string、tensor、numpy、PIL格式图像用法。

import cv2
import torch
import numpy as np
piece_of_data = [
{    
    "image": "./sample_img.jpg",
    "messages": [
        {
            "role": "user",
            "content": [{"type": "text", "text": "Describe the image."},{"type": "image",  "image": "./sample_img.jpg"}]
        },
        # {
        #     "role": "assistant",
        #     "content": [{"type": "text", "text": "The image is a street scene with a car and a person."}]
        # }
    ]
}
]

def my_collator(inputs):
    # ... 其他必要过程
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
    image_inputs = []
    image_inputs_tmp = []
    image_strs = []
    for example in inputs:
        for message in example["messages"]:
            for content in message["content"]:
                if content["type"] == "image":
                    image = cv2.imread(content["image"])
                    image_inputs_tmp.append(image)
                    image_strs.append(content["image"])
    for i,image_input in enumerate(image_inputs_tmp):
        image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
        image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
        # image_tensor = YOUR_TRANSFORM(image_tensor)    # 在这里放置你的transform或其他预处理过程
        image_string = image_strs[i]
        image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
        image_np = image_tensor.numpy() # 不要转换精度,防止损失信息
        image_pil = Image.fromarray(np.uint8(image_np*255))   
        image_inputs.append(image_string)
    
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)
    # ... 其他必要过程
    return batch
print(my_collator(piece_of_data))

 

上述图像为sample_img.jpg

基准:路径输入

上述代码直接采用路径输入时,我们读取的是原图。输出如下:

{'input_ids': tensor([[151644,    872,    198,  74785,    279,   2168,     11,   2291,   6529,
            311,   1894,     13, 151652, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
         151655, 151655, 151655, 151655, 151655, 151655, 151655, 151653, 151645,
            198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]]), 'pixel_values': tensor([[ 0.8902,  0.8902,  0.8824,  ...,  0.9216,  0.9216,  0.9137],
        [ 0.8667,  0.8667,  0.8510,  ...,  0.9137,  0.9137,  0.9059],
        [ 0.8588,  0.8588,  0.8588,  ...,  0.9686,  0.9137,  0.8745],
        ...,
        [-0.1373, -0.1608, -0.1765,  ..., -0.0745, -0.0824,  0.0196],
        [ 0.0118, -0.0667, -0.1059,  ...,  0.0196,  0.0039,  0.0118],
        [-0.0588, -0.0431, -0.0353,  ...,  0.0667, -0.0118, -0.0824]]), 'image_grid_thw': tensor([[ 1, 30, 40]])}

通过相关文章 https://blog.youkuaiyun.com/qq_40672115/article/details/151675269?spm=1001.2014.3001.5502 解读,pixel_values是实际的图像编码后tensor。我们记住这时的pixel_values内容。

numpy格式输入

这里采用0.0-1.0值域,RGB格式,(H,W,C)的图像。直接将上述代码的

image_inputs.append(image_string)

改成

image_inputs.append(image_np)

此时发现pixel_values变成了

 'pixel_values': tensor([[-0.9926, -0.9926, -0.9926,  ..., -0.9925, -0.9925, -0.9925],
        [-0.9927, -0.9927, -0.9927,  ..., -0.9925, -0.9925, -0.9925],
        [-0.9927, -0.9927, -0.9927,  ..., -0.9923, -0.9925, -0.9926],
        ...,
        [-0.9966, -0.9967, -0.9968,  ..., -0.9964, -0.9964, -0.9960],
        [-0.9960, -0.9963, -0.9965,  ..., -0.9960, -0.9961, -0.9960],
        [-0.9963, -0.9962, -0.9962,  ..., -0.9958, -0.9961, -0.9964]])

这是怎么回事?查看Qwen2VLImageProcessor的源码,发现image默认值域是0-255范围的。如果要用0.0-1.0图像,需要设置do_rescale=False。可以在processor中传入参数images_kwargs={"do_rescale": False}

另外经过实验验证,这里的image格式通道顺序是RGB,并非opencv默认的BGR。对于图像的形状,推荐用(H,W,C)。

    def _preprocess(
        self,
        images: Union[ImageInput, VideoInput],
        do_resize: Optional[bool] = None,
        size: Optional[dict[str, int]] = None,
        resample: Optional[PILImageResampling] = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, list[float]]] = None,
        image_std: Optional[Union[float, list[float]]] = None,
        patch_size: Optional[int] = None,
        temporal_patch_size: Optional[int] = None,
        merge_size: Optional[int] = None,
        do_convert_rgb: Optional[bool] = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        """
        Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.

        Args:
            images (`ImageInput`):
                Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
            vision_info (`list[Dict]`, *optional*):
                Optional list of dictionaries containing additional information about vision inputs.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Scale factor to use if rescaling the image.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
                Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
                Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
            patch_size (`int`, *optional*, defaults to `self.patch_size`):
                The spatial patch size of the vision encoder.
            temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
                The temporal patch size of the vision encoder.
            merge_size (`int`, *optional*, defaults to `self.merge_size`):
                The merge size of the vision encoder to llm encoder.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.   - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """

因此,我们应该用的代码是:

def my_collator(inputs):
    # ... 其他必要过程
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
    image_inputs = []
    image_inputs_tmp = []
    image_strs = []
    for example in inputs:
        for message in example["messages"]:
            for content in message["content"]:
                if content["type"] == "image":
                    image = cv2.imread(content["image"])
                    image_inputs_tmp.append(image)
                    image_strs.append(content["image"])
    for i,image_input in enumerate(image_inputs_tmp):
        image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
        image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
        # image_tensor = YOUR_TRANSFORM(image_tensor)    # 在这里放置你的transform或其他预处理过程
        image_string = image_strs[i]
        image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
        image_np = image_tensor.numpy() # 不要转换精度,防止损失信息
        image_pil = Image.fromarray(np.uint8(image_np*255))   
        image_inputs.append(image_np)
    
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True, images_kwargs={"do_rescale": False})
    # ... 其他必要过程
    return batch

这时输出为

'pixel_values': tensor([[ 0.8902,  0.8902,  0.8824,  ...,  0.9216,  0.9216,  0.9137],
        [ 0.8667,  0.8667,  0.8510,  ...,  0.9137,  0.9137,  0.9059],
        [ 0.8588,  0.8588,  0.8588,  ...,  0.9686,  0.9137,  0.8745],
        ...,
        [-0.1373, -0.1608, -0.1765,  ..., -0.0745, -0.0824,  0.0196],
        [ 0.0118, -0.0667, -0.1059,  ...,  0.0196,  0.0039,  0.0118],
        [-0.0588, -0.0431, -0.0353,  ...,  0.0667, -0.0118, -0.0824]]

符合基本预期

Tensor格式输入

这里采用0.0-1.0值域,RGB格式,(H,W,C)的图像。采用代码

def my_collator(inputs):
    # ... 其他必要过程
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
    image_inputs = []
    image_inputs_tmp = []
    image_strs = []
    for example in inputs:
        for message in example["messages"]:
            for content in message["content"]:
                if content["type"] == "image":
                    image = cv2.imread(content["image"])
                    image_inputs_tmp.append(image)
                    image_strs.append(content["image"])
    for i,image_input in enumerate(image_inputs_tmp):
        image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
        image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
        # image_tensor = YOUR_TRANSFORM(image_tensor)    # 在这里放置你的transform或其他预处理过程
        image_string = image_strs[i]
        image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
        image_np = image_tensor.numpy() # 不要转换精度,防止损失信息
        image_pil = Image.fromarray(np.uint8(image_np*255))   
        image_inputs.append(image_tensor)
    
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True, images_kwargs={"do_rescale": False})
    # ... 其他必要过程
    return batch

输出

'pixel_values': tensor([[ 0.8902,  0.8902,  0.8824,  ...,  0.9216,  0.9216,  0.9137],
        [ 0.8667,  0.8667,  0.8510,  ...,  0.9137,  0.9137,  0.9059],
        [ 0.8588,  0.8588,  0.8588,  ...,  0.9686,  0.9137,  0.8745],
        ...,
        [-0.1373, -0.1608, -0.1765,  ..., -0.0745, -0.0824,  0.0196],
        [ 0.0118, -0.0667, -0.1059,  ...,  0.0196,  0.0039,  0.0118],
        [-0.0588, -0.0431, -0.0353,  ...,  0.0667, -0.0118, -0.0824]]

符合预期。

PIL格式输入

这里采用0-255值域,RGB格式,(H,W,C)的图像。此时不需要设置"do_rescale": False。采用代码

def my_collator(inputs):
    # ... 其他必要过程
    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
    image_inputs = []
    image_inputs_tmp = []
    image_strs = []
    for example in inputs:
        for message in example["messages"]:
            for content in message["content"]:
                if content["type"] == "image":
                    image = cv2.imread(content["image"])
                    image_inputs_tmp.append(image)
                    image_strs.append(content["image"])
    for i,image_input in enumerate(image_inputs_tmp):
        image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
        image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
        # image_tensor = YOUR_TRANSFORM(image_tensor)    # 在这里放置你的transform或其他预处理过程
        image_string = image_strs[i]
        image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
        image_np = image_tensor.numpy() # 不要转换精度,防止损失信息
        image_pil = Image.fromarray(np.uint8(image_np*255))   
        image_inputs.append(image_pil)
    
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)
    # ... 其他必要过程
    return batch

输出

 tensor([[ 0.8902,  0.8902,  0.8824,  ...,  0.9216,  0.9216,  0.9137],
        [ 0.8667,  0.8667,  0.8510,  ...,  0.9137,  0.9137,  0.9059],
        [ 0.8588,  0.8588,  0.8588,  ...,  0.9686,  0.9137,  0.8745],
        ...,
        [-0.1373, -0.1608, -0.1765,  ..., -0.0745, -0.0824,  0.0196],
        [ 0.0118, -0.0667, -0.1059,  ...,  0.0196,  0.0039,  0.0118],
        [-0.0588, -0.0431, -0.0353,  ...,  0.0667, -0.0118, -0.0824]],

符合预期

总结

本文围绕 Qwen 3 VL(及 2/2.5 系列)训练中图像预处理的核心需求 —— 确保处理后图像经 Processor 编码与原图直接读入结果一致,展开了多格式图像输入的实践验证与坑点解析。通过搭建简化的数据 collator 流程,分别测试了路径、numpy 数组、torch.Tensor、PIL Image 四种输入方式,明确了各格式的关键适配要点:路径输入作为基准无需额外配置;numpy 数组和 Tensor(0.0-1.0 值域、RGB 格式、(H,W,C) 形状)需通过images_kwargs={"do_rescale": False}关闭默认 rescale,避免值域二次转换导致编码异常;PIL Image(0-255 值域、RGB 格式)则可直接适配 Processor 默认逻辑,无需额外参数调整。同时需注意图像格式转换中的通道顺序(BGR 转 RGB)和精度保留(避免不必要的类型转换),确保预处理过程不损失图像信息。这些实践结论为视觉模型训练中灵活嵌入随机裁切、亮度变换等预处理操作提供了可靠的实现参考,帮助开发者高效规避格式适配问题,保障数据处理管线的一致性与稳定性。

### Qwen-VL 图像描述训练方法与数据 Qwen-VL 是一个多模态大模型,其设计目标在于融合视觉和语言的理解能力。为了实现这一目标,在图像描述任务中,该模型通过大规模的图像-文本对数据进行了预训练,并采用特定的技术手段来提升性能。 #### 预训练数据 Qwen-VL 的预训练依赖于大量的图像-文本对数据集。这些数据来源于多个公开的大规模网络抓取数据集合[^2]。具体来说,这些数据集中包含了丰富的场景、对象以及对应的自然语言描述,从而使得模型能够学习到广泛的视觉特征及其语义表示。这种多样化的数据分布有助于提高模型在不同领域中的泛化能力和表达能力。 #### 多任务预训练阶段 在第二个重要的多任务预训练阶段,Qwen-VL 进一步优化了自身的跨模态理解能力。此阶段引入了更高质量且更为精细粒度的视觉-语言(Vision-Language, VL) 标注数据[^1]。相比于普通的图像分类或者物体检测标签,这类 VL 数据提供了更加详细的上下文信息,例如具体的动作、关系甚至情感状态等复杂概念。此外,在这个阶段还调整了一些技术参数设置,比如增大输入分辨率以便捕捉更多细节,并利用交错方式处理图像与文本序列以增强两者之间的交互作用。 #### 训练方法概述 对于图像描述生成这样的下游应用而言,Qwen-VL 主要依靠上述提到过的高质量预训练成果作为基础支撑 。当面对一个新的未见过图片时 ,它会先将其编码成向量形式然后再解码输出相应句子 .整个过程涉及到了复杂的神经网络架构设计如Transformer变体结构的应用等等. 以下是简化版伪代码展示如何基于已有的视觉特征提取器(Visual Feature Extractor,VFE) 和 文本生成模块(Text Generation Module,TGM): ```python def generate_caption(image_input): # 使用VFE从给定image input 中获取visual features. visual_features = VFE.extract_features(image_input) # 将得到的features送入TGM进行caption generation. caption_output = TGM.generate_text(visual_features) return caption_output ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值