背景
在训练Qwen 3 VL时,希望在数据pipeline的图像处理部分做一些图像预处理,再传递给Processor,这是训练视觉模型的常用方案,比如随机裁切、亮度随机变换等。
我们的目的是:处理后的numpy数组/torch.Tensor/PIL Image经过Processor编码,要和处理后的图像直接读入一致。
这里在实践时,发现有一些坑值得记录,故整理本文。
(本文也适用于Qwen 2 VL、Qwen 2.5 VL)
数据处理基本流程
首先,读取模型和Processor
from transformers import AutoModelForImageTextToText, AutoProcessor
MODEL_NAME = "Qwen/Qwen3-VL-4B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(
MODEL_NAME, dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_NAME)
通常,图像预处理管线会布置在data_collator中,这里实现一个简略版的collator。`piece_of_data`模拟了一个datasets数据集中的项。我们只对图像进行操作,这里不做具体的变换,保持图像原样,只做数据接口的演示。本案例中,展示了string、tensor、numpy、PIL格式图像用法。
import cv2
import torch
import numpy as np
piece_of_data = [
{
"image": "./sample_img.jpg",
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": "Describe the image."},{"type": "image", "image": "./sample_img.jpg"}]
},
# {
# "role": "assistant",
# "content": [{"type": "text", "text": "The image is a street scene with a car and a person."}]
# }
]
}
]
def my_collator(inputs):
# ... 其他必要过程
texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
image_inputs = []
image_inputs_tmp = []
image_strs = []
for example in inputs:
for message in example["messages"]:
for content in message["content"]:
if content["type"] == "image":
image = cv2.imread(content["image"])
image_inputs_tmp.append(image)
image_strs.append(content["image"])
for i,image_input in enumerate(image_inputs_tmp):
image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
# image_tensor = YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程
image_string = image_strs[i]
image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
image_np = image_tensor.numpy() # 不要转换精度,防止损失信息
image_pil = Image.fromarray(np.uint8(image_np*255))
image_inputs.append(image_string)
batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)
# ... 其他必要过程
return batch
print(my_collator(piece_of_data))

上述图像为sample_img.jpg
基准:路径输入
上述代码直接采用路径输入时,我们读取的是原图。输出如下:
{'input_ids': tensor([[151644, 872, 198, 74785, 279, 2168, 11, 2291, 6529,
311, 1894, 13, 151652, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655,
151655, 151655, 151655, 151655, 151655, 151655, 151655, 151653, 151645,
198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1]]), 'pixel_values': tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137],
[ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059],
[ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745],
...,
[-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196],
[ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118],
[-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]), 'image_grid_thw': tensor([[ 1, 30, 40]])}
通过相关文章 https://blog.youkuaiyun.com/qq_40672115/article/details/151675269?spm=1001.2014.3001.5502 解读,pixel_values是实际的图像编码后tensor。我们记住这时的pixel_values内容。
numpy格式输入
这里采用0.0-1.0值域,RGB格式,(H,W,C)的图像。直接将上述代码的
image_inputs.append(image_string)
改成
image_inputs.append(image_np)
此时发现pixel_values变成了
'pixel_values': tensor([[-0.9926, -0.9926, -0.9926, ..., -0.9925, -0.9925, -0.9925],
[-0.9927, -0.9927, -0.9927, ..., -0.9925, -0.9925, -0.9925],
[-0.9927, -0.9927, -0.9927, ..., -0.9923, -0.9925, -0.9926],
...,
[-0.9966, -0.9967, -0.9968, ..., -0.9964, -0.9964, -0.9960],
[-0.9960, -0.9963, -0.9965, ..., -0.9960, -0.9961, -0.9960],
[-0.9963, -0.9962, -0.9962, ..., -0.9958, -0.9961, -0.9964]])
这是怎么回事?查看Qwen2VLImageProcessor的源码,发现image默认值域是0-255范围的。如果要用0.0-1.0图像,需要设置do_rescale=False。可以在processor中传入参数images_kwargs={"do_rescale": False}
另外经过实验验证,这里的image格式通道顺序是RGB,并非opencv默认的BGR。对于图像的形状,推荐用(H,W,C)。
def _preprocess(
self,
images: Union[ImageInput, VideoInput],
do_resize: Optional[bool] = None,
size: Optional[dict[str, int]] = None,
resample: Optional[PILImageResampling] = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
patch_size: Optional[int] = None,
temporal_patch_size: Optional[int] = None,
merge_size: Optional[int] = None,
do_convert_rgb: Optional[bool] = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
"""
Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.
Args:
images (`ImageInput`):
Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
vision_info (`list[Dict]`, *optional*):
Optional list of dictionaries containing additional information about vision inputs.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`dict[str, int]`, *optional*, defaults to `self.size`):
Size of the image after resizing. `shortest_edge` and `longest_edge` keys must be present.
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Scale factor to use if rescaling the image.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
patch_size (`int`, *optional*, defaults to `self.patch_size`):
The spatial patch size of the vision encoder.
temporal_patch_size (`int`, *optional*, defaults to `self.temporal_patch_size`):
The temporal patch size of the vision encoder.
merge_size (`int`, *optional*, defaults to `self.merge_size`):
The merge size of the vision encoder to llm encoder.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
因此,我们应该用的代码是:
def my_collator(inputs):
# ... 其他必要过程
texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
image_inputs = []
image_inputs_tmp = []
image_strs = []
for example in inputs:
for message in example["messages"]:
for content in message["content"]:
if content["type"] == "image":
image = cv2.imread(content["image"])
image_inputs_tmp.append(image)
image_strs.append(content["image"])
for i,image_input in enumerate(image_inputs_tmp):
image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
# image_tensor = YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程
image_string = image_strs[i]
image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
image_np = image_tensor.numpy() # 不要转换精度,防止损失信息
image_pil = Image.fromarray(np.uint8(image_np*255))
image_inputs.append(image_np)
batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True, images_kwargs={"do_rescale": False})
# ... 其他必要过程
return batch
这时输出为
'pixel_values': tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137],
[ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059],
[ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745],
...,
[-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196],
[ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118],
[-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]
符合基本预期
Tensor格式输入
这里采用0.0-1.0值域,RGB格式,(H,W,C)的图像。采用代码
def my_collator(inputs):
# ... 其他必要过程
texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
image_inputs = []
image_inputs_tmp = []
image_strs = []
for example in inputs:
for message in example["messages"]:
for content in message["content"]:
if content["type"] == "image":
image = cv2.imread(content["image"])
image_inputs_tmp.append(image)
image_strs.append(content["image"])
for i,image_input in enumerate(image_inputs_tmp):
image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
# image_tensor = YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程
image_string = image_strs[i]
image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
image_np = image_tensor.numpy() # 不要转换精度,防止损失信息
image_pil = Image.fromarray(np.uint8(image_np*255))
image_inputs.append(image_tensor)
batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True, images_kwargs={"do_rescale": False})
# ... 其他必要过程
return batch
输出
'pixel_values': tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137],
[ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059],
[ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745],
...,
[-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196],
[ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118],
[-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]]
符合预期。
PIL格式输入
这里采用0-255值域,RGB格式,(H,W,C)的图像。此时不需要设置"do_rescale": False。采用代码
def my_collator(inputs):
# ... 其他必要过程
texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in inputs]
image_inputs = []
image_inputs_tmp = []
image_strs = []
for example in inputs:
for message in example["messages"]:
for content in message["content"]:
if content["type"] == "image":
image = cv2.imread(content["image"])
image_inputs_tmp.append(image)
image_strs.append(content["image"])
for i,image_input in enumerate(image_inputs_tmp):
image_input = cv2.cvtColor(image_input, cv2.COLOR_BGR2RGB)/255.
image_tensor = torch.from_numpy(image_input).permute(2, 0, 1).float()
# image_tensor = YOUR_TRANSFORM(image_tensor) # 在这里放置你的transform或其他预处理过程
image_string = image_strs[i]
image_tensor = image_tensor.clamp(0, 1).permute(1, 2, 0)
image_np = image_tensor.numpy() # 不要转换精度,防止损失信息
image_pil = Image.fromarray(np.uint8(image_np*255))
image_inputs.append(image_pil)
batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)
# ... 其他必要过程
return batch
输出
tensor([[ 0.8902, 0.8902, 0.8824, ..., 0.9216, 0.9216, 0.9137],
[ 0.8667, 0.8667, 0.8510, ..., 0.9137, 0.9137, 0.9059],
[ 0.8588, 0.8588, 0.8588, ..., 0.9686, 0.9137, 0.8745],
...,
[-0.1373, -0.1608, -0.1765, ..., -0.0745, -0.0824, 0.0196],
[ 0.0118, -0.0667, -0.1059, ..., 0.0196, 0.0039, 0.0118],
[-0.0588, -0.0431, -0.0353, ..., 0.0667, -0.0118, -0.0824]],
符合预期
总结
本文围绕 Qwen 3 VL(及 2/2.5 系列)训练中图像预处理的核心需求 —— 确保处理后图像经 Processor 编码与原图直接读入结果一致,展开了多格式图像输入的实践验证与坑点解析。通过搭建简化的数据 collator 流程,分别测试了路径、numpy 数组、torch.Tensor、PIL Image 四种输入方式,明确了各格式的关键适配要点:路径输入作为基准无需额外配置;numpy 数组和 Tensor(0.0-1.0 值域、RGB 格式、(H,W,C) 形状)需通过images_kwargs={"do_rescale": False}关闭默认 rescale,避免值域二次转换导致编码异常;PIL Image(0-255 值域、RGB 格式)则可直接适配 Processor 默认逻辑,无需额外参数调整。同时需注意图像格式转换中的通道顺序(BGR 转 RGB)和精度保留(避免不必要的类型转换),确保预处理过程不损失图像信息。这些实践结论为视觉模型训练中灵活嵌入随机裁切、亮度变换等预处理操作提供了可靠的实现参考,帮助开发者高效规避格式适配问题,保障数据处理管线的一致性与稳定性。
1780

被折叠的 条评论
为什么被折叠?



