Llama 3.2-11B-vision多模态大模型结构详解(精确到各个算子)第二弹!——文本预处理的详细步骤

自从去年Meta发布了首个开源Llama3.2Llama 3.2-11B-vision多模态大模型,然而,市面上几乎没有blog研究其结构的具体构造,让人对其原理和结构都会产生不同程度的困惑,不利于对大模型的学习,本系列blog将从头开始一步一步详细地讲解这个多模态大模型,而不会对某个步骤含糊其辞。本blog教程十分适合对大模型的小白。本系列的目录为:

  • 图片预处理的详细步骤(链接: 飞机票
  • 文本预处理的详细步骤(即本文blog)
  • 视觉编码器的详细结构和步骤(敬请期待……)
  • 文本编码器的详细结构和步骤(敬请期待……)
  • 文本交融的详细结构和步骤(敬请期待……)
  • 输出的详细结构和步骤(敬请期待……)
  • Llama 3.2-11B-vision多模态大模型推理完整超清流程图(本系列blog彩蛋^_^敬请期待……)

0. Llama 3.2-11B-vision多模态大模型推理代码

下方为官方提供的完整推理代码,我选取了一个灰度图片,即MNIST数据集中的一张28X28的图片,上面切了一部分,为28X24的大小图片。该图用PIL读入后,实际为28*24的一个矩阵,矩阵中的每个值均为无符号的8位整数,范围在0~255之间,随后传入了processor函数,即本文将要详细和重点介绍的一个算法。其余均为文本预处理部分将在下期详细讲解。

# This is a sample Python script.

# Press Shift+F10 to execute it or replace it with your code.
# Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.

#%%

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
from time import time
import numpy as np
import pandas as pd

model_dir = "./models/llama3.2_11b"

model = MllamaForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)
model.tie_weights()
# 这里是初始化的过程,会识别分词器以及image分片的类型
processor = AutoProcessor.from_pretrained(model_dir)


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    url = "https://www.modelscope.cn/models/LLM-Research/Llama-3.2-11B-Vision/resolve/master/rabbit.jpg"
    while 1:
        # image = Image.open("./data/1995.jpg") # 图片路径
        image = Image.open("./data/000000.png")
        # image = Image.open("./data/bicycle_bigger.png")
        image_array = np.array(image) # 这个没什么用,每个数值在0~255之间,8位无符号整数
        query = "图中的数字是几?"
        # query = "图中的交通工具是什么?"
        # query = "图中的人在干嘛?"
        messages = [
            {"role": "user", "content": [
                {"type": "image"},
                {"type": "text", "text": query} # 填入问题
            ]}
        ]
        s1 = time()
        input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
        # 上面就是套一个text模版,可以本地执行
        # 这里应该是call的过程,就是调用的过程
        inputs = processor(image, input_text, return_tensors="pt").to(model.device)
        for k in inputs.data: # 保存输出值下来!
            save_tensor_to_txt(inputs.data[k], f'./data/tensor_output_0{k}.txt')
        print(processor.decode(inputs["input_ids"][0]))
        # 上面这个融合是比较麻烦的地方
        output = model.generate(**inputs, max_new_tokens=1000)
        print(time() - s1)
        print(processor.decode(output[0]))

# See PyCharm help at https://www.jetbrains.com/help/pycharm/

1. 文本预处理代码概览

下方的代码是文本预处理的核心部分,也是本文的讲解重点。

class MllamaProcessor(ProcessorMixin):
    r"""
    Constructs a Mllama processor which wraps [`MllamaImageProcessor`] and
    [`PretrainedTokenizerFast`] into a single processor that inherits both the image processor and
    tokenizer functionalities. See the [`~MllamaProcessor.__call__`] and [`~OwlViTProcessor.decode`] for more
    information.
    The preferred way of passing kwargs is as a dictionary per modality, see usage example below.
        ```python
        from transformers import MllamaProcessor
        from PIL import Image

        processor = MllamaProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision")

        processor(
            images=your_pil_image,
            text=["<|image|>If I had to write a haiku for this one"],
            images_kwargs = {"size": {"height": 448, "width": 448}},
            text_kwargs = {"padding": "right"},
            common_kwargs = {"return_tensors": "pt"},
        )
        ```

    Args:
        image_processor ([`MllamaImageProcessor`]):
            The image processor is a required input.
        tokenizer ([`PreTrainedTokenizer`, `PreTrainedTokenizerFast`]):
            The tokenizer is a required input.

    """

    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "MllamaImageProcessor"
    tokenizer_class = "PreTrainedTokenizerFast"

    def __init__(self, image_processor, tokenizer):
        if not hasattr(tokenizer, "image_token"):
            self.image_token = "<|image|>"
            self.image_token_id = tokenizer.convert_tokens_to_ids(self.image_token)
        else:
            self.image_token = tokenizer.image_token
            self.image_token_id = tokenizer.image_token_id

        self.python_token = "<|python_tag|>"
        self.python_token_id = tokenizer.convert_tokens_to_ids(self.python_token)
        self.bos_token = tokenizer.bos_token
        self.chat_template = tokenizer.chat_template
        super().__init__(image_processor, tokenizer)

    def __call__(
        self,
        images: Optional[ImageInput] = None,
        text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
        audio=None,
        videos=None,
        **kwargs: Unpack[MllamaProcessorKwargs],
    ) -> BatchFeature:
        """
        Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the `text`
        arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` arguments to
        MllamaImageProcessor's [`~MllamaImageProcessor.__call__`] if `images` is not `None`. Please refer
        to the docstring of the above two methods for more information.

        Args:
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:
                    - `'tf'`: Return TensorFlow `tf.constant` objects.
                    - `'pt'`: Return PyTorch `torch.Tensor` objects.
                    - `'np'`: Return NumPy `np.ndarray` objects.
                    - `'jax'`: Return JAX `jnp.ndarray` objects.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
            TODO: add aspect_ratio_ids and aspect_ratio_mask and cross_attention_mask
        """
        # 如果文本和图片都不存在,则验证不通过!!
        if text is None and images is None:
            raise ValueError("You must specify either text or images.")

        # 就说明了返回的结果要是tensor类型
        output_kwargs = self._merge_kwargs(
            MllamaProcessorKwargs,
            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
            **kwargs,
        )

        text_kwargs = output_kwargs["text_kwargs"]
        images_kwargs = output_kwargs["images_kwargs"]
        common_kwargs = output_kwargs["common_kwargs"]

        data = {}
        # 0. 首先要对输入的字符串进行验证,确保都是字符串类型str
        if text is not None:
            if isinstance(text, str):
                text = [text] # 其实只有一句话
            elif not (isinstance(text, (list, tuple)) and all(isinstance(t, str) for t in text)):
                raise ValueError("Invalid input text. Please provide a string, or a list of strings")
            # n_images_in_text 和图像 token 的匹配: 计算每个文本项中包含的图像 token 数量,并验证图像和文本的配对是否正确。
            # 特别是,当文本中包含图像 token 时,必须确保每个文本项都有相应数量的图像。
            n_images_in_text = [t.count(self.image_token) for t in text]
            text = [build_string_from_input(text_item, self.bos_token, self.image_token) for text_item in text]
            _ = text_kwargs.pop("padding_side", None)  # hack until padding-side is an accepted kwarg by tokenizers
            # 1. PreTrainedTokenizerFast 的分词器,它能够处理字符串并返回 token ids
            encoding = self.tokenizer(text, **text_kwargs)
            # encoding中的两个元素(分词结果input_ids和attention_mask)长度一样:
            # 注意这里第六个id为128256=<image>
            # {'input_ids': tensor([[128000, 128000, 128006,    882, 128007,    271, 128256,  29129, 105363,
            # 83687,  21043, 104194,  11571, 128009, 128006,  78191, 128007,    271]]),
            # 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
            data.update(encoding)

        n_images_in_images = [0]
        if images is not None:
            images = make_list_of_images(images) # 1*1
            n_images_in_images = [len(sample) for sample in images] # [1]:只有第一个样本有一个图

        if text is not None:
            if any(batch_img == 0 for batch_img in n_images_in_text) and not all(
                batch_img == 0 for batch_img in n_images_in_text
            ):
                raise ValueError(
                    "If a batch of text is provided, there should be either no images or at least one image per sample"
                )
            if sum(n_images_in_images) != sum(n_images_in_text):
                if images is None:
                    raise ValueError("No image were provided, but there are image tokens in the prompt")
                else:
                    raise ValueError(
                        f"The number of image token ({sum(n_images_in_text)}) should be the same as in the number of provided images ({sum(n_images_in_images)})"
                    )

        if images is not None:
            # image_processor 会处理图像并将特征添加到最终的输出中。
            # self.image_processor:即为图像预处理的分析过程!(已分析)
            image_features = self.image_processor(images, **images_kwargs)
            num_tiles = image_features.pop("num_tiles")
            data.update(image_features)

        # Create cross attention mask
        if images is not None and text is not None:
            cross_attention_token_mask = [
                get_cross_attention_token_mask(token_ids, self.image_token_id) for token_ids in encoding["input_ids"]
            ] # 各个token_id与<|image|>进行交互
            # 2. cross_attention_mask维度是(1,18,1,4),
            # 1:表示该批只有1个样本
            # length=18表明所有样本最长为18个 token 参与计算
            # 1:每个样本中包含一个图像。 
            # max_num_tiles=4,每个图像被切分成了4个块,用来进行跨注意力计算。表明需要有4个维度(数组的列数),即[[1,0,0,0]]: (1*4)
            # 
            cross_attention_mask = convert_sparse_cross_attention_mask_to_dense(
                cross_attention_token_mask, # 揭示cross_attention_mask的范围:[[[6, -1]]],即从第六个token开始到最后一个token
                num_tiles=num_tiles, # [[1]]
                max_num_tiles=self.image_processor.max_image_tiles, # 4
                length=max(len(input_ids) for input_ids in encoding["input_ids"]), # 18
            )
            data["cross_attention_mask"] = cross_attention_mask

        return_tensors = common_kwargs.pop("return_tensors", None)
        batch_feature = BatchFeature(data=data, tensor_type=return_tensors)

        return batch_feature # 至此,文本预处理结束!!!主要得到了所有tokens的id和掩码

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    def post_process_image_text_to_text(self, generated_outputs):
        """
        Post-process the output of the model to decode the text.

        Args:
            generated_outputs (`torch.Tensor` or `np.ndarray`):
                The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
                or `(sequence_length,)`.

        Returns:
            `List[str]`: The decoded text.
        """
        return self.tokenizer.batch_decode(
            generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )

    @property
    def model_input_names(self):
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(tokenizer_input_names + image_processor_input_names + ["cross_attention_mask"])

2. MllamaProcessor类

下方是这个类的起始部分,会包含各种需要输入的参数,此部分源码均又对各个参数进行细致的讲解,可使用文心一言等大模型详细获取具体各个参数的作用。此外,初始化部分主要也是对输入的参数进行验证以免不同的参数之间发生冲突。

class MllamaProcessor(ProcessorMixin):
    r"""
    Constructs a Mllama processor which wraps [`MllamaImageProcessor`] and
    [`PretrainedTokenizerFast`] into a single processor that inherits both the image processor and
    tokenizer functionalities. See the [`~MllamaProcessor.__call__`] and [`~OwlViTProcessor.decode`] for more
    information.
    The preferred way of passing kwargs is as a dictionary per modality, see usage example below.
        ```python
        from transformers import MllamaProcessor
        from PIL import Image

        processor = MllamaProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision")

        processor(
            images=your_pil_image,
            text=["<|image|>If I had to write a haiku for this one"],
            images_kwargs = {"size": {"height": 448, "width": 448}},
            text_kwargs = {"padding": "right"},
            common_kwargs = {"return_tensors": "pt"},
        )
        ```

    Args:
        image_processor ([`MllamaImageProcessor`]):
            The image processor is a required input.
        tokenizer ([`PreTrainedTokenizer`, `PreTrainedTokenizerFast`]):
            The tokenizer is a required input.

    """

    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "MllamaImageProcessor"
    tokenizer_class = "PreTrainedTokenizerFast"

    def __init__(self, image_processor, tokenizer):
        if not hasattr(tokenizer, "image_token"):
            self.image_token = "<|image|>"
            self.image_token_id = tokenizer.convert_tokens_to_ids(self.image_token)
        else:
            self.image_token = tokenizer.image_token
            self.image_token_id = tokenizer.image_token_id

        self.python_token = "<|python_tag|>"
        self.python_token_id = tokenizer.convert_tokens_to_ids(self.python_token)
        self.bos_token = tokenizer.bos_token
        self.chat_template = tokenizer.chat_template
        super().__init__(image_processor, tokenizer)

    def __call__(
        self,
        images: Optional[ImageInput] = None,
        text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
        audio=None,
        videos=None,
        **kwargs: Unpack[MllamaProcessorKwargs],
    ) -> BatchFeature:
        """
        Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the `text`
        arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` arguments to
        MllamaImageProcessor's [`~MllamaImageProcessor.__call__`] if `images` is not `None`. Please refer
        to the docstring of the above two methods for more information.

        Args:
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:
                    - `'tf'`: Return TensorFlow `tf.constant` objects.
                    - `'pt'`: Return PyTorch `torch.Tensor` objects.
                    - `'np'`: Return NumPy `np.ndarray` objects.
                    - `'jax'`: Return JAX `jnp.ndarray` objects.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
            TODO: add aspect_ratio_ids and aspect_ratio_mask and cross_attention_mask
        """

3. 图文验证

验证图文以及格式是否正确。

# 如果文本和图片都不存在,则验证不通过!!
if text is None and images is None:
    raise ValueError("You must specify either text or images.")

# 就说明了返回的结果要是tensor类型
output_kwargs = self._merge_kwargs(
    MllamaProcessorKwargs,
    tokenizer_init_kwargs=self.tokenizer.init_kwargs,
    **kwargs,
)

text_kwargs = output_kwargs["text_kwargs"]
images_kwargs = output_kwargs["images_kwargs"]
common_kwargs = output_kwargs["common_kwargs"]

4. 验证字符串,然后分词

  • 首先要对输入的字符串进行验证,确保都是字符串类型str,然后分词,更新到data字典中
if text is not None:
    if isinstance(text, str):
        text = [text] # 其实只有一句话
    elif not (isinstance(text, (list, tuple)) and all(isinstance(t, str) for t in text)):
        raise ValueError("Invalid input text. Please provide a string, or a list of strings")
    # n_images_in_text 和图像 token 的匹配: 计算每个文本项中包含的图像 token 数量,并验证图像和文本的配对是否正确。
    # 特别是,当文本中包含图像 token 时,必须确保每个文本项都有相应数量的图像。
    n_images_in_text = [t.count(self.image_token) for t in text]
    text = [build_string_from_input(text_item, self.bos_token, self.image_token) for text_item in text]
    _ = text_kwargs.pop("padding_side", None)  # hack until padding-side is an accepted kwarg by tokenizers
    # 1. PreTrainedTokenizerFast 的分词器,它能够处理字符串并返回 token ids
    encoding = self.tokenizer(text, **text_kwargs)
    # encoding中的两个元素(分词结果input_ids和attention_mask)长度一样:
    # 注意这里第六个id为128256=<image>
    # {'input_ids': tensor([[128000, 128000, 128006,    882, 128007,    271, 128256,  29129, 105363,
    # 83687,  21043, 104194,  11571, 128009, 128006,  78191, 128007,    271]]),
    # 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
    data.update(encoding)

5. 图片数量核对

核对<image>标识和图片数量是否一致

n_images_in_images = [0]
if images is not None:
   images = make_list_of_images(images) # 1*1
   n_images_in_images = [len(sample) for sample in images] # [1]:只有第一个样本有一个图

if text is not None:
   if any(batch_img == 0 for batch_img in n_images_in_text) and not all(
       batch_img == 0 for batch_img in n_images_in_text
   ):
       raise ValueError(
           "If a batch of text is provided, there should be either no images or at least one image per sample"
       )
   if sum(n_images_in_images) != sum(n_images_in_text):
       if images is None:
           raise ValueError("No image were provided, but there are image tokens in the prompt")
       else:
           raise ValueError(
               f"The number of image token ({sum(n_images_in_text)}) should be the same as in the number of provided images ({sum(n_images_in_images)})"
           )

6. 图片预处理过程(见上期blog)

回顾上期,此处生成了图片的三个特征。

 if images is not None:
    # image_processor 会处理图像并将特征添加到最终的输出中。
    # self.image_processor:即为图像预处理的分析过程!(已分析)
    image_features = self.image_processor(images, **images_kwargs)
    num_tiles = image_features.pop("num_tiles")
    data.update(image_features)

7. 生成cross_attention_token_mask

生成交叉注意力机制的mask,目的是为了后面的图文交融做准备!

# Create cross attention mask
if images is not None and text is not None:
    cross_attention_token_mask = [
        get_cross_attention_token_mask(token_ids, self.image_token_id) for token_ids in encoding["input_ids"]
    ] # 各个token_id与<|image|>进行交互
    # 2. cross_attention_mask维度是(1,18,1,4),
    # 1:表示该批只有1个样本
    # length=18表明所有样本最长为18个 token 参与计算
    # 1:每个样本中包含一个图像。 
    # max_num_tiles=4,每个图像被切分成了4个块,用来进行跨注意力计算。表明需要有4个维度(数组的列数),即[[1,0,0,0]]: (1*4)
    # 
    cross_attention_mask = convert_sparse_cross_attention_mask_to_dense(
        cross_attention_token_mask, # 揭示cross_attention_mask的范围:[[[6, -1]]],即从第六个token开始到最后一个token
        num_tiles=num_tiles, # [[1]]
        max_num_tiles=self.image_processor.max_image_tiles, # 4
        length=max(len(input_ids) for input_ids in encoding["input_ids"]), # 18
    )
    data["cross_attention_mask"] = cross_attention_mask

return_tensors = common_kwargs.pop("return_tensors", None)
batch_feature = BatchFeature(data=data, tensor_type=return_tensors)

return batch_feature # 至此,文本预处理结束!!!主要得到了所有tokens的id和掩码

14. processing_mllama.py的完整源代码

下方是官方提供的完整版的原始版本的代码

# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Processor class for Mllama."""

from typing import List, Optional, Union  # 导入必要的类型提示库

import numpy as np  # 导入numpy库用于数值计算

from ...feature_extraction_utils import BatchFeature  # 导入BatchFeature类,用于处理批量数据
from ...image_utils import ImageInput  # 导入ImageInput类,用于表示图像输入
from ...processing_utils import ImagesKwargs, ProcessingKwargs, ProcessorMixin, Unpack  # 导入处理过程中的辅助工具类
from ...tokenization_utils_base import (
    PreTokenizedInput,  # 导入已标记输入类型
    TextInput,  # 导入文本输入类型
)

# TODO: Can we do it that way or its better include as "Copied from ..."
from .image_processing_mllama import make_list_of_images  # 导入自定义函数make_list_of_images,用于图像处理


# 定义MllamaImagesKwargs类,继承自ImagesKwargs,表示处理图像的额外参数
class MllamaImagesKwargs(ImagesKwargs, total=False):
    max_image_tiles: Optional[int]  # 可选参数:图像的最大切片数


# 定义MllamaProcessorKwargs类,继承自ProcessingKwargs,表示处理的附加参数
class MllamaProcessorKwargs(ProcessingKwargs, total=False):
    images_kwargs: MllamaImagesKwargs  # 包含图像处理的参数

    _defaults = {  # 定义默认的图像处理参数
        "image_kwargs": {
            "max_image_tiles": 4,  # 默认最大图像切片数为4
        },
    }


def get_cross_attention_token_mask(input_ids: List[int], image_token_id: int) -> List[List[int]]:
    """
    Generate a cross-attention token mask for image tokens in the input sequence.

    Args:
        input_ids (List[int]): A list of token ids representing the input sequence.
        image_token_id (int): The id of the token used to represent images in the sequence.

    Returns:
        List[List[int]]: A list of [start, end] pairs, where each pair represents the range
        of tokens an image token should attend to.
    """
    # 找到输入中所有图像token的位置
    image_token_locations = [i for i, token in enumerate(input_ids) if token == image_token_id]

    if len(image_token_locations) == 0:  # 如果没有图像token
        return []

    # 只有一个图像token,设置其关注到序列的末尾
    if len(image_token_locations) == 1:
        return [[image_token_locations[0], -1]]

    # 处理多个图像token,每个图像token关注到下一个图像token或者序列的末尾
    vision_masks = [[loc1, loc2] for loc1, loc2 in zip(image_token_locations[:-1], image_token_locations[1:])]

    # 最后的图像token关注到所有后续的文本token
    vision_masks.append([image_token_locations[-1], len(input_ids)])

    # 如果有连续的图像token,它们应该一起关注所有后续的文本
    last_mask_end = vision_masks[-1][1]
    for vision_mask in vision_masks[::-1]:
        if vision_mask[0] == vision_mask[1] - 1:
            vision_mask[1] = last_mask_end
        last_mask_end = vision_mask[1]

    return vision_masks


def convert_sparse_cross_attention_mask_to_dense(
    cross_attention_token_mask: List[List[List[int]]],
    num_tiles: List[List[int]],
    max_num_tiles: int,
    length: int,
) -> np.ndarray:
    """
    Convert the cross attention mask indices to a cross attention mask 4D array.

    Args:
        cross_attention_token_mask (List[List[List[int]]]): A nested list structure where:
            - The outer list represents the batch dimension.
            - The middle list represents different images within each batch item.
            - The inner list contains pairs of integers [start, end] representing token ranges for each image.
        num_tiles (List[List[int]]): A nested list structure specifying the number of tiles for each image in each batch item.
        max_num_tiles (int): The maximum possible number of tiles.
        length (int): The total sequence length of the input.

    Returns:
        np.ndarray: A 4D numpy array of shape (batch_size, length, max_num_images, max_num_tiles)
            The array contains `1` where attention is allowed and `0` where it is not.
    """

    batch_size = len(cross_attention_token_mask)  # 获取批次大小
    max_num_images = max([len(masks) for masks in cross_attention_token_mask])  # 获取最大图像数

    cross_attention_mask = np.zeros(
        shape=(batch_size, length, max_num_images, max_num_tiles),  # 初始化cross_attention_mask数组,填充为0
        dtype=np.int64,  # 使用64位整数类型
    )

    # 遍历每个样本和它的遮罩,将相应位置填充为1
    for sample_idx, (sample_masks, sample_num_tiles) in enumerate(zip(cross_attention_token_mask, num_tiles)):
        for mask_idx, (locations, mask_num_tiles) in enumerate(zip(sample_masks, sample_num_tiles)):
            if len(locations) == 2:
                start, end = locations
                end = min(end, length)  # 限制结束位置不超过序列长度
                if end == -1:  # 如果end为-1,表示关注到序列末尾
                    end = length
                cross_attention_mask[sample_idx, start:end, mask_idx, :mask_num_tiles] = 1  # 填充mask

    return cross_attention_mask  # 返回填充好的4D交叉注意力mask


def build_string_from_input(prompt: str, bos_token: str, image_token: str) -> str:
    """
    Builds a string from the input prompt by adding `bos_token` if not already present.

    Args:
        prompt (`str`): The input prompt string.
        bos_token (`str`): The beginning of sentence token to be added.
        image_token (`str`): The image token used to identify the start of an image sequence.

    Returns:
        str: The modified prompt string with the `bos_token` added if necessary.
    """

    if bos_token in prompt:  # 如果已经有了bos_token,直接返回
        return prompt

    num_image_tokens_on_start = 0
    while prompt.startswith(image_token):  # 统计前缀中图像token的个数
        prompt = prompt[len(image_token) :]
        num_image_tokens_on_start += 1

    return f"{image_token * num_image_tokens_on_start}{bos_token}{prompt}"  # 在图像token后添加bos_token


class MllamaProcessor(ProcessorMixin):
    r"""
    Constructs a Mllama processor which wraps [`MllamaImageProcessor`] and
    [`PretrainedTokenizerFast`] into a single processor that inherits both the image processor and
    tokenizer functionalities. See the [`~MllamaProcessor.__call__`] and [`~OwlViTProcessor.decode`] for more
    information.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "MllamaImageProcessor"
    tokenizer_class = "PreTrainedTokenizerFast"

    def __init__(self, image_processor, tokenizer):
        # 初始化,确保tokenizer拥有图像相关的token信息
        if not hasattr(tokenizer, "image_token"):
            self.image_token = "<|image|>"
            self.image_token_id = tokenizer.convert_tokens_to_ids(self.image_token)
        else:
            self.image_token = tokenizer.image_token
            self.image_token_id = tokenizer.image_token_id

        self.python_token = "<|python_tag|>"
        self.python_token_id = tokenizer.convert_tokens_to_ids(self.python_token)
        self.bos_token = tokenizer.bos_token
        self.chat_template = tokenizer.chat_template
        super().__init__(image_processor, tokenizer)  # 调用父类构造函数

    def __call__(
        self,
        images: Optional[ImageInput] = None,
        text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
        audio=None,
        videos=None,
        **kwargs: Unpack[MllamaProcessorKwargs],
    ) -> BatchFeature:
       """
        Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the `text`
        arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` arguments to
        MllamaImageProcessor's [`~MllamaImageProcessor.__call__`] if `images` is not `None`. Please refer
        to the docstring of the above two methods for more information.

        Args:
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:
                    - `'tf'`: Return TensorFlow `tf.constant` objects.
                    - `'pt'`: Return PyTorch `torch.Tensor` objects.
                    - `'np'`: Return NumPy `np.ndarray` objects.
                    - `'jax'`: Return JAX `jnp.ndarray` objects.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
            TODO: add aspect_ratio_ids and aspect_ratio_mask and cross_attention_mask
        """
        # 如果文本和图片都不存在,则验证不通过!!
        if text is None and images is None:
            raise ValueError("You must specify either text or images.")

        # 就说明了返回的结果要是tensor类型
        output_kwargs = self._merge_kwargs(
            MllamaProcessorKwargs,
            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
            **kwargs,
        )

        text_kwargs = output_kwargs["text_kwargs"]
        images_kwargs = output_kwargs["images_kwargs"]
        common_kwargs = output_kwargs["common_kwargs"]

        data = {}
        # 0. 首先要对输入的字符串进行验证,确保都是字符串类型str
        if text is not None:
            if isinstance(text, str):
                text = [text] # 其实只有一句话
            elif not (isinstance(text, (list, tuple)) and all(isinstance(t, str) for t in text)):
                raise ValueError("Invalid input text. Please provide a string, or a list of strings")
            # n_images_in_text 和图像 token 的匹配: 计算每个文本项中包含的图像 token 数量,并验证图像和文本的配对是否正确。
            # 特别是,当文本中包含图像 token 时,必须确保每个文本项都有相应数量的图像。
            n_images_in_text = [t.count(self.image_token) for t in text]
            text = [build_string_from_input(text_item, self.bos_token, self.image_token) for text_item in text]
            _ = text_kwargs.pop("padding_side", None)  # hack until padding-side is an accepted kwarg by tokenizers
            # 1. PreTrainedTokenizerFast 的分词器,它能够处理字符串并返回 token ids
            encoding = self.tokenizer(text, **text_kwargs)
            # encoding中的两个元素(分词结果input_ids和attention_mask)长度一样:
            # 注意这里第六个id为128256=<image>
            # {'input_ids': tensor([[128000, 128000, 128006,    882, 128007,    271, 128256,  29129, 105363,
            # 83687,  21043, 104194,  11571, 128009, 128006,  78191, 128007,    271]]),
            # 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
            data.update(encoding)

        n_images_in_images = [0]
        if images is not None:
            images = make_list_of_images(images) # 1*1
            n_images_in_images = [len(sample) for sample in images] # [1]:只有第一个样本有一个图

        if text is not None:
            if any(batch_img == 0 for batch_img in n_images_in_text) and not all(
                batch_img == 0 for batch_img in n_images_in_text
            ):
                raise ValueError(
                    "If a batch of text is provided, there should be either no images or at least one image per sample"
                )
            if sum(n_images_in_images) != sum(n_images_in_text):
                if images is None:
                    raise ValueError("No image were provided, but there are image tokens in the prompt")
                else:
                    raise ValueError(
                        f"The number of image token ({sum(n_images_in_text)}) should be the same as in the number of provided images ({sum(n_images_in_images)})"
                    )

        if images is not None:
            # image_processor 会处理图像并将特征添加到最终的输出中。
            # self.image_processor:即为图像预处理的分析过程!(已分析)
            image_features = self.image_processor(images, **images_kwargs)
            num_tiles = image_features.pop("num_tiles")
            data.update(image_features)

        # Create cross attention mask
        if images is not None and text is not None:
            cross_attention_token_mask = [
                get_cross_attention_token_mask(token_ids, self.image_token_id) for token_ids in encoding["input_ids"]
            ] # 各个token_id与<|image|>进行交互
            # 2. cross_attention_mask维度是(1,18,1,4),
            # 1:表示该批只有1个样本
            # length=18表明所有样本最长为18个 token 参与计算
            # 1:每个样本中包含一个图像。 
            # max_num_tiles=4,每个图像被切分成了4个块,用来进行跨注意力计算。表明需要有4个维度(数组的列数),即[[1,0,0,0]]: (1*4)
            # 
            cross_attention_mask = convert_sparse_cross_attention_mask_to_dense(
                cross_attention_token_mask, # 揭示cross_attention_mask的范围:[[[6, -1]]],即从第六个token开始到最后一个token
                num_tiles=num_tiles, # [[1]]
                max_num_tiles=self.image_processor.max_image_tiles, # 4
                length=max(len(input_ids) for input_ids in encoding["input_ids"]), # 18
            )
            data["cross_attention_mask"] = cross_attention_mask

        return_tensors = common_kwargs.pop("return_tensors", None)
        batch_feature = BatchFeature(data=data, tensor_type=return_tensors)

        return batch_feature # 至此,文本预处理结束!!!主要得到了所有tokens的id和掩码

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    def post_process_image_text_to_text(self, generated_outputs):
        """
        Post-process the output of the model to decode the text.

        Args:
            generated_outputs (`torch.Tensor` or `np.ndarray`):
                The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
                or `(sequence_length,)`.

        Returns:
            `List[str]`: The decoded text.
        """
        return self.tokenizer.batch_decode(
            generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )

    @property
    def model_input_names(self):
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(tokenizer_input_names + image_processor_input_names + ["cross_attention_mask"])

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值