了解图像补丁嵌入，从简单的展开到 2D 卷积

本文链接：https://blog.youkuaiyun.com/Java_rich/article/details/148170175

transformer 架构如此强大，因为它不会在文本、图像或任何数据及其组合之间产生任何差异。“Attention” 模型计算序列中每个标记之间的自相似性，允许汇总和生成任何类型的数据。Vision Transformer 通过将图像分解为二次色块来实现这一点，然后将其展平为单个矢量嵌入。此时，可以像处理文本嵌入（或任何其他嵌入）一样处理它们，甚至可以与其他数据类型连接。通常，创建 patchs 的步骤与使用 2D 卷积的第一个可学习的非线性转换相结合，这可能很难解包。本文将深入探讨这一步。本文的编写目的是，您可以通过将代码片段复制并粘贴到 colab 工作表中来进行作。

数据

我们将使用 MNIST 数据集，这是一组手写数字，通常用于训练基本图像分类器。MNIST 镜像在 Torch 中可用，并且可以方便地使用 DataLoader 类加载：

from torchvision.datasets.mnist import MNIST
from torch.utils.data import DataLoader
import torchvision.transforms as T
import torch

torch.manual_seed(42)

img_size = (32,32) # We will resize MNIST images to this size
batch_size = 4

transform = T.Compose([
  T.ToTensor(),
  T.Resize(img_size)
])

train_set = MNIST(
  root="./../datasets", train=True, download=True, transform=transform
  )

train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)

batch = next(iter(train_loader)) # loads the first batch

该代码下载 MNIST 数据集，提供 Torch 转换，将图像转换为 Torch 张量并将其大小调整为 32x32。然后，我们使用 DataLoader 类加载一批 batch_size = 4 张图像。我们使用 torch.manual_seed 将随机生成器初始化为相同的值，从而允许您在笔记本中看到与此处相同的图像。

您可以使用 matplotlib 可视化由四个图像和四个标签组成的批处理：

import matplotlib.pyplot as plt

# batch[0] contains the images and batch[1] the labels
images = batch[0]
labels = batch[1]

# Create a figure and axes for the subplots
fig, axes = plt.subplots(1, batch_size, figsize=(12, 4))

# Iterate through the batch of images and labels
for i in range(batch_size):
  # Convert the image tensor to a NumPy array and remove the channel dimension if it's a grayscale image
  image_np = images[i].numpy().squeeze()

  # Display the image in the corresponding subplot
  axes[i].imshow(image_np, cmap='gray')  # Use 'gray' cmap for grayscale images
  axes[i].set_title(f"Class: {labels[i].item()}") # Assuming labels are tensors, use .item() to get the value
  axes[i].axis('off')

# Adjust the spacing between subplots
plt.tight_layout()

# Display the plot
plt.show()

创建映像修补程序

使用 Transformer 神经网络处理图像时，第一步是将其分解为多个块。在这种情况下，我们可以将 32x32 图像分成 64 个 4x4 （16）块、16 个 8x8 （64）块或 8 个 16x16 （256）页面：

虽然我们以二次形式显示这些补丁，但我们也可以将它们存储在维数为 16、64 或 256 的列向量中。此时，它们已经与文本嵌入无法区分，并且它们的序列就像一串字符或单词一样。

以下是使用 Torch 的 unfold 运算符将图像分开的代码：

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Image and patch sizes
img_size = (32, 32)
patch_size = (8, 8)
n_channels = 1

image = batch[0][1].unsqueeze(0)

# Patch Class
class Patch(nn.Module):
    def __init__(self, img_size, patch_size, n_channels):
        super().__init__()
        self.patch_size = patch_size
        self.n_channels = n_channels

    def forward(self, x): # B x C x H X W
        x = x.unfold(
            2, self.patch_size[0], self.patch_size[0]
            ).unfold(
                3,self.patch_size[1],self.patch_size[1]
                )  # (B, C, P_row, P_col, P_height, P_width)
        x = x.flatten(2)  #(B, C, P_row*P_col*P_height*P_width)
        x = x.transpose(1, 2)  # (B,  P_row*P_col*P_height*P_width, C)
        return x

# Instantiate model
patch = Patch(img_size, patch_size, n_channels)

# Extract patches
with torch.no_grad():
    patches = patch(image) 

# Visualize
patches = patches.squeeze(0)  # Remove batch dimension -> (P, d_model)
patches = patches.view(-1, patch_size[0], patch_size[1]) # reshape back into 8x8

npatches = img_size[0] // patch_size[0]
# Plot patches
fig, axs = plt.subplots(npatches, npatches, figsize=(6, 6))  # 4x4 grid for (32x32) -> 16 patches

for i in range(npatches):
    for j in range(npatches):
        patch_idx = i * npatches + j  # Patch index
        axs[i, j].imshow(patches[patch_idx], cmap="gray", vmin=0, vmax=1)
        axs[i, j].axis("off")

plt.show()

该作像往常一样发生在我们从 nn 派生的 Patch 类的 forward 方法中。模块，通过先沿高度维度展开，然后沿宽度维度展开来作。所有作都在注释中显示它们的维度，其中 B 代表批处理，C 代表通道数（在本例中为 1），H 代表高度，W 代表宽度。展开后，我们从存储图像数据的第二维开始展平张量，最后转置它，使颜色通道排在最后。

代码的其余部分实例化该类，转换图像并对其进行可视化。请注意，我们需要删除批量维度，然后将 1D 图像数据转换回 2D 张量才能正确显示它们。

创建 Patch 嵌入

您可能已经注意到，这种方法在某种程度上将嵌入的维度限制为原始图像维度的倍数。这可以通过使用线性投影进行展开作来更改，从而创建可学习的嵌入。

使用单位矩阵进行线性变换后的修补（左）、使用随机权重的线性变换后（中）以及使用随机权重和偏差项进行线性变换后的修补。

此外，这些嵌入已转换回 2D 张量以进行可视化，并说明了线性项目如何按补丁运行。初始化 nn.以单位矩阵为权重的线性类显示原始数据被保留。使用随机权重，我们可以看到图像中值为零的部分保持不变。最后，我们可以添加一个偏差项来表明转换确实对每个 patch 的影响相同 — 所有空 patch 都显示完全相同的偏差。

这是新类，现在称为 PatchEmbedding，并带有一行代码来实例化它。请注意，我们引入了新的变量 d_model，这是输出嵌入的所需维数。现在可以是任何数字。我们在这里选择了 d_model=64，因为这是上图的设置，但不再有限制。

class PatchEmbedding(nn.Module):
    def __init__(self, img_size, patch_size, n_channels, d_model):
        super().__init__()
        self.patch_size = patch_size
        self.n_channels = n_channels
        self.d_model = d_model

        # Linear projection layer to map each patch to d_model
        self.linear_proj = nn.Linear(patch_size[0] * patch_size[1] * n_channels, d_model,bias=False)
        # The next two lines are unnecessary, but help to visualize that the linear 
        # projection operates along the correct dimensions
        #with torch.no_grad():
        #  self.linear_proj.weight.copy_(torch.eye(self.linear_proj.weight.shape[0]))
       
    def forward(self, x): # B x C x H X W

        x = x.unfold(
            2, self.patch_size[0], self.patch_size[0]
            ).unfold(
                3,self.patch_size[1],self.patch_size[1]
                )  # (B, C, P_row, P_col, P_height, P_width)
        
        B, C, P_row, P_col, P_height, P_width = x.shape
        x = x.reshape(B,C,P_row*P_col,P_height*P_width)
        x = self.linear_proj(x)  # (B*N, d_model)
  
        
        x = x.flatten(2)  #(B, C, P_row*P_col*P_height*P_width)
        x = x.transpose(1, 2)  # (B,  P_row*P_col*P_height*P_width, C)

        x = x.view(B, -1, self.d_model)
      
        return x

d_model = 64
# Instantiate model
patch = PatchEmbedding(img_size, patch_size, n_channels, d_model)

事实上，只要维度是二次的，我们仍然可以可视化结果，如下所示 d_model=2 和 d_model=2500 的输出：

我们可以看到，非线性变换，一个完全连接的神经网络，它接受从 8x8 （64）到 d_model 的输入，可以包含相当多的可学习参数，从左侧的 64x4 （256）到右侧的 64x2500 （160k）。您可以使用

def count_parameters（model）：
返回 sum（p.numel（） for p in model.参数 （） 如果 p.requires_grad）count_parameters

（补丁）

使用 2D 卷积创建 Patch Embeddings

您可能已经注意到，展开运算符非常笨拙，如果不是完全令人讨厌的话。有一种更简单的方法可以将展开和线性变换结合起来，即使用与所需补丁大小相对应的内核大小和步幅长度执行 2D 卷积。这样，卷积不会逐个像素地作，而是逐个补丁地作，从而产生与 unfold 与 nn 组合时相同的结果。线性：

使用 2D 卷积将创建补丁和线性转换结合在一个步骤中。16x16 色块嵌入到 4、64 和 2500 维度中（顶行），8x8 色块嵌入到 4、64 和 2500 维度中（底行）。

这是修改后的 PatchEmbedding 类：

类 PatchEmbedding（nn.模块）：
 def __init__（self， img_size， patch_size， n_channels， d_model）：
 super（）.__init__（）
 self.patch_size = patch_size
 self.n_channels = n_channels
 self.d_model = d_model # 扁平化的补丁大小

 # Conv2d 提取补丁
 self.linear_project = nn.Conv2d（
 in_channels=n_channels，
 out_channels=self.d_model， # 每个补丁被展平为d_model
 kernel_size=patch_size，
 stride=patch_size，
 bias=False
 ） def

 forward（self， x）：
 x = self.linear_project（x） # （B， d_model， P_row， P_col）
 x = x.flatten（2） # （B， d_model， P_row * P_col） -> （B， d_model， P）
 x = x.transpose（1， 2） # （B， P， d_model）
 返回 x

请注意，您可以将上述任何 patch 嵌入馈送到 vision transformer。使用 2D 卷积来执行此作是最通用和最紧凑的表示形式。请注意，卷积每个维度使用一个专用内核，而到目前为止，我们一直对每个补丁使用相同的内核。

我们可以说明这一点，并通过初始化其内核权重来测试卷积不会做任何有趣的事情，以便每个内核每个补丁只提取一个像素。下面的代码适用于补丁大小（8,8）和结果d_model=64。将其添加到 PatchEmbedding 类的 __init__ 方法的末尾：

"""Initialize Conv2d to extract patches without transformation."""
        with torch.no_grad():
            identity_kernel = torch.zeros(
                self.d_model, self.n_channels, *self.patch_size
            )  # Shape: (64, 1, 8, 8)

            for i in range(self.d_model):  
                row = i // self.patch_size[1]  # Row index in the patch
                col = i % self.patch_size[1]   # Column index in the patch
                identity_kernel[i, 0, row, col] = 1  # Place a 1 at the correct pixel position

            self.linear_project.weight.copy_(identity_kernel)

如您所见，identity_kernel张量维护d_model条目，每个维度一个条目，并且每个补丁只有一个像素设置为 1，因此仅提取该像素。一种更简单的方法是简单地将 d_model x d_model 单位矩阵转换为 d_model patch_size 矩阵：

identity_matrix = torch.eye(self.d_model)
        identity_kernel = identity_matrix.view(d_model, 1, *patch_size)  # Shape: (64, 1, 8, 8)

        with torch.no_grad():
             self.linear_project.weight.copy_(identity_kernel)

两种方法都有相同的结果，但第一种方法更清楚地说明了实际发生的情况：每个内核都是一个 0 矩阵，只有一个条目是 1。

无论您是使用线性变换还是小内核集合，两者都具有相同数量的参数。您可以通过检查两个 patch 嵌入的数据结构来了解这一点：

PatchEmbedding（
 （linear_proj）： 线性（in_features=64， out_features=64， bias=False）
)

PatchEmbedding（
 （linear_project）： Conv2d（1， 64， kernel_size=（8， 8）， stride=（8， 8）， bias=False）
)

其中一个只是 64x64 矩阵（4096 个参数）。另一个由 64 个 8x8 矩阵组成，也由 4096 个参数组成。

了解了将图像分解为补丁嵌入序列的不同方式后，您现在可以通过 transformer 编码器（也称为 “vision transformer”）运行它们：

有以下论文写作问题的可以扫下方名片加老师详聊

前沿顶会、期刊论文、综述文献浩如烟海，不知道学习路径，无从下手？

没时间读、不敢读、不愿读、读得少、读不懂、读不下去、读不透彻一篇完整的论文？

CVPR、ICCV、ECCV、ICLR、NeurlPS、AAAI……想发表顶会论文，找不到创新点？

读完论文，仍旧无法用代码复现……

然而，导师时常无法抽出时间指导，想写论文却无人指点……