MLP-Mixer详解

MLP-Mixer是Google提出的一种新的计算机视觉框架,用多层感知机替换传统CNN的卷积和Transformer的自注意力机制。模型包括Per-patch Fully-connected、Mixer Layer,通过MLP进行空间域和通道域的信息融合。虽然其结构简单,但在ImageNet上的性能接近SOTA。然而,MLP-Mixer的Per-patch Fully-connected可视为卷积操作,Mixer Layer则类似深度可分离卷积,引发是否真正摒弃卷积的争议。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

MLP-Mixer详解

论文《MLP-Mixer: An all-MLP Architecture for Vision》

1 主要思想

作为Google ViT团队最近刚提出的一种的CV框架,MLP-Mixer使用多层感知机(MLP)来代替传统CNN中的卷积操作(Conv)和Transformer中的自注意力机制(Self-Attention)。

MLP-Mixer整体设计简单,在ImageNet上的表现接近于近年最新的几个SOTA模型。

2 模型结构

MLP-Mixer主要包括三部分:Per-patch Fully-connected、Mixer Layer、分类器。

其中分类器部分采用传统的全局平均池化(GAP)+全连接层(FC)+Softmax的方式构成,故不进行更多介绍,下面主要针对前两部分进行解释。

2.1 Per-patch Fully-connected

FC相较于Conv,并不能获取局部区域间的信息,为了解决这个问题,MLP-Mixer通过Per-patch Fully-connected将输入图像转化为2D的Table,方便在后面进行局部区域间的信息融合。

具体来说,MLP-Mixer将输入图像相邻无重叠地划分为S个Patch,每个Pat

### MLP-Mixer Model Architecture and Implementation The **MLP-Mixer** is a type of neural network architecture that relies solely on multi-layer perceptrons (MLPs) for both spatial mixing and channel mixing, without using convolutional layers or self-attention mechanisms. This design choice makes it an interesting alternative to traditional architectures like Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs). Below are the key aspects of its structure: #### Key Components of MLP-Mixer 1. **Token Mixing**: The first layer within each block applies an MLP across all input patches (tokens), effectively performing spatial mixing by treating each patch as part of a sequence. 2. **Channel Mixing**: After token mixing, another MLP operates independently on each patch's channels, enabling feature transformation at the individual patch level. This alternating pattern between token and channel mixing allows the model to capture complex relationships among different parts of images while maintaining computational efficiency compared to models relying heavily on attention-based computations[^2]. #### Mathematical Representation Let \( x \in \mathbb{R}^{N \times D} \) represent the input tensor where \( N \) denotes the number of image patches (or tokens), and \( D \) represents their dimensionality after embedding projection. Each mixer block can be mathematically represented as follows: \[ y = \text{MLP}_{\text{token}}(x) + x \] \[ z = \text{MLP}_{\text{channel}}(\text{LayerNorm}(y)) + y \] Here, - \( \text{MLP}_{\text{token}} \): Applies transformations along the token axis (\( N \)). - \( \text{MLP}_{\text{channel}} \): Operates over the channel dimensions (\( D \)). Both operations include residual connections which help mitigate vanishing gradient problems during training. #### Code Example in Python Using PyTorch Below demonstrates how one might implement such blocks programmatically with popular machine learning frameworks like PyTorch. ```python import torch.nn as nn import torch class MixerBlock(nn.Module): def __init__(self, num_patches, hidden_dim, token_dim, channel_dim): super().__init__() self.token_mlp_block = nn.Sequential( nn.LayerNorm(hidden_dim), Rearrange('b n d -> b d n'), nn.Linear(num_patches, token_dim), nn.GELU(), nn.Linear(token_dim, num_patches), Rearrange('b d n -> b n d') ) self.channel_mlp_block = nn.Sequential( nn.LayerNorm(hidden_dim), nn.Linear(hidden_dim, channel_dim), nn.GELU(), nn.Linear(channel_dim, hidden_dim) ) def forward(self, x): out = x + self.token_mlp_block(x) out = out + self.channel_mlp_block(out) return out class MLPMixer(nn.Module): def __init__(self, img_size=224, patch_size=16, embed_dim=768, num_blocks=8, token_dim=384, channel_dim=3072, num_classes=1000): super().__init__() import einops assert img_size % patch_size == 0, 'Image size must be divisible by patch size' self.patch_embed = nn.Conv2d(in_channels=3, out_channels=embed_dim, kernel_size=(patch_size, patch_size), stride=patch_size) num_patches = (img_size // patch_size)**2 self.mixer_blocks = nn.Sequential(*[ MixerBlock(num_patches=num_patches, hidden_dim=embed_dim, token_dim=token_dim, channel_dim=channel_dim) for _ in range(num_blocks)]) self.layernorm = nn.LayerNorm(embed_dim) self.classifier_head = nn.Linear(embed_dim, num_classes) def forward(self, x): x = self.patch_embed(x) x = x.flatten(start_dim=2).transpose(-1,-2) x = self.mixer_blocks(x) x = self.layernorm(x) x = x.mean(dim=1) output = self.classifier_head(x) return output ``` In this code snippet above, `einops` library has been used alongside standard Pytorch modules to facilitate rearranging tensors efficiently when transitioning from token space back into channel space vice versa inside our custom defined class `MixerBlock`. §§Related Questions§§ 1. How does GFNet differ fundamentally from other existing approaches including but not limited to MLP-mixers? 2. What specific advantages do reparameterization techniques offer particularly concerning performance improvements seen under certain conditions discussed earlier regarding ChannelMLP implementations mentioned elsewhere ? 3. Can you elaborate further upon what constitutes parallel multiscale attentions utilized within MUSE framework referenced previously ? 4. In comparison studies involving multiple modalities extraction methods utilizing either BERT based text encoders combined together against alternatives employing SLSTMs etc.,what conclusions were drawn about effectiveness versus complexity tradeoffs involved therefrom?
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值