paper:
机构:Shanghai Jiao Tong University
Abstract
Transformer及其变体在医学图像分割中有着广泛的应用。然而,这些模型的大量参数和计算量使其不适合移动医疗应用。为了解决这个问题,我们提出了一种更有效的方法,即高效组增强UNet (EGE-UNet)。我们以轻量级的方式集成了一个Group multi-axis Hadamard Product Attention module(GHPA)和一个Group Aggregation Bridge module(GAB)。GHPA对输入特征进行分组,并在不同轴上执行Hadamard Product Attention mechanism(HPA),从不同角度提取病理信息。GAB通过分组低阶特征、高阶特征和解码器在每个阶段生成的掩码,有效地融合了多尺度信息。在ISIC2017和ISIC2018数据集上的综合实验表明,EGEUNet优于现有的最先进的方法。简而言之,与TransFuse相比,我们的模型实现了卓越的分割性能,同时将参数和计算成本分别降低了494倍和160倍。此外,据我们所知,这是第一个参数计数限制为50KB的模型。我们的代码可在https://github.com/JCruan519/EGE-UNet上获得。
Contributions:
(1) 提出了GHPA算法和GAB算法,前者有效地获取和整合多视角信息,后者接受不同尺度的特征,并利用辅助mask实现高效的多尺度特征融合。
(2) 提出了EGE-UNet,这是一个非常轻量级的皮肤损伤分割模型。
(3) 实验证明了我们的方法在显著降低资源需求的情况下实现最先进性能的有效性。

Related Works
UNeXt [22] has combined UNet [18] and MLP [21] to develop a lightweight model that
attains superior performance, while diminishing parameter and computation.
MALUNet [19] has reduced the model size by declining the number of model channels and introducing multiple attention modules
Method
Architecture
he encoder is composed of six stages, each with channel numbers of {8, 16, 24, 32, 48, 64}. While the first three stages employ plain convolutions with a kernel size of 3, the last three stages utilize the proposed GHPA to extract representation information from diverse perspectives. In contrast to the simple skip connections in UNet, EGE-UNet incorporates GAB for each stage between the encoder and decoder.
Furthermore, our model leverages deep supervision [27] to generate mask predictions of varying scales, which are utilized for loss function and serve as one of the inputs to GAB.
Deep Supervision(深度监督)指的是在网络的多个中间层(通常是解码器的不同阶段)添加额外的监督信号(即额外的损失函数),以增强梯度传播的效果,提高训练效率,并促进深层网络的优化。

Group multi-axis Hadamard Product Attention module(GHPA)
主要目标:解决**多头自注意力机制(MHSA)**的计算复杂度问题,提供线性复杂度的计算方式

Given an input x and a randomly initialized learnable tensor p, bilinear interpolation is first utilized to resize p to match the size of x. Then, we employ depth-wise separable convolution (DW) [10][20] on p, followed by a hadamard product operation between x and p to obtain the output. However, utilizing simple HPA alone is insufficient to extract information from multiple perspectives, resulting in unsatisfactory results.
Motivated by the multi-head mode in MHSA, we introduce GHPA based on HPA, as illustrated in Algorithm 1. We divide the input into four groups equally along the channel dimension and perform HPA on the height-width, channel-height, and channel-width axes for the first three groups, respectively. For the last group, we only use DW on the feature map. Finally, we concatenate the four groups along the channel dimension and apply another DW to integrate the information from different perspectives. Note that all kernel size employed in DW are 3.
优势:GHPA 通过结合多个轴的注意力机制,可以在多个方向上捕捉到更丰富的特征,减少了 MHSA 的计算复杂度。
Group Aggregation Bridge module(GAB)

主要目标:通过多尺度信息提取来增强对密集预测任务的性能,特别适用于医学图像分割等任务。
in Figure 3, we introduce GAB, 3 inputs: low-level features, high-level features, and a mask.
Firstly, depthwise separable convolution (DW) and bilinear interpolation are employed to adjust the size of high-level features, so as to match the size of low-level features.
Secondly, we partition both feature maps into four groups along the channel dimension, and concatenate one group from the low-level features with one from the high-level features to obtain four groups of fused features. For each group of fused features, the mask is concatenated.
Next, dilated convolutions [25] with kernel size of 3 and different dilated rates of {1, 2, 5, 7} are applied to the different groups, in order to extract information at different scales.
Finally, the four groups are con-catenated along the channel dimension, followed by the application of a plain convolution with the kernel size of 1 to enable interaction among features at different scales.
Dilated convolutions(膨胀卷积)是一种卷积操作的变体,旨在扩展卷积核的感受野,同时保持相同数量的参数和计算量。它通过在卷积核的元素之间插入“空洞”来实现这一点,从而使得卷积操作能够覆盖更广泛的输入区域,但不会增加计算量。
Plain convolution(普通卷积)是指传统的卷积操作
优势:GAB 模块通过膨胀卷积在不同尺度上提取信息,并结合低级和高级特征,以提高模型在密集预测任务中的表现。
Loss Function
since different GAB require different scales of mask information, deep supervision [27] is employed to calculate the loss function for different stages, in order to generate more accurate mask information. Our loss function can be expressed as equation (1) and (2).

Bce and Dice represent binary cross entropy and dice loss. λi is the weight for different stage. In this paper, we set λi to 1, 0.5, 0.4, 0.3, 0.2, 0.1 from i = 0 to i = 5 by default.
Experiment
dataset: ISIC2017 [1][3] and ISIC2018 ,training and testing sets at a 7:3 ratio
EGE-UNet is developed by Pytorch [17] framework. All experiments are per-formed on a single NVIDIA RTX A6000 GPU. The images are normalized and resized to 256×256.
data augmentation: flipping, vertical flipping, and random rotation.
AdamW [13] is utilized as the optimizer, initialized with a learning rate of 0.001 and the CosineAnnealingLR [12] is employed as the scheduler with a maximum number of iterations of 50 and a minimum learning rate of 1e-5.
Epoch:300 Batchsize:8
To evaluate our method, we employ Mean Intersection over Union (mIoU), Dice similarity score (DSC) as metrics, and we conduct 5 times and report the mean and standard deviation of the results for each dataset.
Result



1606

被折叠的 条评论
为什么被折叠?



