《Efficient Regional Memory Network for Video Object Segmentation》论文阅读笔记

《Efficient Regional Memory Network for Video Object Segmentation》论文阅读笔记

论文链接:https://arxiv.org/pdf/2103.12934.pdf
GitHub地址:https://haozhexie.com/project/rmnet

1.摘要

最近,一些基于时空记忆的网络表明,过去帧中的物体线索(如视频帧以及分割后的物体掩码)对于分割当前帧中的物体很有用。然而,这些方法都是以全局到全局(Global-to-Global Matching)的方式对当前帧和过去帧和过去帧之间进行匹配的,这就会导致相似目标的错误匹配和高复杂的计算量。为了解决这个问题,作者提出从局部到局部(Local-to-Local Matching)的方式匹配当前帧和过去帧 用于半监督视频分割任务中(semi-supervised VOS),并命名为Regional Memory Net- work (RMNet)。
这篇论文的方法可以看作是在《Video Object Segmentation using Space-Time Memory Networks》(这篇文章的思想可以点击链接进行查看)文章的基础上进行的改进。主要改进有两个方面:1.时空记忆模块只保存目标区域;2.当前帧和之前帧之间的匹配计算只计算目标所在区域,文章中确定目标区域使用框的方式,类似于目标检测中的检测框完成的;3.时空记忆模块只保存前一帧的结果(应该是这样),4.增加一个TinyFlowNet生成光流信息用于将前一帧的mask转换到当前帧。

Global-to-Global的错误匹配示例和Local-to-Local可以正确匹配的示例如下

### Multi-Level Attention Network for Retinal Vessel Segmentation Retinal vessel segmentation plays a crucial role in diagnosing various eye-related diseases, such as diabetic retinopathy and glaucoma. A multi-level attention network is one of the advanced deep learning approaches that enhance the accuracy and precision of vessel segmentation by leveraging attention mechanisms at multiple levels of the neural network architecture. #### Architecture Overview The architecture typically follows an encoder-decoder structure with additional attention modules strategically placed to emphasize relevant features during decoding. This design allows the model to focus on salient regions while suppressing irrelevant background information, which is particularly important in retinal images where vessels are thin and may blend into the background. 1. **Encoder Path**: The encoder consists of convolutional layers followed by pooling operations to extract hierarchical features from the input image. Commonly used architectures like U-Net or ResNet can serve as the backbone for feature extraction. 2. **Attention Modules**: Multiple attention blocks are integrated at different levels (shallow, intermediate, and deep) within the decoder path. These blocks compute attention maps that dynamically weigh the importance of each feature map channel and spatial location. For instance: - **Channel Attention**: Computes the significance of each channel across the entire feature map. - **Spatial Attention**: Focuses on critical spatial locations within each channel. 3. **Decoder Path**: As the feature maps pass through upsampling layers, attention-weighted features from the encoder are combined to refine the segmentation output progressively. Skip connections help preserve spatial details lost during encoding. 4. **Final Segmentation Layer**: A 1x1 convolution layer is applied at the end to produce the final binary segmentation mask indicating vessel and non-vessel pixels. #### Implementation Details Implementing a multi-level attention network involves defining both the encoder and attention-augmented decoder parts. Below is a simplified example using PyTorch: ```python import torch import torch.nn as nn import torch.nn.functional as F class AttentionBlock(nn.Module): def __init__(self, in_channels): super(AttentionBlock, self).__init__() self.conv1 = nn.Conv2d(in_channels, in_channels // 8, kernel_size=1) self.conv2 = nn.Conv2d(in_channels // 8, 1, kernel_size=1) def forward(self, x): # Channel attention att = F.relu(self.conv1(x)) att = torch.sigmoid(self.conv2(att)) return x * att class MultiLevelAttentionUNet(nn.Module): def __init__(self, in_channels=3, out_channels=1): super(MultiLevelAttentionUNet, self).__init__() self.encoder = nn.Sequential( nn.Conv2d(in_channels, 64, kernel_size=3, padding=1), nn.ReLU(), nn.Conv2d(64, 64, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2) ) self.attention_blocks = nn.ModuleList([ AttentionBlock(64), AttentionBlock(128), AttentionBlock(256) ]) self.decoder = nn.Sequential( nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2), nn.ReLU(), nn.Conv2d(128 + 64, 128, kernel_size=3, padding=1), nn.ReLU(), nn.ConvTranspose2d(128, out_channels, kernel_size=2, stride=2), nn.Sigmoid() ) def forward(self, x): enc_features = [] for layer in self.encoder: x = layer(x) if isinstance(layer, nn.MaxPool2d): enc_features.append(x) x = self.attention_blocks[0](x) dec_input = torch.cat((x, enc_features[-1]), dim=1) x = self.decoder(dec_input) return x ``` This implementation provides a foundational framework that can be further enhanced by adding more complex attention mechanisms or integrating pre-trained models for better performance. Additionally, training should involve loss functions tailored for segmentation tasks, such as Dice loss or Binary Cross-Entropy Loss.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

起个什么名字好w

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值