Focus Longer to See Better: Recursively Reﬁned Attention for Fine-Grained Image Classiﬁcation

本文链接：https://blog.youkuaiyun.com/weixin_44529634/article/details/107367298

Focus Longer to See Better: Recursively Reﬁned Attention for Fine-Grained Image Classiﬁcation

code：https://github.com/TAMU-VITA/Focus-Longer-to-See-Better
paper：https://arxiv.org/abs/2005.10979

Abstract

类间的边缘视觉差异（the marginal visual difference）使得细粒度分类很难。

focus on these marginal differences to extract more representative features.（基于此观察，作者将关注点放在了边缘视觉差异来提取更具有代表性的特征）。
另外，使用可视化方法来验证模型怎样focus changes from coarse to fine details。
一个简单的注意力模型可以聚合(加权)图像中最主要的鉴别部分。
由于模型比较简单，使得它成为一个即插即用模块（an easy plug-n-play module）。

优点：

可解释
相比baseline模型，acc增加高达2%。

3.Our Proposal

image $I$ ；label $c$

a two-stream feature extractor is used to extract global and object-level feature representations to boost the classiﬁcation accuracy.

3.1.Two-Stream Architecture

对于每个image $I$ 会得到一组patches $P$ ，从中随机选择一个patch $P_i$ （patches P是通过该 paper 得到的）， $P_i$ 是由一对坐标表示 $x_i^{tl},y_i^{tl}),(x_i^{br},y_i^{br})]$ ，tl，br分别表示左上角、右下角坐标。

网络的输入就是由 image $I$ 和patch $P_i$ 得到的。

The top stream 就是常见的分类网络，原图放到CNN中提取feature，然后送到classification layer+softmax进行分类。
The second stream将patch送到CNN中提取feature，然后送到LSTMs中去得到更细化的特征表示，该特征表示再通过加权融合的方式形成一个最具有判别性的特征表示。

Global Stream

对应the top steam，CNN模块主要是在ImageNet上pretrian好的模型，图片经过CNN提取到的特征表示为 $W_g *I$ ， $W_g$ 表示整个神经网络的权重， $*$ 表示所有的conv、pooling、非线性激活等函数，之后该feature送到softmax中得到每个类别的概率。公式如下：