READING NOTE: Weakly Supervised Cascaded Convolutional Networks

最新推荐文章于 2022-02-14 18:11:31 发布

原创最新推荐文章于 2022-02-14 18:11:31 发布 · 1.5k 阅读

0 ·

CC 4.0 BY-SA版权

计算机视觉同时被 2 个专栏收录

72 篇文章

订阅专栏

42 篇文章

订阅专栏

提出一种新的级联卷积神经网络架构，能够在较少的人工注释下进行物体检测任务的学习。该方法通过两阶段和三阶段级联网络结构，利用图像级别的标签和对象建议来训练模型。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

TITLE: Weakly Supervised Cascaded Convolutional Networks

AUTHOR: Ali Diba, Vivek Sharma, Ali Pazandeh, Hamed Pirsiavash, Luc Van Gool

ASSOCIATION: KU Leuven, Sharif Tech., UMBC, ETH Zürich

FROM: arXiv:1611.08258

CONTRIBUTIONS

A new architecture of cascaded networks is proposed to learn a convolutional neural network handling the task without
expensive human annotations.

METHOD

This work trains a CNN to detect objects using image level annotaion, which tells what are in one image. At training stage, the input of the network are 1) original image, 2) image level labels and 3) object proposals. At inference stage, the image level labels are excluded. The object proposals can be generated by any method, such as Selective Search and EdgeBox. Two differenct cascaded network structures are proposed.

Two-stage Cascade

The two-stage cascade network structure is illustrated in the following figure.

The first stage is a location network, which is a fully-convolutional CNN with a global average pooling or global maximum pooling. In order to learn multiple classes for single image, an independent loss function for each class is used. The class activation maps are used to select candidate boxes.

The second stage is multiple instance learning network. Given a bag for instances $x_{c}=\{x_{j}|j=1,...,n\}$ and a label set $y_{c}=\{y_{i}|y_{i} \in \{0,1\}, i=1,..,C \}$ , where each $x$ is one of the condidate boxes, $n$ is the number of candidate box, $C$ is the number of categories and $\sum_{i=1}^{C}y_{i}$ , the probabilities and loss can be defined as

S c o r e (I, f i) = m a x (f i 1, . . ., f i n)

$Score(I,f_{i})=max(f_{i1},...,f_{in})$

P (I, f i) = e x p ( S c o r e ( I , f i ) ) \sum C k = 1 e x p ( S c o r e ( I , f k ) )

$P(I, f_{i}) = \frac{exp(Score(I,f_{i}))}{\sum_{k=1}^{C}exp(Score(I,f_{k}))}$

L M I L (P, y) = - \sum i = 1 C y i l o g (P (I, f i))

$L_{MIL}(P,y) = - \sum_{i=1}^{C}y_{i}log(P(I, f_i))$

Im my understanding, only the boxes with the most confidence in each category will be punished if they are wrong. Besides, the equations in the paper have some mistakes.

Three-stage Cascade

The three-stage cascade network structure adds a weak segmentation network between the two stages in the two-stage cascade network. It is illustrated in the following figure.

The weak segmentation network uses the results of the first stage as supervision signal. $s_{ic}$ is defined as the CNN score for pixel $i$ and class $c$ in image $I$ . The score is normalized using softmax

S i c = e x p (s i c) / \sum k = 1 C e x p (s i k)

$S_{ic}= exp(s_{ic})/\sum_{k=1}^{C}exp(s_{ik})$

Considering $y$ as the label set for image $I$ , the loss function for the weakly supervised segmentation network is given by

L s e g (S, G, y) = - \sum i = 1 C y i l o g (S t c c) - \sum i \in I s α i l o g (S t c G i)

$L_{seg}(S,G,y)=-\sum_{i=1}^{C}y_{i}log(S_{t_{c}c}) - \sum_{i \in I_{s}} \alpha_{i}log(S_{t_{c}G_{i}})$

B y t c = a r g m a x i \in I S i c

$By \ \ \ \ \ t_{c} = \mathop{argmax}_{i \in I} S_{ic}$

where $G_{i}$ is the supervision map for the segmentation from the first stage.

SOME IDEAS

This work requires little annotation. The only annotation is the image level label. However, this kind of training still needs complete annotation. For example, we want to detect 20 categories, then we need a 20-d vector to annotate the image. What if we only know 10/20 categories’ status in one image?