论文记录：Detecting Visual Relationships with Deep Relational Networks [DR-Net] (CVPR-17)

最新推荐文章于 2024-05-17 09:37:01 发布

chenhch8

最新推荐文章于 2024-05-17 09:37:01 发布

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

文章标签：论文阅读

本文链接：https://blog.youkuaiyun.com/deepinC/article/details/86418635

本文介绍了DR-Net，一种结合统计模型与深度学习的关系检测新方法，针对视觉关系检测的挑战，如长尾分布和类内多样性。DR-Net由对象检测、对过滤和联合识别三个阶段组成，通过RNN实现条件随机场的迭代公式，提高了关系预测的准确性。实验结果显示DR-Net提升了状态-of-the-art的表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

（这里仅记录了论文的一些内容以及自己的一点点浅薄的理解，具体实验尚未恢复。由于本人新人一枚，若有错误以及不足之处，还望不吝赐教）

总结

previous works 的缺点

将 VRD 视为分类问题，即 consider each type of relationship (1, $e . g .$ “ride”) or each distinct visual phrase (2, $e . g .$ “person-ride-horse”) as a category

对于第（1）种，存在如下问题：increase diversity within each class，即虽然解决了第（2）中数量庞大的问题，使得类别数量大大减少，但是类内的 object-pairs 类别的差异可能会很大且不同类之间 object-pairs 的差异可能会很不一致，使得模型难以拟合；对于第（2）中，存在 ① 组合数量庞大，所需训练的分类器数量多（一种组合训练一个分类器）；②由于 long-tail 分布，即每种类别所包含的样本数量差异大，有些包含很少样本，使得对应的分类器得不到充分的训练
本文提出了 DR-Net (Deep Relational Network) 用于 exploit the statistical dependencies between objects and their relations/predicates
contributions:
- DR-Net, a novel formulation that combines the strengths of statistical models and deep learning
- an effective framekwork for visual relationship detection, which brings the state-of-the-art to a new level
缺点：将关系检测视为一个多分类问题，忽略了多关系的情况，即 co-occurrence of predicates (预测使用了 $s o f t m a x$ 函数)

模型框架

在这里插入图片描述

三个 stages
1. Object Detection: 使用 Faster RCNN 来检测物体，每个物体带有 a bounding box & an appearance feature
2. Pair Filtering: 产生 object pairs，然后引入一个浅层 NN 作为 filter 去除一些明显不成立的 object pair。该 filter 会综合 spatial configurations ( $e . g .$ objects too far away are unlikely to be related) 和 object categories ( $e . g .$ certain objects are unlikely to form a meaningful relationship) 从而过滤
3. Joint Recognition: 以每个 object pair 作为输入，包括对应的两个 object 的 ① appearance feature（源于 object detection）、根据 bounding box 所获取的 masks 和二者的区域图，其中，masks 通过一个三层卷积来提取 ② spatial feature（Spatial Module），具体过程如下图 3 所示。而区域图则通过 VGG16/ResNet-101 来提取 ③ appearance feature（Appr Module）。然后将 ② 和 ③ concatenate 起来作为两层全连接层的输入以获得 ④ compressed pair feature。最后将 ① 和 ④ 作为 DR-Net 的输入，输出则是一个关系的概率分布表 $\in R^{|P|}$ （ $s o f t m a x$ ），其中 $∣ P ∣$ 表示关系数据集的大小，取最大概率对应的 $r$ 作为最后的预测结果
在训练时，这三个阶段是独立训练的
DR-Net

该网络是基于 CRF 条件随机场进行推导获得的，最终的迭代公式为：
$\begin{array}{l} \mathbf{q_s}' = \sigma(\mathbf{W}_a\mathbf{x}_s + \mathbf{W}_{sr}\mathbf{q}_r + \mathbf{W}_{so}\mathbf{q}_o) \\ \mathbf{q_r}' = \sigma(\mathbf{W}_r\mathbf{x}_r + \mathbf{W}_{rs}\mathbf{q}_s + \mathbf{W}_{ro}\mathbf{q}_o) \\ \mathbf{q_o}' = \sigma(\mathbf{W}_a\mathbf{x}_o + \mathbf{W}_{or}\mathbf{q}_r + \mathbf{W}_{os}\mathbf{q}_s) \end{array} \tag{1}$