Relational Context Learning for Human-Object Interaction Detection

该文提出了一种名为MUREN的新方法,它通过多路解码器分支和多级关系嵌入模块强化了上下文交换,以改进人类对象交互检测。MUREN利用一元、二元和三元关系来增强任务特定令牌的上下文信息,从而提高关系推理的能力。注意力融合模块进一步促进了各分支之间的信息交流。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Paper Link
Code Link

Abstract

Most one-stage methods for HOI detection typically build on transformer architectures with two decoder branches, one for human-object pair detection and the other for interaction classification. Such disentangled transformers, however, may suffer from insufficient context exchange between the branches and lead to a lack of context information for relational reasoning. This work propose the multiplex relation network (MUREN) that performs rich context exchange between three decoder branches using unary, pairwise, and ternary relations of human, object, and interaction tokens.

Method

Existing transformer-based methods for HOI detection can be roughly divided into two types: single-branch and two-branch.

  • The single-branch methods update a token set through a single transformer decoder and detect HOI instances using the subsequent FFNs directly. As a single transformer decoder is responsible for all sub-tasks (i.e.,human detection, object detection, and interaction classification), they are limited in adapting to the different subtasks with multi-task learning, simultaneously.
  • The two-branch methods adopt two separated transformer decoder branches where
    one detects human-object pairs from a human-object token set while the other classifies interaction classes between human-object pairs from an interaction token set. However, the insufficient context exchange between the branches prevents the two-branch methods from learning relational contexts.

The overall architecture of MUREN

The MUtiplex RElation Network (MUREN) adopts three decoder branches which are responsible for three sub-tasks: human detection, object detection, and interaction classification:

罗子鸣666

  • First, the input image is fed into the CNN backbone followed by the transformer encoder to extract the image tokens.
  • A transformer decoder layer in each branch layer extracts the task-specific tokens for predicting the sub-task.
  • The MURE takes the task-specific tokens as input and generates the multiplex relation context for relational reasoning.
  • The attentive fusion module propagates the multiplex relation context to each sub-task for context exchange.
  • The outputs at the last layer of each branch are fed into to predict the HOI instances.

Multiplex Relation Embedding Module (MURE)

Since the task-specific tokens are generated from the separated branches, the tokens suffer from a lack of relational context information. To mitigate this issue, the multiplex relation embedding module (MURE) generates multiplex relation context for relational reasoning. The multiplex relation context contains the unary, pairwise, and ternary relation contexts to exploit useful information in each relation context.
在这里插入图片描述
MURE takes i-th task-specific tokens and the image tokens as input, and embed the unary [ f i H ; f i O ; f i I ] [f_i^H; f_i^O;f_i^I] [fiH;fiO;fiI] and pairwise ( [ f i H O ; f i H I ; f i O I ] [f_i^{HO}; f_i^{HI};f_i^{OI}] [fiHO;fiHI;fiOI]) relation contexts into the ternary relation context. The multiplex relation context, the output of MURE, is fed into subsequent attentive fusion module for context exchange.

Attentive Fusion

TODO

Comment

  1. The context exchange between the branches is important for the effectiveness of MUREN.
  2. The target of MUREN is limited on single human, even in the multi-person scenarios.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值