[TPAMI 2025]Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation

最新推荐文章于 2025-12-15 13:18:50 发布

原创最新推荐文章于 2025-12-15 13:18:50 发布 · 950 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#YOLO #人工智能 #目标检测 #计算机视觉 #深度学习 #图论 #机器学习

论文精读专栏收录该内容

228 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

论文网址：Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation | IEEE Journals & Magazine | IEEE Xplore

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3.1. YOLO Series Object Detectors

2.3.2. DETR Series Object Detectors

2.3.3. Hypergraph Learning Methods

2.4. Hypergraph Computation Empowered Semantic Collecting and Scattering Framework

2.5. Methods

2.5.1. Preliminaries

2.5.2. Hyper-YOLO Overview

2.5.3. Mixed Aggregation Network

2.5.4. Hypergraph-Based Cross-Level and Cross-Position Representation Network

2.5.5. Comparison and Analysis

2.6. Experiments

2.6.1. Experimental Setup

2.6.2. Results and Discussions

2.6.3. Ablation Studies on Backbone

2.6.4. Ablation Studies on Neck

2.6.5. More Ablation Studies

2.6.6. More Evaluation on Instance Segmentation Task

2.6.7. Visualization of High-Order Learning in Object Detection

2.7. Conclusion

1. 心得

（1）省流，清华佬的

（2）重要的图片补药放在补充材料里面啊我们看不到！！！

好想看wwwwww

2. 论文逐段精读

2.1. Abstract

①Limitations of traditional YOLO: neck can not efficiently aggregate cross-level feature or utilize the correlation of high order features

②Thus, they proposed Hypergraph Computation Empowered Semantic Collecting and Scattering (HGC-SCS)

2.2. Introduction

①Most of existing works fail to explore the high order relationship between features

②Performance:

2.3. Related Work

2.3.1. YOLO Series Object Detectors

①List different versions of YOLO and mention that they proposed an improved version

2.3.2. DETR Series Object Detectors

①DERT, which based on Transformer, is faster and more accurate than YOLO. However, it has plenty of parameters and performs worse on small object detection

②Transformer is similar to graph（感觉有点小共识怎么回事）

③Hyper-graph is able to solve the problem of Transformer

2.3.3. Hypergraph Learning Methods

①超图捕获高阶关系然后超图在计算机视觉还没有充分探索哈哈哈哈哈哈哈哈

2.4. Hypergraph Computation Empowered Semantic Collecting and Scattering Framework

①For feature map $\mathbf{X}$ , hyper graph will construct it to $f:\boldsymbol{X}\to\mathcal{G}$ . Then get the hyper feature map $\mathbf{X}_{hyper}$ . $\mathbf{X}$ and $\mathbf{X}_{hyper}$ will be fused to construct the hybrid feature map $\mathbf{X}'$

②Hypergraph Computation Empowered Semantic Collecting and Scattering (HGC-SCS) framework:

$\begin{cases} \boldsymbol{X}_{mixed}\xleftarrow{\text{Collecting}}\{\boldsymbol{X}_{1},\boldsymbol{X}_{2},\ldots\} \\ \boldsymbol{X}_{hyper}=\text{HyperComputation}(\boldsymbol{X}_{mixed})//\text{High-Order} \\ \mathrm{Learning} \\ \{\boldsymbol{X}_{1}^{\prime},\boldsymbol{X}_{2}^{\prime},\ldots\}\xleftarrow{\text{Scattering}}\{\phi(\boldsymbol{X}_{hyper},\boldsymbol{X}_{1}),\phi(\boldsymbol{X}_{hyper},\boldsymbol{X}_{2}) \\ ,\ldots\} & \end{cases}$

where $\phi \left ( \cdot \right )$ denotes the feature fusion function

2.5. Methods

2.5.1. Preliminaries

①Three scale outputs of the neck: $\{N_3,N_4,N_5\}$ , which are small-scale, medium-scale, and large-scale feature map

②5 stages in backbone: $\{B_1,B_2,B_3,B_4,B_5\}$ , the higher number denotes the semantic feature at higher level and deeper layer

2.5.2. Hyper-YOLO Overview

①感觉把上一节的内容又说了一下，说自己在那些地方提取特征

2.5.3. Mixed Aggregation Network

①The schematic of Mixed Aggregation Network (MANet):

where $c$ in pictures denotes channel number

②The processes in MANet:

③The final output is fused by all of these feature:

$\boldsymbol{X}_{out}=\mathrm{Conv}_o(\boldsymbol{X}_1||\boldsymbol{X}_2||\ldots||\boldsymbol{X}_{4+n})$

prowess n.造诣；高超的技艺；非凡的技能

2.5.4. Hypergraph-Based Cross-Level and Cross-Position Representation Network

①Pipeline of proposed Hypergraph-Based Cross-Level and Cross-Position Representation Network (HyperC2Net):

（1）Hypergraph Construction

①For hypergraph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ , $\mathcal{V}$ denotes node set and $\mathcal{E}$ is hyperedge set

②How to build hypergraph:

③Edges are screened by $\epsilon$ -ball from each feature point:

$\mathcal{E}=\{ball(v,\epsilon)\mid v\in\mathcal{V}\}$

where $ball(v,\epsilon)=\{u\mid||\boldsymbol{x}_u-\boldsymbol{x}_v||_d<\epsilon,u\in\mathcal{V}\}$

（2）Hypergraph Convolution

①Hypergraph conv: spatial-domain hypergraph convolution with residual connection:

$\left.\left\{ \begin{array} {l}\boldsymbol{x}_e=\frac{1}{|\mathcal{N}_v(e)|}\sum_{v\in\mathcal{N}_v(e)}\boldsymbol{x}_v\boldsymbol{\Theta} \\ \boldsymbol{x}_v^{\prime}=\boldsymbol{x}_v+\frac{1}{|\mathcal{N}_e(v)|}\sum_{e\in\mathcal{N}_e(v)}\boldsymbol{x}_e \end{array}\right.\right.$

where $\mathcal{N}_v(e)=\{v\mid v\in e,v\in\mathcal{V}\}$ and $\mathcal{N}_e(v)=\{e\mid v\in e,e\in\mathcal{E}\}$ , where $\Theta$ is trainable parameter

②The fomular of hyper graph convolution:

$\mathrm{HyperConv}(\boldsymbol{X},\boldsymbol{H})=\boldsymbol{X}+\boldsymbol{D}_v^{-1}\boldsymbol{H}\boldsymbol{D}_e^{-1}\boldsymbol{H}^\top\boldsymbol{X}\boldsymbol{\Theta}$

where $\boldsymbol{D}_v$ and $\boldsymbol{D}_e$ denote diagonal degree matrices of the vertices and hyperedges

（3）An Instance of HGC-SCS Framework

①Hypergraph-based cross-level and cross-position representation network (HyperC2Net):

$\begin{cases} X_{mixed}=B_{1}||B_{2}||B_{3}||B_{4}||B_{5} \\ X_{hyper}=HyperConv(X_{mixed},H) \\ N_{3},N_{4},N_{5}=\phi(X_{hyper},B_{3}),\phi(X_{hyper},B_{4}), \\ \phi(X_{hyper},B_{4}) & \end{cases}$

where $\parallel$ denotes concatenation, $\phi$ denotes fusion function

2.5.5. Comparison and Analysis

①They change PANet/gather-distribute neck to HyperC2Net

2.6. Experiments

2.6.1. Experimental Setup

①Performance on Microsoft COCO dataset:

where different convolutional layers and feature dimension takes different model size, -T (the last C2F in Bottom-Up stage is changed to 1×1 Conv), -N, -S, -M, -L

②Fair comparison: no pretraining and self-distillation strategies for all methods

③Input of all these models: 640×640 pixels