Day 2: Deep High-Resolution Representation Learning for Human Pose Estimation

最新推荐文章于 2023-03-13 20:28:37 发布

原创最新推荐文章于 2023-03-13 20:28:37 发布 · 183 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #深度学习 #机器学习 #ieee论文

论文研读专栏收录该内容

18 篇文章

订阅专栏

该博客介绍了HRNet在姿态估计领域的创新，它摒弃了传统从高精到低精再到高精的串行流程，改为全程保持高分辨率特征，并通过多尺度融合增强信息交流。HRNet由高分辨率子网络开始，逐步添加低分辨率子网络，形成并行结构，不断进行多尺度融合，以提升高分辨率特征的表现力，从而提高姿态估计的准确性。

最近在做 pose estimation 相关的东西，这篇论文之前其实看过，也做了些笔记，不过没有统一整理。今天没太多时间，所以先把一些重要的东西写下来，记在这。

这篇文章的主要贡献有：

以往的 pose estimation 模型，都是有个从高精到低精，再从低精到高精的过程，而且整个过程中是串行的，因此会损失一定的 spatial information，本文提出的 HRnet，采用了并行模式，在整个过程中都保留了高精的特征图。
此外，本模型采用了 multi scale fusion，把前一个stage的高、低精度进行了一个融合，再输出到下一个阶段。

以下为英文部分

Core Ideas and Contributions

Maintains high-resolution representations through the whole process.
We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high resolution representations

Methods and Approaches

General Description

We start from a high-resolution subnetwork as the fifirst stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process.
We perform repeated multi-scale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation.

在这里插入图片描述

Detailed Description

With the input of size $\times H \times 3$ , trasform this problem to estimating $K$ heatmaps of size $\times H'$ , ${H_1, H_2, ..., H_K\}$ , where each heatmap $H_k$ indicates the location confidence of the $k$ th keypoint.
It composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution

Sequencial multi-resolution subnetworks

Let $Nsr\mathcal{N}_{s r}$ be the subnetwork in the $s$ th stage and $r$ be the resolution index (Its resolution is $12r−1\frac{1}{2^{r-1}}$ of the resolution of the first subnetwork). The high-to-low network with $S$ (e.g., 4) stages can be denoted as:

$N11→N22→N33→N44\mathcal{N}_{11} \rightarrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{44}$

Parallel multi-resolution subnetworks

start from a high-resolution subnetwork as the first stage.
gradually add high-to-low subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel.
the resulting resolution for the parallel subnetworks of a later stage, consists of the resolutions from the previous stage, and an extra lower one

$N11→N21→N31→N41↘N22→N32→N42↘N33→N43↘N44.\begin{aligned}\mathcal{N}_{11} \rightarrow \mathcal{N}_{21} \rightarrow \mathcal{N}_{31} \rightarrow \mathcal{N}_{41} \\ \searrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{32} \rightarrow \mathcal{N}_{42} \\ \searrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{43} \\ \searrow \mathcal{N}_{44} .\end{aligned}$

Repeated multi-scale fusion

We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks
We divided the third stage into several (e.g., 3) exchange blocks, and each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is given as follows

$C311↘↗C312↘↗C313↘C321→E31→C322→E32→C323→E33C331↗↘C332↗↘C333↗\begin{array}{lllll}\mathcal{C}_{31}^{1} & \searrow & \nearrow \mathcal{C}_{31}^{2} & \searrow & \nearrow \mathcal{C}_{31}^{3} & \searrow \\\mathcal{C}_{32}^{1} & \rightarrow \mathcal{E}_{3}^{1} & \rightarrow \mathcal{C}_{32}^{2} & \rightarrow \mathcal{E}_{3}^{2} & \rightarrow \mathcal{C}_{32}^{3} & \rightarrow \mathcal{E}_{3}^{3} \\\mathcal{C}_{33}^{1} & \nearrow & \searrow \mathcal{C}_{33}^{2} & \nearrow & \searrow \mathcal{C}_{33}^{3} & \nearrow\end{array}$

在这里插入图片描述

$CsrbC^b_{sr}$ represents the convolution unit in the $r$ th resolution of the $b$ th block in the $s$ th stabe, and $Esb\mathcal{E}_{s}^{b}$ is the corresponding exchange unit.
The inputs are $s$ **response maps: ${X1,X2,…,Xs}\left\{\mathbf{X}_{1}, \mathbf{X}_{2}, \ldots, \mathbf{X}_{s}\right\}$ . The outputs are $s$ **response maps: ${Y1,Y2,…,Ys}\left\{\mathbf{Y}_{1}, \mathbf{Y}_{2}, \ldots, \mathbf{Y}_{s}\right\}$ , whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, $Yk=∑i=1sa(Xi,k)\mathbf{Y}_{k}=\sum_{i=1}^{s} a\left(\mathbf{X}_{i}, k\right)$ . The exchange unit across stages has an extra output map $Ys+1:Ys+1=a(Ys,s+1)\mathbf{Y}_{s+1}: \mathbf{Y}_{s+1}=a\left(\mathbf{Y}_{s}, s+1\right)$ .
The function $a(Xi,k)a\left(\mathbf{X}_{i}, k\right)$ consists of upsampling or downsampling $X_i$ **from resolution $i$ **to resolution $k$ . We adopt strided 3×3 convolutions for downsampling. For instance, one strided 3×3 convolution with the stride 2 for 2× downsampling, and two consecutive strided 3 × 3 convolutions with the stride 2 for 4× downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a $\times 1$ convolution for aligning the number of channels. If $a(\cdot, \cdot)$ is just an identify connection: $a(Xi,k)=Xia\left(\mathbf{X}_{i}, k\right)=\mathbf{X}_{i}$