最近在做 pose estimation 相关的东西,这篇论文之前其实看过,也做了些笔记,不过没有统一整理。今天没太多时间,所以先把一些重要的东西写下来,记在这。
这篇文章的主要贡献有:
- 以往的 pose estimation 模型,都是有个从高精到低精,再从低精到高精的过程,而且整个过程中是串行的,因此会损失一定的 spatial information,本文提出的 HRnet,采用了并行模式,在整个过程中都保留了高精的特征图。
- 此外,本模型采用了 multi scale fusion,把前一个stage的高、低精度进行了一个融合,再输出到下一个阶段。
以下为英文部分
Core Ideas and Contributions
- Maintains high-resolution representations through the whole process.
- We conduct
repeated multi-scale fusionssuch that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high resolution representations
Methods and Approaches
General Description
- We start from a high-resolution subnetwork as the fifirst stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process.
- We perform repeated multi-scale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation.

Detailed Description
- With the input of size W×H×3W \times H \times 3W×H×3, trasform this problem to estimating KKK heatmaps of size W′×H′W' \times H'W′×H′, {H1,H2,...,HK}\{H_1, H_2, ..., H_K\}{H1,H2,...,HK}, where each heatmap HkH_kHk indicates the location confidence of the kkkth keypoint.
- It composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution
Sequencial multi-resolution subnetworks
- Let Nsr\mathcal{N}_{s r}Nsr be the subnetwork in the sssth stage and rrr be the resolution index (Its resolution is 12r−1\frac{1}{2^{r-1}}2r−11 of the resolution of the first subnetwork). The high-to-low network with SSS (e.g., 4) stages can be denoted as:
N11→N22→N33→N44\mathcal{N}_{11} \rightarrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{44}N11→N22→N33→N44
Parallel multi-resolution subnetworks
- start from a high-resolution subnetwork as the first stage.
- gradually add high-to-low subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel.
- the resulting resolution for the parallel subnetworks of a later stage, consists of the resolutions from the previous stage, and an extra lower one
N11→N21→N31→N41↘N22→N32→N42↘N33→N43↘N44.\begin{aligned}\mathcal{N}_{11} \rightarrow \mathcal{N}_{21} \rightarrow \mathcal{N}_{31} \rightarrow \mathcal{N}_{41} \\ \searrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{32} \rightarrow \mathcal{N}_{42} \\ \searrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{43} \\ \searrow \mathcal{N}_{44} .\end{aligned}N11→N21→N31→N41↘N22→N32→N42↘N33→N43↘N44.
Repeated multi-scale fusion
- We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks
- We divided the third stage into several (e.g., 3) exchange blocks, and each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is given as follows
C311↘↗C312↘↗C313↘C321→E31→C322→E32→C323→E33C331↗↘C332↗↘C333↗\begin{array}{lllll}\mathcal{C}_{31}^{1} & \searrow & \nearrow \mathcal{C}_{31}^{2} & \searrow & \nearrow \mathcal{C}_{31}^{3} & \searrow \\\mathcal{C}_{32}^{1} & \rightarrow \mathcal{E}_{3}^{1} & \rightarrow \mathcal{C}_{32}^{2} & \rightarrow \mathcal{E}_{3}^{2} & \rightarrow \mathcal{C}_{32}^{3} & \rightarrow \mathcal{E}_{3}^{3} \\\mathcal{C}_{33}^{1} & \nearrow & \searrow \mathcal{C}_{33}^{2} & \nearrow & \searrow \mathcal{C}_{33}^{3} & \nearrow\end{array}C311C321C331↘→E31↗↗C312→C322↘C332↘→E32↗↗C313→C323↘C333↘→E33↗

- CsrbC^b_{sr}Csrb represents the convolution unit in the rrrth resolution of the bbbth block in the sssth stabe, and Esb\mathcal{E}_{s}^{b}Esb is the corresponding exchange unit.
- The inputs are sss **response maps: {X1,X2,…,Xs}\left\{\mathbf{X}_{1}, \mathbf{X}_{2}, \ldots, \mathbf{X}_{s}\right\}{X1,X2,…,Xs}. The outputs are sss **response maps: {Y1,Y2,…,Ys}\left\{\mathbf{Y}_{1}, \mathbf{Y}_{2}, \ldots, \mathbf{Y}_{s}\right\}{Y1,Y2,…,Ys}, whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, Yk=∑i=1sa(Xi,k)\mathbf{Y}_{k}=\sum_{i=1}^{s} a\left(\mathbf{X}_{i}, k\right)Yk=∑i=1sa(Xi,k). The exchange unit across stages has an extra output map Ys+1:Ys+1=a(Ys,s+1)\mathbf{Y}_{s+1}: \mathbf{Y}_{s+1}=a\left(\mathbf{Y}_{s}, s+1\right)Ys+1:Ys+1=a(Ys,s+1).
- The function a(Xi,k)a\left(\mathbf{X}_{i}, k\right)a(Xi,k) consists of upsampling or downsampling XiX_iXi **from resolution iii **to resolution kkk. We adopt strided 3×3 convolutions for downsampling. For instance, one strided 3×3 convolution with the stride 2 for 2× downsampling, and two consecutive strided 3 × 3 convolutions with the stride 2 for 4× downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a 1×11 \times 11×1 convolution for aligning the number of channels. If i=k,a(⋅,⋅)i=k, a(\cdot, \cdot)i=k,a(⋅,⋅) is just an identify connection: a(Xi,k)=Xia\left(\mathbf{X}_{i}, k\right)=\mathbf{X}_{i}a(Xi,k)=Xi
Network instantiation
- This part is not clear enough just by reading the paper, so I will look into the code and paste a new link here describing the codes in the future.
该博客介绍了HRNet在姿态估计领域的创新,它摒弃了传统从高精到低精再到高精的串行流程,改为全程保持高分辨率特征,并通过多尺度融合增强信息交流。HRNet由高分辨率子网络开始,逐步添加低分辨率子网络,形成并行结构,不断进行多尺度融合,以提升高分辨率特征的表现力,从而提高姿态估计的准确性。
671

被折叠的 条评论
为什么被折叠?



