Day 2: Deep High-Resolution Representation Learning for Human Pose Estimation

该博客介绍了HRNet在姿态估计领域的创新,它摒弃了传统从高精到低精再到高精的串行流程,改为全程保持高分辨率特征,并通过多尺度融合增强信息交流。HRNet由高分辨率子网络开始,逐步添加低分辨率子网络,形成并行结构,不断进行多尺度融合,以提升高分辨率特征的表现力,从而提高姿态估计的准确性。

最近在做 pose estimation 相关的东西,这篇论文之前其实看过,也做了些笔记,不过没有统一整理。今天没太多时间,所以先把一些重要的东西写下来,记在这。

这篇文章的主要贡献有:

  • 以往的 pose estimation 模型,都是有个从高精到低精,再从低精到高精的过程,而且整个过程中是串行的,因此会损失一定的 spatial information,本文提出的 HRnet,采用了并行模式,在整个过程中都保留了高精的特征图。
  • 此外,本模型采用了 multi scale fusion,把前一个stage的高、低精度进行了一个融合,再输出到下一个阶段。

以下为英文部分


Core Ideas and Contributions

  1. Maintains high-resolution representations through the whole process.
  2. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high resolution representations

Methods and Approaches

General Description

  • We start from a high-resolution subnetwork as the fifirst stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process.
  • We perform repeated multi-scale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation.

在这里插入图片描述

Detailed Description

  • With the input of size W×H×3W \times H \times 3W×H×3, trasform this problem to estimating KKK heatmaps of size W′×H′W' \times H'W×H, {H1,H2,...,HK}\{H_1, H_2, ..., H_K\}{H1,H2,...,HK}, where each heatmap HkH_kHk indicates the location confidence of the kkkth keypoint.
  • It composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution

Sequencial multi-resolution subnetworks

  • Let Nsr\mathcal{N}_{s r}Nsr be the subnetwork in the sssth stage and rrr be the resolution index (Its resolution is 12r−1\frac{1}{2^{r-1}}2r11 of the resolution of the first subnetwork). The high-to-low network with SSS (e.g., 4) stages can be denoted as:

N11→N22→N33→N44\mathcal{N}_{11} \rightarrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{44}N11N22N33N44

Parallel multi-resolution subnetworks

  • start from a high-resolution subnetwork as the first stage.
  • gradually add high-to-low subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel.
  • the resulting resolution for the parallel subnetworks of a later stage, consists of the resolutions from the previous stage, and an extra lower one

N11→N21→N31→N41↘N22→N32→N42↘N33→N43↘N44.\begin{aligned}\mathcal{N}_{11} \rightarrow \mathcal{N}_{21} \rightarrow \mathcal{N}_{31} \rightarrow \mathcal{N}_{41} \\ \searrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{32} \rightarrow \mathcal{N}_{42} \\ \searrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{43} \\ \searrow \mathcal{N}_{44} .\end{aligned}N11N21N31N41N22N32N42N33N43N44.

Repeated multi-scale fusion

  • We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks
  • We divided the third stage into several (e.g., 3) exchange blocks, and each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is given as follows

C311↘↗C312↘↗C313↘C321→E31→C322→E32→C323→E33C331↗↘C332↗↘C333↗\begin{array}{lllll}\mathcal{C}_{31}^{1} & \searrow & \nearrow \mathcal{C}_{31}^{2} & \searrow & \nearrow \mathcal{C}_{31}^{3} & \searrow \\\mathcal{C}_{32}^{1} & \rightarrow \mathcal{E}_{3}^{1} & \rightarrow \mathcal{C}_{32}^{2} & \rightarrow \mathcal{E}_{3}^{2} & \rightarrow \mathcal{C}_{32}^{3} & \rightarrow \mathcal{E}_{3}^{3} \\\mathcal{C}_{33}^{1} & \nearrow & \searrow \mathcal{C}_{33}^{2} & \nearrow & \searrow \mathcal{C}_{33}^{3} & \nearrow\end{array}C311C321C331E31C312C322C332E32C313C323C333E33

在这里插入图片描述

  • CsrbC^b_{sr}Csrb represents the convolution unit in the rrrth resolution of the bbbth block in the sssth stabe, and Esb\mathcal{E}_{s}^{b}Esb is the corresponding exchange unit.
  • The inputs are sss **response maps: {X1,X2,…,Xs}\left\{\mathbf{X}_{1}, \mathbf{X}_{2}, \ldots, \mathbf{X}_{s}\right\}{X1,X2,,Xs}. The outputs are sss **response maps: {Y1,Y2,…,Ys}\left\{\mathbf{Y}_{1}, \mathbf{Y}_{2}, \ldots, \mathbf{Y}_{s}\right\}{Y1,Y2,,Ys}, whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, Yk=∑i=1sa(Xi,k)\mathbf{Y}_{k}=\sum_{i=1}^{s} a\left(\mathbf{X}_{i}, k\right)Yk=i=1sa(Xi,k). The exchange unit across stages has an extra output map Ys+1:Ys+1=a(Ys,s+1)\mathbf{Y}_{s+1}: \mathbf{Y}_{s+1}=a\left(\mathbf{Y}_{s}, s+1\right)Ys+1:Ys+1=a(Ys,s+1).
  • The function a(Xi,k)a\left(\mathbf{X}_{i}, k\right)a(Xi,k) consists of upsampling or downsampling XiX_iXi **from resolution iii **to resolution kkk. We adopt strided 3×3 convolutions for downsampling. For instance, one strided 3×3 convolution with the stride 2 for 2× downsampling, and two consecutive strided 3 × 3 convolutions with the stride 2 for 4× downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a 1×11 \times 11×1 convolution for aligning the number of channels. If i=k,a(⋅,⋅)i=k, a(\cdot, \cdot)i=k,a(,) is just an identify connection: a(Xi,k)=Xia\left(\mathbf{X}_{i}, k\right)=\mathbf{X}_{i}a(Xi,k)=Xi

Network instantiation

  • This part is not clear enough just by reading the paper, so I will look into the code and paste a new link here describing the codes in the future.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值