Convolutional Pose Machines-base on Pose Machine and Convolutional Network

本文链接：https://blog.youkuaiyun.com/qq_43452156/article/details/103620208

Abstract

Aim：use convolutional networks to learning image features and image-dependent spatial models for pose estimation
Contribution：implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation
Method: design a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages,producing increasingly refined estimates for part locations.
Training: address the difficulty of vanishing gradients by providing a natural learning objective function that enforces intermediate supervision and replenishing back-propagated gradients and conditioning the learning procedure

Introduction

Convolutional Pose Machines(CPMs) = pose machine + convolutional architectures：

Pose Machine：
(1)learning a long-range dependencies between image and multi-part cues;
(2)tight integration between learning and inference
(3)a modular sequential design
CPMs:
(1)learn feature representations for both image and spatial context directly from data
(2)a differentiable architecture that allows for globally joint training with BP
(3)efficiently handle large training datasets

In order to capture long-range interactions between parts, the design of the network in each stage of our sequential prediction framework is motivated by the goal of achieving a large receptive field on both the image and the belief maps.

The method for solving Vanishing gradients
replenish gradients and produce increasingly accurate belief maps by enforcing intermediate supervision periodically through the network

Main contributions:
(a)learn implicit spatial models via a sequential composition of convolutional architectures
(b)a systematic approach to designing and training such an architecture to learn both image features and image-dependent spatial models for structured prediction tasks

Related Work

we show the regressed confidence maps are suitable to be inputted to further convolutional networks with large receptive fields to learn implicit spatial dependencies. we show how the sequential prediction framework takes advantage of the preserved uncertainty in the confidence maps to encode the rich spatial context, with enforcing the intermediate local supervisions to address the problem of vanishing gradients.

Method

Pose Machines

Convolutional Pose Machines

在这里插入图片描述

一、 Keypoint Localization Using Local Image Evidence

CPMs的第一阶段仅从local image evidence预测part beliefs，如图2c。

the evidence is local because the receptive field of the first stage of the network is constrained to a small patch around the output pixel location.

将输入图片规范为 $368\times368$ ，网络感受野为 $160\times 160$ 。该网络可以看成是在图片上滑动深度网络，在每个 $160\times 160$ 的image patch，从local image evidence中回归 $P + 1$ 维输出向量，用于表示在图片某个位置每个part的分数。

二、Sequential Prediction with Learned Spatial Context Features

$\psi——the\space receptive\space field\space of\space the\space predictor\space on\space the\space beliefs\space from\space the\space previous\space satge$

网路设计的指导思想：
（1）在第二阶段的输出层实现一个巨大的receptive field，使得能够学习part之间潜在的复杂性和long-range相关性
（2）通过简单的使用前一阶段输出的特征（与图模型中定义potential function相反），后续阶段的卷积层使得分类器通过选择最具有预测性的特征，从而自由地结合contextual information。
（3）第一阶段的belief maps由一个网络生成：这个网络可以以较小的receptive field对图像进行局部检测；
（4)第二阶段，设计一个能够显著增大receptive field的网络
Large receptive fields的实现方法：
（1）pooling——降低精度
（2）增加卷积核大小——增加参数个数
（3）增加卷积层数——在训练过程中可能会出现vanishing gradients

在 $t > 2$ 阶段，网络和对应的感受野如图2d所示。图4显示了receptive field与精度之间的关系。
在这里插入图片描述

三、Learning in Convolutional Pose Machines

问题：
深层卷积网络的训练过程会产生vanishing gradients
方法：
在每一个阶段的输出上定义一个loss function，用于最小化每个part的predicated与ideal belief maps之间的 $l_2$ 距离。
ground truth locations：
将每一个part $p$ 的ideal belief map记为：
$b_*^p(Y_p=z)$
在每一个part $p$ 的ground truth locations上放置Gaussian peaks产生。
loss function：
每一stage要优化的损失函数为：
$f_t=\sum_{p=1}^{P+1} \sum_{z\in Z}||b_t^p(z)-b_*^p(z)||_2^2 \tag 4$
overall loss function:
总体损失函数可以通过在每一个stage增加损失计算并记为：
$F=\sum_{t=1}^T f_t\tag 5$
使用standard stochastic gradient descend 训练网络；在 $t\geq 2$ stage，通过共享对应卷积的weights实现图像特征 $x^{'}$ 的共享，如图2。

Evaluation

Analysis

1. 假设：
中间损失函数可以补充每一个stage的梯度，从而避免vanishing gradient。

2. 验证：
观察有无中间监督的模型不同深度的梯度直方图，如图5，假设得以验证。
在这里插入图片描述可以看到：随着训练进行，梯度幅度分布的方差减小，也即模型越收敛。

Benefit of end-to-end learning

如图6a，能够显著增加精度

Comparison on training schemes

如图6b，设计了四种训练方式：
（1）train from scratch using a global loss function that enforces intermediate supervision
（2）stage-wise; where each stage is trained in a feed-forward fashion and stacked
（3）as same as (1) but initialized with weights from (2)
（4）as same as (1) but with no intermediate supervision

Performance across stages

如图6c，对每个阶段的性能作了比较。

Discussion

Convolutional pose machines provide an end-to-end architecture. We showed that a sequential architecture somposed of convolutional networks is capable of implicitly learning a spatial models for pose by communicating increasingly refined uncertainty-preserving beliefs between stages.Problems with spatial dependencies between variables arise in multiple domains of computer vision such as semantic image labeling,single image-depth prediction and object detection and future work will involve extending our architecture to these problems.Our approach achieves state of the art accuracy on all primary benchmarks,however we do observe failure cases mainly when multiple people are in close proximity. Handling multiple people in a single end-to-end architecture is also a challenging problem and an interesting avenue for future work.