Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation CVPR 2019

最新推荐文章于 2022-06-23 15:13:57 发布

原创最新推荐文章于 2022-06-23 15:13:57 发布 · 1.3k 阅读

0 ·

CC 4.0 BY-SA版权

论文阅读专栏收录该内容

5 篇文章

订阅专栏

本文提出了一种基于图像的人体3D姿态估计方法，通过利用不同视角下的图像对，学习几何感知的3D表示。该方法包括图像骨架映射、基于骨架的视图合成以及表示一致性约束等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在这里插入图片描述

gold:

learn a geometry-aware 3D representation $G\mathcal{G}$ for the human pose

discover the geometry relation between paired images $I^i_t,I^j_t)$
which are acquired from synchronized and calibrated cameras

$i, j$ ：different view point
$t$ :acquiring time acquiring

main components

image skeleton mapping
skeleton-based view synthesis
representation consistency constraint

image-skeleton mapping

input raw image pair $I^i_t,I^j_t)$ with size of $W×HW\times H$

$i, j$ :different view point ,(synchronized and calibrated cameras $i, j$ )

$Cti,CtjC^i_t,C^j_t$ : $K$ keypoint heatmaps from a pre-trained 2D human pose eastimator

We follow previous works [45, 20, 18] to train the 2D estimator on MPII dataset.

constructed 8 pixels width 2D skeleton maps from heatmaps

binary skeleton maps pair $S^i_t,S^j_t)$ , $St(⋅)∈{0,1}(K−1)×W×HS_t^{(\cdot)}\in\{0,1\}^{(K-1)\times W \times H}$

Geometry representation via view synthesis

training set $T={(Sti,Stj,Ri→j)}\mathcal T = \{(S^i_t,S^j_t,R_{i\to j})\}$

pairs of two views of projection of same 3D skeleton $(Sti,Stj)(S^i_t,S^j_t)$
relative rotation matrix $Ri→jR_{i\to j}$ from coordinate system of camera $i$ to $j$

Straightforward way for learning representation in unsupervised/weakly-supervised manner is to utilize auto-encoding mechanism reconstrcting input image.

no geometry structure information
nor provides more useful information for 3D pose estimation than 2D coordinates

novel ‘skeleton-based view synthesis’ generate image under a new viewpoint.
Given an image under the known viewpoint as input

source domain : (input image） $Si={Sti}i=1V\mathcal S^i = \{S^i_t\}^V_{i=1}$

$V$ :amount of viewpoints

target domain :generate image $Sj={Stj}i=1V\mathcal S^j = \{S^j_t\}^V_{i=1}$

$\not=i$

encoder $ϕ:Si→G\phi:\mathcal S^i \to \mathcal G$

source skeleton $Sti→SiS^i_t \to \mathcal S^i$ into a latent space $Gi∈GG_i\in \mathcal G$

decoder $ψ:Ri→j×G→Si\psi:\mathcal R_{i\to j} \times \mathcal G \to \mathcal S_i$

ratation matrix $Ri→jR_{ i \to j }$
$G_i$ as the set of $m$ discrete points on 3 $η\eta$ -dimensional space
in practice: $G = [g_1,g_2,...g_M]^T,g_m = (x_m,y_m,z_m)$

$Lℓ2(ϕ⋅ψ,θ)=1NT∑∥ψ(Ri→j×ϕ(Sti))−Stj∥L_{\ell2}(\phi \cdot\psi,\theta)=\frac{1}{N^T}\sum\lVert\psi(R_{i\to j}\times\phi(S^i_t))-S^j_t\rVert$

Representation consistency constraint

image skeleton mapping + view synthesis （previous two steps）lead to unrealistic generation on target pose when there are large self occlusions in source view

Since there is no explicit constraint on latent space to facilitate $G\mathcal G$ to be semantic

We assume there exists an inverse mapping (one-to-one) between source domain and target domain, on the condition of the known relative rotation matrix. We could find:

a encoder $μ:Sj→G\mu:\mathcal S^j \to \mathcal G$

maps target skeleton $StjS^j_t$ to latent space $G~j∈G\tilde G_j \in \mathcal G$

a decoder $ν:Rj→i×G→Sti\nu:R_{j\to i}\times G \to S^i_t$

maps representation $G~j\tilde G_j$ back to source skeleton $StiS^i_t$
$G~i\tilde G_i$ and $G_i$ should be the same shared representation on $G\mathcal G$ with different rotation-related coefficients ------namely representation consistency

$lrc=∑m=1M∥f×Gi−G~i∥2l_{rc}=\sum_{m=1}^M\lVert f \times G_i -\tilde G_i \rVert^2$

$R_{i \to j}$ rotation-related transformation that map $G_i$ to $G~j\tilde G_j$
a bidirectional encoderdecoder framework
- $generator(ϕ,ψ)generator(\phi,\psi)$ , $G_{ij}$ :rotated $G_i$
- $generator(μ,ν)generator(\mu,\nu)$