Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation CVPR 2019

本文提出了一种基于图像的人体3D姿态估计方法,通过利用不同视角下的图像对,学习几何感知的3D表示。该方法包括图像骨架映射、基于骨架的视图合成以及表示一致性约束等关键步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

在这里插入图片描述

gold:

learn a geometry-aware 3D representation G\mathcal{G}G for the human pose

discover the geometry relation between paired images (Iti,Itj)( I^i_t,I^j_t)(Iti,Itj)
which are acquired from synchronized and calibrated cameras

  • i,ji,ji,j:different view point
  • ttt:acquiring time acquiring

main components

  • image skeleton mapping
  • skeleton-based view synthesis
  • representation consistency constraint

image-skeleton mapping

input raw image pair (Iti,Itj)( I^i_t,I^j_t)(Iti,Itj) with size of W×HW\times HW×H

  • i,ji,ji,j:different view point ,(synchronized and calibrated cameras i,ji,ji,j)

Cti,CtjC^i_t,C^j_tCti,Ctj:KKK keypoint heatmaps from a pre-trained 2D human pose eastimator

We follow previous works [45, 20, 18] to train the 2D estimator on MPII dataset.

constructed 8 pixels width 2D skeleton maps from heatmaps

binary skeleton maps pair (Sti,Stj)( S^i_t,S^j_t)(Sti,Stj),St(⋅)∈{0,1}(K−1)×W×HS_t^{(\cdot)}\in\{0,1\}^{(K-1)\times W \times H}St(){0,1}(K1)×W×H

Geometry representation via view synthesis

training set T={(Sti,Stj,Ri→j)}\mathcal T = \{(S^i_t,S^j_t,R_{i\to j})\}T={(Sti,Stj,Rij)}

  • pairs of two views of projection of same 3D skeleton (Sti,Stj)(S^i_t,S^j_t)(Sti,Stj)
  • relative rotation matrix Ri→jR_{i\to j}Rij from coordinate system of camera iii to jjj

Straightforward way for learning representation in unsupervised/weakly-supervised manner is to utilize auto-encoding mechanism reconstrcting input image.

  • no geometry structure information
  • nor provides more useful information for 3D pose estimation than 2D coordinates

novel ‘skeleton-based view synthesis’ generate image under a new viewpoint.
Given an image under the known viewpoint as input

source domain : (input image) Si={Sti}i=1V\mathcal S^i = \{S^i_t\}^V_{i=1}Si={Sti}i=1V

  • VVV :amount of viewpoints

target domain :generate image Sj={Stj}i=1V\mathcal S^j = \{S^j_t\}^V_{i=1}Sj={Stj}i=1V

  • j̸=ij \not=ij̸=i

encoder ϕ:Si→G\phi:\mathcal S^i \to \mathcal Gϕ:SiG

  • source skeleton Sti→SiS^i_t \to \mathcal S^iStiSi into a latent space Gi∈GG_i\in \mathcal GGiG

decoder ψ:Ri→j×G→Si\psi:\mathcal R_{i\to j} \times \mathcal G \to \mathcal S_iψ:Rij×GSi

  • ratation matrix Ri→jR_{ i \to j }Rij
  • GiG_iGi as the set of mmm discrete points on 3η\etaη-dimensional space
    in practice: G=[g1,g2,...gM]T,gm=(xm,ym,zm)G = [g_1,g_2,...g_M]^T,g_m = (x_m,y_m,z_m)G=[g1,g2,...gM]T,gm=(xm,ym,zm)

Lℓ2(ϕ⋅ψ,θ)=1NT∑∥ψ(Ri→j×ϕ(Sti))−Stj∥L_{\ell2}(\phi \cdot\psi,\theta)=\frac{1}{N^T}\sum\lVert\psi(R_{i\to j}\times\phi(S^i_t))-S^j_t\rVertL2(ϕψ,θ)=NT1ψ(Rij×ϕ(Sti))Stj

Representation consistency constraint

image skeleton mapping + view synthesis (previous two steps)lead to unrealistic generation on target pose when there are large self occlusions in source view

Since there is no explicit constraint on latent space to facilitate G\mathcal GG to be semantic

We assume there exists an inverse mapping (one-to-one) between source domain and target domain, on the condition of the known relative rotation matrix. We could find:

a encoder μ:Sj→G\mu:\mathcal S^j \to \mathcal Gμ:SjG

  • maps target skeleton StjS^j_tStj to latent space G~j∈G\tilde G_j \in \mathcal GG~jG

a decoder ν:Rj→i×G→Sti\nu:R_{j\to i}\times G \to S^i_tν:Rji×GSti

  • maps representation G~j\tilde G_jG~j back to source skeleton StiS^i_tSti
  • G~i\tilde G_iG~i and GiG_iGi should be the same shared representation on G\mathcal GG with different rotation-related coefficients ------namely representation consistency

lrc=∑m=1M∥f×Gi−G~i∥2l_{rc}=\sum_{m=1}^M\lVert f \times G_i -\tilde G_i \rVert^2lrc=m=1Mf×GiG~i2

  • f=Ri→jf = R_{i \to j}f=Rij rotation-related transformation that map GiG_iGi to G~j\tilde G_jG~j
  • a bidirectional encoderdecoder framework
    • generator(ϕ,ψ)generator(\phi,\psi)generator(ϕ,ψ),GijG_{ij}Gij:rotated GiG_iGi
    • generator(μ,ν)generator(\mu,\nu)generator(μ,ν)

      在这里插入图片描述

total loss of the bidirectional model

where θ and ζ denotes the parameters of two encode-decoder networks, respectively.

3D human pose estimation by learnt representation

given: monocular image III
goal : b={(xp,yp,zp)}p=1P\bold b = \{(x^p,y^p,z^p)\}^P_{p=1}b={(xp,yp,zp)}p=1P,P body joints,b∈B\bold b \in \mathcal BbB

function F:I→B\mathcal F:\mathcal I \to \mathcal BF:IB to learn the pose regression

2 fully connect layer

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值