ACMMM-2024 三维人体姿态(3D Human Pose)相关论文3篇_tmm3d人体姿态估计论文-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_42155685/article/details/142896081

ACMMM-2024 三维人体姿态(3D Human Pose)相关论文3篇

Geometry-Guided Diffusion Model with Masked Transformer for Robust Multi-View 3D Human Pose Estimation

文章解读: http://www.studyai.com/xueshu/paper/detail/0be538bdf1
文章链接: (https://openreview.net/forum?id=z9nEV02Ujx)

摘要

Recent research on Diffusion Models and Transformers has brought significant advancements to 3D Human Pose Estimation (HPE).
Nonetheless, existing methods often fail to concurrently address the issues of accuracy and generalization.
In this paper, we propose a Geometry-guided Diffusion Model with Masked Transformer (Masked Gifformer) for robust multi-view 3D HPE.
Within the framework of the diffusion model, a hierarchical multi-view transformer-based denoiser is exploited to fit the 3D pose distribution by systematically integrating joint and view information.
To address the long-standing problem of poor generalization, we introduce a fully random mask mechanism without any additional learnable modules or parameters.
Furthermore, we incorporate geometric guidance into the diffusion model to enhance the accuracy of the model.
This is achieved by optimizing the sampling process to minimize reprojection errors through modeling a conditional guidance distribution.
Extensive experiments on two benchmarks demonstrate that Masked Gifformer effectively achieves a trade-off between accuracy and generalization.
Specifically, our method outperforms other probabilistic methods by $KaTeX parse error: Undefined control sequence: \textgreater at position 1: \̲t̲e̲x̲t̲g̲r̲e̲a̲t̲e̲r̲ ̲40\\%$ and achieves comparable results with state-of-the-art deterministic methods.
In addition, our method exhibits robustness to varying camera numbers, spatial arrangements, and datasets…

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

文章解读: http://www.studyai.com/xueshu/paper/detail/c65c9da3c1
文章链接: (https://openreview.net/forum?id=mq5Kg0XBWd)

摘要

Current advancements in 3D human pose estimation have attained notable success by converting 2D poses into their 3D counterparts.
However, this approach is inherently influenced by the errors introduced by 2D pose detectors and overlooks the intrinsic spatial information embedded within RGB images.
To address these challenges, we introduce a versatile module called Adaptive Pose Pooling (APP), compatible with many existing 2D-to-3D lifting models.
The APP module includes three novel sub-modules: Pose-Aware Offsets Generation (PAOG), Pose-Aware Sampling (PAS), and Spatial Temporal Information Fusion (STIF).
First, we extract latent features of the multi-frame lifting model.
Then, a 2D pose detector is utilized to extract multi-level feature maps from the image.
After that, PAOG generates offsets according to featuremaps.
PAS uses offsets to sample featuremaps.
Then, STIF can fuse PAS sampling features and latent features.
This innovative design allows the APP module to simultaneously capture spatial and temporal information.
We conduct comprehensive experiments on two widely used datasets: Human3.6M and MPI-INF-3DHP.
Meanwhile, we employ various lifting models to demonstrate the efficacy of the APP module.
Our results show that the proposed APP module consistently enhances the performance of lifting models, achieving state-of-the-art results.
Significantly, our module achieves these performance boosts without necessitating alterations to the architecture of the lifting model…

3D Human Pose Estimation from Multiple Dynamic Views via Single-view Pretraining with Procrustes Alignment

文章解读: http://www.studyai.com/xueshu/paper/detail/d5fd8b842b
文章链接: (https://openreview.net/forum?id=5SelUL07QL)

摘要

3D Human pose estimation from multiple cameras with unknown calibration has received less attention than it should.
The few existing data-driven solutions do not fully exploit 3D training data that are available on the market, and typically train from scratch for every novel multi-view scene, which impedes both accuracy and efficiency.
We show how to exploit 3D training data to the fullest and associate multiple dynamic views efficiently to achieve high precision on novel scenes using a simple yet effective framework, dubbed \textit{Multiple Dynamic View Pose estimation} (MDVPose).
MDVPose utilizes novel scenarios data to finetune a single-view pretrained motion encoder in multi-view setting, aligns arbitrary number of views in a unified coordinate via Procruste alignment, and imposes multi-view consistency.
The proposed method achieves 22.1 mm P-MPJPE or 34.2 mm MPJPE on the challenging in-the-wild Ski-Pose PTZ dataset, which outperforms the state-of-the-art method by 24.8% P-MPJPE (-7.3 mm) and 19.0% MPJPE (-8.0 mm).
It also outperforms the state-of-the-art methods by a large margin (-18.2mm P-MPJPE and -28.3mm MPJPE) on the EgoBody dataset.
In addition, MDVPose achieves robust performance on the Human3.6M datasets featuring multiple static cameras.
Code will be released upon acceptance…