Geometry-Guided Diffusion Model with Masked Transformer for Robust Multi-View 3D Human Pose Estimation

Xinyi Zhang,Qinpeng Cui,Qiqi Bao,Wenming Yang,Qingmin Liao
DOI: https://doi.org/10.1145/3664647.3681265
2024-01-01
Abstract:Recent research on Diffusion Models and Transformers has brought significant advancements to 3D Human Pose Estimation (HPE). Nonetheless, existing methods often fail to concurrently address the issues of accuracy and generalization. In this paper, we propose a Geometry-guided Dif fusion Model with Masked Transformer (Masked Gifformer) for robust multi-view 3D HPE. Within the framework of the diffusion model, a hierarchical multi-view trans-former-based denoiser is exploited to fit the 3D pose distribution by systematically integrating joint and view information. To address the long-standing problem of poor generalization, we introduce a fully random mask mechanism without any additional learnable modules or parameters. Furthermore, we incorporate geometric guidance into the diffusion model to enhance the accuracy of the model. This is achieved by optimizing the sampling process to minimize reprojection errors through modeling a conditional guidance distribution. Extensive experiments on two benchmarks demonstrate that Masked Gifformer effectively achieves a trade-off between accuracy and generalization. Specifically, our method outperforms other probabilistic methods by > 40% and achieves comparable results with state-of-the-art deterministic methods. In addition, our method exhibits robustness to varying camera numbers, spatial arrangements, and datasets.
What problem does this paper attempt to address?