Joint Multi-Scale Transformers and Pose Equivalence Constraints for 3D Human Pose Estimation
Yongpeng Wu,Dehui Kong,Junna Gao,Jinghua Li,Baocai Yin
DOI: https://doi.org/10.1016/j.jvcir.2024.104247
IF: 2.887
2024-01-01
Journal of Visual Communication and Image Representation
Abstract:Different from image-based 3D pose estimation, video-based 3D pose estimation gains performance improvement with temporal information. However, these methods still face the challenge of insufficient generalization ability, including human motion speed, body shape, and camera distance. To address the above problems, we propose a novel approach, referred to as joint Spatial-temporal Multi-scale Transformers and Pose Transformation Equivalence Constraints (SMT-PTEC) for 3D human pose estimation from videos. We design a more general spatial-temporal multi-scale feature extraction strategy, and introduce optimization constraints that adapt to the diversity of data to improve the accuracy of pose estimation. Specifically, we first introduce a spatial multi-scale transformer to extract multi-scale features of pose and establish a cross-scale information transfer mechanism, which effectively explores the underlying knowledge of human motion. Then, we present a temporal multi-scale transformer to explore multi-scale dependencies between frames, enhance the adaptability of the network to human motion speed, and improve the estimation accuracy through a context aware fusion of multi-scale predictions. Moreover, we add pose transformation equivalence constraints by changing the training samples with horizontal flipping, scaling, and body shape transformation to effectively overcome the influence of camera distance and body shape for the prediction accuracy. Extensive experimental results demonstrate that our approach achieves superior performance with less computational complexity than previous state-of-the-art methods. Code is available at https://github.com/JNGao123/SMT-PTEC.