Frame-Padded Multiscale Transformer for Monocular 3D Human Pose Estimation

Yuanhong Zhong,Guangxia Yang,Daidi Zhong,Xun Yang,Shanshan Wang
DOI: https://doi.org/10.1109/tmm.2023.3347095
IF: 7.3
2023-01-01
IEEE Transactions on Multimedia
Abstract:Monocular 3D human pose estimation is an ill-posed problem in computer vision due to its depth ambiguity. Most existing works supplement the depth information by extracting temporal pose features from video frames, and they have made notable progress. However, these approaches divide a long sequence of video frames into multiple short sequences for separate processing, which leads to the loss of complementary information between sequences. Furthermore, the short-term temporal correlation among frames in a sequence is often not fully exploited. To model temporal dependencies efficiently, we propose the frame-padded multiscale transformer approach, which includes a frame-padded video sequence preprocessing step and a multiscale temporal transformer backbone. Our approach addresses the omission of the temporal features of edge frames in existing approaches by padding video frames in the shallow layer. In addition, we extract the temporal information of 3D human poses using a multiscale transformer to enhance the short-term correlation of human pose skeleton keypoints. Extensive experiments validate the effectiveness of our approach on two popular datasets: Human3.6M and MPI-INF-3DHP. The results show that our approach achieves state-of-the-art performance.
What problem does this paper attempt to address?