Cross-Space-Time 3D Human Body Pose Estimation Based on Transformer

Yuhao Zhang,Huang Hua,Qishen Li,Penghui Chen
DOI: https://doi.org/10.1109/icctit60726.2023.10435911
2023-01-01
Abstract:The precise estimation of 3D human pose from monocular videos poses a formidable challenge, primarily attributed to the complexities introduced by depth blur and self-occlusion. We observed significant differences in the movement of the same joint at different times. However, previous methods were unable to effectively simulate the corresponding relationships of the same joint at different times. For this purpose, we advocate for the adoption of a Transformer-centric design, which we've dubbed the Cross Connection Transformer (CCT). It can learn the dependency relationships between joints at different times and stages, and then capture information from multiple stages for cross attention and feature fusion. This task can be decomposed into two stages: first, employ a spatio-temporal encoder, to capture the temporal motion patterns of individual joints and learn the spatial correlation between joints. In the subsequent phase, the process entails acquiring proficiency in cross-spatial communication and amalgamating diverse spatial features to formulate the ultimate 3D pose. Through two innovative interaction modules, this model explicitly encodes local and global dependencies between body joints, providing a rich representation of body joints. This is crucial for capturing small changes across frames, namely the representation between features. Following rigorous experimental validation, our CCT model showcased good performance on the demanding datasets Human3.6M and MPI-INF-3DHP. The outcomes underscore our model's superiority, surpassing the advanced pure Transformer method MixSTE. Notably, our model achieves a performance improvement of 2% beyond the optimal results on Human3.6M.
What problem does this paper attempt to address?