DSTFormer: 3D Human Pose Estimation with a Dual-scale Spatial and Temporal Transformer Network

Shaokun Zhang,Xinde Li,Chuanfei Hu,Jianping Xu,Huaping Liu
DOI: https://doi.org/10.1109/icarm62033.2024.10715863
2024-01-01
Abstract:Recent transformer-based methods for estimating 3D human pose have gained widespread attention, achieving state-of-the-art results. Previous methods have primarily focused on capturing motion patterns of the human body at a single scale or cascading multiple scales, such as joints, bones, and body-parts. However, they are difficult to simultaneously capture spatial-temporal motion patterns of the human body at different scales due to the complex motion patterns. To address this issue, we propose Dual-scale Spatial and Temporal transFormer (DSTFormer), which can concurrently explore the spatial dependencies and temporal motion patterns of human joints and bones. Additionally, we introduce a Gcn-Spatial Transformer Block (GSTB), which introduces Graph Convolutional Networks (GCN) into transformer to enhance the exploitation of local relationships and global information between adjacent joints or bones. Extensive experiments are conducted on the Human3.6M benchmark dataset, and superior results are reported when comparing to other state-of-the-art methods. More remarkably, our model achieves to-date the best published performance, with P1 errors of 37.9 mm and 15.6 mm, respectively.
What problem does this paper attempt to address?