Spatial-temporal-spectral Transformer for 3D Human Pose Estimation.

Yongpeng Wu,Dehui Kong,Shaofan Wang,Jinghua Li,Baocai Yin
DOI: https://doi.org/10.1109/hpcc-dss-smartcity-dependsys53884.2021.00194
2021-01-01
Abstract:Human motion exhibits a high spatial-temporal correlation, and further exploration of the intrinsic correlation of joint motion trajectories is beneficial to improve the performance of 3D pose estimation. Therefore, we propose a novel spatial-temporal-spectral transformer for high-quality 3D human pose estimation in videos, which mainly includes the spatial-temporal transformer at the joint level and the spectral transformer at the joint trajectory level. The former explores the dependencies of the joint level from the skeleton graphic structure and the sequence to obtain a richer feature representation. The latter explores the dependence of joint motion trajectories in the spectral domain. To obtain a more accurate 3D pose estimation of the center frame, a multi-layer stride convolution module is used to realize the estimation from the full frame to the center frame. In addition, since the 2D and 3D pose sequences have the same motion trajectory in the $xy$ plane, we add the consistency constraint to obtain more accurate estimation results. Extensive experiments show that the proposed framework achieves state-of-the-art performance on Human3.6M.
What problem does this paper attempt to address?