Mixed Transformer for Temporal 3D Human Pose and Shape Estimation from Monocular Video

Cheng He,Xiaoliang Ma,Lei Wang,Gongbin Chen,Longhua Hu,Jun Cheng
DOI: https://doi.org/10.1109/ACAIT60137.2023.10528513
2023-01-01
Abstract:Accurate and smooth reconstruction of human motion sequences is critical for 3D human pose and shape estimation from monocular video. Image-based human reconstruction methods are difficult to achieve smooth results on videos. Existing video-based methods either use recurrent neural networks or convolutional neural networks to model temporal information, or use attention mechanisms to capture global human-related information. Although they have achieved better results, they did not consider modeling both temporal information and extracting global human-related static information at the same time. In this paper, we propose a Mixed transformer-based Human Mesh Recovery network (MixformerHMR), which can model temporal information and extract global static information simultaneously. Specifically, we design a GRUformer by using GRU as the token mixer of MetaFormer to model the temporal information and use a transformer-based feature extractor to extract global static feature from an input video. Our MixformerHMR outperforms state-of-the-art methods in accuracy and achieves competitive results in smoothness on the Human3.6M, 3DPW, and MPI-INF-3DHP benchmark datasets.
What problem does this paper attempt to address?