STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

Wei Yao,Hongwen Zhang,Yunlian Sun,Jinhui Tang
2024-01-03
Abstract:The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when recovering 3D human meshes from videos, existing models usually ignore spatial and temporal information, which may lead to misalignment between the meshes and the images as well as temporal discontinuity. Specifically, the paper points out: 1. **Spatial Alignment Problem**: Existing models may not be able to align 3D meshes and images well when processing a single image, resulting in inaccurate recovery results. 2. **Temporal Coherence Problem**: When dealing with video sequences, existing models are often unable to capture temporal coherence well, causing the recovery results to be jittery or non - smooth over time. 3. **Trade - off between Precision and Smoothness**: Existing methods often sacrifice smoothness while improving the recovery precision, and vice versa. Therefore, how to achieve a better balance between precision and smoothness is a challenge. To solve these problems, the authors propose a new Spatio - Temporal Alignment Fusion (STAF) model. STAF makes full use of the spatial and temporal information of the input image sequences through the following three modules: 1. **Temporal Coherence Fusion Module (TCFM)**: Captures temporal coherence through the self - attention mechanism and retains the original spatial location information, so as to better learn temporal information. 2. **Spatial Alignment Fusion Module (SAFM)**: Extracts the spatial features of the human body through projection sampling and uses a multi - stage adjacent feature fusion mechanism to enhance the spatial representation of the target frame. 3. **Average Pooling Module (APM)**: Enables the model to focus on the entire input sequence, not just the target frame, thereby significantly improving the smoothness of the recovery results. Through these innovations, STAF has achieved state - of - the - art performance on multiple standard benchmark datasets and has achieved a better trade - off between precision and smoothness.