Abstract:The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when recovering 3D human meshes from videos, existing models usually ignore spatial and temporal information, which may lead to misalignment between the meshes and the images as well as temporal discontinuity. Specifically, the paper points out: 1. **Spatial Alignment Problem**: Existing models may not be able to align 3D meshes and images well when processing a single image, resulting in inaccurate recovery results. 2. **Temporal Coherence Problem**: When dealing with video sequences, existing models are often unable to capture temporal coherence well, causing the recovery results to be jittery or non - smooth over time. 3. **Trade - off between Precision and Smoothness**: Existing methods often sacrifice smoothness while improving the recovery precision, and vice versa. Therefore, how to achieve a better balance between precision and smoothness is a challenge. To solve these problems, the authors propose a new Spatio - Temporal Alignment Fusion (STAF) model. STAF makes full use of the spatial and temporal information of the input image sequences through the following three modules: 1. **Temporal Coherence Fusion Module (TCFM)**: Captures temporal coherence through the self - attention mechanism and retains the original spatial location information, so as to better learn temporal information. 2. **Spatial Alignment Fusion Module (SAFM)**: Extracts the spatial features of the human body through projection sampling and uses a multi - stage adjacent feature fusion mechanism to enhance the spatial representation of the target frame. 3. **Average Pooling Module (APM)**: Enables the model to focus on the entire input sequence, not just the target frame, thereby significantly improving the smoothness of the recovery results. Through these innovations, STAF has achieved state - of - the - art performance on multiple standard benchmark datasets and has achieved a better trade - off between precision and smoothness.

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

Temporally Coherent Full 3D Mesh Human Pose Recovery from Monocular Video

Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation

Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos

Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos

Marker-Less 3d Human Motion Capture With Monocular Image Sequence And Height-Maps

GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras

HPOF:3D Human Pose Recovery from Monocular Video with Optical Flow

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos

Human Mesh Recovery from Arbitrary Multi-view Images

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

3D Human Pose and Shape Reconstruction from Videos Via Confidence-Aware Temporal Feature Aggregation

3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention

Delving Deep into Pixel Alignment Feature for Accurate Multi-view Human Mesh Recovery

Recovering 3D Human Mesh from Monocular Images: A Survey

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation.

Learning Local Recurrent Models for Human Mesh Recovery

Mixed Transformer for Temporal 3D Human Pose and Shape Estimation from Monocular Video

JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery