Abstract:Capturing cross-pose correlation from a sequence of frame-level 2D poses is essential for 3D human pose estimation (3D-HPE) in the video. Recent studies have shown the promising potential of modeling the pose relation with feature-mixing operations on the temporal domain. However, they seldom consider the interaction across poses in the frequency domain. This paper studies a Frequency-Temporal Collaborative Module (FTCM) to explore the feasibility of encoding the cross-pose correlations in both frequency and temporal domains. FTCM aims to jointly capture the global and local cross-pose correlations with a more lightweight network model. Specifically, FTCM splits the pose features into two groups along the channel dimension and separately models the frequency and temporal interactions across poses with different feature-mixing operations in parallel. To achieve this goal, we purposely design two pose-mixing units, i.e., the frequency pose-mixing (FPM) and the temporal pose-mixing (TPM). Particularly, FPM is designed to reap the global correlations among different pose frequencies with the representation obtained by converting the original pose signals with Fast Fourier transform (FFT). Unlike the pose-mixing used by previous methods like Transformers that influences an individual pose with all other poses, TPM locally calibrates the pose with dynamics aggregated within several adjacent poses in the temporal domain, explicitly weighting neighboring poses more with respect to the far-away ones so as to enforce a strict locality constraint. Besides, the group strategy significantly reduces the model complexity. To verify the effectiveness of FTCM, we conduct extensive experiments on two benchmarks (i.e., Human3.6M and MPI-INF-3DHP). Experimental results not only exhibit favorable accuracy/complexity trade-offs of our FTCM but also show superior or comparable performance to state-of-the-art methods on both datasets. The code and model are publicly available at: https://github.com/zhenhuat/FTCM.

VH3D-LSFM: Video-Based Human 3D Pose Estimation with Long-Term and Short-Term Pose Fusion Mechanism.

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

An Improved 3D Human Pose Estimation Model Based on Temporal Convolution with Gaussian Error Linear Units

Exploring Temporal Consistency for Human Pose Estimation in Videos

3D Human Pose Estimation Based on Multi View Information Fusion

Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation

FTCM: Frequency-Temporal Collaborative Module for Efficient 3D Human Pose Estimation in Video

Three-dimensional Human Pose Estimation Based on Spatio-Temporal Multi-Feature Fusion Network

Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video

STAFFormer: Spatio-temporal Adaptive Fusion Transformer for Efficient 3D Human Pose Estimation

Deep Dual Consecutive Network for Human Pose Estimation

Two-Stage Representation Refinement Based on Convex Combination for 3D Human Poses Estimation

Research on 3D Human Pose Estimation Technique Based on Multi-View Information Fusion

A Multi-Person Pose Estimation with LSTM for Video Stream

Spatial-temporal-spectral Transformer for 3D Human Pose Estimation.

Frame-Padded Multiscale Transformer for Monocular 3D Human Pose Estimation

Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation.

Enhanced 3D Human Pose Estimation from Videos by Using Attention-Based Neural Network with Dilated Convolutions

Joint Multi-Scale Transformers and Pose Equivalence Constraints for 3D Human Pose Estimation

Double-chain Constraints for 3D Human Pose Estimation in Images and Videos

Robust 3D Human Pose Estimation from Single Images or Video Sequences