Abstract:Transformer-based methods have emerged as the golden standard in 2D-3D human pose estimation from video sequences, largely thanks to their powerful spatial–temporal feature encoders. In the past, researchers have made concerted efforts to engineer spatial and temporal encoders using transformer blocks. This approach involved a dramatic reshaping of the input, transforming it from mere joint information to dynamic joint trajectories. Despite this, the inherent limitations of the spatial–temporal structure have resulted in an inadequate acquisition and subsequent utilization of temporal information. In an attempt to rectify this prevalent issue, our paper proposes a new model, dubbed Spatial–Temporal-ReTemporal Transformer (i.e., STRFormer). This model ingeniously employs two separate temporal transformer blocks to extract the essential temporal motion information from video sequences. Intriguingly, one temporal transformer block is dedicated to the original video sequence, while the other concerns itself with the reversed order video. This novel approach allows for a more thorough investigation and utilization of temporal information from the video sequences. In order to alternate the processing of these two blocks effectively with the spatial block, we focus on maximizing the extraction of temporal domain information. This method leads to a more comprehensive understanding of the pose estimation and its evolution over time. Furthermore, we introduce a novel error metric, Mean Per-Joint Position Acceleration Error (i.e., MPJAE). This advanced metric takes into account the body part velocity in adjacent predicted frames, allowing for a more detailed evaluation of the predicted poses. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of our proposed model. The results demonstrate that our STRFormer, coupled with the MPJAE loss, achieves highly competitive results when compared with other state-of-the-art models. This illustrates its promising potential and practical applicability in 2D-3D human pose estimation tasks. We plan to release our code publicly for further research.

Mixed Transformer for Temporal 3D Human Pose and Shape Estimation from Monocular Video

An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction.

3D Human Pose Estimation with Spatial and Temporal Transformers

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

STRFormer: Spatial–Temporal–ReTemporal Transformer for 3D Human Pose Estimation

Temporally Coherent Full 3D Mesh Human Pose Recovery from Monocular Video

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

H4MER: Human 4D Modeling by Learning Neural Compositional Representation With Transformer

Exploiting multi-transformer encoder with multiple-hypothesis aggregation via diffusion model for 3D human pose estimation

ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention

Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction

Human Mesh Recovery from Monocular Images via a Skeleton-disentangled Representation

PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model

Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation

Multi-hop graph transformer network for 3D human pose estimation