Abstract:Transformer-based methods have emerged as the golden standard in 2D-3D human pose estimation from video sequences, largely thanks to their powerful spatial–temporal feature encoders. In the past, researchers have made concerted efforts to engineer spatial and temporal encoders using transformer blocks. This approach involved a dramatic reshaping of the input, transforming it from mere joint information to dynamic joint trajectories. Despite this, the inherent limitations of the spatial–temporal structure have resulted in an inadequate acquisition and subsequent utilization of temporal information. In an attempt to rectify this prevalent issue, our paper proposes a new model, dubbed Spatial–Temporal-ReTemporal Transformer (i.e., STRFormer). This model ingeniously employs two separate temporal transformer blocks to extract the essential temporal motion information from video sequences. Intriguingly, one temporal transformer block is dedicated to the original video sequence, while the other concerns itself with the reversed order video. This novel approach allows for a more thorough investigation and utilization of temporal information from the video sequences. In order to alternate the processing of these two blocks effectively with the spatial block, we focus on maximizing the extraction of temporal domain information. This method leads to a more comprehensive understanding of the pose estimation and its evolution over time. Furthermore, we introduce a novel error metric, Mean Per-Joint Position Acceleration Error (i.e., MPJAE). This advanced metric takes into account the body part velocity in adjacent predicted frames, allowing for a more detailed evaluation of the predicted poses. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of our proposed model. The results demonstrate that our STRFormer, coupled with the MPJAE loss, achieves highly competitive results when compared with other state-of-the-art models. This illustrates its promising potential and practical applicability in 2D-3D human pose estimation tasks. We plan to release our code publicly for further research.

EgoFormer: Transformer-Based Motion Context Learning for Ego-Pose Estimation

Ego-Body Pose Estimation via Ego-Head Pose Estimation

PoseConvGRU: A Monocular Approach for Visual Ego-motion Estimation by Learning

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

Motion Imitation of a Humanoid Robot Via Pose Estimation

3D Human Pose Perception from Egocentric Stereo Videos

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video

Kinematics-Guided Reinforcement Learning for Object-Aware 3D Ego-Pose Estimation

SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera

Scene-aware Egocentric 3D Human Pose Estimation

xR-EgoPose: Egocentric 3D Human Pose from an HMD Camera

Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions

SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras

3D Human Pose Estimation with Spatial and Temporal Transformers

Ego+X: an Egocentric Vision System for Global 3D Human Pose Estimation and Social Interaction Characterization

STRFormer: Spatial–Temporal–ReTemporal Transformer for 3D Human Pose Estimation

4D Human Body Capture from Egocentric Video via 3D Scene Grounding

Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views

EgoCap and EgoFormer: First-person image captioning with context fusion