Abstract:The capability to accurately estimate 3D human poses is crucial for diverse fields such as action recognition, gait recognition, and virtual/augmented reality. However, a persistent and significant challenge within this field is the accurate prediction of human poses under conditions of severe occlusion. Traditional image-based estimators struggle with heavy occlusions due to a lack of temporal context, resulting in inconsistent predictions. While video-based models benefit from processing temporal data, they encounter limitations when faced with prolonged occlusions that extend over multiple frames. This challenge arises because these models struggle to generalize beyond their training datasets, and the variety of occlusions is hard to capture in the training data. Addressing these challenges, we propose STRIDE (Single-video based TempoRally contInuous occlusion Robust 3D Pose Estimation), a novel Test-Time Training (TTT) approach to fit a human motion prior for each video. This approach specifically handles occlusions that were not encountered during the model's training. By employing STRIDE, we can refine a sequence of noisy initial pose estimates into accurate, temporally coherent poses during test time, effectively overcoming the limitations of prior methods. Our framework demonstrates flexibility by being model-agnostic, allowing us to use any off-the-shelf 3D pose estimation method for improving robustness and temporal consistency. We validate STRIDE's efficacy through comprehensive experiments on challenging datasets like Occluded Human3.6M, Human3.6M, and OCMotion, where it not only outperforms existing single-image and video-based pose estimation models but also showcases superior handling of substantial occlusions, achieving fast, robust, accurate, and temporally consistent 3D pose estimates. Code is made publicly available at <a class="link-external link-https" href="https://github.com/take2rohit/stride" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "STRIDE: Single - video based Temporally Continuous Occlusion Robust 3D Pose Estimation" aims to address the challenge of accurately estimating 3D human poses in the case of severe occlusions. Specifically, the paper focuses on the following key issues: 1. **Poor performance of image - based 3D pose estimation methods under severe occlusions**: - Existing image - based 3D pose estimation methods are prone to produce inconsistent prediction results when dealing with severe occlusions due to the lack of temporal context. For example, when the human body is partially or completely occluded, these methods often fail to provide accurate 3D pose estimates. 2. **Limitations of video - based 3D pose estimation methods under long - term occlusions**: - Although video - based methods can alleviate the problems caused by partial occlusions by exploiting temporal continuity, they still perform poorly when dealing with long - term occlusions (spanning multiple frames). This is because training data usually does not include such long - term occlusion scenarios, resulting in the model being unable to generalize to these situations. 3. **Insufficient generalization ability of existing algorithms in unseen videos**: - When dealing with unseen videos, especially when the occlusion patterns and imaging conditions are different from the training data, the performance of existing algorithms drops significantly. This limits their effectiveness in practical applications. ### Solutions To address the above challenges, the paper proposes STRIDE (Single - video based Temporally Continuous Occlusion Robust 3D Pose Estimation), a new method based on Test - Time Training (TTT). The main contributions of STRIDE include: 1. **Test - Time Training (TTT)**: - STRIDE adjusts a pre - trained human motion prior model (motion prior) by performing test - time training on each new video. This process enables the model to adapt to the occlusion patterns and data distribution changes in a specific video, thereby improving generalization ability. 2. **Model - agnosticism**: - STRIDE is a model - agnostic framework that can be combined with any existing 3D pose estimation method to improve temporal and spatial consistency. This means that it can enhance the performance of various different methods, not just a specific model. 3. **Efficiency and robustness**: - STRIDE has achieved state - of - the - art results on multiple challenging benchmark datasets, especially performing well in dealing with severe occlusions. Moreover, it has high computational efficiency, being more than 2 times faster than existing similar methods, and does not need to access any labeled training data during inference, thus being more privacy - friendly and storage - friendly. ### Specific technical details 1. **Learning motion prior**: - First, a self - attention - based motion prior model (motion prior) is constructed by pre - training on a large - scale 3D pose dataset. During the pre - training process, noise and occlusions are synthetically introduced to simulate occlusion situations in the real world, and the model is trained to recover temporally coherent 3D pose sequences from these noisy inputs. 2. **Test - time alignment**: - For a given test video, the initial noisy pose estimates for each frame are first obtained using an existing 3D pose estimator (such as BEDLAM). Then, the motion prior model is fine - tuned in an unsupervised manner to adapt to the specific motion patterns in the video. This process uses several geometric and physical constraint loss functions, including Limb Loss, Mean Position of Joints Point Loss (MPJP Loss), Normalized - Mean Position of Joints Point Loss (N - MPJP Loss), and Velocity Loss, to ensure that the generated poses are consistent and reasonable in both time and space. ### Experimental results The paper has carried out experimental validations on multiple datasets, including Human3.6M, OCMotion, and Occluded Human3.6M. The results show that STRIDE significantly outperforms existing image - and video - based 3D pose estimation methods in dealing with severe occlusions, especially in Occluded Human3.6M.

STRIDE: Single-video based Temporally Continuous Occlusion Robust 3D Pose Estimation

Temporal Consistent Object Pose Estimation from Monocular Videos

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

POISE: Pose Guided Human Silhouette Extraction under Occlusions

Occlusion Robust 3D Human Pose Estimation with StridedPoseGraphFormer and Data Augmentation

Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild

Occlusion Resilient 3D Human Pose Estimation

OCR-Pose: Occlusion-aware Contrastive Representation for Unsupervised 3D Human Pose Estimation

Robust 3D Human Pose Estimation from Single Images or Video Sequences

Out of the Box: A combined approach for handling occlusion in Human Pose Estimation

Multi-view Pose Fusion for Occlusion-Aware 3D Human Pose Estimation

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion

Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos

RSB-Pose: Robust Short-Baseline Binocular 3D Human Pose Estimation with Occlusion Handling

Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation

3D Human pose estimation from video via multi-scale multi-level spatial temporal features

Hybrid 3D Human Pose Estimation with Monocular Video and Sparse IMUs

Spatio-temporal Tendency Reasoning for Human Body Pose and Shape Estimation from Videos

Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation

Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose Estimation