STRIDE: Single-video based Temporally Continuous Occlusion Robust 3D Pose Estimation

Rohit Lal,Saketh Bachu,Yash Garg,Arindam Dutta,Calvin-Khang Ta,Dripta S. Raychaudhuri,Hannah Dela Cruz,M. Salman Asif,Amit K. Roy-Chowdhury
2024-12-04
Abstract:The capability to accurately estimate 3D human poses is crucial for diverse fields such as action recognition, gait recognition, and virtual/augmented reality. However, a persistent and significant challenge within this field is the accurate prediction of human poses under conditions of severe occlusion. Traditional image-based estimators struggle with heavy occlusions due to a lack of temporal context, resulting in inconsistent predictions. While video-based models benefit from processing temporal data, they encounter limitations when faced with prolonged occlusions that extend over multiple frames. This challenge arises because these models struggle to generalize beyond their training datasets, and the variety of occlusions is hard to capture in the training data. Addressing these challenges, we propose STRIDE (Single-video based TempoRally contInuous occlusion Robust 3D Pose Estimation), a novel Test-Time Training (TTT) approach to fit a human motion prior for each video. This approach specifically handles occlusions that were not encountered during the model's training. By employing STRIDE, we can refine a sequence of noisy initial pose estimates into accurate, temporally coherent poses during test time, effectively overcoming the limitations of prior methods. Our framework demonstrates flexibility by being model-agnostic, allowing us to use any off-the-shelf 3D pose estimation method for improving robustness and temporal consistency. We validate STRIDE's efficacy through comprehensive experiments on challenging datasets like Occluded Human3.6M, Human3.6M, and OCMotion, where it not only outperforms existing single-image and video-based pose estimation models but also showcases superior handling of substantial occlusions, achieving fast, robust, accurate, and temporally consistent 3D pose estimates. Code is made publicly available at <a class="link-external link-https" href="https://github.com/take2rohit/stride" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "STRIDE: Single - video based Temporally Continuous Occlusion Robust 3D Pose Estimation" aims to address the challenge of accurately estimating 3D human poses in the case of severe occlusions. Specifically, the paper focuses on the following key issues: 1. **Poor performance of image - based 3D pose estimation methods under severe occlusions**: - Existing image - based 3D pose estimation methods are prone to produce inconsistent prediction results when dealing with severe occlusions due to the lack of temporal context. For example, when the human body is partially or completely occluded, these methods often fail to provide accurate 3D pose estimates. 2. **Limitations of video - based 3D pose estimation methods under long - term occlusions**: - Although video - based methods can alleviate the problems caused by partial occlusions by exploiting temporal continuity, they still perform poorly when dealing with long - term occlusions (spanning multiple frames). This is because training data usually does not include such long - term occlusion scenarios, resulting in the model being unable to generalize to these situations. 3. **Insufficient generalization ability of existing algorithms in unseen videos**: - When dealing with unseen videos, especially when the occlusion patterns and imaging conditions are different from the training data, the performance of existing algorithms drops significantly. This limits their effectiveness in practical applications. ### Solutions To address the above challenges, the paper proposes STRIDE (Single - video based Temporally Continuous Occlusion Robust 3D Pose Estimation), a new method based on Test - Time Training (TTT). The main contributions of STRIDE include: 1. **Test - Time Training (TTT)**: - STRIDE adjusts a pre - trained human motion prior model (motion prior) by performing test - time training on each new video. This process enables the model to adapt to the occlusion patterns and data distribution changes in a specific video, thereby improving generalization ability. 2. **Model - agnosticism**: - STRIDE is a model - agnostic framework that can be combined with any existing 3D pose estimation method to improve temporal and spatial consistency. This means that it can enhance the performance of various different methods, not just a specific model. 3. **Efficiency and robustness**: - STRIDE has achieved state - of - the - art results on multiple challenging benchmark datasets, especially performing well in dealing with severe occlusions. Moreover, it has high computational efficiency, being more than 2 times faster than existing similar methods, and does not need to access any labeled training data during inference, thus being more privacy - friendly and storage - friendly. ### Specific technical details 1. **Learning motion prior**: - First, a self - attention - based motion prior model (motion prior) is constructed by pre - training on a large - scale 3D pose dataset. During the pre - training process, noise and occlusions are synthetically introduced to simulate occlusion situations in the real world, and the model is trained to recover temporally coherent 3D pose sequences from these noisy inputs. 2. **Test - time alignment**: - For a given test video, the initial noisy pose estimates for each frame are first obtained using an existing 3D pose estimator (such as BEDLAM). Then, the motion prior model is fine - tuned in an unsupervised manner to adapt to the specific motion patterns in the video. This process uses several geometric and physical constraint loss functions, including Limb Loss, Mean Position of Joints Point Loss (MPJP Loss), Normalized - Mean Position of Joints Point Loss (N - MPJP Loss), and Velocity Loss, to ensure that the generated poses are consistent and reasonable in both time and space. ### Experimental results The paper has carried out experimental validations on multiple datasets, including Human3.6M, OCMotion, and Occluded Human3.6M. The results show that STRIDE significantly outperforms existing image - and video - based 3D pose estimation methods in dealing with severe occlusions, especially in Occluded Human3.6M.