EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling

Songpengcheng Xia,Yu Zhang,Zhuo Su,Xiaozheng Zheng,Zheng Lv,Guidong Wang,Yongjie Zhang,Qi Wu,Lei Chu,Ling Pei
2024-12-14
Abstract:Estimating full-body motion using the tracking signals of head and hands from VR devices holds great potential for various applications. However, the sparsity and unique distribution of observations present a significant challenge, resulting in an ill-posed problem with multiple feasible solutions (i.e., hypotheses). This amplifies uncertainty and ambiguity in full-body motion estimation, especially for the lower-body joints. Therefore, we propose a new method, EnvPoser, that employs a two-stage framework to perform full-body motion estimation using sparse tracking signals and pre-scanned environment from VR devices. EnvPoser models the multi-hypothesis nature of human motion through an uncertainty-aware estimation module in the first stage. In the second stage, we refine these multi-hypothesis estimates by integrating semantic and geometric environmental constraints, ensuring that the final motion estimation aligns realistically with both the environmental context and physical interactions. Qualitative and quantitative experiments on two public datasets demonstrate that our method achieves state-of-the-art performance, highlighting significant improvements in human motion estimation within motion-environment interaction scenarios.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: the multi - solution and uncertainty problems in estimating whole - body motion from sparse tracking signals (such as head and hand signals provided by VR devices), especially in the estimation of lower - limb joints. Specifically: 1. **Challenges Brought by Sparse Observation Data**: - VR devices (such as PICO and Quest) usually only provide sparse tracking signals of the head and hands. - These sparse input signals may lead to multiple reasonable motion hypotheses corresponding to the same input, making motion estimation uncertain and ambiguous. 2. **Utilization of Environmental Information**: - Human motion is highly correlated with the surrounding environment. Existing methods often simplify the human - environment interaction and ignore the complex interaction details. - The paper proposes to reduce the uncertainty in the estimation by combining pre - scanned environmental information and guide the motion estimation results to be more in line with the actual scene. 3. **Multi - Hypothesis Motion Estimation**: - The multi - solution problem caused by sparse observation data requires a method that can explicitly model multiple hypotheses. - By introducing an uncertainty estimation module, this multi - solution problem can be better handled, thereby improving the accuracy of motion estimation. To solve these problems, the paper proposes a new framework named EnvPoser, which consists of two stages: - **First Stage**: Use an uncertainty - aware initial motion estimation module to explicitly model multi - hypothesis motion estimation. - **Second Stage**: Refine the multi - hypothesis estimation by combining semantic and geometric environmental constraints to ensure that the final motion estimation result is in line with both the environmental context and physical interaction. Through this method, EnvPoser can more accurately estimate whole - body motion based on sparse observation data and perform well in scenarios involving environmental interaction. ### Formula Summary 1. **Loss Function**: - Loss function in the initial stage: \[ L_{S - I}=\lambda_M L_M+\lambda_\delta L_\delta \] where: \[ L_M = \|\hat{\theta}-\theta\|_2^2 \] \[ L_\delta=\|\hat{\theta}-\theta\|_{\delta}^2+\log(\|\delta\|_2) \] 2. **Loss Function of the Environment - Aware Refinement Module**: - The final second - stage loss function: \[ L_{S - II}=L_{S - I}+L_M'+ \lambda_1 L_{posi}+\lambda_2 L_{hAL}+\lambda_3 L_{fc}+\lambda_4 L_{contact}+\lambda_5 L_{gfh}+\lambda_6 L_{gp}+\lambda_7 L_{coap} \] where: \[ L_M'=\|\hat{\theta}_{RM}-\theta\|_2^2 \] \[ L_{posi}=\|\hat{P}_{RM}-P\|_2^2 \] \[ L_{hAL}=\|\hat{P}_{hand, RM}-P_{hand}\|_1 \] \[ L_{fc}=\|(\hat{P}_{feet, RM}-P_{feet})\cdot C\|_1 \] \[ L_{gfh}=\|\hat{z}_{feet, PRM}-z_{ground}\|_1 \] \[ L_{gp}=\|(\hat{