Ego-Body Pose Estimation via Ego-Head Pose Estimation

Jiaman Li,C. Karen Liu,Jiajun Wu
2023-08-28
Abstract:Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
This paper attempts to solve the problem of estimating 3D human motion from egocentric videos. Specifically, the paper points out that it is challenging to directly learn the mapping to human motion from egocentric videos because the user's body is often out of the field of view of the front - facing camera, which makes it difficult to model this complex relationship. In addition, collecting large - scale, high - quality datasets containing paired egocentric videos and 3D human motion requires precise motion - capture devices, which usually limit the scenes in the videos to laboratory - like environments, further increasing the difficulty of data acquisition. To solve these problems, the paper proposes a new method - Ego - Body Pose Estimation via Ego - Head Pose Estimation (abbreviated as EgoEgo). This method decomposes the problem into two stages and connects these two stages with the head motion as an intermediate representation: 1. **Head Pose Estimation**: First, combined with monocular SLAM (Simultaneous Localization and Mapping) technology and learning methods, accurate head motion is predicted from the input egocentric video. The key challenges here include the unknown direction of gravity, the scale difference between the estimated space and the real 3D world, and the low accuracy of monocular SLAM in estimating relative head rotation. To this end, the paper proposes a hybrid solution, using SLAM and learning - based models (such as Transformer) to significantly improve the accuracy of head motion estimation from egocentric videos. 2. **Full - Body Pose Estimation**: Then, using the estimated head pose as input, a conditional diffusion model is used to generate multiple possible full - body motions. This method eliminates the need for paired egocentric view videos and 3D human pose training datasets by decoupling the head and body poses, so that large - scale single - mode datasets (for example, datasets containing only egocentric videos or 3D human poses) can be used for learning. In addition, in order to systematically evaluate the method, the researchers also developed a synthetic dataset ARES (AMASS - Replica - Ego - Syn) containing paired egocentric videos and human motion. The experimental results show that on ARES and real data, the performance of the EgoEgo model is significantly better than the current state - of - the - art methods.