Abstract:Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods.

What problem does this paper attempt to address?

This paper attempts to solve the problem of estimating 3D human motion from egocentric videos. Specifically, the paper points out that it is challenging to directly learn the mapping to human motion from egocentric videos because the user's body is often out of the field of view of the front - facing camera, which makes it difficult to model this complex relationship. In addition, collecting large - scale, high - quality datasets containing paired egocentric videos and 3D human motion requires precise motion - capture devices, which usually limit the scenes in the videos to laboratory - like environments, further increasing the difficulty of data acquisition. To solve these problems, the paper proposes a new method - Ego - Body Pose Estimation via Ego - Head Pose Estimation (abbreviated as EgoEgo). This method decomposes the problem into two stages and connects these two stages with the head motion as an intermediate representation: 1. **Head Pose Estimation**: First, combined with monocular SLAM (Simultaneous Localization and Mapping) technology and learning methods, accurate head motion is predicted from the input egocentric video. The key challenges here include the unknown direction of gravity, the scale difference between the estimated space and the real 3D world, and the low accuracy of monocular SLAM in estimating relative head rotation. To this end, the paper proposes a hybrid solution, using SLAM and learning - based models (such as Transformer) to significantly improve the accuracy of head motion estimation from egocentric videos. 2. **Full - Body Pose Estimation**: Then, using the estimated head pose as input, a conditional diffusion model is used to generate multiple possible full - body motions. This method eliminates the need for paired egocentric view videos and 3D human pose training datasets by decoupling the head and body poses, so that large - scale single - mode datasets (for example, datasets containing only egocentric videos or 3D human poses) can be used for learning. In addition, in order to systematically evaluate the method, the researchers also developed a synthetic dataset ARES (AMASS - Replica - Ego - Syn) containing paired egocentric videos and human motion. The experimental results show that on ARES and real data, the performance of the EgoEgo model is significantly better than the current state - of - the - art methods.

Ego-Body Pose Estimation via Ego-Head Pose Estimation

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Kinematics-Guided Reinforcement Learning for Object-Aware 3D Ego-Pose Estimation

SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera

SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras

xR-EgoPose: Egocentric 3D Human Pose from an HMD Camera

EgoFormer: Transformer-Based Motion Context Learning for Ego-Pose Estimation

Estimating Body and Hand Motion in an Ego-sensed World

Scene-aware Egocentric 3D Human Pose Estimation

Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video

Social EgoMesh Estimation

4D Human Body Capture from Egocentric Video via 3D Scene Grounding

3D Human Pose Perception from Egocentric Stereo Videos

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views

Estimating Egocentric 3D Human Pose in Global Space

EgoHumans: An Egocentric 3D Multi-Human Benchmark

Ego+X: an Egocentric Vision System for Global 3D Human Pose Estimation and Social Interaction Characterization

EgoHDM: An Online Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System