Abstract:We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture the wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve the hands: the resulting kinematic and temporal constraints result in over 40% lower hand estimation errors compared to noisy monocular estimates. Project page: <a class="link-external link-https" href="https://egoallo.github.io/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of estimating human body postures, heights, and hand parameters from the egocentric view of head - worn devices (such as smart glasses). Specifically, the researchers proposed a system named **EgoAllo**, which can estimate the wearer's movements from the egocentric SLAM postures and images provided by head - worn devices and represent these movements in the allocentric coordinate frame of the scene. #### Main challenges: 1. **Limited observability**: Most body parts (such as hands) appear occasionally in the egocentric image frame, while other body parameters (such as height) are never directly observed. 2. **Global consistency**: To ensure the accuracy of the estimates, body postures and height parameters must be consistent with self - motion and scene scale, for example, making the estimated feet touch the ground and the head aligned with the camera height. 3. **Spatio - temporal invariance**: To improve the model performance, it is necessary to design a conditional parameterization method with spatial and temporal invariance. #### Solutions: - **Conditional diffusion model**: EgoAllo uses a conditional diffusion model to estimate 3D human body postures, heights, and hand parameters. By introducing a conditional parameterization method with spatio - temporal invariance, the robustness and generalization ability of the model are improved. - **Local and global alignment**: The estimated local body parameters can be transformed into the global coordinate system through global alignment, thereby achieving accurate scene reconstruction. - **Guided loss**: By introducing the guided loss of physical constraints and visual hand observations, the accuracy of hand estimation is further improved. ### Summary The core problem of this paper is to develop a method that can accurately estimate human body postures, heights, and hand parameters from the egocentric view of head - worn devices and place the results in the global coordinate system of the scene. By introducing a conditional parameterization method with spatio - temporal invariance and a conditional diffusion model, the EgoAllo system significantly improves the accuracy and robustness of the estimates.

Estimating Body and Hand Motion in an Ego-sensed World