Abstract:Understanding social interactions from egocentric views is crucial for many applications, ranging from assistive robotics to AR/VR. Key to reasoning about interactions is to understand the body pose and motion of the interaction partner from the egocentric view. However, research in this area is severely hindered by the lack of datasets. Existing datasets are limited in terms of either size, capture/annotation modalities, ground-truth quality, or interaction diversity. We fill this gap by proposing EgoBody, a novel large-scale dataset for human pose, shape and motion estimation from egocentric views, during interactions in complex 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human shapes and poses relative to the scene, over time. We collect 125 sequences, spanning diverse interaction scenarios, and propose the first benchmark for 3D full-body pose and shape estimation of the social partner from egocentric views. We extensively evaluate state-of-the-art methods, highlight their limitations in the egocentric scenario, and address such limitations leveraging our high-quality annotations. Data and code are available at <a class="link-external link-https" href="https://sanweiliti.github.io/egobody/egobody.html" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to estimate the 3D human pose, shape and motion of interaction partners from the first - person perspective (egocentric view) of head - mounted devices (HMD). Specifically, the paper points out that understanding social interactions is crucial for many applications, such as assistive robotics, augmented reality (AR) and virtual reality (VR). However, research in this area is limited by existing datasets in terms of scale, capture/annotation patterns, realism quality or interaction diversity. Therefore, the paper proposes EgoBody, a novel large - scale dataset, aiming to fill these gaps and provide high - quality 3D ground - truth data required for human pose, shape and motion estimation from the first - person perspective. ### Specific problems solved by the paper: 1. **Insufficient datasets**: Existing datasets are either of limited scale, lack high - quality 3D ground - truth, or have insufficiently diverse interaction scenarios. These problems impede the research progress of 3D human pose and motion estimation from the first - person perspective. 2. **Technical challenges**: Data captured from the first - person perspective of head - mounted devices has unique challenges, such as severe body truncation, motion blur (aggravated by the physical movement of the HMD), people entering and leaving the field of view, etc. Existing methods perform poorly in dealing with these challenges. 3. **Lack of benchmarking**: Currently, there is a lack of a benchmark specifically for 3D human pose and shape estimation from the first - person perspective. This makes it difficult to evaluate the performance of existing methods in practical applications. ### Main contributions of the paper: 1. **EgoBody dataset**: Propose EgoBody, a large - scale first - person perspective dataset, which contains high - quality 3D human pose, shape and motion ground - truth, as well as rich multi - modal data (RGB, depth, eye - tracking, etc.). 2. **Benchmarking**: Establish the first benchmark for 3D human pose and shape estimation from the first - person perspective, and conduct extensive evaluations of existing methods, revealing their limitations in this specific setting. 3. **Performance improvement**: By fine - tuning existing methods on the EgoBody training set, significantly improve their performance and robustness on the EgoBody test set and other first - person perspective datasets. ### Key technologies in the solution: - **Multi - modal data acquisition**: Use Microsoft HoloLens2 head - mounted devices and multiple Azure Kinect cameras to synchronously acquire multi - modal data, including RGB images, depth maps, eye - tracking, etc. - **High - quality 3D ground - truth generation**: Through marker - free motion capture methods, combine multi - view RGB - D data and the SMPL - X body model to reconstruct high - precision 3D human pose and shape. - **Calibration and synchronization**: Improve the calibration between HoloLens2 and Kinect through a key - point optimization scheme to ensure data consistency and accuracy. - **Temporal consistency optimization**: Obtain natural and coherent human motion through temporal smoothing optimization of the preliminary estimation of each frame. In conclusion, through the construction of the EgoBody dataset and benchmarking, this paper provides important resources and support for 3D human pose and shape estimation from the first - person perspective, promoting the research progress in this area.

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Ego-Body Pose Estimation via Ego-Head Pose Estimation

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

EgoHumans: An Egocentric 3D Multi-Human Benchmark

SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras

Social EgoMesh Estimation

4D Human Body Capture from Egocentric Video via 3D Scene Grounding

SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera

xR-EgoPose: Egocentric 3D Human Pose from an HMD Camera

Ego+X: an Egocentric Vision System for Global 3D Human Pose Estimation and Social Interaction Characterization

EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs

You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions

Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

Estimating Body and Hand Motion in an Ego-sensed World

BEHAVE: Dataset and Method for Tracking Human Object Interactions

MEEV: Body Mesh Estimation On Egocentric Video

EgoAvatar: Egocentric View-Driven and Photorealistic Full-body Avatars

EgoHDM: An Online Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System

Egok360: A 360 Egocentric Kinetic Human Activity Video Dataset

3D Human Pose Perception from Egocentric Stereo Videos