Siwei Zhang,Qianli Ma,Yan Zhang,Zhiyin Qian,Taein Kwon,Marc Pollefeys,Federica Bogo,Siyu Tang
Abstract:Understanding social interactions from egocentric views is crucial for many applications, ranging from assistive robotics to AR/VR. Key to reasoning about interactions is to understand the body pose and motion of the interaction partner from the egocentric view. However, research in this area is severely hindered by the lack of datasets. Existing datasets are limited in terms of either size, capture/annotation modalities, ground-truth quality, or interaction diversity. We fill this gap by proposing EgoBody, a novel large-scale dataset for human pose, shape and motion estimation from egocentric views, during interactions in complex 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human shapes and poses relative to the scene, over time. We collect 125 sequences, spanning diverse interaction scenarios, and propose the first benchmark for 3D full-body pose and shape estimation of the social partner from egocentric views. We extensively evaluate state-of-the-art methods, highlight their limitations in the egocentric scenario, and address such limitations leveraging our high-quality annotations. Data and code are available at <a class="link-external link-https" href="https://sanweiliti.github.io/egobody/egobody.html" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to estimate the 3D human pose, shape and motion of interaction partners from the first - person perspective (egocentric view) of head - mounted devices (HMD). Specifically, the paper points out that understanding social interactions is crucial for many applications, such as assistive robotics, augmented reality (AR) and virtual reality (VR). However, research in this area is limited by existing datasets in terms of scale, capture/annotation patterns, realism quality or interaction diversity. Therefore, the paper proposes EgoBody, a novel large - scale dataset, aiming to fill these gaps and provide high - quality 3D ground - truth data required for human pose, shape and motion estimation from the first - person perspective.
### Specific problems solved by the paper:
1. **Insufficient datasets**: Existing datasets are either of limited scale, lack high - quality 3D ground - truth, or have insufficiently diverse interaction scenarios. These problems impede the research progress of 3D human pose and motion estimation from the first - person perspective.
2. **Technical challenges**: Data captured from the first - person perspective of head - mounted devices has unique challenges, such as severe body truncation, motion blur (aggravated by the physical movement of the HMD), people entering and leaving the field of view, etc. Existing methods perform poorly in dealing with these challenges.
3. **Lack of benchmarking**: Currently, there is a lack of a benchmark specifically for 3D human pose and shape estimation from the first - person perspective. This makes it difficult to evaluate the performance of existing methods in practical applications.
### Main contributions of the paper:
1. **EgoBody dataset**: Propose EgoBody, a large - scale first - person perspective dataset, which contains high - quality 3D human pose, shape and motion ground - truth, as well as rich multi - modal data (RGB, depth, eye - tracking, etc.).
2. **Benchmarking**: Establish the first benchmark for 3D human pose and shape estimation from the first - person perspective, and conduct extensive evaluations of existing methods, revealing their limitations in this specific setting.
3. **Performance improvement**: By fine - tuning existing methods on the EgoBody training set, significantly improve their performance and robustness on the EgoBody test set and other first - person perspective datasets.
### Key technologies in the solution:
- **Multi - modal data acquisition**: Use Microsoft HoloLens2 head - mounted devices and multiple Azure Kinect cameras to synchronously acquire multi - modal data, including RGB images, depth maps, eye - tracking, etc.
- **High - quality 3D ground - truth generation**: Through marker - free motion capture methods, combine multi - view RGB - D data and the SMPL - X body model to reconstruct high - precision 3D human pose and shape.
- **Calibration and synchronization**: Improve the calibration between HoloLens2 and Kinect through a key - point optimization scheme to ensure data consistency and accuracy.
- **Temporal consistency optimization**: Obtain natural and coherent human motion through temporal smoothing optimization of the preliminary estimation of each frame.
In conclusion, through the construction of the EgoBody dataset and benchmarking, this paper provides important resources and support for 3D human pose and shape estimation from the first - person perspective, promoting the research progress in this area.