HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

Prithviraj Banerjee,Sindi Shkodrani,Pierre Moulon,Shreyas Hampali,Shangchen Han,Fan Zhang,Linguang Zhang,Jade Fountain,Edward Miller,Selen Basol,Richard Newcombe,Robert Wang,Jakob Julian Engel,Tomas Hodan
2024-11-28
Abstract:We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground-truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.
Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to accurately perform 3D tracking of hands and objects in multi - view videos from the first - person perspective**. Specifically, the paper introduces the HOT3D dataset, which is a public dataset specifically designed for the 3D tracking tasks of hands and objects from the first - person perspective (egocentric). ### Main problems: 1. **Lack of high - quality first - person multi - view data**: Most of the existing datasets are based on single - view or external - view (exocentric), and cannot fully utilize multi - view information to improve the accuracy of hand and object tracking. 2. **Complexity of hand - object interaction**: The interaction between hands and objects involves complex dynamic grasping actions, which pose challenges to existing methods. 3. **Limitations of existing methods**: Existing hand and object tracking methods perform poorly in handling complex scenes, especially in cases of hand - object interaction. ### Characteristics of the HOT3D dataset: - **Multi - view synchronized data streams**: The dataset contains multi - view RGB / grayscale image streams from Project Aria and Quest 3 devices, and these image streams are synchronously captured by hardware triggers. - **High - quality annotations**: It provides high - precision 3D pose annotations of hands and objects, as well as camera pose annotations. - **Diverse scenes and objects**: The dataset includes 19 different participants interacting with 33 different types of rigid objects, covering typical scenes such as kitchens, offices, and living rooms. - **Rich modal signals**: In addition to the image streams, it also provides eye - tracking information and 3D point clouds generated by SLAM. ### Main contributions of the paper: 1. **Release of the HOT3D dataset**: This is the first large - scale first - person multi - view dataset, providing high - quality hand and object pose annotations and supporting research on multiple 2D / 3D tasks. 2. **Development of a strong baseline model**: A powerful multi - view baseline model has been developed for 6DoF object pose estimation and 3D reconstruction tasks of unknown objects. 3. **Verification of the effectiveness of multi - view methods**: Experimental results show that multi - view methods are significantly superior to single - view methods in 3D hand tracking, 6DoF object pose estimation, and 3D reconstruction tasks. Through these contributions, the paper aims to promote the research progress in the field of 3D hand and object tracking and provide a powerful benchmark platform for future work.