Abstract:We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground-truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to accurately perform 3D tracking of hands and objects in multi - view videos from the first - person perspective**. Specifically, the paper introduces the HOT3D dataset, which is a public dataset specifically designed for the 3D tracking tasks of hands and objects from the first - person perspective (egocentric). ### Main problems: 1. **Lack of high - quality first - person multi - view data**: Most of the existing datasets are based on single - view or external - view (exocentric), and cannot fully utilize multi - view information to improve the accuracy of hand and object tracking. 2. **Complexity of hand - object interaction**: The interaction between hands and objects involves complex dynamic grasping actions, which pose challenges to existing methods. 3. **Limitations of existing methods**: Existing hand and object tracking methods perform poorly in handling complex scenes, especially in cases of hand - object interaction. ### Characteristics of the HOT3D dataset: - **Multi - view synchronized data streams**: The dataset contains multi - view RGB / grayscale image streams from Project Aria and Quest 3 devices, and these image streams are synchronously captured by hardware triggers. - **High - quality annotations**: It provides high - precision 3D pose annotations of hands and objects, as well as camera pose annotations. - **Diverse scenes and objects**: The dataset includes 19 different participants interacting with 33 different types of rigid objects, covering typical scenes such as kitchens, offices, and living rooms. - **Rich modal signals**: In addition to the image streams, it also provides eye - tracking information and 3D point clouds generated by SLAM. ### Main contributions of the paper: 1. **Release of the HOT3D dataset**: This is the first large - scale first - person multi - view dataset, providing high - quality hand and object pose annotations and supporting research on multiple 2D / 3D tasks. 2. **Development of a strong baseline model**: A powerful multi - view baseline model has been developed for 6DoF object pose estimation and 3D reconstruction tasks of unknown objects. 3. **Verification of the effectiveness of multi - view methods**: Experimental results show that multi - view methods are significantly superior to single - view methods in 3D hand tracking, 6DoF object pose estimation, and 3D reconstruction tasks. Through these contributions, the paper aims to promote the research progress in the field of 3D hand and object tracking and provide a powerful benchmark platform for future work.

HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking

In-Hand 3D Object Reconstruction from a Monocular RGB Video

3D Hand Pose Estimation in Everyday Egocentric Images

HUP-3D: A 3D multi-view synthetic dataset for assisted-egocentric hand-ultrasound pose estimation

A Survey on 3D Hand Pose Estimation: Cameras, Methods, and Datasets

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation

Egocentric 6-DoF Tracking of Small Handheld Objects

Instance Tracking in 3D Scenes from Egocentric Videos

EgoHumans: An Egocentric 3D Multi-Human Benchmark

The H3D Dataset for Full-Surround 3D Multi-Object Detection and Tracking in Crowded Urban Scenes

HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction

Back to RGB: 3D tracking of hands and hand-object interactions based on short-baseline stereo

RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction

3D hand tracking for human computer interaction

Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild.

OHO: A Multi-Modal, Multi-Purpose Dataset for Human-Robot Object Hand-Over

EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset