Abstract:We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the head movement. To this end, we designed the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks. Our analysis demonstrates the effectiveness of addressing distortion specific to egocentric cameras, adopting high-capacity transformers to learn complex hand-object interactions, and fusing predictions from different views. Our study further reveals challenging scenarios intractable with state-of-the-art methods, such as fast hand motion, object reconstruction from narrow egocentric views, and close contact between two hands and objects. Our efforts will enrich the community's knowledge foundation and facilitate future hand studies on egocentric hand-object interactions.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the problem of 3D pose estimation of hand-object interactions from an egocentric view. Specifically: 1. **3D Hand Pose Estimation**: The paper proposes the task of estimating 3D hand poses from single-view images based on the AssemblyHands dataset. 2. **Consistent Motion Reconstruction**: Based on the ARCTIC dataset, the paper proposes estimating the poses of hands and movable objects from RGB images to reconstruct the 3D surfaces during interactions. ### Specific Challenges - **Ghosting and Viewpoint Bias**: Rapidly changing viewpoints due to head movements, camera distortion, and occlusions between hands and objects make accurate 3D reconstruction of hands and objects very challenging. - **Dataset Limitations**: Early datasets lack scale and diversity in bimanual operations, limiting the realistic evaluation of real-world interactions. - **Method Limitations**: Existing methods still face challenges in handling fast hand movements, object reconstruction in narrow viewpoints, and complex and tight contacts between hands and objects. ### Main Contributions - Introduction of the HANDS23 Challenge, designing training and test sets based on the AssemblyHands and ARCTIC datasets. - Proposal of new benchmark methods and analysis of the effectiveness of state-of-the-art methods, particularly explicit or implicit learning of camera distortion, application of high-capacity Transformer models, and multi-view prediction fusion techniques. - In-depth analysis of remaining challenging scenarios, such as fast hand movements, object reconstruction in narrow viewpoints, and complex contacts between hands and objects. - Providing deep insights for future research through comprehensive analysis of the two benchmark datasets.

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

In-Hand 3D Object Reconstruction from a Monocular RGB Video

AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation

3D Hand Pose Estimation in Everyday Egocentric Images

A Survey on 3D Hand Pose Estimation: Cameras, Methods, and Datasets

Depth-Based 3D Hand Pose Estimation: from Current Achievements to Future Goals

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Benchmarking 2D Egocentric Hand Pose Datasets

ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Images

HandOS: 3D Hand Reconstruction in One Stage

The 2017 Hands in the Million Challenge on 3D Hand Pose Estimation.

HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular Images.

Learning Explicit Contact for Implicit Reconstruction of Hand-held Objects from Monocular Images

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction

EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Egocentric Hand-object Interaction Detection and Application

EgoHumans: An Egocentric 3D Multi-Human Benchmark

Two Hands Are Better Than One: Resolving Hand to Hand Intersections via Occupancy Networks

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation