Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Zicong Fan,Takehiko Ohkawa,Linlin Yang,Nie Lin,Zhishan Zhou,Shihao Zhou,Jiajun Liang,Zhong Gao,Xuanyang Zhang,Xue Zhang,Fei Li,Zheng Liu,Feng Lu,Karim Abou Zeid,Bastian Leibe,Jeongwan On,Seungryul Baek,Aditya Prakash,Saurabh Gupta,Kun He,Yoichi Sato,Otmar Hilliges,Hyung Jin Chang,Angela Yao
2024-08-06
Abstract:We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the head movement. To this end, we designed the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks. Our analysis demonstrates the effectiveness of addressing distortion specific to egocentric cameras, adopting high-capacity transformers to learn complex hand-object interactions, and fusing predictions from different views. Our study further reveals challenging scenarios intractable with state-of-the-art methods, such as fast hand motion, object reconstruction from narrow egocentric views, and close contact between two hands and objects. Our efforts will enrich the community's knowledge foundation and facilitate future hand studies on egocentric hand-object interactions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the problem of 3D pose estimation of hand-object interactions from an egocentric view. Specifically: 1. **3D Hand Pose Estimation**: The paper proposes the task of estimating 3D hand poses from single-view images based on the AssemblyHands dataset. 2. **Consistent Motion Reconstruction**: Based on the ARCTIC dataset, the paper proposes estimating the poses of hands and movable objects from RGB images to reconstruct the 3D surfaces during interactions. ### Specific Challenges - **Ghosting and Viewpoint Bias**: Rapidly changing viewpoints due to head movements, camera distortion, and occlusions between hands and objects make accurate 3D reconstruction of hands and objects very challenging. - **Dataset Limitations**: Early datasets lack scale and diversity in bimanual operations, limiting the realistic evaluation of real-world interactions. - **Method Limitations**: Existing methods still face challenges in handling fast hand movements, object reconstruction in narrow viewpoints, and complex and tight contacts between hands and objects. ### Main Contributions - Introduction of the HANDS23 Challenge, designing training and test sets based on the AssemblyHands and ARCTIC datasets. - Proposal of new benchmark methods and analysis of the effectiveness of state-of-the-art methods, particularly explicit or implicit learning of camera distortion, application of high-capacity Transformer models, and multi-view prediction fusion techniques. - In-depth analysis of remaining challenging scenarios, such as fast hand movements, object reconstruction in narrow viewpoints, and complex contacts between hands and objects. - Providing deep insights for future research through comprehensive analysis of the two benchmark datasets.