1st Place Solution of Multiview Egocentric Hand Tracking Challenge ECCV2024

Minqiang Zou,Zhi Lv,Riqiang Jin,Tian Zhan,Mochen Yu,Yao Tang,Jiajun Liang
2024-10-08
Abstract:Multi-view egocentric hand tracking is a challenging task and plays a critical role in VR interaction. In this report, we present a method that uses multi-view input images and camera extrinsic parameters to estimate both hand shape and pose. To reduce overfitting to the camera layout, we apply crop jittering and extrinsic parameter noise augmentation. Additionally, we propose an offline neural smoothing post-processing method to further improve the accuracy of hand position and pose. Our method achieves 13.92mm MPJPE on the Umetrack dataset and 21.66mm MPJPE on the HOT3D dataset.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the challenging problem of multi-view egocentric hand tracking, particularly in virtual reality (VR) interactions. Specifically, the paper proposes a method that utilizes multi-view input images and camera extrinsic parameters to estimate the shape and pose of the hand, thereby improving the accuracy and robustness of hand tracking. Additionally, to reduce overfitting to specific camera layouts, the authors introduce techniques such as crop jittering and extrinsic parameter noise augmentation, and propose an offline neural smoothing post-processing method to further enhance the accuracy of hand position and pose. The main contribution of the paper lies in the design of a unified architecture that can effectively handle both single-view and multi-view inputs. Through a series of technical improvements, such as feature extraction, feature fusion modules, and the design of the regression part, the method achieves an MPJPE (Mean Per Joint Position Error) of 13.92mm on the Umetrack dataset and 21.66mm on the HOT3D dataset, significantly enhancing the performance of hand tracking. These improvements are crucial for ensuring accurate and seamless interaction under different camera setups.