Decoupling Heterogeneous Features for Robust 3D Interacting Hand Poses Estimation

Huan Yao,Changxing Ding,Xuanda Xu,Zhifeng Lin
DOI: https://doi.org/10.1145/3664647.3681068
2024-01-01
Abstract:Estimating the 3D poses of interacting hands from a monocular image is challenging due to the similarity in appearance between hand parts. Therefore, utilizing the appearance features alone tends to result in unreliable pose estimation. Existing approaches directly fuse the appearance features with position features, ignoring that the two types of features are heterogeneous. Here, the appearance features are derived from the RGB values of pixels, while the position features are mapped from the coordinates of pixels or joints. To address this problem, we present a novel framework called Decoupled Feature Learning (DFL ) for 3D pose estimation of interacting hands. By decoupling the appearance and position features, we facilitate the interactions within each feature type and those between both types of features. First, we compute the appearance relationships between the joint queries and the image feature maps; we utilize these relationships to aggregate each joint's appearance and position features. Second, we compute the 3D spatial relationships between hand joints using their position features; we utilize these relationships to guide the feature enhancement of joints. Third, we calculate appearance relationships and spatial relationships between the joints and image using the appearance and position features, respectively; we utilize these complementary relationships to promote the joints' location in the image. The two processes mentioned above are conducted iteratively. Finally, only the refined position features are used for hand pose estimation. This strategy avoids the step of mapping heterogeneous appearance features to hand-joint positions. Our method significantly outperforms state-of-the-art methods on the large-scale InterHand2.6M dataset. More impressively, our method exhibits strong generalization ability on in-the-wild images.
What problem does this paper attempt to address?