Geometric-aware RGB-D Representation Learning for Hand-Object Reconstruction

Jiajun Ma,Yanmin Zhou,Zhipeng Wang,Hongrui Sang,Rong Jiang,Bin He
DOI: https://doi.org/10.1016/j.eswa.2024.124995
IF: 8.5
2024-01-01
Expert Systems with Applications
Abstract:Reconstructing hand–object interaction from single images, crucial for interactive applications, is challenging due to the diversity of hand–object poses and shapes. Depth maps effectively complement RGB data in understanding these interactions geometrically within challenging scenes. However, most existing methods do not fully utilize the potential benefits of RGB-D information fused at different feature levels for reconstruction, limiting their capability to capture the geometric details of hand–object interaction. To address this, we propose an implicit geometric-aware RGB-D representation learning approach using adaptive bidirectional RGB-D feature fusion (ABF) and geometric Fourier feature encoding (GFFE). This method innovatively leverages the hierarchical and complementary high-level features of RGB-D information to enhance the neural implicit representations of hand–object interaction during the feature extraction and encoding stages of RGB-D data. Initially, a two-stream RGB-point cloud encoder, combining CNN and Transformer architectures, extracts appearance and geometric information from RGB and point cloud data. Subsequently, ABF fuses this information, generating dense RGB-D features by leveraging the complementary of appearance and geometric features along with the varying sensitivities of different layers. Finally, GFFE introduces frequency domain information to capture both high-frequency and low-frequency features of objects equally to precisely capture the geometric details of hand–object interaction. Experiments conducted on the real-world DexYCB and the synthetic ObMan benchmarks demonstrate that our approach significantly outperforms existing approaches.
What problem does this paper attempt to address?