Cross-modal Attention and Geometric Contextual Aggregation Network for 6dof Object Pose Estimation

Yi Guo,Fei Wang,Hao Chu,Shiguang Wen
DOI: https://doi.org/10.1016/j.neucom.2024.128891
IF: 6
2024-01-01
Neurocomputing
Abstract:The availability of affordable RGB-D sensors has made it more suitable to use RGB-D images for accurate 6D pose estimation, which allows for precise 6D parameter prediction using RGB-D images while maintaining a reasonable cost. A crucial research challenge is effectively exploiting adaptive feature extraction and fusion from the appearance information of RGB images and the geometric information of depth images. Moreover, previous methods have neglected the spatial geometric relationships of local position and the properties of point features, which are beneficial for tackling pose estimation in occlusion scenarios. In this work, we propose a cross-attention fusion framework for learning 6D pose estimation from RGB-D images. During the feature extraction stage, we design a geometry-aware context network that encodes local geometric properties of objects in point clouds using dual criteria, distance and geometric angles. Moreover, we propose a cross-attention framework that combines spatial and channel attention in a cross-modal attention manner. This innovative framework enables us to capture the correlation and importance between RGB and depth features, resulting in improved accuracy in pose estimation, particularly in complex scenes. In the experimental results, we demonstrated that the proposed method outperforms state-of-the-art methods on four challenging benchmark datasets: YCB-Video, LineMOD, Occlusion LineMOD, and MP6D. Video is available at https://youtu.be/4mgdbQKaHOc.
What problem does this paper attempt to address?