CMA: Cross-modal Attention for 6D Object Pose Estimation

Lu Zou,Zhangjin Huang,Fangjun Wang,Zhouwang Yang,Guoping Wang
DOI: https://doi.org/10.1016/j.cag.2021.04.018
2021-01-01
Abstract:Deep learning methods for 6D object pose estimation based on RGB and depth (RGB-D) images have been successfully applied to robotic manipulation and grasping. Among these approaches, the fusion of RGB and depth modalities is one of the most critical issues. Most existing works performed fusion via either simple concatenation, or element-wise multiplication of the features generated by these two modalities. Despite achieving impressive progress, such fusion strategies do not explicitly consider the different con-tributions of RGB and depth modalities, leaving a gap for performance enhancement. In this paper, we present a Cross-Modal Attention (CMA) component for the problem of 6D object pose estimation. With the attention mechanism, features of two different modalities are aggregated adaptively through the at-tention weights, such that powerful representations from the RGB-D images can be efficiently extracted. Comprehensive experiments on both LINEMOD and YCB-Video datasets demonstrate that the proposed approach achieves state-of-the-art performance. (c) 2021 Elsevier Ltd. All rights reserved.
What problem does this paper attempt to address?