Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

Zhibin Xiao,Pengwei Xie,Guijin Wang
DOI: https://doi.org/10.1007/978-3-030-98358-1_28
2022-01-01
Abstract:RGB-D object detection is a fundamental yet challenging task due to the inherent difference between the RGB and Depth information. In this paper, we propose a Multi-scale Cross-modal Transformer Network (MCTNet) consisting of two well-designed components: the Multi-modal Feature Pyramid module (MFP), and the Cross-Modal Transformer (CMTrans). Specially, we introduce the MFP to enrich the high-level semantic features with geometric information and enhance low-level geometric clues with semantic features, which is demonstrated facilitating the further cross-modal feature fusion. Furthermore, we develop the CMTrans to effectively exploit the long-range attention between the enhanced RGB and depth features, enabling the network to focus on regions of interest. Extensive experiments show our MCTNet surpasses state-of-the-art detectors by 1.6% mAP on SUN RGB-D and 1.0% mAP on NYU Depth v2, which demonstrates the effectiveness of the proposed method.
What problem does this paper attempt to address?