Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes

Siyu Chen,Ting Han,Changshe Zhang,Weiquan Liu,Jinhe Su,Zongyue Wang,Guorong Cai
2024-09-12
Abstract:RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issues of semantic segmentation and road detection in autonomous driving scenarios, particularly the failure of existing methods to fully utilize the intrinsic spatial properties of depth maps when processing RGB-D data. This leads to offset problems in attention representation, thereby affecting prediction accuracy. To this end, the authors propose a novel learnable Depth Interaction Pyramid Transformer (DiPFormer), which introduces Depth Spatial Awareness Optimization (Depth SAO) to represent real-world spatial relationships and employs Depth Linear Cross Attention (Depth LCA) to learn the similarity of RGB-D feature spaces, thus addressing the attention offset problem. Additionally, an MLP decoder is used to effectively fuse multi-scale features to meet real-time requirements. Experimental results show that DiPFormer significantly improves the performance of road detection and semantic segmentation tasks on several well-known datasets such as KITTI, KITTI-360, and Cityscapes.