Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes

Siyu Chen,Ting Han,Changshe Zhang,Weiquan Liu,Jinhe Su,Zongyue Wang,Guorong Cai

2024-09-12

Abstract:RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issues of semantic segmentation and road detection in autonomous driving scenarios, particularly the failure of existing methods to fully utilize the intrinsic spatial properties of depth maps when processing RGB-D data. This leads to offset problems in attention representation, thereby affecting prediction accuracy. To this end, the authors propose a novel learnable Depth Interaction Pyramid Transformer (DiPFormer), which introduces Depth Spatial Awareness Optimization (Depth SAO) to represent real-world spatial relationships and employs Depth Linear Cross Attention (Depth LCA) to learn the similarity of RGB-D feature spaces, thus addressing the attention offset problem. Additionally, an MLP decoder is used to effectively fuse multi-scale features to meet real-time requirements. Experimental results show that DiPFormer significantly improves the performance of road detection and semantic segmentation tasks on several well-known datasets such as KITTI, KITTI-360, and Cityscapes.

Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Depth Estimation of Traffic Scenes from Image Sequence Using Deep Learning.

RGB×D: Learning Depth-Weighted RGB Patches for RGB-D Indoor Semantic Segmentation

Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection

Depth-Induced Gap-Reducing Network for RGB-D Salient Object Detection: an Interaction, Guidance and Refinement Approach

Salient Object Detection for RGBD Video Via Spatial Interaction and Depth-Based Boundary Refinement

Integrating 3D structure into traffic scene understanding with RGB-D data.

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

Depth Injection Framework for RGBD Salient Object Detection.

DBCNet: Dynamic Bilateral Cross-Fusion Network for RGB-T Urban Scene Understanding in Intelligent Vehicles

Learning Cross-Modality Interaction for Robust Depth Perception of Autonomous Driving

Exploiting Depth from Single Monocular Images for Object Detection and Semantic Segmentation

HDBFormer: Efficient RGB-D Semantic Segmentation With a Heterogeneous Dual-Branch Framework

Adaptive Semantic Fusion Framework for Unsupervised Monocular Depth Estimation

TCANet: three-stream coordinate attention network for RGB-D indoor semantic segmentation

DDNet: Depth Dominant Network for Semantic Segmentation of RGB-D Images

A Fusion Network for Semantic Segmentation Using RGB-D Data

Traffic Scene Segmentation Based on RGB-D Image and Deep Learning

OccDepth: A Depth-Aware Method for 3D Semantic Scene Completion

DepthRefiner: Adapting RGB Trackers to RGBD Scenes Via Depth-Fused Refinement